![](https://europe-west1-atp-views-tracker.cloudfunctions.net/working-analytics?notebook=tutorials--agent-with-tavily-web-access--search-extract-crawl)

# 1. Search, Extract, and Crawl the Web 🌐

Welcome! In this tutorial, you'll gain hands-on experience with the core capabilities of the Tavily API—searching the web with semantic understanding, extracting content from live web pages, and crawling entire websites. 

These skills are essential for anyone building AI agents or applications that need up-to-date, relevant information from the internet. By learning how to programmatically access and process real-time web data, you'll be able to bridge the gap between static language models and the dynamic world they operate in, making your agents smarter, more accurate, and context-aware.

We'll cover:
- How to perform web searches and retrieve the most relevant results
- How to extract clean, usable content from any URL
- How to crawl websites to gather comprehensive information
- How to fine-tune your queries with advanced parameters
 

---

## Getting Started

Follow these steps to set up:

1. **Sign up** for Tavily at [app.tavily.com](https://app.tavily.com/home/?utm_source=github&utm_medium=referral&utm_campaign=nir_diamant) to get your API key.

   *Refer to the screenshots linked below for step-by-step guidance:*

   - ![Screenshot: Signup Page](assets/sign-up.png)
   - ![Screenshot: Tavily API Keys Dashboard](assets/api-key.png)


2. **Copy your API key** from your Tavily account dashboard.

3. **Paste your API key** into the cell below and execute the cell.

In [None]:
# To export your API key into a .env file, run the following cell (replace with your actual keys):
!echo "TAVILY_API_KEY=<your-tavily-api-key>" >> .env

Install dependencies in the cell below.

In [None]:
%pip install --upgrade tavily-python --quiet

### Setting Up Your Tavily API Client

The code below will instantiate the Tavily client with your API key.


In [2]:
import os
import getpass
from dotenv import load_dotenv
from tavily import TavilyClient

# Load environment variables from .env file
load_dotenv()

# Prompt the user to securely input the API key if not already set in the environment
if not os.environ.get("TAVILY_API_KEY"):
    os.environ["TAVILY_API_KEY"] = getpass.getpass("TAVILY_API_KEY:\n")

# Initialize the Tavily API client using the loaded or provided API key
tavily_client = TavilyClient(api_key=os.getenv("TAVILY_API_KEY"))

## Search 🔍 

Let's run a basic web search query to retrieve up-to-date information about NYC. 


In [3]:
# Perform a web search to retrieve the most up-to-date information available.
search_results = tavily_client.search(
    query="What happened in NYC today?", max_results=5
)

This search invocation will return 5 results. Each result includes the web page's title, URL, a content snippet for RAG purposes, and a semantic score indicating how closely the result matches your query.

In [None]:
# Print the results
for result in search_results["results"]:
    print(result["title"])
    print(result["url"])
    print(result["content"])
    print(result["score"])
    print("\n")

Let's experiment with different API parameter configurations to see Tavily in action. Try everything from broad topics to specific questions! You can adjust parameters such as the number of results, time range, and domain filters to tailor your search. For more information, read the search [API reference](https://docs.tavily.com/documentation/api-reference/endpoint/search?utm_source=github&utm_medium=referral&utm_campaign=nir_diamant) and [best practices guide](https://docs.tavily.com/documentation/best-practices/best-practices-search?utm_source=github&utm_medium=referral&utm_campaign=nir_diamant). Let's apply a time range filter, domain filter, and use the `news` search topic.


In [5]:
# Perform a web search with a specific time range, domain filter, and topic filter.
search_results = tavily_client.search(
    query="Anthropic model release?",
    max_results=5,
    time_range="month",
    include_domains=["techcrunch.com"],
    topic="news",
)

Notice that all the results are from `techcrunch.com` and are limited to the past month. By setting the `news` topic, our search is focused on trusted third-party news sources.

In [None]:
# Print the results
for result in search_results["results"]:
    print(result["title"])
    print(result["url"])
    print(result["content"])
    print("\n")

## Extract 📄 

Next, we'll use the Tavily extract endpoint to retrieve the complete content (i.e., `raw_content`) of each page using the URLs from our previous search results. Instead of just using the short content snippets from the search, this allows us to access the full text of each page. For efficiency, the extract endpoint can process up to 20 URLs at once in a single call. For more information, read the extract [API reference](https://docs.tavily.com/documentation/api-reference/endpoint/extract?utm_source=github&utm_medium=referral&utm_campaign=nir_diamant) and [best practices guide](https://docs.tavily.com/documentation/best-practices/best-practices-extract?utm_source=github&utm_medium=referral&utm_campaign=nir_diamant). 


In [7]:
# Extract the full-page content from the URLs in the search results.
extract_results = tavily_client.extract(
    urls=[result["url"] for result in search_results["results"]],
    # extract_depth="advanced", #uncomment to use our advanced extract depth for complex web pages with dynamic content, embedded media, or structured data.
)

Let's look at the raw content, which provides much more detailed and complete information than the short content snippets shown earlier. If you use raw content as input to LLMs, remember to consider your model's context window limits.

In [None]:
# Print the results
for result in extract_results["results"]:
    print(result["url"])
    print(result["raw_content"])
    print("\n")

Rather than using the Extract endpoint to return raw page content, we can combine the search and extract endpoints into a API call by using the search endpoint with the `include_raw_content=True` parameter.

In [9]:
# Perform a web search with live content extraction.
search_results = tavily_client.search(
    query="Anthropic model release?",
    max_results=1,
    include_raw_content=True,
)

Each search result now contains the web page's title, URL, semantic score, a content snippet, and the complete raw content. Tavily's flexible and modular API supports building a wide range of agentic systems, regardless of model size.

In [None]:
# Print the results
for result in search_results["results"]:
    print(result["url"])
    print(result["content"])
    print(result["score"])
    print(result["raw_content"])
    print("\n")

## Crawl 🕸️ 

Now let’s use Tavily to crawl a webpage and extract all its links. Web crawling is the process of automatically navigating through websites by following hyperlinks to discover numerous web pages and URLs (think of it like falling down a Wikipedia rabbit hole 🐇—clicking from page to page, diving deeper into interconnected topics). For autonomous web agents, this capability is essential for accessing deep web data which might be difficult to retrieve via search. For more information, read the crawl [API reference](https://docs.tavily.com/documentation/api-reference/endpoint/crawl?utm_source=github&utm_medium=referral&utm_campaign=nir_diamant) and [best practices guide](https://docs.tavily.com/documentation/best-practices/best-practices-crawl?utm_source=github&utm_medium=referral&utm_campaign=nir_diamant).


Let's begin by crawling the Tavily website to gather all nested pages.

In [11]:
# Crawl the Tavily website
crawl_results = tavily_client.crawl(url="tavily.com")

We can see all the nested URLs.

In [None]:
# Print the results
for result in crawl_results["results"]:
    print(result["url"])

The crawl endpoint also returns the raw page content of each URL.

In [None]:
# Print the results
for result in crawl_results["results"]:
    print(result["url"])
    print(result["raw_content"])
    print("\n")

If you're interested in just the links (without the full page content), use the Map endpoint. It's a faster and more cost-effective way to retrieve all the links from a site.

In [14]:
# Map the Tavily website
map_results = tavily_client.map(url="tavily.com")

Let's view the results, which only contain the links in this case.

In [None]:
# Print the results
map_results

The `instructions` parameter of the crawl/map endpoint is a powerful feature that lets you guide the web crawl using natural language instructions.

In [16]:
# Map the Tavily website with natural language instructions
guided_map_results = tavily_client.map(
    url="tavily.com", instructions="find only the developer docs"
)

Now, the results will only include developer docs from the Tavily webpage.

In [None]:
guided_map_results

Experiment with different URLs to see how Tavily maps the structure of different websites. How would you integrate this into your agentic systems...🤔?

## Conclusion & Next Steps
 
In this tutorial, you learned how to:
- Perform real-time web searches using the Tavily API
- Extract content from web pages
- Crawl and map websites to gather links and information
- Guide crawls with natural language instructions for targeted data extraction
 
These foundational skills enable your agents to access and utilize up-to-date web information, making them more powerful and context-aware. Feel free to experiment with the Tavily API in the [playground](https://app.tavily.com/playground?utm_source=github&utm_medium=referral&utm_campaign=nir_diamant) and read the [best practices guide](https://docs.tavily.com/documentation/best-practices/best-practices-search?utm_source=github&utm_medium=referral&utm_campaign=nir_diamant) to optimize for your use case.
 
**Ready to take the next step?**  
In **Tutorial #2: Building a Web Agent** that can search, extract, and crawl autonomously, you'll combine these capabilities to build a fully autonomous web agent. This agent will be able to reason, decide when to search, crawl, or extract, and integrate web data into its workflow—all powered by Tavily.
 
[👉 **Continue to Tutorial #2!**](./web-agent-tutorial.ipynb)
