# Build a RAG System with Tavily Web Crawling
---

This notebook demonstrates how to build a RAG system by crawling web content, processing it into chunks, and using it to answer questions.

## Overview

We'll cover the following steps:

1. Crawl a website using Tavily's crawling API
2. Extract and process the raw content
3. Create documents with metadata
4. Split documents into manageable chunks
5. Create vector embeddings
6. Build a question-answering system

Let's get started!

In [1]:
import getpass
import os

import requests

if not os.environ.get("TAVILY_API_KEY"):
    os.environ["TAVILY_API_KEY"] = getpass.getpass("TAVILY_API_KEY:\n")

TAVILY_API_KEY = os.getenv("TAVILY_API_KEY")



## Step 1: Define the Target Website

We'll specify the base URL to crawl. For this example, we're using `tavily.com`.

---

In [2]:
base_url = "tavily.com"

## Step 2: Crawl the Website

Now we'll use Tavily's crawling API to extract content from the website. We can control the crawling behavior with parameters like:

- `limit`: Maximum number of pages to crawl
- `max_depth`: How many levels deep to crawl from the starting page
- `max_breadth`: Maximum number of links to follow at each level
- `extract_depth`: Level of content extraction ("basic" or "advanced")
- `select_paths`: Specific URL paths to include
- `select_domains`: Specific domains to include

---

In [3]:
crawl_result = requests.post(
    "https://api.tavily.com/crawl",
    headers={"Authorization": f"Bearer {TAVILY_API_KEY}"},
    json={
        "url": base_url,
        "limit": 100,
        "max_depth": 5,
        "max_breadth": 100,
        "extract_depth": "advanced",
        "select_paths": ["/documentation/*", "/api-reference/*"],
        "select_domains": ["docs.tavily.com", "blog.tavily.com"],
    },
)

ConnectionError: HTTPSConnectionPool(host='prod-api.tavily.com', port=443): Max retries exceeded with url: /crawl (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x10756f4f0>: Failed to resolve 'prod-api.tavily.com' ([Errno 8] nodename nor servname provided, or not known)"))

## Step 3: Examine Crawled URLs

Let's look at the URLs that were successfully crawled. Note: the results will only be related to the documentation and api-reference path and the docs.tavily.com domain, as set in the `select_paths` and `select_domains` arguements.

Hint: we can use these parameters to intelligently create vector databases...

---

In [None]:
for page in crawl_result.json()["data"]:
    print(page["url"])

## Step 4: Preview the Raw Content

Let's examine a sample of the raw content from one of the crawled pages to understand what we're working with:

---

In [None]:
# Access the data array from the JSON response
data = crawl_result.json()["data"]

# Just view one sample page from the data array
if data:
    page = data[1]  # Get the first page
    raw_content = page["raw_content"]
    print(f"URL: {page['url']}")
    print(f"Raw Content:{raw_content}...")  # Print first 200 chars with ellipsis
    print("-" * 50)  # Print a separator

## Step 5: Process Content into Documents

We'll convert the crawled content into LangChain Document objects, which will allow us to:

1. Maintain important metadata (source URL, page name)
2. Prepare the text for chunking
3. Make the content ready for vectorization

---

In [None]:
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Create a list to store the Document objects
documents = []

# Loop through each page in your crawl results
for page in crawl_result.json()["data"]:
    # Extract the content and URL/name for each page
    page_content = page["raw_content"]
    page_url = page["url"]

    # Create a Document for each page with the URL and page name as metadata
    doc = Document(
        page_content=page_content,
        metadata={"source": page_url, "page_name": page_url.split("/")[-1]},
    )

    documents.append(doc)


## Step 6: Split Documents into Chunks

We'll split the documents into smaller, more manageable chunks using the `RecursiveCharacterTextSplitter` and preview the result.

---

In [None]:
# If you still want to split each page into smaller chunks:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  # Larger chunk size for page-level content
    chunk_overlap=100,
    length_function=len,
    is_separator_regex=False,
)
# Split the documents while preserving metadata
all_chunks = text_splitter.split_documents(documents)

# Now you have each page as a separate document with proper metadata
print(f"Created {len(documents)} page-level documents")
print(f"Split into {len(all_chunks)} total chunks")

# Example of accessing the documents
for i, doc in enumerate(documents[:2]):  # Print first 3 for example
    print(f"\nDocument {i+1}:")
    print(f"Page: {doc.metadata.get('page_name')}")
    print(f"Source: {doc.metadata.get('source')}")
    print(f"Content length: {len(doc.page_content)} characters")

## Step 7: Create Vector Embeddings

Now we'll create vector embeddings for our document chunks using OpenAI's embedding model and store them in a Chroma vector database. This allows us to perform semantic search on our document collection.

---

In [None]:
from langchain.embeddings import OpenAIEmbeddings
from langchain_chroma import Chroma

# Create embeddings for the documents
embeddings = OpenAIEmbeddings()

# Create a vector store from the loaded documents
vector_store = Chroma.from_documents(all_chunks, embeddings)

## Step 8: Build the Question-Answering System

Finally, we'll create a retrieval-based question-answering system using gpt-4o-mini. We use the "stuff" chain type, which combines all relevant retrieved documents into a single context for the model.

---

In [24]:
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI

# Initialize the language model
llm = ChatOpenAI(model="gpt-4o-mini", streaming=True)

# Create a QA chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vector_store.as_retriever(),
)

## Step 9: Test the System

Let's test our RAG system by asking a question about Tavily's documentation.

---

In [29]:
# Example question
query = "What is Tavily's production rate limit?"
answer = qa_chain.invoke(query)

In [30]:
answer["result"]

"Tavily's production rate limit is 1,000 requests per minute (RPM)."

## Conclusion

We've successfully built a complete RAG system that can:

1. Crawl web content from a specific domain
2. Process and structure the content
3. Create vector embeddings for semantic search
4. Answer questions based on the crawled information

This approach can be extended to create knowledge bases from any website, documentation, or content repository, making it a powerful tool for building domain-specific assistants and search systems. 

How you could enhance this by combining the Tavily `/Search` endpoint with the `/Crawl` endpoint 🤔... find out in the [Agentic Crawling Tutorial!](./agentic-crawl.ipynb)