# 🤖 TavilyCrawl Tutorial: Intelligent Web Crawling

> **📚 Part of the LangChain - Develop AI Agents with LangChain & LangGraph**  
> [🎓 Get the full course](https://www.udemy.com/course/langchain/?referralCode=D981B8213164A3EA91AC)

## What is TavilyCrawl?

**TavilyCrawl** is the first intelligent web crawler that uses AI to determine which paths to explore during crawling. It combines AI-powered decision making with parallel processing capabilities.

### Key Features:

- **AI-Powered Path Selection**: Uses AI to determine which paths to explore
- **Parallel Processing**: Explores hundreds of paths simultaneously  
- **Advanced Extraction**: Extracts content from dynamically rendered pages
- **Instruction-Driven**: Follows natural language instructions to guide exploration
- **Targeted Content**: Returns content tailored for LLM integration and RAG systems

In this tutorial, we'll demonstrate TavilyCrawl by comparing different instruction approaches on the **LangChain documentation**:
1. 🔍 **Regular crawl without instructions** - baseline behavior
2. ❌ **Regular crawl with poor instructions** - demonstrates what to avoid
3. ✅ **Regular crawl with good instructions** - targeted results

---

In [1]:
# Install required packages
%pip install langchain-tavily certifi

# For pretty printing and visualization
%pip install rich pandas json


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.
[31mERROR: Could not find a version that satisfies the requirement json (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for json[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
import os
import ssl
import json
from typing import Any, Dict, List

import certifi
from langchain_tavily import TavilyCrawl
from rich.console import Console
from rich.panel import Panel
from rich.table import Table
from rich.json import JSON

# Configure SSL context
ssl_context = ssl.create_default_context(cafile=certifi.where())
os.environ["SSL_CERT_FILE"] = certifi.where()
os.environ["REQUESTS_CA_BUNDLE"] = certifi.where()

# Initialize rich console for pretty printing
console = Console()

print("✅ All imports successful!")

✅ All imports successful!


In [3]:
# Set your Tavily API key here
import getpass

# For Google Colab, you can use getpass for secure input
if 'TAVILY_API_KEY' not in os.environ:
    os.environ['TAVILY_API_KEY'] = getpass.getpass('Enter your Tavily API key: ')

# Alternative: Set directly (uncomment and add your key)
# os.environ["TAVILY_API_KEY"] = "your_tavily_api_key_here"

print("✅ API key set successfully!")

✅ API key set successfully!


## 🚀 Initialize TavilyCrawl

Let's initialize TavilyCrawl and set up our target URL for demonstration.

In [4]:
# Initialize TavilyCrawl
tavily_crawl = TavilyCrawl()

# Target URL: LangChain Documentation
target_url = "https://python.langchain.com/"

console.print(Panel.fit(
    f"🎯 **Target Website**: {target_url}\n🤖 **Crawler**: TavilyCrawl",
    title="Demo Setup",
    border_style="bright_blue"
))

print("TavilyCrawl initialized successfully")

TavilyCrawl initialized successfully


## 🔍 Demo 1: Regular Crawl Without Instructions

First, let's see what happens when we use TavilyCrawl without any specific instructions. This will show us the baseline behavior on the LangChain documentation.

In [9]:
# Demo 1: Crawl without instructions
console.print(Panel.fit(
    f"🎯 **Target**: {target_url}\n📋 **Instructions**: None (baseline crawl)\n⚙️ **Max Depth**: 2\n🎨 **Extract Depth**: advanced",
    title="Demo 1: Regular Crawl Without Instructions",
    border_style="blue"
))

console.print("Running TavilyCrawl without instructions...", style="blue")

# Basic crawl without instructions
basic_result = tavily_crawl.invoke({
    "url": target_url,
    "max_depth": 1,
    "extract_depth": "advanced"
})

basic_results = basic_result.get("results", [])
console.print(f"Basic crawl completed. Found {len(basic_results)} pages", style="green")

# Show what we got without instructions
console.print(f"\n📊 **Results Without Instructions**: {len(basic_results)} pages", style="cyan")
console.print("   📄 Mix of all content types from LangChain docs")
console.print("   🔍 No filtering - everything from the crawled sections")
console.print("   ⚠️  Requires manual work to find specific content")

In [10]:
# Display sample results from basic crawl
console.print("\n📋 **Sample Results from Basic Crawl (No Filtering):**\n", style="cyan")

for i, result in enumerate(basic_results[:3], 1):  # Show first 3 results
    title = result.get("title") or result.get("url", "No URL")  # Use URL if no title
    url = result.get("url", "No URL")
    content = result.get("raw_content", "No content")[:150] + "..."
    
    panel_content = f"""🔗 **URL**: {url}

📖 **Content Preview**:
{content}"""
    
    console.print(Panel(
        panel_content,
        title=f"📄 {i}. {title}",
        border_style="blue"
    ))
    print()

console.print(f"... and {len(basic_results) - 3} more mixed results", style="italic cyan")
console.print("🔍 **Note**: Mixed content types - guides, integrations, concepts, etc.", style="cyan")










## 🔍 Understanding TavilyCrawl Result Format

Let's examine the structure and fields of TavilyCrawl results to understand what data is available.

In [11]:
# Examine the structure of TavilyCrawl results
console.print("📋 **TavilyCrawl Result Structure Analysis**\n", style="blue")

# Let's examine a single result from our basic crawl
if basic_results:
    sample_result = basic_results[0]
    
    console.print("🔍 **Available Fields in Each Result:**", style="cyan")
    for field, value in sample_result.items():
        field_type = type(value).__name__
        value_preview = str(value)[:100] + "..." if len(str(value)) > 100 else str(value)
        console.print(f"   📄 **{field}** ({field_type}): {value_preview}")
    
    console.print(f"\n📊 **Complete Structure of Result #1:**", style="green")
    # Display the full structure as formatted JSON
    result_json = JSON.from_data(sample_result)
    console.print(result_json)
    
else:
    console.print("⚠️ No results available for structure analysis", style="red")

## ❌ Demo 2: Regular Crawl With Bad Instructions

Let's first see what happens when we use poor instructions. This demonstrates what to avoid when working with TavilyCrawl.

# Demo 2: Crawl with poor instructions

In [12]:
bad_instructions = "What is LangChain and how does it work?"

console.print(Panel.fit(
    f"🎯 **Target**: {target_url} (same as Demo 1)\n📋 **Instructions**: {bad_instructions}\n⚠️ **Type**: Poor (asks a question instead of guiding crawl)\n⚙️ **Max Depth**: 2\n🎨 **Extract Depth**: advanced",
    title="Demo 2: Regular Crawl With Poor Instructions", 
    border_style="red"
))

console.print("Starting crawl with poor instructions...", style="red")
console.print("This demonstrates an ineffective instruction pattern", style="italic")

In [13]:
# Execute the crawl with poor instructions
bad_result = tavily_crawl.invoke({
    "url": target_url,
    "instructions": bad_instructions,
    "max_depth": 2,
    "extract_depth": "advanced"
})

console.print("Crawl with poor instructions completed", style="yellow")

# Show what happens with poor instructions
bad_results = bad_result.get("results", [])

console.print(f"\n❌ **Why These Instructions Are Ineffective:**", style="red")
console.print(f"   📊 Results found: {len(bad_results)} pages")
console.print("   🤔 Instructions ask a question instead of guiding the crawler")
console.print("   🎯 Crawl doesn't answer questions - it finds pages that might contain answers")
console.print("   ⚠️  Results may be unfocused or too broad")

In [14]:
# Display results from poor instructions
console.print("\n❌ **Results from Poor Instructions:**\n", style="red")

for i, result in enumerate(bad_results[:3], 1):  # Show first 3 results
    title = result.get("title") or result.get("url", "No URL")  # Use URL if no title
    url = result.get("url", "No URL")
    content = result.get("raw_content", "No content")[:150] + "..."
    
    panel_content = f"""🔗 **URL**: {url}

📖 **Content Preview**:
{content}"""
    
    console.print(Panel(
        panel_content,
        title=f"📄 {i}. {title}",
        border_style="red"
    ))
    print()

console.print(f"... and {len(bad_results) - 3} more potentially unfocused results", style="italic red")
console.print("⚠️ **Note**: Results might be too general or unfocused", style="red")










## ✅ Demo 3: Regular Crawl With Good Instructions

Now let's see how TavilyCrawl performs when provided with effective instructions. Good instructions allow the AI to filter and target specific content precisely.

### How Good Instructions Help:
- Provide clear direction for the crawler
- Target specific content types or topics
- Enable AI to make intelligent filtering decisions
- Reduce manual post-processing work

# Demo 3: Crawl with good instructions


In [18]:
good_instructions = "Find all pages about ai agents"

console.print(Panel.fit(
    f"🎯 **Target**: {target_url} (same as Demo 1 & 2)\n📋 **Instructions**: {good_instructions}\n✅ **Type**: Good (specific, action-oriented)\n⚙️ **Max Depth**: 2\n🎨 **Extract Depth**: advanced",
    title="Demo 3: Regular Crawl With Good Instructions", 
    border_style="green"
))

console.print("Starting crawl with good instructions...", style="green")
console.print("Instructions will guide the AI to target specific content", style="italic")

In [19]:
# Execute the crawl with good instructions
good_result = tavily_crawl.invoke({
    "url": target_url,
    "instructions": good_instructions,
    "max_depth": 2,
    "extract_depth": "advanced"
})

console.print("Crawl with good instructions completed", style="green")

# Show the results of instruction-based filtering
good_results = good_result.get("results", [])

# Compare with previous demos
console.print(f"\n🎯 **Impact of Good Instructions:**", style="blue")
console.print(f"   📊 Demo 1 (no instructions): {len(basic_results)} mixed pages")
console.print(f"   ❌ Demo 2 (poor instructions): {len(bad_results)} potentially unfocused pages")
console.print(f"   ✅ Demo 3 (good instructions): {len(good_results)} targeted agent pages")
console.print(f"   📉 **Good instructions reduced noise significantly**", style="green")

In [20]:
# Display the targeted agent documentation found
console.print("\n🎯 **LangChain Agent Documentation Found:**\n", style="green")

for i, result in enumerate(good_results, 1):
    title = result.get("title") or result.get("url", "No URL")  # Use URL if no title
    url = result.get("url", "No URL")
    content = result.get("raw_content", "No content")[:200] + "..."
    
    panel_content = f"""🔗 **URL**: {url}

📖 **Content Preview**:
{content}"""
    
    console.print(Panel(
        panel_content,
        title=f"📑 {i}. {title}",
        border_style="green"
    ))
    print()

console.print("📝 **Note**: All results are specifically about agents in LangChain", style="green")




## 📊 Comparison of All Three Approaches

Now let's compare all three approaches side by side to understand the impact of instruction quality.

In [21]:
# Create comparison table
comparison_table = Table(title="📊 TavilyCrawl: Instruction Quality Comparison")
comparison_table.add_column("Approach", style="cyan", no_wrap=True)
comparison_table.add_column("Instructions", style="yellow")
comparison_table.add_column("Pages Found", style="blue")
comparison_table.add_column("Content Quality", style="green")
comparison_table.add_column("Usefulness", style="red")

comparison_table.add_row(
    "🔍 No Instructions",
    "None (baseline)",
    f"{len(basic_results)}",
    "Mixed (all types)",
    "Low (requires filtering)"
)

comparison_table.add_row(
    "❌ Poor Instructions",
    bad_instructions,
    f"{len(bad_results)}",
    "Potentially unfocused",
    "Medium (may need filtering)"
)

comparison_table.add_row(
    "✅ Good Instructions",
    good_instructions,
    f"{len(good_results)}",
    "Highly targeted",
    "High (ready to use)"
)

console.print(comparison_table)

console.print("\n🎯 **Key Observations:**", style="blue")
console.print("   🔍 **No instructions** return everything, requiring manual filtering")
console.print("   ❌ **Poor instructions** may return unfocused results")
console.print("   ✅ **Good instructions** provide highly targeted, ready-to-use results")
console.print("   💡 **Best practice**: Use specific, action-oriented instructions")

console.print(f"\n📈 **Efficiency with Good Instructions:**", style="green")
console.print(f"   🎯 Filtering efficiency: {((len(basic_results) - len(good_results)) / len(basic_results) * 100):.1f}% reduction in noise")
console.print("   ⚡ Time saved: No manual post-processing required")
console.print("   🧠 AI-powered: Intelligent path selection and content filtering")