# TavilyCrawl Tutorial: Intelligent Web Crawling

## What We'll Build

In this tutorial, you'll learn how to use TavilyCrawl to intelligently crawl websites using AI-guided instructions. We'll demonstrate:

1. **Basic Web Crawling** - Crawl a website without specific instructions
2. **Instruction-Guided Crawling** - Use natural language to target specific content
3. **Results Comparison** - Compare the effectiveness of both approaches
4. **Best Practices** - Learn how to write effective crawling instructions

### Target Website
We'll crawl the LangChain documentation (https://python.langchain.com/) to find content about AI agents.

## What is TavilyCrawl?

TavilyCrawl is an intelligent web crawler that uses AI to determine which paths to explore during crawling. It combines AI-powered decision making with parallel processing capabilities.

### Key Features:

- **AI-Powered Path Selection**: Uses AI to determine which paths to explore
- **Parallel Processing**: Explores hundreds of paths simultaneously  
- **Advanced Extraction**: Extracts content from dynamically rendered pages
- **Instruction-Driven**: Follows natural language instructions to guide exploration
- **Targeted Content**: Returns content tailored for LLM integration and RAG systems

### Tavily Resources:
- <a href="https://tavily.com" target="_blank">Official Website</a>
- <a href="https://docs.tavily.com" target="_blank">API Documentation</a>
- <a href="https://docs.tavily.com/documentation/api-reference/endpoint/crawl" target="_blank">Crawl API Reference</a>
- <a href="https://pypi.org/project/langchain-tavily/" target="_blank">LangChain Python Integration</a>
- <a href="https://app.tavily.com/home" target="_blank">Get API Key</a>

This tutorial demonstrates TavilyCrawl by comparing crawl results with and without instructions on the LangChain documentation.

---

## Setup & Installation

First, let's install the required packages and set up our environment.


In [None]:
# Install required packages
%pip install langchain-tavily certifi

# For pretty printing and visualization
%pip install rich pandas

In [None]:
import os
import ssl
import json
from typing import Any, Dict, List

import certifi
from langchain_tavily import TavilyCrawl
from rich.console import Console
from rich.panel import Panel
from rich.table import Table
from rich.json import JSON

# Configure SSL context
ssl_context = ssl.create_default_context(cafile=certifi.where())
os.environ["SSL_CERT_FILE"] = certifi.where()
os.environ["REQUESTS_CA_BUNDLE"] = certifi.where()

# Initialize rich console for pretty printing
console = Console()

print("All imports successful!")

## API Key Setup

You'll need a Tavily API key to use TavilyCrawl. Get yours at [https://app.tavily.com/home](https://app.tavily.com/home).

Set environment variable `TAVILY_API_KEY`


In [None]:
# Set your Tavily API key here
import getpass

# For Google Colab, you can use getpass for secure input
if 'TAVILY_API_KEY' not in os.environ:
    os.environ['TAVILY_API_KEY'] = getpass.getpass('Enter your Tavily API key: ')

# Alternative: Set directly (uncomment and add your key)
# os.environ["TAVILY_API_KEY"] = "your_tavily_api_key_here"

print("API key set successfully!")

## Initialize TavilyCrawl

Initialize TavilyCrawl and set up target URL for demonstration.

In [None]:
# Initialize TavilyCrawl
tavily_crawl = TavilyCrawl()

# Target URL: LangChain Documentation
target_url = "https://python.langchain.com/"

console.print(Panel.fit(
    f"Target Website: {target_url}\nCrawler: TavilyCrawl",
    title="Demo Setup",
    border_style="bright_blue"
))

print("TavilyCrawl initialized successfully")

## Demo 1: Crawl Without Instructions

Crawl without specific instructions to show baseline behavior on the LangChain documentation.

In [None]:
# Demo 1: Crawl without instructions
console.print(Panel.fit(
    f"Target: {target_url}\nInstructions: None (baseline crawl)\nMax Depth: 1\nExtract Depth: advanced",
    title="Demo 1: Crawl Without Instructions",
    border_style="blue"
))

console.print("Running TavilyCrawl without instructions...", style="blue")

# Basic crawl without instructions
basic_result = tavily_crawl.invoke({
    "url": target_url,
    "max_depth": 1,
    "extract_depth": "advanced"
})

# Show raw output immediately
console.print(basic_result)

# Extract results for analysis
basic_results = basic_result.get("results", [])

# Now display the formatted results nicely


In [None]:
console.print(f"\nResults Without Instructions: {len(basic_results)} pages", style="cyan")
console.print("   Mix of all content types from LangChain docs")
console.print("   No filtering - everything from the crawled sections")
console.print("   Requires manual work to find specific content")

console.print("\nSample Results from Basic Crawl (No Filtering):\n", style="cyan")

for i, result in enumerate(basic_results[:3], 1):  # Show first 3 results
    url = result.get("url", "No URL")
    content = result.get("raw_content", "No content")[:150] + "..."
    
    panel_content = f"""URL: {url}

Content Preview:
{content}"""
    
    console.print(Panel(
        panel_content,
        title=f"{i}. {url}",
        border_style="blue"
    ))
    print()

console.print(f"... and {len(basic_results) - 3} more mixed results", style="italic cyan")
console.print("Note: Mixed content types - guides, integrations, concepts, etc.", style="cyan")

## Demo 2: Crawl With Instructions

Use specific instructions to improve the quality and relevance of crawl results. Instructions can dramatically improve targeting and filtering.

In [None]:
instructions = "Find all pages about ai agents"

console.print(Panel.fit(
    f"Target: {target_url} (same as Demo 1)\nInstructions: {instructions}\nType: Specific, action-oriented\nMax Depth: 3\nExtract Depth: advanced",
    title="Demo 2: Crawl With Instructions", 
    border_style="green"
))

console.print("Starting crawl with instructions...", style="green")
console.print("Instructions will guide the AI to target specific content", style="italic")

In [None]:
# Execute the crawl with instructions
result_with_instructions = tavily_crawl.invoke({
    "url": target_url,
    "instructions": instructions,
    "max_depth": 3,
    "extract_depth": "advanced"
})

# Show raw output immediately
console.print("\nRaw TavilyCrawl Output:", style="yellow")
console.print(result_with_instructions)

console.print("\nCrawl with instructions completed", style="green")

# Show the results of instruction-based filtering
results_with_instructions = result_with_instructions.get("results", [])

In [None]:
# Display the targeted agent documentation found
console.print("\nLangChain Agent Documentation Found:\n", style="green")

for i, result in enumerate(results_with_instructions, 1):
    url = result.get("url", "No URL")
    content = result.get("raw_content", "No content")[:200] + "..."
    
    panel_content = f"""URL: {url}

Content Preview:
{content}"""
    
    console.print(Panel(
        panel_content,
        title=f"{i}. {url}",
        border_style="green"
    ))
    print()

console.print("Note: All results are specifically about agents in LangChain", style="green")

## Comparison of Both Approaches

Compare both approaches to understand the impact of instruction quality.

In [None]:
# Create comparison table
comparison_table = Table(title="TavilyCrawl: Instruction Quality Comparison")
comparison_table.add_column("Approach", style="cyan", no_wrap=True)
comparison_table.add_column("Instructions", style="yellow")
comparison_table.add_column("Pages Found", style="blue")
comparison_table.add_column("Content Quality", style="green")
comparison_table.add_column("Usefulness", style="red")

comparison_table.add_row(
    "No Instructions",
    "None (baseline)",
    f"{len(basic_results)}",
    "Mixed (all types)",
    "Low (requires filtering)"
)

comparison_table.add_row(
    "With Instructions",
    instructions,
    f"{len(results_with_instructions)}",
    "Highly targeted",
    "High (ready to use)"
)

console.print(comparison_table)

console.print("\nKey Observations:", style="blue")
console.print("   No instructions return everything, requiring manual filtering")
console.print("   Instructions provide highly targeted, ready-to-use results")
console.print("   Best practice: Use specific, action-oriented instructions")

console.print(f"\nEfficiency with Instructions:", style="green")
console.print(f"   Filtering efficiency: {((len(basic_results) - len(results_with_instructions)) / len(basic_results) * 100):.1f}% reduction in noise")
console.print("   Time saved: No manual post-processing required")
console.print("   AI-powered: Intelligent path selection and content filtering")