# 🗺️ TavilyMap & TavilyExtract Tutorial

> **📚 Part of the LangChain Course: Building AI Agents & RAG Apps**  
> [🎓 Get the full course](https://www.udemy.com/course/langchain/?referralCode=D981B8213164A3EA91AC)


This notebook demonstrates two powerful tools from Tavily AI:
- **TavilyMap**: Automatically discovers and maps website structures
- **TavilyExtract**: Extracts clean, structured content from web pages

Perfect for documentation scraping, research, and content extraction! 🚀

---


## 📦 Setup & Installation

First, let's install the required packages and set up our environment.


In [None]:
# Install required packages
!pip install langchain-tavily certifi

# For pretty printing and visualization
!pip install rich pandas


In [None]:
import asyncio
import os
import ssl
from typing import Any, Dict, List

import certifi
from langchain_tavily import TavilyExtract, TavilyMap
from rich.console import Console
from rich.panel import Panel

# Configure SSL context
ssl_context = ssl.create_default_context(cafile=certifi.where())
os.environ["SSL_CERT_FILE"] = certifi.where()
os.environ["REQUESTS_CA_BUNDLE"] = certifi.where()

# Initialize rich console for pretty printing
console = Console()


print("✅ All imports successful!")


## 🔑 API Key Setup

You'll need a Tavily API key to use these tools. Get yours at [tavily.com](https://app.tavily.com/home?utm_campaign=eden_marco&utm_medium=socials&utm_source=linkedin).

Set environment variable `TAVILY_API_KEY`

In [None]:


# Set directly (uncomment and add your key)
# tavily_api_key = "your_tavily_api_key_here"

os.environ["TAVILY_API_KEY"] = "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"

## 🗺️ TavilyMap: Website Structure Discovery

TavilyMap automatically discovers and maps website structures by crawling through links. It's perfect for:
- Documentation sites
- Blog archives
- Knowledge bases
- Any structured website

### Key Parameters:
- `max_depth`: How deep to crawl (default: 3)
- `max_breadth`: How many links per page (default: 10)
- `max_pages`: Maximum total pages to discover (default: 100)


In [None]:
# Initialize TavilyMap with custom settings
tavily_map = TavilyMap(
    max_depth=3,        # Crawl up to 3 levels deep
    max_breadth=15,     # Follow up to 15 links per page
    max_pages=50        # Limit to 50 total pages for demo
)

print("✅ TavilyMap initialized successfully!")


### 📊 Demo: Mapping a Documentation Site

Let's map the structure of a popular documentation site. We'll use the FastAPI documentation as an example.


In [None]:
# Example website to map
demo_url = "https://python.langchain.com/docs/introduction/"

console.print(f"🔍 Mapping website structure for: {demo_url}", style="bold blue")
console.print("This may take a moment...")

# Map the website structure
site_map = tavily_map.invoke(demo_url)

# Display results
urls = site_map.get('results', [])
console.print(f"\n✅ Successfully mapped {len(urls)} URLs!", style="bold green")

# Show first 10 URLs as examples
console.print("\n📋 First 50 discovered URLs:", style="bold yellow")
for i, url in enumerate(urls[:50], 1):
    console.print(f"  {i:2d}. {url}")

if len(urls) > 10:
    console.print(f"  ... and {len(urls) - 50} more URLs")


## 🔍 TavilyExtract: Clean Content Extraction

TavilyExtract takes URLs and returns clean, structured content without ads, navigation, or other noise. Perfect for:
- Documentation processing
- Content analysis
- Research and data collection
- Building knowledge bases

### Key Features:
- Removes HTML markup and navigation
- Extracts main content only
- Handles JavaScript-rendered content
- Batch processing support

In [None]:
# Initialize TavilyExtract
tavily_extract = TavilyExtract()

print("✅ TavilyExtract initialized successfully!")


### 📄 Demo: Extracting Content from URLs

Let's extract clean content from some of the URLs we discovered earlier.


In [None]:
# Select a few interesting URLs for extraction
sample_urls = [urls[15]]  # Take first 5 URLs
console.print(f"📚 Extracting content from {len(sample_urls)} URLs...", style="bold blue")

# Extract content
extraction_result = await tavily_extract.ainvoke(input={"urls": sample_urls})

# Display results
extracted_docs = extraction_result.get('results', [])
console.print(f"\n✅ Successfully extracted {len(extracted_docs)} documents!", style="bold green")

# Show summary of each extracted document
for i, doc in enumerate(extracted_docs, 1):
    url = doc.get('url', 'Unknown')
    content = doc.get('raw_content', '')

    # Create a panel for each document
    panel_content = f"""URL: {url}
Content Length: {len(content):,} characters
Preview: {content}..."""

    console.print(Panel(panel_content, title=f"Document {i}", border_style="blue"))
    print()  # Add spacing


### ⚡ Batch Processing Demo

For larger datasets, we can process URLs in batches to optimize performance and handle rate limits.


In [None]:
def chunk_urls(urls: List[str], chunk_size: int = 3) -> List[List[str]]:
    """Split URLs into chunks of specified size."""
    chunks = []
    for i in range(0, len(urls), chunk_size):
        chunk = urls[i:i + chunk_size]
        chunks.append(chunk)
    return chunks

async def extract_batch(urls: List[str], batch_num: int) -> List[Dict[str, Any]]:
    """Extract documents from a batch of URLs."""
    try:
        console.print(f"🔄 Processing batch {batch_num} with {len(urls)} URLs", style="blue")
        docs = await tavily_extract.ainvoke(input={"urls": urls})
        results = docs.get('results', [])
        console.print(f"✅ Batch {batch_num} completed - extracted {len(results)} documents", style="green")
        return results
    except Exception as e:
        console.print(f"❌ Batch {batch_num} failed: {e}", style="red")
        return []

# Process a larger set of URLs in batches
url_batches = chunk_urls(urls[:9], chunk_size=3) # Take first 9 URLs for batch demo, split to batches of 3

console.print(f"📦 Processing 9 URLs in {len(url_batches)} batches", style="bold yellow")

# Process batches concurrently
tasks = [extract_batch(batch, i + 1) for i, batch in enumerate(url_batches)]
batch_results = await asyncio.gather(*tasks)

# Flatten results
all_extracted = []
for batch_result in batch_results:
    all_extracted.extend(batch_result)

console.print(f"\n🎉 Batch processing complete! Total documents extracted: {len(all_extracted)}", style="bold green")


## 🎯 Real-World Use Cases

Here are some practical applications of TavilyMap and TavilyExtract:

### 1. Documentation Scraping
- Map entire documentation sites
- Extract clean content for search indexes
- Build knowledge bases from existing docs

### 2. Competitive Analysis
- Map competitor websites
- Extract product information
- Monitor content changes

### 3. Research & Content Collection
- Gather information from multiple sources
- Build datasets for analysis
- Create content archives

### 4. SEO & Site Analysis
- Discover all pages on a site
- Analyze content structure
- Identify content gaps


## 🎯 Conclusion

This tutorial demonstrated the power of TavilyMap and TavilyExtract for automated web content discovery and extraction:

### Key Takeaways:

1. **TavilyMap** is perfect for:
   - Discovering website structures
   - Finding all pages on a site
   - Site auditing

2. **TavilyExtract** excels at:
   - Clean content extraction
   - Removing HTML noise
   - Batch processing
   - Structured data collection

3. **Combined** they enable:
   - Complete documentation scraping
   - Automated content pipelines
   - Knowledge base creation
   - Research automation

### Next Steps:
- Integrate with vector databases for semantic search
- Add content filtering and classification
- Build monitoring systems for content changes
- Create automated reporting dashboards

---

**Happy scraping!** 🚀