# üï∏Ô∏è Modern Web Data Ingestion with CAMEL

<div class="align-center">
  <a href="https://www.camel-ai.org/"><img src="https://i.postimg.cc/KzQ5rfBC/button.png"width="150"></a>
  <a href="https://discord.camel-ai.org"><img src="https://i.postimg.cc/L4wPdG9N/join-2.png"  width="150"></a></a>
  
‚≠ê <i>Star us on [*Github*](https://github.com/camel-ai/camel), join our [*Discord*](https://discord.camel-ai.org) or follow our [*X*](https://x.com/camelaiorg)
</div>

This notebook demonstrates **3 modern ways** to ingest data from websites using CAMEL. Getting clean, LLM-ready data from the messy web is often the first bottleneck in building AI agents. 

We will cover:

1.  **Firecrawl**: A powerful external service that turns any website into clean Markdown.
2.  **Crawl4AI**: An open-source, local-first asyncio web crawler.
3.  **Unstructured.io**: A robust library for parsing complex documents and HTML.

By the end of this cookbook, you will know how to programmatically read websites for your RAG pipelines or Agent contexts.

## üì¶ Installation

Install CAMEL and the necessary loader libraries.

In [None]:
%pip install "camel-ai[all]" 
!playwright install chromium

## üîë Setting Up API Keys

Firecrawl requires an API key. You can get one from [firecrawl.dev](https://firecrawl.dev). Crawl4AI and Unstructured (local mode) do not require keys.

In [None]:
import os
from getpass import getpass

# Prompt for Firecrawl API key if you plan to use it
firecrawl_key = getpass('Enter your Firecrawl API key (optional if using others): ')
if firecrawl_key:
    os.environ["FIRECRAWL_API_KEY"] = firecrawl_key

## Method 1: Firecrawl (Service-Based)

Firecrawl is excellent when you need high-quality Markdown conversion and don't want to manage browser infrastructure. It handles dynamic content, infinite scrolls, and complex layouts automatically.

In [None]:
from camel.loaders import Firecrawl

# Initialize Firecrawl
firecrawl = Firecrawl()

target_url = "https://www.camel-ai.org"

# Scrape a single URL
try:
    # scrape() returns a dict with 'content', 'metadata', etc.
    result = firecrawl.scrape(url=target_url)
    
    print("\n--- Metadata ---")
    print(result.get('metadata', {}))
    
    print("\n--- Markdown Content (Snippet) ---")
    print(result.get('markdown', '')[:500] + "...")
except Exception as e:
    print(f"Firecrawl error: {e}")

## Method 2: Crawl4AI (Local Async Generation)

Crawl4AI is a powerful open-source crawler that runs locally. It's fast, asynchronous, and free. Useful for high-volume scraping where you have compute resources.

In [None]:
from camel.loaders import Crawl4AI
import asyncio
import sys
from concurrent.futures import ThreadPoolExecutor

# Helper to run async code in a fresh Proactor loop (safe for Jupyter/Windows)
def run_in_proactor_thread(coro):
    """
    On Windows, Jupyter often starts with a SelectorEventLoop which is
    incompatible with Playwright. This helper runs the coroutine in a
    separate thread with a fresh ProactorEventLoop.
    """
    if sys.platform == 'win32':
        def _target():
            asyncio.set_event_loop_policy(asyncio.WindowsProactorEventLoopPolicy())
            return asyncio.run(coro)
        
        with ThreadPoolExecutor(max_workers=1) as executor:
            return executor.submit(_target).result()
    else:
        # On Linux/Mac, asyncio.run() usually works fine
        return asyncio.run(coro)

async def crawl_task():
    crawler = Crawl4AI()
    target_url = "https://www.camel-ai.org"
    return await crawler.scrape(url=target_url)

# Execute
try:
    result_dict = run_in_proactor_thread(crawl_task())
    
    print("\n--- Cleaned HTML (Snippet) ---")
    print(result_dict.get('cleaned_html', '')[:300] + "...")
    
    print("\n--- Markdown Content (Snippet) ---")
    print(result_dict.get('markdown', '')[:500] + "...")
except Exception as e:
    print(f"Crawl4AI error: {e}")

## Method 3: Unstructured (Universal Parser)

Unstructured is technically a document parser, but it handles HTML URLs exceptionally well. It partitions the page into semantic elements (Title, NarrativeText, ListItem), which is great for RAG chunking.

In [None]:
from camel.loaders import UnstructuredIO

target_url = "https://www.camel-ai.org"

# Unstructured parses the URL into a list of 'Elements'
elements = UnstructuredIO.parse_file_or_url(target_url)

print(f"Found {len(elements)} elements on the page.\n")

# Let's inspect the first few elements
for i, el in enumerate(elements[:10]):
    print(f"Type: {type(el).__name__} | Text: {el.text}")

# We can also group them or clean them using Unstructured's cleaning utils
from camel.loaders import UnstructuredIO

full_text = "\n".join([el.text for el in elements])
cleaned_text = UnstructuredIO.clean_text_data(full_text)

print("\n--- Cleaned Text Snippet ---")
print(cleaned_text[:500])

## üèÅ Conclusion

You now have 3 robust ways to ingest web data:

| Method | Best For | Pros | Cons |
| :--- | :--- | :--- | :--- |
| **Firecrawl** | Production RAG | High quality Markdown, handles dynamic JS | Paid component |
| **Crawl4AI** | High volume, Free | Fast, Async, Open Source | Requires local resources |
| **Unstructured** | Granular Chunking | Semantic partitioning (Title vs Text) | Slower than raw scraping |

Choose the one that fits your pipeline's infrastructure and budget!

That's everything: Got questions about üê´ CAMEL-AI? Join us on [Discord](https://discord.camel-ai.org)! whether you want to share feedback, explore the latest in multi-agent systems, get support, or connect with others on exciting projects, we‚Äôd love to have you in the community! ü§ù