<img src="https://drive.google.com/uc?export=view&id=1wYSMgJtARFdvTt5g7E20mE4NmwUFUuog" width="200">

[![Build Fast with AI](https://img.shields.io/badge/BuildFastWithAI-GenAI%20Bootcamp-blue?style=for-the-badge&logo=artificial-intelligence)](https://www.buildfastwithai.com/genai-course)
[![EduChain GitHub](https://img.shields.io/github/stars/satvik314/educhain?style=for-the-badge&logo=github&color=gold)](https://github.com/satvik314/educhain)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/12TOxopyRs1xfFg8nb542U9wTehSLouzL#scrollTo=EDJ18EXZ5BUZ)
## Master Generative AI in 6 Weeks
**What You'll Learn:**
- Build with Latest LLMs
- Create Custom AI Apps
- Learn from Industry Experts
- Join Innovation Community
Transform your AI ideas into reality through hands-on projects and expert mentorship.
[Start Your Journey](https://www.buildfastwithai.com/genai-course)
*Empowering the Next Generation of AI Innovators

## 🔥🕷️ Crawl4AI: Open-source LLM Friendly Web Crawler & Scrapper


🕷️🤖 Crawl4AI is an open-source Python library designed to simplify web crawling and extract useful information from web pages. This documentation will guide you through the features, usage, and customization of Crawl4AI.



### **Quickstart with Crawl4AI**

#### **Installation**
Install Crawl4AI and necessary dependencies:

In [1]:
%%capture
!pip install -U crawl4ai
!pip install nest_asyncio
!playwright install

###Setting Up API Keys

In [5]:
import os
from google.colab import userdata
os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')

In [2]:
import asyncio
import nest_asyncio
nest_asyncio.apply()

####  **Basic Setup and Simple Crawl**

In [3]:
import asyncio
from crawl4ai import AsyncWebCrawler, CacheMode

async def simple_crawl():
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url="https://www.buildfastwithai.com/",
            cach_mode = CacheMode.ENABLED # Default is ENABLED
        )
        print(result.markdown_v2.raw_markdown[:500].replace("\n", " -- "))  # Print the first 500 characters

asyncio.run(simple_crawl())

[INIT].... → Crawl4AI 0.4.0
[COMPLETE] ● Database backup created at: /root/.crawl4ai/crawl4ai.db.backup_20241209_102044
[INIT].... → Starting database migration...
[COMPLETE] ● Migration completed. 0 records processed.
[FETCH]... ↓ https://www.buildfastwithai.com/... | Status: True | Time: 10.60s
[SCRAPE].. ◆ Processed https://www.buildfastwithai.com/... | Time: 824ms
[COMPLETE] ● https://www.buildfastwithai.com/... | Status: True | Total: 12.61s
[![buildfastwithai](/_next/static/media/light.5e8e48b7.svg)](/) --  -- [GenAI Bootcamp](/genai-course)[Daily GenAI Quiz](/daily-quiz) --  -- [Resources](/#resources) --  -- [App Showcase](https://apps.buildfastwithai.com) --  -- [Events](/#events) --  -- More  --  -- Sign In --  -- [](https://www.linkedin.com/company/build-fast-with-ai/)[](https://x.com/satvikps) --  -- [![buildfastwithai](/_next/static/media/light.5e8e48b7.svg)](/) --  -- Sign In --  -- [](https://www.linkedin.com/company/build-fast-with-ai/)[](https://x.com/satvikps) --  -- 

#### **Dynamic Content Handling**

In [6]:
async def crawl_dynamic_content():

    async with AsyncWebCrawler(verbose=True) as crawler:
        js_code = [
            "const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"
        ]
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            js_code=js_code,
            # wait_for=wait_for,
            cach_mode = CacheMode.ENABLED # Default is ENABLED
        )
        print(result.markdown_v2.raw_markdown[:500].replace("\n", " -- "))  # Print first 500 characters

asyncio.run(crawl_dynamic_content())

[INIT].... → Crawl4AI 0.4.0
[FETCH]... ↓ https://www.nbcnews.com/business... | Status: True | Time: 10.76s
[SCRAPE].. ◆ Processed https://www.nbcnews.com/business... | Time: 2151ms
[COMPLETE] ● https://www.nbcnews.com/business... | Status: True | Total: 13.03s
IE 11 is not supported. For an optimal experience visit our site on another browser. --  -- Skip to Content --  -- [NBC News Logo](https://www.nbcnews.com) --  -- Sponsored By --  --   * [Politics](https://www.nbcnews.com/politics) --   * [U.S. News](https://www.nbcnews.com/us-news) --   * Local --   * [New York](https://www.nbcnews.com/new-york) --   * [Los Angeles](https://www.nbcnews.com/los-angeles) --   * [Chicago](https://www.nbcnews.com/chicago) --   * [Dallas-Fort Worth](https://www.nbcnews.com/dallas-fort-worth) --   * [Philadelph


#### **Content Cleaning and Fit Markdown**

In [7]:
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

async def clean_content():
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url="https://en.wikipedia.org/wiki/Apple",
            excluded_tags=['nav', 'footer', 'aside'],
            remove_overlay_elements=True,
            # word_count_threshold=10,
            cach_mode = CacheMode.ENABLED,
            markdown_generator=DefaultMarkdownGenerator(
                content_filter=PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0),
                options={
                    "ignore_links": True
                }
            ),

        )
        full_markdown_length = len(result.markdown_v2.raw_markdown)
        fit_markdown_length = len(result.markdown_v2.fit_markdown)
        print(f"Full Markdown Length: {full_markdown_length}")
        print(f"Fit Markdown Length: {fit_markdown_length}")


asyncio.run(clean_content())

[INIT].... → Crawl4AI 0.4.0
[FETCH]... ↓ https://en.wikipedia.org/wiki/Apple... | Status: True | Time: 6.15s
[SCRAPE].. ◆ Processed https://en.wikipedia.org/wiki/Apple... | Time: 3517ms
[COMPLETE] ● https://en.wikipedia.org/wiki/Apple... | Status: True | Total: 9.71s
Full Markdown Length: 80379
Fit Markdown Length: 72208


####  **Link Analysis and Smart Filtering**

In [8]:
async def link_analysis():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            cach_mode = CacheMode.ENABLED,
            exclude_external_links=True,
            exclude_social_media_links=True,
            # exclude_domains=["facebook.com", "twitter.com"]
        )
        print(f"Found {len(result.links['internal'])} internal links")
        print(f"Found {len(result.links['external'])} external links")

        for link in result.links['internal'][:5]:
            print(f"Href: {link['href']}\nText: {link['text']}\n")


asyncio.run(link_analysis())

Found 139 internal links
Found 39 external links
Href: https://www.nbcnews.com
Text: NBC News Logo

Href: https://www.nbcnews.com/politics
Text: Politics

Href: https://www.nbcnews.com/us-news
Text: U.S. News

Href: https://www.nbcnews.com/new-york
Text: New York

Href: https://www.nbcnews.com/los-angeles
Text: Los Angeles



####  **Media Handling**

In [9]:
async def media_handling():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            cach_mode = CacheMode.ENABLED,
            exclude_external_images=False,
            # screenshot=True # Set this to True if you want to take a screenshot
        )
        for img in result.media['images'][:5]:
            print(f"Image URL: {img['src']}, Alt: {img['alt']}, Score: {img['score']}")

asyncio.run(media_handling())

Image URL: https://media-cldnry.s-nbcnews.com/image/upload/t_focal-762x508,f_auto,q_auto:best/rockcms/2024-12/241203-donald-trump-al-1014-2ee816.jpg, Alt: Donald Trump, Score: 5
Image URL: https://media-cldnry.s-nbcnews.com/image/upload/t_focal-762x508,f_auto,q_auto:best/rockcms/2024-12/241208-retail-wm-1127p-b77e92.jpg, Alt: Package Deliveries As Cyber Monday Deals Hit, Score: 5
Image URL: https://media-cldnry.s-nbcnews.com/image/upload/t_focal-80x80,f_auto,q_auto:best/rockcms/2024-12/241206-donald-trump-MTP-interview-ac-936p-2e9e2d.jpg, Alt: donald trump mtp exclusive interview politics political politician meet the press, Score: 5
Image URL: https://media-cldnry.s-nbcnews.com/image/upload/t_focal-80x80,f_auto,q_auto:best/rockcms/2024-12/241205-bitcoin-se-122p-ae5e82.jpg, Alt: Attendees during the Bitcoin 2024 conference, Score: 5
Image URL: https://media-cldnry.s-nbcnews.com/image/upload/t_focal-80x80,f_auto,q_auto:best/rockcms/2024-05/240503-aetna-mn-1605-04ad07.jpg, Alt: Aetna hea

### LLM Extraction

This example demonstrates how to use language model-based extraction to retrieve structured data from a pricing page on OpenAI’s site.

In [7]:
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field
import os, json


class OpenAIModelFee(BaseModel):
    model_name: str = Field(..., description="Name of the OpenAI model.")
    input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
    output_fee: str = Field(
        ..., description="Fee for output token for the OpenAI model."
    )

async def extract_structured_data_using_llm(provider: str, api_token: str = None, extra_headers: dict = None):
    print(f"\n--- Extracting Structured Data with {provider} ---")

    # Skip if API token is missing (for providers that require it)
    if api_token is None and provider != "ollama":
        print(f"API token is required for {provider}. Skipping this example.")
        return

    extra_args = {"extra_headers": extra_headers} if extra_headers else {}

    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url="https://openai.com/api/pricing/",
            word_count_threshold=1,
            extraction_strategy=LLMExtractionStrategy(
                provider=provider,
                api_token=api_token,
                schema=OpenAIModelFee.schema(),
                extraction_type="schema",
                instruction="""Extract all model names along with fees for input and output tokens."
                "{model_name: 'GPT-4', input_fee: 'US$10.00 / 1M tokens', output_fee: 'US$30.00 / 1M tokens'}.""",
                **extra_args
            ),
            cach_mode = CacheMode.ENABLED
        )
        print(json.loads(result.extracted_content)[:5])

# Usage:
await extract_structured_data_using_llm("openai/gpt-4o-mini", os.getenv("OPENAI_API_KEY"))


--- Extracting Structured Data with openai/gpt-4o-mini ---
[INIT].... → Crawl4AI 0.4.0


<ipython-input-7-9cb9ad09a3f6>:30: PydanticDeprecatedSince20: The `schema` method is deprecated; use `model_json_schema` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.10/migration/
  schema=OpenAIModelFee.schema(),


[FETCH]... ↓ https://openai.com/api/pricing/... | Status: True | Time: 0.04s
[SCRAPE].. ◆ Processed https://openai.com/api/pricing/... | Time: 76ms
[COMPLETE] ● https://openai.com/api/pricing/... | Status: True | Total: 0.15s
[{'model_name': 'GPT-4', 'input_fee': 'US$10.00 / 1M tokens', 'output_fee': 'US$30.00 / 1M tokens', 'error': False}]
