https://crawl4ai.com/mkdocs/quickstart/

In [4]:
from crawl4ai import WebCrawler

def create_crawler():
    crawler = WebCrawler(verbose=True)
    crawler.warmup()
    return crawler

crawler = create_crawler()


[LOG] 🚀 Initializing LocalSeleniumCrawlerStrategy
[LOG] 🌤️  Warming up the WebCrawler
[LOG] 🌞 WebCrawler is ready to crawl


Basic Usage

Simply provide a URL and let Crawl4AI do the magic!

In [5]:
result = crawler.run(url="https://www.nba.com/news")
print(f"Basic crawl result: {result}")

[LOG] 🚀 Content extracted for https://www.nba.com/news, success: True, time taken: 0.06369686126708984 seconds
Basic crawl result: url='https://www.nba.com/news' html='<html lang="en" data-version="4.44.0" data-build="16175" data-theme="" class="userconsent-cntry-us userconsent-state- userconsent-reg-us"><head class="at-element-marker"><script type="text/javascript" src="https://bam.nr-data.net/1/NRJS-93744526e47188ec9f0?a=927622108&amp;sa=1&amp;v=1177.96a4d39&amp;t=Unnamed%20Transaction&amp;rst=3369&amp;ck=1&amp;ref=https://www.nba.com/news&amp;be=635&amp;fe=3270&amp;dc=1075&amp;af=err,xhr,stn,ins,spa&amp;perf=%7B%22timing%22:%7B%22of%22:1727192330120,%22n%22:0,%22f%22:0,%22dn%22:1,%22dne%22:32,%22c%22:32,%22s%22:71,%22ce%22:116,%22rq%22:116,%22rp%22:544,%22rpe%22:583,%22dl%22:547,%22di%22:972,%22ds%22:1074,%22de%22:1074,%22dc%22:3268,%22l%22:3270,%22le%22:3275%7D,%22navigation%22:%7B%7D%7D&amp;fp=876&amp;fcp=876&amp;jsonp=NREUM.setToken"></script><script src="https://js-agent.newreli

Taking Screenshots 📸

Let's take a screenshot of the page!

In [6]:
import base64

result = crawler.run(url="https://www.nba.com/news", screenshot=True)
with open("screenshot.png", "wb") as f:
    f.write(base64.b64decode(result.screenshot))
print("Screenshot saved to 'screenshot.png'!")

[LOG] 🕸️ Crawling https://www.nba.com/news using LocalSeleniumCrawlerStrategy...
[LOG] ✅ Crawled https://www.nba.com/news successfully!
[LOG] 🚀 Crawling done for https://www.nba.com/news, success: True, time taken: 5.7097578048706055 seconds
[LOG] 📸 Screenshot taken and converted to base64
[LOG] 🚀 Content extracted for https://www.nba.com/news, success: True, time taken: 0.061559438705444336 seconds
Screenshot saved to 'screenshot.png'!


Understanding Parameters 🧠

By default, Crawl4AI caches the results of your crawls. This means that subsequent crawls of the same URL will be much faster! Let's see this in action.

First crawl (caches the result):

In [7]:
result = crawler.run(url="https://www.nba.com/news")
print(f"First crawl result: {result}")

[LOG] 🚀 Content extracted for https://www.nba.com/news, success: True, time taken: 0.05968046188354492 seconds
First crawl result: url='https://www.nba.com/news' html='<html lang="en" data-version="4.44.0" data-build="16175" data-theme="" class="userconsent-cntry-us userconsent-state- userconsent-reg-us"><head class="at-element-marker"><script type="text/javascript" src="https://bam.nr-data.net/1/NRJS-93744526e47188ec9f0?a=927622108&amp;sa=1&amp;v=1177.96a4d39&amp;t=Unnamed%20Transaction&amp;rst=5667&amp;ck=1&amp;ref=https://www.nba.com/news&amp;be=868&amp;fe=5570&amp;dc=1541&amp;af=err,xhr,stn,ins,spa&amp;perf=%7B%22timing%22:%7B%22of%22:1727192411948,%22n%22:0,%22f%22:0,%22dn%22:1,%22dne%22:32,%22c%22:32,%22s%22:77,%22ce%22:128,%22rq%22:128,%22rp%22:656,%22rpe%22:697,%22dl%22:659,%22di%22:1255,%22ds%22:1540,%22de%22:1540,%22dc%22:5566,%22l%22:5569,%22le%22:5573%7D,%22navigation%22:%7B%7D%7D&amp;fp=1044&amp;fcp=1044&amp;jsonp=NREUM.setToken"></script><script src="https://js-agent.newr

Force to crawl again:

In [None]:
result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True)
print(f"Second crawl result: {result}")

Adding a Chunking Strategy 🧩

Let's add a chunking strategy: RegexChunking! This strategy splits the text based on a given regex pattern.

In [9]:
from crawl4ai.chunking_strategy import RegexChunking

result = crawler.run(
    url="https://www.nba.com/news",
    chunking_strategy=RegexChunking(patterns=["\n\n"])
)
print(f"RegexChunking result: {result}")

[LOG] 🚀 Content extracted for https://www.nba.com/news, success: True, time taken: 0.06035208702087402 seconds
RegexChunking result: url='https://www.nba.com/news' html='<html lang="en" data-version="4.44.0" data-build="16175" data-theme="" class="userconsent-cntry-us userconsent-state- userconsent-reg-us"><head class="at-element-marker"><script type="text/javascript" src="https://bam.nr-data.net/1/NRJS-93744526e47188ec9f0?a=927622108&amp;sa=1&amp;v=1177.96a4d39&amp;t=Unnamed%20Transaction&amp;rst=5667&amp;ck=1&amp;ref=https://www.nba.com/news&amp;be=868&amp;fe=5570&amp;dc=1541&amp;af=err,xhr,stn,ins,spa&amp;perf=%7B%22timing%22:%7B%22of%22:1727192411948,%22n%22:0,%22f%22:0,%22dn%22:1,%22dne%22:32,%22c%22:32,%22s%22:77,%22ce%22:128,%22rq%22:128,%22rp%22:656,%22rpe%22:697,%22dl%22:659,%22di%22:1255,%22ds%22:1540,%22de%22:1540,%22dc%22:5566,%22l%22:5569,%22le%22:5573%7D,%22navigation%22:%7B%7D%7D&amp;fp=1044&amp;fcp=1044&amp;jsonp=NREUM.setToken"></script><script src="https://js-agent.ne

NlpSentenceChunking which splits the text into sentences using NLP techniques.

In [10]:
from crawl4ai.chunking_strategy import NlpSentenceChunking

result = crawler.run(
    url="https://www.nba.com/news",
    chunking_strategy=NlpSentenceChunking()
)
print(f"NlpSentenceChunking result: {result}")

[LOG] 🚀 Content extracted for https://www.nba.com/news, success: True, time taken: 0.05998086929321289 seconds
NlpSentenceChunking result: url='https://www.nba.com/news' html='<html lang="en" data-version="4.44.0" data-build="16175" data-theme="" class="userconsent-cntry-us userconsent-state- userconsent-reg-us"><head class="at-element-marker"><script type="text/javascript" src="https://bam.nr-data.net/1/NRJS-93744526e47188ec9f0?a=927622108&amp;sa=1&amp;v=1177.96a4d39&amp;t=Unnamed%20Transaction&amp;rst=5667&amp;ck=1&amp;ref=https://www.nba.com/news&amp;be=868&amp;fe=5570&amp;dc=1541&amp;af=err,xhr,stn,ins,spa&amp;perf=%7B%22timing%22:%7B%22of%22:1727192411948,%22n%22:0,%22f%22:0,%22dn%22:1,%22dne%22:32,%22c%22:32,%22s%22:77,%22ce%22:128,%22rq%22:128,%22rp%22:656,%22rpe%22:697,%22dl%22:659,%22di%22:1255,%22ds%22:1540,%22de%22:1540,%22dc%22:5566,%22l%22:5569,%22le%22:5573%7D,%22navigation%22:%7B%7D%7D&amp;fp=1044&amp;fcp=1044&amp;jsonp=NREUM.setToken"></script><script src="https://js-ag

Adding an Extraction Strategy 🧠

Let's get smarter with an extraction strategy: CosineStrategy! This strategy uses cosine similarity to extract semantically similar blocks of text.

In [18]:
from crawl4ai.extraction_strategy import CosineStrategy

result = crawler.run(
    url="https://www.nba.com/news",
    extraction_strategy=CosineStrategy(
        word_count_threshold=10, 
        max_dist=0.2, 
        linkage_method="ward", 
        top_k=3
    )
)
print(f"CosineStrategy result: {result}")

[LOG] 🚀 Content extracted for https://www.nba.com/news, success: True, time taken: 0.06273198127746582 seconds
CosineStrategy result: url='https://www.nba.com/news' html='<html lang="en" data-version="4.44.0" data-build="16175" data-theme="" class="userconsent-cntry-us userconsent-state- userconsent-reg-us"><head class="at-element-marker"><script type="text/javascript" src="https://bam.nr-data.net/1/NRJS-93744526e47188ec9f0?a=927622108&amp;sa=1&amp;v=1177.96a4d39&amp;t=Unnamed%20Transaction&amp;rst=4323&amp;ck=1&amp;ref=https://www.nba.com/news&amp;be=1410&amp;fe=4134&amp;dc=1694&amp;af=err,xhr,stn,ins,spa&amp;perf=%7B%22timing%22:%7B%22of%22:1727193102298,%22n%22:0,%22f%22:0,%22dn%22:1,%22dne%22:28,%22c%22:28,%22s%22:68,%22ce%22:111,%22rq%22:111,%22rp%22:684,%22rpe%22:726,%22dl%22:687,%22di%22:1589,%22ds%22:1694,%22de%22:1694,%22dc%22:4131,%22l%22:4133,%22le%22:4137%7D,%22navigation%22:%7B%7D%7D&amp;fp=1482&amp;fcp=1482&amp;jsonp=NREUM.setToken"></script><script src="https://js-agent.

You can also pass other parameters like semantic_filter to extract specific content.

In [13]:
result = crawler.run(
    url="https://www.nba.com/news",
    extraction_strategy=CosineStrategy(
        semantic_filter="inflation rent prices"
    )
)
print(f"CosineStrategy result with semantic filter: {result}")

[LOG] 🚀 Content extracted for https://www.nba.com/news, success: True, time taken: 0.0619351863861084 seconds
CosineStrategy result with semantic filter: url='https://www.nba.com/news' html='<html lang="en" data-version="4.44.0" data-build="16175" data-theme="" class="userconsent-cntry-us userconsent-state- userconsent-reg-us"><head class="at-element-marker"><script type="text/javascript" src="https://bam.nr-data.net/1/NRJS-93744526e47188ec9f0?a=927622108&amp;sa=1&amp;v=1177.96a4d39&amp;t=Unnamed%20Transaction&amp;rst=5667&amp;ck=1&amp;ref=https://www.nba.com/news&amp;be=868&amp;fe=5570&amp;dc=1541&amp;af=err,xhr,stn,ins,spa&amp;perf=%7B%22timing%22:%7B%22of%22:1727192411948,%22n%22:0,%22f%22:0,%22dn%22:1,%22dne%22:32,%22c%22:32,%22s%22:77,%22ce%22:128,%22rq%22:128,%22rp%22:656,%22rpe%22:697,%22dl%22:659,%22di%22:1255,%22ds%22:1540,%22de%22:1540,%22dc%22:5566,%22l%22:5569,%22le%22:5573%7D,%22navigation%22:%7B%7D%7D&amp;fp=1044&amp;fcp=1044&amp;jsonp=NREUM.setToken"></script><script src

Using LLMExtractionStrategy 🤖

Time to bring in the big guns: LLMExtractionStrategy without instructions! This strategy uses a large language model to extract relevant information from the web page.

In [19]:
import json
from crawl4ai import WebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from langchain_community.chat_models import ChatOllama
import re

def clean_content(content: str):
    # Remove unwanted control characters
    return re.sub(r'[\x00-\x1F\x7F]', '', content)

# Initialize the local LLM using Ollama with Mistral
llm = ChatOllama(model="qwen2.5", temperature=0)

# Create the web scraper function using Crawl4AI
def scrape_nba_stories(url: str):
    # Create a WebCrawler instance and warm it up
    crawler = WebCrawler(verbose=True)
    crawler.warmup()

    # Use LLMExtractionStrategy with the local Mistral model via Ollama
    result = crawler.run(
        url=url,
        word_count_threshold=1,
        extraction_strategy=LLMExtractionStrategy(
            provider="ollama/qwen2.5",  # Use the local Mistral model with Ollama
            api_token="no-token",  # No token needed for local models
            apply_chunking=False,
            instruction="""
            From the crawled content, extract the following details for the latest NBA news stories:
            1. Title of the article or headline.
            2. A brief summary or description of the article.
            3. Author's name, if available.
            4. Date of publication, if available.
            Only extract content that appears to be related to NBA news articles.
            Ignore any subscription prompts, login messages, or promotional content.
            The extracted JSON format should look like this:
            {
                "articles": [
                    {
                        "title": "Article Title 1",
                        "summary": "Brief summary of the article.",
                        "author": "Author Name",
                        "date": "Publication Date"
                    },
                    {
                        "title": "Article Title 2",
                        "summary": "Brief summary of the article.",
                        "author": "Author Name",
                        "date": "Publication Date"
                    }
                ]
            }
            """
        ),
        bypass_cache=True,
    )

    # Convert the extracted content from JSON
    result_converted = result.extracted_content.encode('utf-8', errors='ignore').decode("unicode_escape")
    result_cleaned = clean_content(result_converted)
    
    try:
        result_json = json.loads(result_cleaned)
    except json.JSONDecodeError as e:
        print(f"JSON decoding error: {e}")
        print(f"Problematic content: {result_cleaned}")
        result_json = None  # Set to None if there is an error

    return result_json

# Test the scraper function
if __name__ == "__main__":
    url = 'https://www.espn.com/mlb/insider/story/_/id/41399848/mlb-2024-regular-season-grades-all-30-teams'
    
    scraped_data = scrape_nba_stories(url)
    
    # Print the scraped NBA stories if available
    if scraped_data:
        print("Scraped NBA Stories:", json.dumps(scraped_data, indent=2))
    else:
        print("No valid data extracted.")


[LOG] 🚀 Initializing LocalSeleniumCrawlerStrategy
[LOG] 🌤️  Warming up the WebCrawler
[LOG] 🌞 WebCrawler is ready to crawl
[LOG] 🕸️ Crawling https://www.espn.com/mlb/insider/story/_/id/41399848/mlb-2024-regular-season-grades-all-30-teams using LocalSeleniumCrawlerStrategy...
[LOG] ✅ Crawled https://www.espn.com/mlb/insider/story/_/id/41399848/mlb-2024-regular-season-grades-all-30-teams successfully!
[LOG] 🚀 Crawling done for https://www.espn.com/mlb/insider/story/_/id/41399848/mlb-2024-regular-season-grades-all-30-teams, success: True, time taken: 3.145214319229126 seconds
[LOG] 🚀 Content extracted for https://www.espn.com/mlb/insider/story/_/id/41399848/mlb-2024-regular-season-grades-all-30-teams, success: True, time taken: 0.17802000045776367 seconds
[LOG] 🔥 Extracting semantic blocks for https://www.espn.com/mlb/insider/story/_/id/41399848/mlb-2024-regular-season-grades-all-30-teams, Strategy: LLMExtractionStrategy
[LOG] Call LLM for https://www.espn.com/mlb/insider/story/_/id/41399