<img src="https://drive.google.com/uc?export=view&id=1wYSMgJtARFdvTt5g7E20mE4NmwUFUuog" width="200">

[![Gen AI Experiments](https://img.shields.io/badge/Gen%20AI%20Experiments-GenAI%20Bootcamp-blue?style=for-the-badge&logo=artificial-intelligence)](https://github.com/buildfastwithai/gen-ai-experiments)
[![Gen AI Experiments GitHub](https://img.shields.io/github/stars/buildfastwithai/gen-ai-experiments?style=for-the-badge&logo=github&color=gold)](http://github.com/buildfastwithai/gen-ai-experiments)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1N_GUpqymBxqBW3uqzF1QK1m3r9tqbpdU?usp=sharing)


**What You'll Learn:**
- Master cutting-edge AI tools & frameworks
- 6 weeks of hands-on, project-based learning
- Weekly live mentorship sessions
Transform your AI ideas into reality through hands-on projects and expert mentorship.


[Start Your Journey](https://www.buildfastwithai.com/genai-course)




## üï∑Ô∏è AnyCrawl Web Scraping

AnyCrawl scrape API converts any webpage into structured data optimized for Large Language Models (LLM). It supports multiple scraping engines including Cheerio, Playwright, Puppeteer, and outputs in various formats such as HTML, Markdown, JSON, etc.

###**Setup and Installation**



In [1]:
# Install dependencies (uncomment when running in Colab)
!pip install requests tqdm nbformat --quiet

### **Setup the API Key**


In [3]:
from google.colab import userdata
import os

os.environ['ANYCRAWL_API_KEY']=userdata.get('ANYCRAWL_API_KEY')

API_KEY = os.getenv("ANYCRAWL_API_KEY")

## Configuration
Set your AnyCrawl API key and a helper function for POST requests.

In [30]:
import requests, json, os, time
from pathlib import Path

ANYCRAWL_API = "https://api.anycrawl.dev/v1/scrape"

HEADERS = {
    "Authorization": f"Bearer {API_KEY}",
    "Content-Type": "application/json"
}

def call_anycrawl(payload):
    """Call AnyCrawl /v1/scrape and return JSON response (with basic error handling)."""
    resp = requests.post(ANYCRAWL_API, headers=HEADERS, json=payload, timeout=120)
    try:
        data = resp.json()
    except Exception as e:
        raise RuntimeError(f"Non-JSON response: {resp.status_code}\n{resp.text}") from e
    if not data.get("success", False):
        raise RuntimeError(f"AnyCrawl error: {data.get('error')} - {data.get('message')}")
    return data


## 1) Basic scraping (cheerio) ‚Äî static page
Scrape a simple static page and print the returned markdown and a short preview of the JSON payload.

In [12]:
# Example: scrape a simple page (cheerio - fast, static)
payload = {
    "url": "https://docs.agno.com/introduction", # Using documentation of Agno Agnet
    # "engine": "cheerio",  # default
    "formats": ["markdown"],
    "timeout": 30000
}

try:
    result = call_anycrawl(payload)
    data = result['data']
    print("URL:", data.get('url'))
    print("Status:", data.get('status'))
    print("Title:", data.get('title'))
    print('\n--- Markdown preview ---\n')
    print(data.get('markdown', '')[:1500])   # preview first 1500 chars
except Exception as e:
    print('Error:', e)


URL: https://docs.agno.com/introduction
Status: completed
Title: What is Agno? - Agno

--- Markdown preview ---

What is Agno? - Agno

[Agno home page![light logo](https://mintlify.s3.us-west-1.amazonaws.com/agno/logo/black.svg)![dark logo](https://mintlify.s3.us-west-1.amazonaws.com/agno/logo/white.svg)](https://docs.agno.com/)

Search...

‚åòKAsk AI

Search...

Navigation

Introduction

What is Agno?

[User Guide

](https://docs.agno.com/introduction)[Examples

](https://docs.agno.com/examples/introduction)[Workspaces

](https://docs.agno.com/workspaces/introduction)[FAQs

](https://docs.agno.com/faq/environment-variables)[API reference

](https://docs.agno.com/reference/agents/agent)[Changelog

](https://docs.agno.com/changelog/overview)

On this page

*   [Getting Started](https://docs.agno.com/introduction#getting-started)
*   [Why Agno?](https://docs.agno.com/introduction#why-agno%3F)
*   [Dive deeper](https://docs.agno.com/introduction#dive-deeper)

Engineers and researchers use

## 2) Dynamic scraping (playwright) ‚Äî pages that require JS
Use `engine: 'playwright'` for SPAs and JS-heavy pages. This will be slower but can capture dynamic content and screenshots.

In [31]:
# Dynamic scraping example (playwright) with full-page screenshot
payload = {
    "url": "https://news.ycombinator.com/",   # example dynamic site (usually loads with static HTML too)
    "engine": "playwright",
    "formats": ["markdown", "screenshot@fullPage", "rawHtml"],
    "timeout": 45000,
    "wait_for": 2000   # ms - small wait after load
}

try:
    result = call_anycrawl(payload)
    data = result['data']
    print('Status:', data.get('status'))
    # Markdown preview
    md = data.get('markdown', '')
    print('\n--- Markdown preview ---\n', md[:1500])
    # Screenshot URL (if returned)
    if data.get('screenshot'):
        print('\nScreenshot URL:', data.get('screenshot'))
except Exception as e:
    print('Error:', e)


Status: completed

--- Markdown preview ---
 Hacker News

[![](https://news.ycombinator.com/y18.svg)](https://news.ycombinator.com/)

**[Hacker News](https://news.ycombinator.com/news)**[new](https://news.ycombinator.com/newest) | [past](https://news.ycombinator.com/front) | [comments](https://news.ycombinator.com/newcomments) | [ask](https://news.ycombinator.com/ask) | [show](https://news.ycombinator.com/show) | [jobs](https://news.ycombinator.com/jobs) | [submit](https://news.ycombinator.com/submit)

[login](https://news.ycombinator.com/login?goto=news)

1.

[

](https://news.ycombinator.com/vote?id=44861106&how=up&goto=news)

[Google paid a $250K reward for a bug](https://issues.chromium.org/issues/412578726) ([chromium.org](https://news.ycombinator.com/from?site=chromium.org))

274 points by [alexcos](https://news.ycombinator.com/user?id=alexcos) [4 hours ago](https://news.ycombinator.com/item?id=44861106) | [hide](https://news.ycombinator.com/hide?id=44861106&goto=news) | [105¬†co

## 3) Scraping with proxy
If you use proxies, pass the `proxy` parameter. Example:

In [32]:
payload = {
    "url": "https://example.com",
    "engine": "playwright",
    "proxy": "http://proxy.example.com:8080",
    "formats": ["markdown"],
    "timeout": 30000
}

# DON'T RUN this cell unless you replace proxy with a working proxy URL
print('Payload preview:', payload)

Payload preview: {'url': 'https://example.com', 'engine': 'playwright', 'proxy': 'http://proxy.example.com:8080', 'formats': ['markdown'], 'timeout': 30000}


## 4) Extracting structured JSON via `json_options` (schema)
You can provide a JSON schema and a prompt to extract structured fields from the page. This is great for scraping product pages, job postings, or any structured content.

In [45]:
payload = {
    "url": "https://amzn.in/d/b9gh2om",
    "formats": ["markdown"],
    "json_options": { # Writing Json_options
        "schema": {
            "type": "object",
            "properties": {
                "title": {"type": "string"},
                "price": {"type": "string"},
                "description": {"type": "string"},
                "images": {
                    "type": "array",
                    "items": {"type": "string"}
                }
            },
            "required": ["title"]
        },
        "prompt": "Extract product title, price, description and image URLs from the page."
    }
}



try:
    res = call_anycrawl(payload)
    print('Structured JSON output:\n', json.dumps(res['data']['markdown'], indent=2))
except Exception as e:
    print('Error:', e)


Structured JSON output:


## 5) Batch & concurrency example
You can send multiple requests concurrently. Below is a simple example using `concurrent.futures` to run several scrapes in parallel. Use responsibly ‚Äî don't overload target sites.

In [29]:
from concurrent.futures import ThreadPoolExecutor, as_completed

urls = [
    'https://example.com',
    'https://httpbin.org/html',
    'https://www.python.org/'
]

def scrape_url(u, engine='cheerio'):
    try:
        payload = {"url": u, "engine": engine, "formats": ["markdown"], "timeout": 30000}
        r = call_anycrawl(payload)
        return u, r['data'].get('title', ''), r['data'].get('status')
    except Exception as e:
        return u, None, str(e)

results = []
with ThreadPoolExecutor(max_workers=3) as ex:
    futures = [ex.submit(scrape_url, u) for u in urls]
    for f in as_completed(futures):
        results.append(f.result())

for row in results:
    print(row) # Just printing status of scrapted websites


('https://example.com', 'Example Domain', 'completed')
('https://www.python.org/', 'Welcome to Python.org', 'completed')
('https://httpbin.org/html', '', 'completed')


## 6) Save outputs locally (Markdown, HTML, screenshots)
Demonstrate saving the returned markdown/html/screenshot links to local files.

In [46]:
out_dir = Path("anycrawl_outputs")
out_dir.mkdir(exist_ok=True)
# This cell assumes you have result from earlier (variable 'result' or 'res'); we'll demonstrate with 'result' if present.
try:
    sample = globals().get('result') or globals().get('res')
    if not sample:
        print('No sample result available - run an earlier scrape cell first.')
    else:
        data = sample['data']
        # Save markdown
        if data.get('markdown'):
            (out_dir / 'page.md').write_text(data['markdown'], encoding='utf-8')
            print('Saved markdown ->', out_dir/'page.md')
        # Save raw HTML
        if data.get('html'):
            (out_dir / 'page.html').write_text(data['html'], encoding='utf-8')
            print('Saved html ->', out_dir/'page.html')
        # If screenshot link present, just print the url (downloading might require additional auth)
        if data.get('screenshot'):
            print('Screenshot URL:', data['screenshot'])
except Exception as e:
    print('Error saving outputs:', e)


Saved markdown -> anycrawl_outputs/page.md
Screenshot URL: https://api.anycrawl.dev/v1/public/storage/file/screenshot-fullPage-867bc456-3747-4ffb-b7e1-9748625da11b.jpeg


## 7) Preparing content for LLM usage (cleaning & chunking)
Simple example: remove long code blocks, keep paragraphs, chunk text into ~1000 token-ish segments (approx by characters).

In [51]:
import re, math
def clean_markdown(md):
    # remove script/style/code fences for cleaner LLM input
    md = re.sub(r"```[\s\S]*?```", "", md)
    md = re.sub(r"<script[\s\S]*?</script>", "", md, flags=re.I)
    return md.strip()

def chunk_text(text, chunk_size_chars=3000):
    chunks = []
    start = 0
    while start < len(text):
        chunk = text[start:start+chunk_size_chars]
        chunks.append(chunk)
        start += chunk_size_chars
    return chunks

# Example usage
sample_md = None
if 'result' in globals():
    sample_md = result['data'].get('markdown')
elif 'res' in globals():
    sample_md = res['data'].get('markdown')

if sample_md:
    cleaned = clean_markdown(sample_md)
    chunks = chunk_text(cleaned, 3000)
    print('Found', len(chunks), 'chunks. Example chunk length:', len(chunks[0]))
    for chunk in chunks[:5]:
        print(chunk[:50])
        print('---')
else:
    print('No markdown sample available. Run earlier scrape cell.')


Found 6 chunks. Example chunk length: 3000
Hacker News

[![](https://news.ycombinator.com/y18
---
mbinator.com/hide?id=44860908&goto=news) | [10¬†com
---
oints by [beariish](https://news.ycombinator.com/u
---
/news.ycombinator.com/from?site=mrwint.github.io))
---
ws)

[1910: The year the modern world lost its min
---


## 8) Troubleshooting & Best Practices
- Use `cheerio` for static pages, `playwright`/`puppeteer` for dynamic pages.
- Use proxies & rotate IPs for large-scale scraping.
- Set sensible timeouts and retry logic.
- Respect robots.txt and target website terms of service.

---
