---
title: "Scraper Demo"
format: html
jupyter: python3
---


# Scraper Demo: Crawl4AI + Polars

This document demonstrates the scraping pipeline for a few sample URLs using the project modules.

## 1. Extract Content URLs


In [None]:
import sys
import os

# Add parent directory to path so we can import from src
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath("__file__"))))
from src.url_extractor import extract_content_urls

urls = extract_content_urls("https://www.versnellingsplan.nl/kennisbank/")
print(f"Found {len(urls)} URLs:")
for url in urls[:3]:
    print(url)

## 2. Scrape Content for a Sample URL


In [None]:
from src.content_scraper import scrape_content

sample_url = urls[0] if urls else "https://www.versnellingsplan.nl/kennisbank/some-article"
content = scrape_content(sample_url)
print(content)

## 3. Download PDFs for the Sample URL


In [None]:
# Create output directory if it doesn't exist
os.makedirs("../output/demo/pdfs", exist_ok=True)

from src.pdf_downloader import download_pdfs
pdfs = download_pdfs(content.get('pdf_links', []), "../output/demo/pdfs")
print(pdfs)

## 4. Generate JSON Output


In [None]:
# Create output directory if it doesn't exist
os.makedirs("../output/demo", exist_ok=True)

from src.json_generator import generate_json
json_path = generate_json(content, pdfs, "../output/demo")
print(f"JSON saved at: {json_path}")

---

This demo shows the end-to-end workflow for scraping, PDF downloading, and JSON output using Crawl4AI and project modules.