Scrape-Evals: an evaluation framework for web scraping engines

Scrape-Evals is an evaluation framework for web scraping engines ("engines") that benchmarks quality and robustness on a fixed dataset. We focus on: (1) whether an engine successfully retrieves page content (Coverage/Success Rate); and (2) how well the retrieved content captures a human-curated core snippet while avoiding noise (Recall/Precision/F1). The F1 score measures content quality by balancing how much important content is captured (recall) against how much irrelevant content is filtered out (precision). In our results, we refer to the F1 score as "quality" for simplicity.

This framework supports APIs for Firecrawl, Apify, ScraperAPI, ScrapingBee, Zyte, and more but also some self-hosted engines like Crawl4AI, Playwright, Puppeteer, Rest, Scrapy, and Selenium. Additional APIs can be easily integrated.

Results

Below are evaluation results across different engines.

Engine	Coverage (Success Rate) (%)	Quality (F1)
Firecrawl	80.9	0.68
Exa	76.3	0.53
Tavily	67.6	0.50
ScraperAPI	63.5	0.45
Zyte	62.9	0.47
ScrapingBee	60.6	0.45
Apify	60.2	0.42
Crawl4ai	58.0	0.45
Selenium	55.0	0.40
Scrapy	54.0	0.43
Puppeteer	53.7	0.41
Rest (requests)	50.6	0.36
Playwright	39.5	0.34

Install

pip install -r requirements.txt
# Optional: install Playwright browsers if using the Playwright engine
python -m playwright install chromium

Datasets

The datasets/1-0-0.csv dataset contains 1,000 web pages with human-annotated ground truth for evaluating how well web scraping engines capture core content while avoiding noise (navigation, ads, footers, etc.). Check the dataset README for more details.

How to run

Single engine

From the scraper_evals directory:

python run_eval.py \
  --scrape_engine rest_scraper \
  --suite quality \
  --output-dir runs \
  --dataset datasets/1-0-0.csv \
  --resume

Flags:

--resume: do not delete existing outputs; skip scrape if present
--rerun: start fresh (deletes per-engine output dir)
--analysis-only: recompute metrics only; requires existing outputs
--dry-run: test with temporary directory and limited data (5 tasks); automatically cleans up
--max-workers N: internal per-engine concurrency

Outputs:

Per-engine summary: runs/results/<engine>_<suite>.json
Per-URL artifacts: runs/<engine>_<suite>/<task_id>/{task.json,scrape_output.json,grader_output.json}

All engines (parallel)

python run_all.py \
  --dataset datasets/1-0-0.csv \
  --suite quality \
  --output-dir runs \
  --concurrency 6 \
  --resume

Notes:

Use --rerun for a fresh run. The runner pre-cleans per-engine dirs, then runs children with --resume to avoid concurrent deletes.
--timeout-minutes caps each engine's total run time (default 45).
Logs are unbuffered; each line is prefixed with the engine name.

Dry Run Testing

The --dry-run option for testing your setup and verifying engines work correctly:

# Test single engine with dry run
python run_eval.py \
  --scrape_engine selenium_scraper \
  --suite quality \
  --output-dir runs \
  --dataset datasets/1-0-0.csv \
  --dry-run

# Test all engines with dry run
python run_all.py \
  --dataset datasets/1-0-0.csv \
  --dry-run

Metrics

Coverage (Success Rate): Indicates successful content retrieval when the response returns a valid HTTP status (2xx/3xx), contains substantive content (content_size>0), and represents actual page content rather than access intermediary pages. Rows requiring specific validation text are marked unsuccessful when that text is unavailable. Snippet Quality Metrics (evaluated on the optimal content window matching the expected text length):
Recall: Proportion of expected content successfully captured Precision: Proportion of captured content that matches expected material F1 Score: Balanced measure combining recall and precision for overall content quality

Reproducibility

Seeded dataset; fixed CSV manifest for URLs/snippets
Engine versions pinned via requirements.txt

Contributing

We welcome contributions! Please feel free to open an issue or submit a pull request.

To contribute:

Fork the repository
Create a new branch
Make your changes
Submit a pull request

Troubleshooting

Having issues? Check out our TROUBLESHOOTING.md guide for solutions to common problems including:

Missing dependencies and environment setup
Selenium WebDriver and Playwright browser issues
Memory and permission errors
Network/API connectivity problems
Dataset and file path issues

Quick tip: Use --dry-run to test your setup safely with limited data and automatic cleanup.

License

This repository is made available under the MIT license. By contributing to this repository, you agree to make your contribution publicly available under the same license, and you represent that you have the authority to do so.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
datasets		datasets
engines		engines
evals		evals
runs/results		runs/results
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
TROUBLESHOOTING.md		TROUBLESHOOTING.md
__init__.py		__init__.py
analysis.ipynb		analysis.ipynb
requirements.txt		requirements.txt
run_all.py		run_all.py
run_eval.py		run_eval.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Scrape-Evals: an evaluation framework for web scraping engines

Results

Install

Datasets

How to run

Single engine

All engines (parallel)

Dry Run Testing

Metrics

Reproducibility

Contributing

Troubleshooting

License

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

firecrawl/scrape-evals

Folders and files

Latest commit

History

Repository files navigation

Scrape-Evals: an evaluation framework for web scraping engines

Results

Install

Datasets

How to run

Single engine

All engines (parallel)

Dry Run Testing

Metrics

Reproducibility

Contributing

Troubleshooting

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages