This project measures OSINT-adjacent tools for document parsing, browser automation, and scraping. It produces JSON run artifacts plus a static HTML report in public/index.html.
The first implemented category is runnable locally with pdftotext; additional local/open-source PDF tools, browser tools, Firecrawl, and Obscura are wired into the runner. Paid API tools skip until their keys are present, and paid/network execution still requires explicit flags.
pdf_extraction: LlamaParse, Fireparse/Firecrawl document parse, LangExtract, Surya OCR, Extend Parse 2.0, Docling, and a Poppler baseline.browser_automation: browser-use terminal, browser-harness, dev-browser, and a Playwright script on the same form-driven investigative tasks.scraping: Firecrawl scrape, Exa contents, and Obscura fetch. Search endpoints are intentionally excluded because they answer a different use case.
python3 -m benchmarkers.cli list
python3 -m benchmarkers.cli doctor
python3 -m benchmarkers.cli run --category pdf_extraction --tool pdftotext_baseline
python3 -m benchmarkers.cli report
open public/index.htmlIf benchmarks/.env exists, it is loaded automatically before doctor and run.
Use --env-file <path> only when you want to point at a different dotenv.
Network/API tools are gated:
python3 -m benchmarkers.cli doctor --allow-network --allow-paid
python3 -m benchmarkers.cli run --category scraping --allow-network --allow-paidReports can be built from one run, or combined from several runs to avoid re-spending paid credits:
python3 -m benchmarkers.cli combine \
results/run-a/results.json \
results/run-b/results.json \
--output results/combined-current.json \
--update-latest
python3 -m benchmarkers.cli reportTo load keys from a global dotenv instead of the local .env:
python3 -m benchmarkers.cli doctor --allow-network --allow-paid --env-file ~/.claude/.env
python3 -m benchmarkers.cli run --category scraping --allow-network --allow-paid --env-file ~/.claude/.envThe runner writes:
results/<timestamp>/results.jsonresults/latest.jsonpublic/index.html
Copy the static report and JSON into another static directory when needed:
python3 -m benchmarkers.cli export-site --target site-static/benchmarksDo not commit keys. The runner checks environment variables and otherwise skips tools.
Expected variables:
FIRECRAWL_API_KEYEXA_API_KEYLLAMA_CLOUD_API_KEYEXTEND_API_KEYLANGEXTRACT_MODELLANGEXTRACT_PROVIDERoptional; set toopenaifor OpenAI-compatible endpointsLANGEXTRACT_MODEL_URLoptional; set to an Ollama URL such ashttp://localhost:11434LANGEXTRACT_BASE_URLoptional; OpenAI-compatible base URL whenLANGEXTRACT_PROVIDER=openaiLANGEXTRACT_API_KEYonly when the selected LangExtract model backend requires a cloud/API key
Some tools may also use their own CLI auth stores. Firecrawl, for example, can use its configured CLI auth.
LANGEXTRACT_MODEL is required to run LangExtract. For local Ollama, use a model such as gemma2:2b and optionally LANGEXTRACT_MODEL_URL=http://localhost:11434; no LangExtract API key is required. For OpenAI-compatible endpoints, set LANGEXTRACT_PROVIDER=openai, LANGEXTRACT_BASE_URL, and the provider key in LANGEXTRACT_API_KEY. Some CLIs, including Firecrawl, can also use their own auth stores.
Large binaries and model caches are deliberately not committed.
Obscura is installed locally under bin/obscura/ from the official macOS Apple Silicon release:
mkdir -p bin/obscura
curl -sSL https://github.com/h4ckf0r0day/obscura/releases/latest/download/obscura-aarch64-macos.tar.gz \
-o /private/tmp/obscura-aarch64-macos.tar.gz
tar -xzf /private/tmp/obscura-aarch64-macos.tar.gz -C bin/obscuraThe adapter uses:
bin/obscura/obscura fetch <url> --dump text --timeout <seconds> --quietDocling, Surya, and browser-use are run through uvx so their large Python environments and model artifacts stay outside git:
- Docling:
uvx --from docling-slim docling - Surya OCR:
uvx --from surya-ocr surya_ocr - browser-use:
uvx --from browser-use browser-use
The PDF set is intentionally small and limited to public source URLs, with local cached copies under ../fine-tuning/source-pdfs used for repeatable local parsing:
shultz-follow-the-money.pdf: public policy manual with non-linear front matter, budget/oil revenue terminology, and policy-report layout.unesco-story-based-inquiry.pdf: public UNESCO investigative manual for throughput, chapter extraction, and method terminology.gijn-citizen-investigations.pdf: public GIJN guide with concrete OSINT tasks, organization/entity probes, and image-heavy pages.
Rank available PDFs by rough extraction difficulty:
python3 -m benchmarkers.cli pdf-auditThe repository includes the latest static report at public/index.html. GitHub Pages can publish that directory directly, which makes the report easy to iframe from another site.
Current notable findings in the report:
pdftotext_baselineis fast and strong on the current public born-digital PDF set.- Browser automation is now scored on four investigative form workflows, not snapshots: Companies House filing history, OpenSanctions entity screening, Wikidata entity identity, and OpenStreetMap place lookup.
dev-browserand thePlaywright scriptcompleted all four browser workflows and returned the target evidence.browser-harnesscurrently fails the browser tasks with a CDP keepalive timeout.browser-useterminal executed but returned no target evidence with the current adapter, so it is shown as missed rather than successful.- Scraping now uses harder registry/civic-monitoring sources from the Scoutpost benchmark family: Companies House, Basel-Stadt protocols, Zurich Gemeinderat protocols, and Lausanne Conseil communal PVs.
Firecrawl scrape,Exa contents API, andObscura headless browserare compared as scraping/content retrieval tools; Firecrawl and Exa search endpoints are excluded.Exa contents APIreturned explicit retrieval errors on two civic pages in the latest scraping run; those are scored as failures, not accidental URL/probe matches.