Switch extraction fetcher to curl_cffi for Cloudflare-fronted sources by smodee · Pull Request #20 · algorithmicgovernance/BioScanCast

smodee · 2026-05-12T09:14:10Z

Summary

Replaces httpx with curl_cffi in bioscancast/extraction/fetcher.py so HTTP fetches use a browser TLS fingerprint (Chrome by default). This gets past Cloudflare/JA3-based blocks that returned HTTP 401 on Reuters and intermittent 403s on CIDRAP during the docling eval.
ExtractionConfig.user_agent is replaced with .impersonate ("chrome" default); the impersonation profile sets a matching browser UA automatically.
Streaming, the 25 MB cap, Content-Length pre-check, magic-byte sniffing, and the never-raise-on-network-errors contract are preserved. Uses try/finally + response.close() since the streaming Response in curl_cffi 0.15.0 isn't a context manager.
New tests/conftest.py registers a live pytest marker and a --live CLI flag; live integration tests are skipped by default.
New live test hits https://www.reuters.com/ to confirm the Cloudflare path actually works end-to-end.
README gains a Dependencies section explaining the curl_cffi choice and the pytest --live opt-in.

Closes #18.

Test plan

pytest bioscancast/tests/ — 179 passed, 1 skipped (live)
pytest bioscancast/tests/test_extraction_fetcher.py --live — 15 passed (Reuters returns 200 + HTML, confirming the Cloudflare bypass)
Reviewer: spot-check that no other module still expects ExtractionConfig.user_agent (grep confirms only the new impersonate field is referenced)

🤖 Generated with Claude Code

Replaces httpx with curl_cffi in bioscancast/extraction/fetcher.py so the fetcher uses a browser TLS fingerprint (Chrome by default) and gets past Cloudflare/JA3 blocks that returned 401 on Reuters and intermittent 403s on CIDRAP during the docling eval. Streaming, 25 MB cap, magic-byte sniffing, and the never-raise contract are preserved. ExtractionConfig.user_agent is replaced with .impersonate; the impersonation profile sets a matching browser UA automatically. A pytest --live flag (registered in a new conftest.py) gates a live integration test against Reuters; suite stays hermetic by default. Closes #18. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

smodee mentioned this pull request May 17, 2026

Fix five OpenAI-integration bugs surfaced by end-to-end run #22

Merged

4 tasks

rapsoj approved these changes May 18, 2026

View reviewed changes

rapsoj merged commit c4ab5c9 into main May 18, 2026

rapsoj deleted the feat/stage3-extraction-curl_cffi branch May 18, 2026 16:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch extraction fetcher to curl_cffi for Cloudflare-fronted sources#20

Switch extraction fetcher to curl_cffi for Cloudflare-fronted sources#20
rapsoj merged 1 commit into
mainfrom
feat/stage3-extraction-curl_cffi

smodee commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

smodee commented May 12, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants