Skip to content

Switch extraction fetcher to curl_cffi for Cloudflare-fronted sources#20

Merged
rapsoj merged 1 commit into
mainfrom
feat/stage3-extraction-curl_cffi
May 18, 2026
Merged

Switch extraction fetcher to curl_cffi for Cloudflare-fronted sources#20
rapsoj merged 1 commit into
mainfrom
feat/stage3-extraction-curl_cffi

Conversation

@smodee
Copy link
Copy Markdown
Collaborator

@smodee smodee commented May 12, 2026

Summary

  • Replaces httpx with curl_cffi in bioscancast/extraction/fetcher.py so HTTP fetches use a browser TLS fingerprint (Chrome by default). This gets past Cloudflare/JA3-based blocks that returned HTTP 401 on Reuters and intermittent 403s on CIDRAP during the docling eval.
  • ExtractionConfig.user_agent is replaced with .impersonate ("chrome" default); the impersonation profile sets a matching browser UA automatically.
  • Streaming, the 25 MB cap, Content-Length pre-check, magic-byte sniffing, and the never-raise-on-network-errors contract are preserved. Uses try/finally + response.close() since the streaming Response in curl_cffi 0.15.0 isn't a context manager.
  • New tests/conftest.py registers a live pytest marker and a --live CLI flag; live integration tests are skipped by default.
  • New live test hits https://www.reuters.com/ to confirm the Cloudflare path actually works end-to-end.
  • README gains a Dependencies section explaining the curl_cffi choice and the pytest --live opt-in.

Closes #18.

Test plan

  • pytest bioscancast/tests/ — 179 passed, 1 skipped (live)
  • pytest bioscancast/tests/test_extraction_fetcher.py --live — 15 passed (Reuters returns 200 + HTML, confirming the Cloudflare bypass)
  • Reviewer: spot-check that no other module still expects ExtractionConfig.user_agent (grep confirms only the new impersonate field is referenced)

🤖 Generated with Claude Code

Replaces httpx with curl_cffi in bioscancast/extraction/fetcher.py so the
fetcher uses a browser TLS fingerprint (Chrome by default) and gets past
Cloudflare/JA3 blocks that returned 401 on Reuters and intermittent 403s
on CIDRAP during the docling eval. Streaming, 25 MB cap, magic-byte
sniffing, and the never-raise contract are preserved.

ExtractionConfig.user_agent is replaced with .impersonate; the
impersonation profile sets a matching browser UA automatically. A pytest
--live flag (registered in a new conftest.py) gates a live integration
test against Reuters; suite stays hermetic by default.

Closes #18.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@rapsoj rapsoj merged commit c4ab5c9 into main May 18, 2026
@rapsoj rapsoj deleted the feat/stage3-extraction-curl_cffi branch May 18, 2026 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add curl_cffi to fetcher to handle Cloudflare-fronted HTML sources

2 participants