Background
During the Docling vs in-tree extraction evaluation (see data/docling_eval/FINDINGS.md), the Reuters source returned HTTP 401 Forbidden when fetched with both Docling's default httpx fetcher and our in-tree fetcher.py. The same URL fetched fine with curl from the command line.
Root cause: Cloudflare Bot Management fingerprints the TLS handshake (JA3/JA4), not just the User-Agent string. Both httpx and Python's requests share the same TLS stack, which Cloudflare blocks. curl uses a different TLS implementation that some sites whitelist.
CIDRAP also intermittently 403'd on httpx mid-eval while succeeding with curl. As more news sources adopt Cloudflare or similar bot protection, the in-tree fetcher will hit this wall more often.
Goal
Replace (or augment) the in-tree fetcher's HTTP client with curl_cffi, a Python wrapper around curl that impersonates Chrome/Firefox TLS fingerprints. Drop-in requests-like API.
Approach
- Add
curl_cffi to requirements.txt (pin a reasonable minimum).
- In
bioscancast/extraction/fetcher.py:
- Replace the httpx call with
curl_cffi.requests.get(url, impersonate="chrome", timeout=..., stream=...).
- Or: keep httpx as the default and fall back to
curl_cffi on 401/403 with a UA-related error message.
- Preserve current behaviour: streaming, 25 MB cap, User-Agent header (browser-like now, no longer custom), magic-byte sniffing, Content-Length pre-check.
- Pick an impersonation profile:
chrome is the safe default; revisit if specific sources need firefox or a versioned profile.
- Tests:
- Unit tests with mocked
curl_cffi (no real network).
- Integration smoke test (marked
@pytest.mark.live) hitting a CDN-fronted URL that 401s on httpx.
Definition of done
Out of scope
- Full browser automation (Playwright/Selenium) for sites that defeat even
curl_cffi. If we hit one, file a separate issue and consider a per-source playwright fallback or a news API.
- Residential proxies. Not needed yet.
References
Background
During the Docling vs in-tree extraction evaluation (see
data/docling_eval/FINDINGS.md), the Reuters source returned HTTP 401 Forbidden when fetched with both Docling's default httpx fetcher and our in-treefetcher.py. The same URL fetched fine withcurlfrom the command line.Root cause: Cloudflare Bot Management fingerprints the TLS handshake (JA3/JA4), not just the User-Agent string. Both
httpxand Python'srequestsshare the same TLS stack, which Cloudflare blocks.curluses a different TLS implementation that some sites whitelist.CIDRAP also intermittently 403'd on
httpxmid-eval while succeeding withcurl. As more news sources adopt Cloudflare or similar bot protection, the in-tree fetcher will hit this wall more often.Goal
Replace (or augment) the in-tree fetcher's HTTP client with
curl_cffi, a Python wrapper around curl that impersonates Chrome/Firefox TLS fingerprints. Drop-inrequests-like API.Approach
curl_cffitorequirements.txt(pin a reasonable minimum).bioscancast/extraction/fetcher.py:curl_cffi.requests.get(url, impersonate="chrome", timeout=..., stream=...).curl_cffion 401/403 with a UA-related error message.chromeis the safe default; revisit if specific sources needfirefoxor a versioned profile.curl_cffi(no real network).@pytest.mark.live) hitting a CDN-fronted URL that 401s onhttpx.Definition of done
curl_cffiis inrequirements.txt.curl_cffifor HTTP requests, preserving streaming/size-cap behaviour.--live).Out of scope
curl_cffi. If we hit one, file a separate issue and consider a per-sourceplaywrightfallback or a news API.References
data/docling_eval/FINDINGS.md— "Cloudflare: trafilatura won't help" section.curl_cffi: https://github.com/yifeikong/curl_cffi