Skip to content

Add curl_cffi to fetcher to handle Cloudflare-fronted HTML sources #18

@smodee

Description

@smodee

Background

During the Docling vs in-tree extraction evaluation (see data/docling_eval/FINDINGS.md), the Reuters source returned HTTP 401 Forbidden when fetched with both Docling's default httpx fetcher and our in-tree fetcher.py. The same URL fetched fine with curl from the command line.

Root cause: Cloudflare Bot Management fingerprints the TLS handshake (JA3/JA4), not just the User-Agent string. Both httpx and Python's requests share the same TLS stack, which Cloudflare blocks. curl uses a different TLS implementation that some sites whitelist.

CIDRAP also intermittently 403'd on httpx mid-eval while succeeding with curl. As more news sources adopt Cloudflare or similar bot protection, the in-tree fetcher will hit this wall more often.

Goal

Replace (or augment) the in-tree fetcher's HTTP client with curl_cffi, a Python wrapper around curl that impersonates Chrome/Firefox TLS fingerprints. Drop-in requests-like API.

Approach

  1. Add curl_cffi to requirements.txt (pin a reasonable minimum).
  2. In bioscancast/extraction/fetcher.py:
    • Replace the httpx call with curl_cffi.requests.get(url, impersonate="chrome", timeout=..., stream=...).
    • Or: keep httpx as the default and fall back to curl_cffi on 401/403 with a UA-related error message.
    • Preserve current behaviour: streaming, 25 MB cap, User-Agent header (browser-like now, no longer custom), magic-byte sniffing, Content-Length pre-check.
  3. Pick an impersonation profile: chrome is the safe default; revisit if specific sources need firefox or a versioned profile.
  4. Tests:
    • Unit tests with mocked curl_cffi (no real network).
    • Integration smoke test (marked @pytest.mark.live) hitting a CDN-fronted URL that 401s on httpx.

Definition of done

  • curl_cffi is in requirements.txt.
  • Fetcher uses curl_cffi for HTTP requests, preserving streaming/size-cap behaviour.
  • Existing fetcher tests pass against the new implementation.
  • A live test confirms a Reuters or other Cloudflare-fronted URL fetches successfully (skipped in CI, runnable locally with --live).
  • Documented in README.md under deps.

Out of scope

  • Full browser automation (Playwright/Selenium) for sites that defeat even curl_cffi. If we hit one, file a separate issue and consider a per-source playwright fallback or a news API.
  • Residential proxies. Not needed yet.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions