Add curl_cffi to fetcher to handle Cloudflare-fronted HTML sources

## Background

During the Docling vs in-tree extraction evaluation (see [`data/docling_eval/FINDINGS.md`](data/docling_eval/FINDINGS.md)), the Reuters source returned **HTTP 401 Forbidden** when fetched with both Docling's default httpx fetcher and our in-tree `fetcher.py`. The same URL fetched fine with `curl` from the command line.

Root cause: Cloudflare Bot Management fingerprints the **TLS handshake** (JA3/JA4), not just the User-Agent string. Both `httpx` and Python's `requests` share the same TLS stack, which Cloudflare blocks. `curl` uses a different TLS implementation that some sites whitelist.

CIDRAP also intermittently 403'd on `httpx` mid-eval while succeeding with `curl`. As more news sources adopt Cloudflare or similar bot protection, the in-tree fetcher will hit this wall more often.

## Goal

Replace (or augment) the in-tree fetcher's HTTP client with [`curl_cffi`](https://github.com/yifeikong/curl_cffi), a Python wrapper around curl that impersonates Chrome/Firefox TLS fingerprints. Drop-in `requests`-like API.

## Approach

1. **Add `curl_cffi` to `requirements.txt`** (pin a reasonable minimum).
2. **In `bioscancast/extraction/fetcher.py`**:
   - Replace the httpx call with `curl_cffi.requests.get(url, impersonate="chrome", timeout=..., stream=...)`.
   - Or: keep httpx as the default and fall back to `curl_cffi` on 401/403 with a UA-related error message.
   - Preserve current behaviour: streaming, 25 MB cap, User-Agent header (browser-like now, no longer custom), magic-byte sniffing, Content-Length pre-check.
3. **Pick an impersonation profile**: `chrome` is the safe default; revisit if specific sources need `firefox` or a versioned profile.
4. **Tests**:
   - Unit tests with mocked `curl_cffi` (no real network).
   - Integration smoke test (marked `@pytest.mark.live`) hitting a CDN-fronted URL that 401s on `httpx`.

## Definition of done

- [ ] `curl_cffi` is in `requirements.txt`.
- [ ] Fetcher uses `curl_cffi` for HTTP requests, preserving streaming/size-cap behaviour.
- [ ] Existing fetcher tests pass against the new implementation.
- [ ] A live test confirms a Reuters or other Cloudflare-fronted URL fetches successfully (skipped in CI, runnable locally with `--live`).
- [ ] Documented in [README.md](README.md) under deps.

## Out of scope

- Full browser automation (Playwright/Selenium) for sites that defeat even `curl_cffi`. If we hit one, file a separate issue and consider a per-source `playwright` fallback or a news API.
- Residential proxies. Not needed yet.

## References

- Findings: [`data/docling_eval/FINDINGS.md`](data/docling_eval/FINDINGS.md) — "Cloudflare: trafilatura won't help" section.
- `curl_cffi`: https://github.com/yifeikong/curl_cffi


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add curl_cffi to fetcher to handle Cloudflare-fronted HTML sources #18

Background

Goal

Approach

Definition of done

Out of scope

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add curl_cffi to fetcher to handle Cloudflare-fronted HTML sources #18

Description

Background

Goal

Approach

Definition of done

Out of scope

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions