Release v0.2.0: raw passthrough + Docker cleanup + WCXB benchmark by dongwhee · Pull Request #1 · bbulb/trawl

dongwhee · 2026-04-15T01:34:44Z

Summary

First minor release beyond v0.1.0. Bundles three feature streams that accumulated on `develop`.

Raw passthrough for JSON/XML responses

`fetch_page` bypasses extraction/chunking/retrieval for structured-data URLs
Two-stage detection: URL suffix (httpx fast path) + Playwright `Content-Type` post-check (re-fetches raw bytes via httpx)
256 KB default cap via `TRAWL_PASSTHROUGH_MAX_BYTES`
Response carries `content_type` and `truncated` fields; `path="raw_passthrough"`

Docker / runtime config

Removed host-specific `ENV TRAWL_*` layers from `Dockerfile`
Injection via `-e` / compose / `--env-file`; comment block documents required vs optional
`profile_page` hidden from MCP tool list when `TRAWL_VLM_URL` is unset
Model-name defaults generalized (`bge-m3`, `gemma`) to match llama-server `--alias` conventions

WCXB benchmark scaffolding

External word-F1 evaluation against WCXB dev split (36 pages)
Baseline Trafilatura comparison + aggregated markdown report

Validation

Parity matrix: 12/12 PASS
Passthrough tests: 13/13 PASS
MCP stdio smoke (VLM set + unset): PASS
WCXB dev: F1 0.791 (external, matches sanity-check target)

Test plan

Pull the merged main and verify `python tests/test_pipeline.py` on clean checkout
`docker build` with no TRAWL_* env → confirm image runs but fetch_page errors until env injected
MCP client sees `fetch_page` only when `TRAWL_VLM_URL` omitted

🤖 Generated with Claude Code

Spec: docs/superpowers/specs/2026-04-14-wcxb-benchmark-design.md Plan: docs/superpowers/plans/2026-04-14-wcxb-benchmark.md Task 0 (upstream verification) completed by controller. Pinned commit c039d5ee9f5a3a984a0e167e63aacd04e76e78a9 of WCXB with schema corrections applied for Task 1-10 subagents.

Vendors evaluate.py from Murrough-Foley/web-content-extraction-benchmark at commit c039d5ee (CC-BY-4.0). Prepends a license/source header; original code is unmodified. Adds 5 unit tests covering word_f1 edge cases (identical, disjoint, partial overlap, empty inputs). Also adds pythonpath=["."] to pytest ini so benchmarks.wcxb is importable from the repo root without a separate install step.

Three small WCXB-shaped HTML+JSON pairs under tests/fixtures/wcxb/. Used by Task 3–7 runner unit tests; real WCXB data is gitignored so these committed fixtures are the test-time substitute. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Implements evaluate_page() in benchmarks/wcxb/run.py that loads a WCXB-shaped page (html.gz + json), runs trawl.extraction.html_to_markdown, scores with word_f1, and returns a dict matching the raw.json schema for the trawl column. Supports both flat (fixture) and split (real WCXB) directory layouts. Covered by 4 unit tests (4/4 pass). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds evaluate_page_with_baseline() that runs both trawl (3-way + BS fallback) and a plain Trafilatura baseline (same markdown options, no favor flags) on the same HTML, plus with_snippets_hit / without_snippets_hit counts for both. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Pure-function layer: aggregate() computes overall + per-type F1 summaries with error exclusion, top wins/losses by delta; render_report() produces a structured Markdown report. No IO, independently testable. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Appends run_all() to benchmarks/wcxb/run.py: iterates page IDs (flat + split layout), calls evaluate_page / evaluate_page_with_baseline per page, writes raw.json + report.md, logs progress every 100 pages, and returns non-zero exit on >=5% trawl error rate. Adds _main() / argparse CLI (--data-dir, --out-dir, --limit, --type, --no-baseline) runnable as `python -m benchmarks.wcxb.run`. Appends 4 tests to test_wcxb_runner.py covering raw+report output, --limit, --type filter, and --no-baseline null fields. 10/10 pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds sanity_traf_default to evaluate_page_with_baseline output. Calls trafilatura.extract(html) with no options, matching how WCXB upstream measured the published F1=0.958, so the vendored evaluate.py can later be verified to reproduce that number within ±0.02. Field lands in raw.json only; aggregation and report rendering are unchanged. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Idempotent downloader for the WCXB dev split (1,497 html + 1,497 ground-truth = 2,994 files) pinned to commit c039d5ee. manifest.json records per-file SHA-256 hashes generated via the git trees API. Unit tests cover hash verification and download idempotency. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds WCXB dev-split F1 results (1,497 pages, 7 page types): - trawl html_to_markdown: F1 = 0.777 - Trafilatura (same env): F1 = 0.750 Also fixes run.py to handle schema_version 2.0 pages that omit main_content (2 of 1497 pages); uses .get("main_content", "") so these score F1=0 rather than crashing with KeyError. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…tal 0.791 The spec and plan referenced 0.958 as the Trafilatura sanity target, which is WCXB's article-only sub-score. Our runner measures the full dev split (7 page types), whose upstream-published Trafilatura F1 is 0.791. Corrected tolerance: [0.766, 0.816] (±0.025 around 0.791). Reference measurement from the 2026-04-14 full run: 0.773, vendor verified. Historical note retained in both files so the old number isn't re-used by mistake. No code change — the run.py sanity field itself is unchanged; only the downstream assertion threshold in Task 10 Step 3.

C1 late chunking spike. Opt-in toggle, dedicated :8084 server, 8K truncate with per-chunk baseline fallback. Measurement via extended parity runner (recall@k + MRR). Go/no-go: MRR Δ ≥ +0.03, no recall@5 regression, 12/12 parity, <2x latency.

…view order

Spec for bypassing extraction on structured data responses. Two-stage detection (URL hint → httpx, Playwright Content-Type post-check → raw body re-fetch), single-chunk response with content_type + truncated metadata, 256 KB default cap. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Step-by-step plan to implement raw passthrough for JSON/XML responses, organized as TDD tasks with bite-sized steps, local HTTP server test fixture, and checkpoint after each task. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add `matches()` to detect structured-data URL suffixes (.json, .xml, .rss, .atom) and `is_passthrough_content_type()` to identify passthrough-eligible media types using explicit allow-list plus RFC 6838 structured-syntax suffix pattern. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PassthroughResult dataclass and fetch() use httpx streaming to GET structured-data URLs, capped at PASSTHROUGH_MAX_BYTES, with ok=False paths for HTTP errors, content-type mismatches, and empty bodies. PASSTHROUGH_MAX_BYTES is read via module globals at call time so monkeypatch.setattr works correctly in tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Base image mcr.microsoft.com/playwright/python already ships chromium at /ms-playwright, so `playwright install --with-deps chromium` just re-downloads it (+~300MB, ~1min build time). Also reorder so source `COPY` happens after `pip install -e .`, using stub package dirs to resolve the editable install. Previously any edit under src/ invalidated the deps layer and triggered a full reinstall. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add `content_type: str | None = None` field to `FetchResult`. Update `_open_context` to capture the navigation response's Content-Type header and yield a 4-tuple (context, page, html, content_type). Update `fetch()` and `render_session()` to unpack the new element accordingly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…URLs Add `_decode_passthrough_body`, `_build_passthrough_result`, and `_fetch_html` helpers. In `fetch_relevant`, check `passthrough.matches` before the query=None guard so JSON/XML/RSS/Atom URLs bypass embedding entirely and work without a query. In `_run_full_pipeline`, the same check handles the query-provided code path. The generic fetcher chain is extracted into `_fetch_html` to avoid duplication. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…heck When a suffix-less URL (e.g. /api/weather) returns JSON or XML, Playwright's navigation response carries the real Content-Type but the rendered HTML is Chromium's JSON/XML viewer DOM, not the raw body. After _fetch_html returns, inspect fetched.content_type. If it passes is_passthrough_content_type(), discard the rendered HTML and re-fetch the raw bytes via the new passthrough.fetch_raw_body() (httpx, no Content-Type gate). Use the Playwright-detected ct as authoritative (it may carry charset info absent from the raw server header). On re-fetch failure return a terminal error with path=raw_passthrough rather than falling back to the extraction pipeline (garbled viewer DOM would produce nonsense chunks). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Adds a passthrough test case to the MCP smoke test that: - Starts a local HTTP server serving raw JSON content - Calls fetch_page via MCP on a local URL with no query - Verifies path == "raw_passthrough", content_type == "application/json" - Confirms truncated == False and chunk text matches the raw response This exercises the raw-passthrough detection path (MIME type without query → skip embedding/retrieval entirely) via the MCP protocol. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Raw passthrough for JSON/XML/RSS/Atom responses: bypasses extraction and chunking, returns the original body as a single chunk. Two-stage detection (URL suffix hint via httpx; Playwright Content-Type post-check re-fetches raw bytes). 256 KB default cap via TRAWL_PASSTHROUGH_MAX_BYTES. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…VLM_URL Model-name defaults were tied to specific GGUF filenames from one local llama-server setup. Replace with short aliases (`bge-m3`, `gemma`) that match the default `--alias` strings users typically configure. profile_page depends on a vision LLM; when TRAWL_VLM_URL is unset the tool would fail at call time with an opaque error. Hide it from the MCP tool list instead, and return an explicit "disabled" payload if an agent calls it anyway. MCP smoke test exercises both states. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Image previously baked host-specific URLs (host.docker.internal:80xx) and GGUF filenames as ENV layers. Replace with a comment block listing required/optional env vars and referencing .env.example. Same image now works across local-dev, LAN, and remote llama-server setups. Also generalize model-name examples in .env.example to match the alias defaults (bge-m3, gemma) and note that TRAWL_VLM_URL gates profile_page visibility in the MCP tool list. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

lyla and others added 30 commits April 14, 2026 16:02

chore: ignore .worktrees/ directory

92e2e92

docs: add RESEARCH.md (C1-C5 improvement candidates)

f79d887

docs(wcxb): add README, gitignore data dir, register in CLAUDE.md

2eee073

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs(late-chunking): add implementation plan (7 tasks)

e4ba66a

docs: mark C1 late-chunking as rejected with NO-GO summary, update re…

38391fa

…view order

feat(pipeline): add content_type and truncated fields to PipelineResult

62bb80a

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

docs: document TRAWL_PASSTHROUGH_MAX_BYTES and passthrough behaviour

23c66de

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

lyla and others added 2 commits April 15, 2026 10:30

chore: bump version to 0.2.0

8c933c3

dongwhee merged commit 1a674f9 into main Apr 15, 2026
3 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release v0.2.0: raw passthrough + Docker cleanup + WCXB benchmark#1

Release v0.2.0: raw passthrough + Docker cleanup + WCXB benchmark#1
dongwhee merged 32 commits intomainfrom
develop

dongwhee commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dongwhee commented Apr 15, 2026

Summary

Raw passthrough for JSON/XML responses

Docker / runtime config

WCXB benchmark scaffolding

Validation

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant