Conversation
Spec: docs/superpowers/specs/2026-04-14-wcxb-benchmark-design.md Plan: docs/superpowers/plans/2026-04-14-wcxb-benchmark.md Task 0 (upstream verification) completed by controller. Pinned commit c039d5ee9f5a3a984a0e167e63aacd04e76e78a9 of WCXB with schema corrections applied for Task 1-10 subagents.
Vendors evaluate.py from Murrough-Foley/web-content-extraction-benchmark at commit c039d5ee (CC-BY-4.0). Prepends a license/source header; original code is unmodified. Adds 5 unit tests covering word_f1 edge cases (identical, disjoint, partial overlap, empty inputs). Also adds pythonpath=["."] to pytest ini so benchmarks.wcxb is importable from the repo root without a separate install step.
Three small WCXB-shaped HTML+JSON pairs under tests/fixtures/wcxb/. Used by Task 3–7 runner unit tests; real WCXB data is gitignored so these committed fixtures are the test-time substitute. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements evaluate_page() in benchmarks/wcxb/run.py that loads a WCXB-shaped page (html.gz + json), runs trawl.extraction.html_to_markdown, scores with word_f1, and returns a dict matching the raw.json schema for the trawl column. Supports both flat (fixture) and split (real WCXB) directory layouts. Covered by 4 unit tests (4/4 pass). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds evaluate_page_with_baseline() that runs both trawl (3-way + BS fallback) and a plain Trafilatura baseline (same markdown options, no favor flags) on the same HTML, plus with_snippets_hit / without_snippets_hit counts for both. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Pure-function layer: aggregate() computes overall + per-type F1 summaries with error exclusion, top wins/losses by delta; render_report() produces a structured Markdown report. No IO, independently testable. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Appends run_all() to benchmarks/wcxb/run.py: iterates page IDs (flat + split layout), calls evaluate_page / evaluate_page_with_baseline per page, writes raw.json + report.md, logs progress every 100 pages, and returns non-zero exit on >=5% trawl error rate. Adds _main() / argparse CLI (--data-dir, --out-dir, --limit, --type, --no-baseline) runnable as `python -m benchmarks.wcxb.run`. Appends 4 tests to test_wcxb_runner.py covering raw+report output, --limit, --type filter, and --no-baseline null fields. 10/10 pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds sanity_traf_default to evaluate_page_with_baseline output. Calls trafilatura.extract(html) with no options, matching how WCXB upstream measured the published F1=0.958, so the vendored evaluate.py can later be verified to reproduce that number within ±0.02. Field lands in raw.json only; aggregation and report rendering are unchanged. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Idempotent downloader for the WCXB dev split (1,497 html + 1,497 ground-truth = 2,994 files) pinned to commit c039d5ee. manifest.json records per-file SHA-256 hashes generated via the git trees API. Unit tests cover hash verification and download idempotency. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds WCXB dev-split F1 results (1,497 pages, 7 page types):
- trawl html_to_markdown: F1 = 0.777
- Trafilatura (same env): F1 = 0.750
Also fixes run.py to handle schema_version 2.0 pages that omit
main_content (2 of 1497 pages); uses .get("main_content", "") so
these score F1=0 rather than crashing with KeyError.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tal 0.791 The spec and plan referenced 0.958 as the Trafilatura sanity target, which is WCXB's article-only sub-score. Our runner measures the full dev split (7 page types), whose upstream-published Trafilatura F1 is 0.791. Corrected tolerance: [0.766, 0.816] (±0.025 around 0.791). Reference measurement from the 2026-04-14 full run: 0.773, vendor verified. Historical note retained in both files so the old number isn't re-used by mistake. No code change — the run.py sanity field itself is unchanged; only the downstream assertion threshold in Task 10 Step 3.
C1 late chunking spike. Opt-in toggle, dedicated :8084 server, 8K truncate with per-chunk baseline fallback. Measurement via extended parity runner (recall@k + MRR). Go/no-go: MRR Δ ≥ +0.03, no recall@5 regression, 12/12 parity, <2x latency.
Spec for bypassing extraction on structured data responses. Two-stage detection (URL hint → httpx, Playwright Content-Type post-check → raw body re-fetch), single-chunk response with content_type + truncated metadata, 256 KB default cap. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Step-by-step plan to implement raw passthrough for JSON/XML responses, organized as TDD tasks with bite-sized steps, local HTTP server test fixture, and checkpoint after each task. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add `matches()` to detect structured-data URL suffixes (.json, .xml, .rss, .atom) and `is_passthrough_content_type()` to identify passthrough-eligible media types using explicit allow-list plus RFC 6838 structured-syntax suffix pattern. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PassthroughResult dataclass and fetch() use httpx streaming to GET structured-data URLs, capped at PASSTHROUGH_MAX_BYTES, with ok=False paths for HTTP errors, content-type mismatches, and empty bodies. PASSTHROUGH_MAX_BYTES is read via module globals at call time so monkeypatch.setattr works correctly in tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Base image mcr.microsoft.com/playwright/python already ships chromium at /ms-playwright, so `playwright install --with-deps chromium` just re-downloads it (+~300MB, ~1min build time). Also reorder so source `COPY` happens after `pip install -e .`, using stub package dirs to resolve the editable install. Previously any edit under src/ invalidated the deps layer and triggered a full reinstall. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add `content_type: str | None = None` field to `FetchResult`. Update `_open_context` to capture the navigation response's Content-Type header and yield a 4-tuple (context, page, html, content_type). Update `fetch()` and `render_session()` to unpack the new element accordingly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…URLs Add `_decode_passthrough_body`, `_build_passthrough_result`, and `_fetch_html` helpers. In `fetch_relevant`, check `passthrough.matches` before the query=None guard so JSON/XML/RSS/Atom URLs bypass embedding entirely and work without a query. In `_run_full_pipeline`, the same check handles the query-provided code path. The generic fetcher chain is extracted into `_fetch_html` to avoid duplication. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…heck When a suffix-less URL (e.g. /api/weather) returns JSON or XML, Playwright's navigation response carries the real Content-Type but the rendered HTML is Chromium's JSON/XML viewer DOM, not the raw body. After _fetch_html returns, inspect fetched.content_type. If it passes is_passthrough_content_type(), discard the rendered HTML and re-fetch the raw bytes via the new passthrough.fetch_raw_body() (httpx, no Content-Type gate). Use the Playwright-detected ct as authoritative (it may carry charset info absent from the raw server header). On re-fetch failure return a terminal error with path=raw_passthrough rather than falling back to the extraction pipeline (garbled viewer DOM would produce nonsense chunks). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds a passthrough test case to the MCP smoke test that: - Starts a local HTTP server serving raw JSON content - Calls fetch_page via MCP on a local URL with no query - Verifies path == "raw_passthrough", content_type == "application/json" - Confirms truncated == False and chunk text matches the raw response This exercises the raw-passthrough detection path (MIME type without query → skip embedding/retrieval entirely) via the MCP protocol. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Raw passthrough for JSON/XML/RSS/Atom responses: bypasses extraction and chunking, returns the original body as a single chunk. Two-stage detection (URL suffix hint via httpx; Playwright Content-Type post-check re-fetches raw bytes). 256 KB default cap via TRAWL_PASSTHROUGH_MAX_BYTES. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…VLM_URL Model-name defaults were tied to specific GGUF filenames from one local llama-server setup. Replace with short aliases (`bge-m3`, `gemma`) that match the default `--alias` strings users typically configure. profile_page depends on a vision LLM; when TRAWL_VLM_URL is unset the tool would fail at call time with an opaque error. Hide it from the MCP tool list instead, and return an explicit "disabled" payload if an agent calls it anyway. MCP smoke test exercises both states. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Image previously baked host-specific URLs (host.docker.internal:80xx) and GGUF filenames as ENV layers. Replace with a comment block listing required/optional env vars and referencing .env.example. Same image now works across local-dev, LAN, and remote llama-server setups. Also generalize model-name examples in .env.example to match the alias defaults (bge-m3, gemma) and note that TRAWL_VLM_URL gates profile_page visibility in the MCP tool list. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
First minor release beyond v0.1.0. Bundles three feature streams that accumulated on `develop`.
Raw passthrough for JSON/XML responses
Docker / runtime config
WCXB benchmark scaffolding
Validation
Test plan
🤖 Generated with Claude Code