Skip to content

Release v0.2.0: raw passthrough + Docker cleanup + WCXB benchmark#1

Merged
dongwhee merged 32 commits intomainfrom
develop
Apr 15, 2026
Merged

Release v0.2.0: raw passthrough + Docker cleanup + WCXB benchmark#1
dongwhee merged 32 commits intomainfrom
develop

Conversation

@dongwhee
Copy link
Copy Markdown
Collaborator

Summary

First minor release beyond v0.1.0. Bundles three feature streams that accumulated on `develop`.

Raw passthrough for JSON/XML responses

  • `fetch_page` bypasses extraction/chunking/retrieval for structured-data URLs
  • Two-stage detection: URL suffix (httpx fast path) + Playwright `Content-Type` post-check (re-fetches raw bytes via httpx)
  • 256 KB default cap via `TRAWL_PASSTHROUGH_MAX_BYTES`
  • Response carries `content_type` and `truncated` fields; `path="raw_passthrough"`

Docker / runtime config

  • Removed host-specific `ENV TRAWL_*` layers from `Dockerfile`
  • Injection via `-e` / compose / `--env-file`; comment block documents required vs optional
  • `profile_page` hidden from MCP tool list when `TRAWL_VLM_URL` is unset
  • Model-name defaults generalized (`bge-m3`, `gemma`) to match llama-server `--alias` conventions

WCXB benchmark scaffolding

  • External word-F1 evaluation against WCXB dev split (36 pages)
  • Baseline Trafilatura comparison + aggregated markdown report

Validation

  • Parity matrix: 12/12 PASS
  • Passthrough tests: 13/13 PASS
  • MCP stdio smoke (VLM set + unset): PASS
  • WCXB dev: F1 0.791 (external, matches sanity-check target)

Test plan

  • Pull the merged main and verify `python tests/test_pipeline.py` on clean checkout
  • `docker build` with no TRAWL_* env → confirm image runs but fetch_page errors until env injected
  • MCP client sees `fetch_page` only when `TRAWL_VLM_URL` omitted

🤖 Generated with Claude Code

lyla and others added 30 commits April 14, 2026 16:02
Spec: docs/superpowers/specs/2026-04-14-wcxb-benchmark-design.md
Plan: docs/superpowers/plans/2026-04-14-wcxb-benchmark.md

Task 0 (upstream verification) completed by controller. Pinned commit
c039d5ee9f5a3a984a0e167e63aacd04e76e78a9 of WCXB with schema corrections
applied for Task 1-10 subagents.
Vendors evaluate.py from Murrough-Foley/web-content-extraction-benchmark
at commit c039d5ee (CC-BY-4.0). Prepends a license/source header;
original code is unmodified. Adds 5 unit tests covering word_f1
edge cases (identical, disjoint, partial overlap, empty inputs).

Also adds pythonpath=["."] to pytest ini so benchmarks.wcxb is
importable from the repo root without a separate install step.
Three small WCXB-shaped HTML+JSON pairs under tests/fixtures/wcxb/.
Used by Task 3–7 runner unit tests; real WCXB data is gitignored so
these committed fixtures are the test-time substitute.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements evaluate_page() in benchmarks/wcxb/run.py that loads a
WCXB-shaped page (html.gz + json), runs trawl.extraction.html_to_markdown,
scores with word_f1, and returns a dict matching the raw.json schema for
the trawl column. Supports both flat (fixture) and split (real WCXB)
directory layouts. Covered by 4 unit tests (4/4 pass).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds evaluate_page_with_baseline() that runs both trawl (3-way + BS fallback)
and a plain Trafilatura baseline (same markdown options, no favor flags) on the
same HTML, plus with_snippets_hit / without_snippets_hit counts for both.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Pure-function layer: aggregate() computes overall + per-type F1
summaries with error exclusion, top wins/losses by delta; render_report()
produces a structured Markdown report. No IO, independently testable.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Appends run_all() to benchmarks/wcxb/run.py: iterates page IDs (flat +
split layout), calls evaluate_page / evaluate_page_with_baseline per
page, writes raw.json + report.md, logs progress every 100 pages, and
returns non-zero exit on >=5% trawl error rate.

Adds _main() / argparse CLI (--data-dir, --out-dir, --limit, --type,
--no-baseline) runnable as `python -m benchmarks.wcxb.run`.

Appends 4 tests to test_wcxb_runner.py covering raw+report output,
--limit, --type filter, and --no-baseline null fields. 10/10 pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds sanity_traf_default to evaluate_page_with_baseline output. Calls
trafilatura.extract(html) with no options, matching how WCXB upstream
measured the published F1=0.958, so the vendored evaluate.py can later
be verified to reproduce that number within ±0.02. Field lands in
raw.json only; aggregation and report rendering are unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Idempotent downloader for the WCXB dev split (1,497 html + 1,497
ground-truth = 2,994 files) pinned to commit c039d5ee. manifest.json
records per-file SHA-256 hashes generated via the git trees API.
Unit tests cover hash verification and download idempotency.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds WCXB dev-split F1 results (1,497 pages, 7 page types):
- trawl html_to_markdown: F1 = 0.777
- Trafilatura (same env): F1 = 0.750

Also fixes run.py to handle schema_version 2.0 pages that omit
main_content (2 of 1497 pages); uses .get("main_content", "") so
these score F1=0 rather than crashing with KeyError.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tal 0.791

The spec and plan referenced 0.958 as the Trafilatura sanity target, which
is WCXB's article-only sub-score. Our runner measures the full dev split
(7 page types), whose upstream-published Trafilatura F1 is 0.791.

Corrected tolerance: [0.766, 0.816] (±0.025 around 0.791). Reference
measurement from the 2026-04-14 full run: 0.773, vendor verified.

Historical note retained in both files so the old number isn't re-used by
mistake. No code change — the run.py sanity field itself is unchanged;
only the downstream assertion threshold in Task 10 Step 3.
C1 late chunking spike. Opt-in toggle, dedicated :8084 server, 8K
truncate with per-chunk baseline fallback. Measurement via extended
parity runner (recall@k + MRR). Go/no-go: MRR Δ ≥ +0.03, no recall@5
regression, 12/12 parity, <2x latency.
Spec for bypassing extraction on structured data responses. Two-stage
detection (URL hint → httpx, Playwright Content-Type post-check → raw
body re-fetch), single-chunk response with content_type + truncated
metadata, 256 KB default cap.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Step-by-step plan to implement raw passthrough for JSON/XML responses,
organized as TDD tasks with bite-sized steps, local HTTP server test
fixture, and checkpoint after each task.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add `matches()` to detect structured-data URL suffixes (.json, .xml, .rss, .atom)
and `is_passthrough_content_type()` to identify passthrough-eligible media types
using explicit allow-list plus RFC 6838 structured-syntax suffix pattern.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PassthroughResult dataclass and fetch() use httpx streaming to GET
structured-data URLs, capped at PASSTHROUGH_MAX_BYTES, with ok=False
paths for HTTP errors, content-type mismatches, and empty bodies.
PASSTHROUGH_MAX_BYTES is read via module globals at call time so
monkeypatch.setattr works correctly in tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Base image mcr.microsoft.com/playwright/python already ships chromium
at /ms-playwright, so `playwright install --with-deps chromium` just
re-downloads it (+~300MB, ~1min build time).

Also reorder so source `COPY` happens after `pip install -e .`, using
stub package dirs to resolve the editable install. Previously any edit
under src/ invalidated the deps layer and triggered a full reinstall.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add `content_type: str | None = None` field to `FetchResult`. Update
`_open_context` to capture the navigation response's Content-Type header
and yield a 4-tuple (context, page, html, content_type). Update `fetch()`
and `render_session()` to unpack the new element accordingly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…URLs

Add `_decode_passthrough_body`, `_build_passthrough_result`, and
`_fetch_html` helpers. In `fetch_relevant`, check `passthrough.matches`
before the query=None guard so JSON/XML/RSS/Atom URLs bypass embedding
entirely and work without a query. In `_run_full_pipeline`, the same
check handles the query-provided code path. The generic fetcher chain
is extracted into `_fetch_html` to avoid duplication.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…heck

When a suffix-less URL (e.g. /api/weather) returns JSON or XML,
Playwright's navigation response carries the real Content-Type but the
rendered HTML is Chromium's JSON/XML viewer DOM, not the raw body.

After _fetch_html returns, inspect fetched.content_type. If it passes
is_passthrough_content_type(), discard the rendered HTML and re-fetch
the raw bytes via the new passthrough.fetch_raw_body() (httpx, no
Content-Type gate). Use the Playwright-detected ct as authoritative
(it may carry charset info absent from the raw server header).

On re-fetch failure return a terminal error with path=raw_passthrough
rather than falling back to the extraction pipeline (garbled viewer DOM
would produce nonsense chunks).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds a passthrough test case to the MCP smoke test that:
- Starts a local HTTP server serving raw JSON content
- Calls fetch_page via MCP on a local URL with no query
- Verifies path == "raw_passthrough", content_type == "application/json"
- Confirms truncated == False and chunk text matches the raw response

This exercises the raw-passthrough detection path (MIME type without
query → skip embedding/retrieval entirely) via the MCP protocol.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Raw passthrough for JSON/XML/RSS/Atom responses: bypasses extraction
and chunking, returns the original body as a single chunk. Two-stage
detection (URL suffix hint via httpx; Playwright Content-Type
post-check re-fetches raw bytes). 256 KB default cap via
TRAWL_PASSTHROUGH_MAX_BYTES.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…VLM_URL

Model-name defaults were tied to specific GGUF filenames from one local
llama-server setup. Replace with short aliases (`bge-m3`, `gemma`) that
match the default `--alias` strings users typically configure.

profile_page depends on a vision LLM; when TRAWL_VLM_URL is unset the
tool would fail at call time with an opaque error. Hide it from the
MCP tool list instead, and return an explicit "disabled" payload if an
agent calls it anyway. MCP smoke test exercises both states.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
lyla and others added 2 commits April 15, 2026 10:30
Image previously baked host-specific URLs (host.docker.internal:80xx)
and GGUF filenames as ENV layers. Replace with a comment block listing
required/optional env vars and referencing .env.example. Same image now
works across local-dev, LAN, and remote llama-server setups.

Also generalize model-name examples in .env.example to match the alias
defaults (bge-m3, gemma) and note that TRAWL_VLM_URL gates profile_page
visibility in the MCP tool list.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@dongwhee dongwhee merged commit 1a674f9 into main Apr 15, 2026
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant