feat(cli,context,routing,observability,contracts): adopter-inspection surfaces (#106, #221, #224, #225, #226)#231
Conversation
… surfaces (#106, #221, #224, #225, #226) Lands the 5-issue "decision-surface inspectability" group selected by the triage pass. Shared blast radius: envelope.py, __main__.py, routing/router.py + new explanation.py, extras/otel.py, new schemas/ directory. Owner-mode (Mode B) authorised; bumps version to 0.5.0 with documented public-API deltas under CHANGELOG ## [Unreleased]. #221 — argparse → Typer + Rich rewrite of __main__.py - typer>=0.9 + rich>=13.0 promoted from [cli] extra to core deps - [cli] extra kept as empty alias for one cycle (removal: v0.6) - 7 existing subcommands preserved verbatim (build, route, demo, print-tree, init, ingest, replay); 1 new subcommand (stats, #106) - tests/test_cli.py exit-code-0 assertion relaxed to accept Typer's code-2 no-args convention; all other golden assertions unchanged #106 — BuildStats diagnostic report - BuildStats.prompt_tokens @Property (single source of truth; replaces sum(tokens_per_section.values()) + header_footer_tokens across 6+ inline call sites in extras/otel.py, __main__.py, metrics.py) - BuildStats.report(format='text'|'rich', *, phase, budget) returns deterministic, paste-friendly diagnostic string with sections, drop-reasons, and budget-utilisation recommendations - BuildStats.report_dict(...) returns versioned ({"version": 1, ...}) structured payload for programmatic consumers - contextweaver stats CLI subcommand renders against ingested session JSON; --format {rich,text}, --phase, --budget flags #226 — RouteResult.explanation() - New routing/explanation.py module keeps router.py under soft cap - RouteResult.explanation(format='md'|'dict') overload pair - Markdown: top-k table, confidence-gap line, ambiguity flag + clarifying question, context-hints / filters sections - Dict: versioned schema; safe for OTel span attributes - Privacy: never emits args_schema or full descriptions - docs/troubleshooting.md gains paste-ready example #225 — JSON Schemas + drift gate (closes #196) - 6 schemas under schemas/ and docs/schemas/v0/ (mkdocs-published $id URLs): catalog, choice_card, result_envelope, route_trace, build_stats, graph_manifest - src/contextweaver/_schema_gen.py — stdlib-only dataclass → Draft 2020-12 generator; deterministic byte-stable output - ChoiceCard.kind tightened to Literal[...]; __post_init__ enforces gateway-spec §2 size bounds (name ≤64, ≤5 tags each ≤24 chars) on every code path including from_dict - make schemas / make schemas-check (gating in make ci); CI workflow runs --check after the scorecard drift gate - new docs/contracts.md; examples/sample_catalog.yaml gets the # yaml-language-server: $schema= header #224 — OTel GenAI semantic conventions - extras/otel.py rewritten on top of opentelemetry.semconv._incubating .attributes.gen_ai_attributes (opentelemetry-api>=1.27 floor + opentelemetry-semantic-conventions>=0.48b0) - Span shapes: invoke_agent for build(), execute_tool for route() - Stable attrs: gen_ai.system='contextweaver', gen_ai.operation.name, gen_ai.usage.input_tokens, gen_ai.tool.name - Engine-specific telemetry under contextweaver.* namespace - Token-usage histogram renamed to canonical gen_ai.client.token.usage - otel_emit_experimental flag (default False) gates PII-prone attrs - tests/test_otel.py uses InMemorySpanExporter for deterministic SemConv-name assertions - new docs/integration_otel.md (Laminar + Phoenix worked examples, PII-safety guidance) Verification (all green on v0.5 branch): ruff format --check src/ tests/ examples/ scripts/ → clean ruff check src/ tests/ examples/ scripts/ → clean mypy src/ → 0 issues / 66 files pytest --cov=contextweaver -q → 995 passed, 2 skipped (+41 new tests over baseline) python scripts/gen_schemas.py --check → schemas up to date python -m contextweaver demo → completes make example → all examples clean
There was a problem hiding this comment.
Pull request overview
Lands the five "decision-surface inspectability" issues (#106, #221, #224, #225, #226) in a single drop: BuildStats diagnostic reports + stats CLI, RouteResult.explanation(), JSON-Schema publishing with a CI drift gate, an argparse→Typer/Rich CLI rewrite, and an OTel rewrite to the GenAI semantic conventions. Bumps to 0.5.0. Promotes typer/rich from [cli] extra to core deps (and keeps [cli] as an empty alias for one cycle). The OTel and CLI changes are user-visible breaking changes documented under CHANGELOG [Unreleased].
Changes:
- BuildStats gains
prompt_tokensproperty +report()/report_dict()with a newcontextweaver statsTyper subcommand andRouteResult.explanation()renders Markdown/dict rationale (extracted intorouting/explanation.py). - Six JSON Schemas committed under
schemas/anddocs/schemas/v0/generated by the new stdlib_schema_gen.py;make schemas/schemas-checkwired intomake ciand the GitHub Actions workflow;ChoiceCard.kindtightened toLiteral[...]with__post_init__size-bound enforcement. extras/otel.pyrewritten onopentelemetry.semconv._incubating.attributes.gen_ai_attributes(floor bumped to>=1.27+ newopentelemetry-semantic-conventions>=0.48b0); spans renamed toinvoke_agent/execute_tool; token-usage histogram renamed togen_ai.client.token.usage;__main__.pyrewritten on Typer + Rich.
Reviewed changes
Copilot reviewed 35 out of 35 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
src/contextweaver/envelope.py |
Adds BuildStats.prompt_tokens, report()/report_dict(), and ChoiceCard size-bound enforcement + Literal kind. |
src/contextweaver/_schema_gen.py |
New stdlib dataclass→JSON-Schema generator with per-type extras and deterministic serialisation. |
src/contextweaver/routing/router.py |
Adds RouteResult.explanation() with overload pair, delegating to routing.explanation. |
src/contextweaver/routing/explanation.py |
New module rendering routing rationale as Markdown or versioned dict. |
src/contextweaver/extras/otel.py |
Rewrite onto GenAI SemConv; invoke_agent/execute_tool spans; engine-specific attrs under contextweaver.* namespace; otel_emit_experimental gate. |
src/contextweaver/__main__.py |
Argparse→Typer+Rich rewrite, adds stats subcommand, factored _restore_manager_from_session helper. |
src/contextweaver/metrics.py |
Switches to BuildStats.prompt_tokens. |
scripts/gen_schemas.py |
New regenerator/drift-checker for the six published schemas. |
schemas/*.schema.json, docs/schemas/v0/*.schema.json |
Six committed schemas mirrored under docs for $id publishing. |
pyproject.toml |
Version 0.5.0; typer/rich into core; [cli] emptied; OTel floor bumped; mypy carve-out for __main__. |
Makefile, .github/workflows/ci.yml |
Add schemas/schemas-check targets and wire the drift gate into CI. |
mkdocs.yml |
Adds Contracts + Observability nav and excludes schema JSONs from docs nav. |
examples/sample_catalog.yaml |
Adds # yaml-language-server: $schema=... header. |
docs/contracts.md, docs/integration_otel.md, docs/troubleshooting.md |
New contracts page, OTel guide with Laminar/Phoenix examples, troubleshooting addition for explanation(). |
AGENTS.md, CHANGELOG.md |
Updated module map + comprehensive [Unreleased] entries. |
tests/test_envelope.py, tests/test_router.py, tests/test_otel.py, tests/test_cli.py, tests/test_schema_gen.py |
New + updated tests covering report, explanation, OTel SemConv assertions, stats CLI, and schema round-trips/drift. |
Benchmark delta (vs
|
| size | recall@k (head Δ vs base) | MRR (head Δ vs base) | p99 (ms) |
|---|---|---|---|
| 50 | ✅ 0.5649 (+0.0000) | ✅ 0.4978 (+0.0000) | ✅ 0.442 (base 0.463) |
| 83 | ✅ 0.3825 (+0.0000) | ✅ 0.3242 (+0.0000) | ✅ 0.720 (base 0.876) |
| 1000 | ✅ 0.1475 (+0.0000) | ✅ 0.1456 (+0.0000) | ✅ 33.863 (base 31.897) |
Per-backend × per-size matrix
| backend | size | recall@k (Δ) | MRR (Δ) | p99 (ms) |
|---|---|---|---|---|
| bm25 | 100 | ✅ 0.3825 (+0.0000) | ✅ 0.3399 (+0.0000) | ✅ 5.939 (base 5.642) |
| bm25 | 500 | ✅ 0.2250 (+0.0000) | ✅ 0.2165 (+0.0000) | ✅ 28.543 (base 27.538) |
| bm25 | 1000 | ✅ 0.1575 (+0.0000) | ✅ 0.1525 (+0.0000) | ✅ 85.467 (base 78.368) |
| fuzzy | 100 | ✅ 0.0000 (+0.0000) | ✅ 0.0000 (+0.0000) | ✅ 0.000 (base 0.000) |
| fuzzy | 500 | ✅ 0.0000 (+0.0000) | ✅ 0.0000 (+0.0000) | ✅ 0.000 (base 0.000) |
| fuzzy | 1000 | ✅ 0.0000 (+0.0000) | ✅ 0.0000 (+0.0000) | ✅ 0.000 (base 0.000) |
| tfidf | 100 | ✅ 0.3825 (+0.0000) | ✅ 0.3220 (+0.0000) | ✅ 0.983 (base 0.872) |
| tfidf | 500 | ✅ 0.2325 (+0.0000) | ✅ 0.2314 (+0.0000) | ✅ 9.396 (base 8.660) |
| tfidf | 1000 | ✅ 0.1475 (+0.0000) | ✅ 0.1456 (+0.0000) | ✅ 33.875 (base 30.071) |
Context pipeline (per scenario)
| scenario | tokens | dropped | dedup |
|---|---|---|---|
| large_catalog | 1514 (base 1514, Δ+0) | 0 (base 0, Δ+0) | 0 (base 0, Δ+0) |
| long_conversation | 2548 (base 2548, Δ+0) | 0 (base 0, Δ+0) | 0 (base 0, Δ+0) |
| short_conversation | 496 (base 496, Δ+0) | 0 (base 0, Δ+0) | 0 (base 0, Δ+0) |
| stress_conversation | 6651 (base 6651, Δ+0) | 7 (base 7, Δ+0) | 4 (base 4, Δ+0) |
Numbers come from make benchmark / make benchmark-matrix.
Latency is hardware-dependent — treat the markers as a rough guide.
See benchmarks/scorecard.md for the full picture.
…umented enum Addresses Phase 2 audit finding on PR #231 — `ChoiceCard.kind` was tightened to `Literal["tool", "agent", "skill", "internal"]` so the published `choice_card.schema.json` carries an enum constraint, but `__post_init__` only enforced name/tags size bounds. The `Literal[...]` annotation only constrains mypy: a Python caller (or `ChoiceCard.from_dict` reading an external JSON payload) could still construct `ChoiceCard(kind="bogus")` and produce an object that violates the published schema — a contract leak between the type system and runtime. This commit adds a runtime check in `__post_init__` that rejects any value not in `CHOICE_CARD_KINDS`, mirroring the existing size-bound enforcement. The check fires on every construction path including `ChoiceCard.from_dict`, matching the PR description's claim of "enforce ... on every code path including from_dict". Tests: - test_choice_card_rejects_unknown_kind — direct ChoiceCard(kind="bogus") - test_choice_card_from_dict_rejects_unknown_kind — via from_dict() - test_choice_card_accepts_all_documented_kinds — pins the enum Verification: - ruff format/lint/mypy clean - pytest -q: 995 passed, 5 skipped - make schemas-check / scorecard-check / llms-check all clean - make example + make demo clean
- otel.py: gate PII-prone prompt content behind otel_emit_experimental flag (was dead code — stored but never checked in on_context_built) - otel.py: add version-guard comment for incubating SemConv import path - __main__.py: add event-index context to _restore_manager_from_session errors - _schema_gen.py: document 300-line soft cap exemption in module docstring - test_otel.py: add test_experimental_flag_gates_prompt_in_span verifying the flag conditionally includes/excludes prompt in span attributes
Lands the 5-issue "decision-surface inspectability" group selected by the
triage pass. Shared blast radius: envelope.py, main.py,
routing/router.py + new explanation.py, extras/otel.py, new schemas/
directory. Owner-mode (Mode B) authorised; bumps version to 0.5.0 with
documented public-API deltas under CHANGELOG ## [Unreleased].
#221 — argparse → Typer + Rich rewrite of main.py
print-tree, init, ingest, replay); 1 new subcommand (stats, [context] BuildStats diagnostic report with human-readable output and CLI #106)
code-2 no-args convention; all other golden assertions unchanged
#106 — BuildStats diagnostic report
sum(tokens_per_section.values()) + header_footer_tokens across 6+
inline call sites in extras/otel.py, main.py, metrics.py)
deterministic, paste-friendly diagnostic string with sections,
drop-reasons, and budget-utilisation recommendations
structured payload for programmatic consumers
JSON; --format {rich,text}, --phase, --budget flags
#226 — RouteResult.explanation()
clarifying question, context-hints / filters sections
#225 — JSON Schemas + drift gate (closes #196)
URLs): catalog, choice_card, result_envelope, route_trace,
build_stats, graph_manifest
2020-12 generator; deterministic byte-stable output
gateway-spec §2 size bounds (name ≤64, ≤5 tags each ≤24 chars)
on every code path including from_dict
runs --check after the scorecard drift gate
yaml-language-server: $schema= header
#224 — OTel GenAI semantic conventions
.attributes.gen_ai_attributes (opentelemetry-api>=1.27 floor +
opentelemetry-semantic-conventions>=0.48b0)
gen_ai.usage.input_tokens, gen_ai.tool.name
SemConv-name assertions
PII-safety guidance)
Verification (all green on v0.5 branch):
ruff format --check src/ tests/ examples/ scripts/ → clean
ruff check src/ tests/ examples/ scripts/ → clean
mypy src/ → 0 issues / 66 files
pytest --cov=contextweaver -q → 995 passed, 2 skipped
(+41 new tests over baseline)
python scripts/gen_schemas.py --check → schemas up to date
python -m contextweaver demo → completes
make example → all examples clean