Skip to content

Add reference architecture, benchmarks, and discoverability polish#203

Merged
dgenio merged 6 commits into
mainfrom
claude/triage-issues-o7Ofs
May 16, 2026
Merged

Add reference architecture, benchmarks, and discoverability polish#203
dgenio merged 6 commits into
mainfrom
claude/triage-issues-o7Ofs

Conversation

@dgenio
Copy link
Copy Markdown
Owner

@dgenio dgenio commented May 15, 2026

Summary

Lands the 5-issue "external-facing surface polish" bundle selected by the triage pass. Relates to #181, #197, #198, #199, #200 — none of these are auto-closed by the PR keyword grammar; some are fully addressed by this PR and some are partial with explicit follow-ups (see below).

Three major components:

  1. Slack ops bot reference architecture (relates to [examples] Production reference architectures cookbook — Slack ops bot, code-review bot, voice agent #198partial) — a production-shaped, runnable example of a multi-turn agent fronting ~48 ops tools, demonstrating how routing, the context firewall, and persistent facts compose end-to-end. Code-review bot and voice agent architectures are deferred to follow-up issues [examples] Code-review bot reference architecture #204 and [examples] Voice agent reference architecture (Pipecat) #205 so [examples] Production reference architectures cookbook — Slack ops bot, code-review bot, voice agent #198 stays open.
  2. Public benchmark scorecard (relates to [eval] Publish benchmark scorecard with reproducible public numbers (top-k recall, token savings, latency) #197partial) — a deterministic, committed markdown document (benchmarks/scorecard.md) regenerated from benchmarks/results/latest.json, so users can evaluate routing recall, token savings, and latency without running the harness. Per-backend matrix (tfidf/bm25/fuzzy) and weekly scheduled CI regeneration are deferred to follow-up issues [eval] Benchmark scorecard per-backend matrices (tfidf / bm25 / fuzzy) at 100/500/1000 catalog sizes #206 and [eval] Weekly scheduled scorecard regeneration CI job #207 so [eval] Publish benchmark scorecard with reproducible public numbers (top-k recall, token savings, latency) #197 stays open.
  3. Discoverability polish (relates to [docs] Discoverability polish — README terminology, social preview, citation file, repo metadata #200full) — README badges (CI, PyPI, Python versions, license, docs, Discussions), extended keywords, CITATION.cff, updated pyproject.toml metadata, and a new decision-tree landing page (docs/which_pattern.md) that routes users to the smallest piece that fixes their symptom. The social-preview PNG remains deferred per the issue's own implementation notes (needs a GitHub UI step beyond a code PR).

Also fully landed in this PR: #181 (stress-test benchmark scenario forcing items_dropped > 0 + dedup_removed > 0) and #199 (decision-tree landing page).

Issue status after this PR

Issue Status Notes
#181 ✅ Fully addressed stress_conversation.jsonl triggers drops=7, dedup=4, compaction=3.29×
#197 🟡 Partial — leave open Scorecard committed; follow-ups #206, #207
#198 🟡 Partial — leave open Slack ops bot landed; follow-ups #204, #205
#199 ✅ Fully addressed docs/which_pattern.md + cross-links + mkdocs nav
#200 ✅ Fully addressed (minus PNG) Badges, CITATION, pyproject metadata, classifiers

Changes

Reference Architecture (Slack ops bot)

  • examples/architectures/slack_ops_bot/main.py — runnable example with a 6-turn on-call investigation transcript, demonstrating routing shortlist selection, firewall artifact interception, and persistent fact tracking across turns. The 240-line synthetic log dump and tool-response map are built lazily via functools.cache so import is cheap.
  • examples/architectures/slack_ops_bot/catalog.yaml — 48-tool catalog (logs, deploy, oncall, alerts, tickets, metrics, identity, infra, feature) with namespaces, side-effect flags, and cost hints.
  • examples/architectures/slack_ops_bot/README.md — local documentation explaining the architecture, setup, and key patterns.
  • examples/architectures/slack_ops_bot/OUTPUT.md — captured deterministic run output showing routing decisions, firewall behavior, and fact persistence.
  • tests/test_architectures_slack.py — smoke tests pinning deterministic invariants (catalog size, intent match rate, firewall artifact interception). The firewall assertion checks substring shape rather than the exact character count so cosmetic transcript edits don't churn the test.
  • docs/architectures/slack_ops_bot.md — public documentation linking to the architecture and explaining its shape.
  • docs/architectures/index.md — landing page for reference architectures.

Benchmark Scorecard

  • scripts/render_scorecard.py — deterministic stdlib-only renderer that converts benchmarks/results/latest.json to benchmarks/scorecard.md. No contextweaver imports, validates required fields, derives the answer-phase budget from the payload's context rows, and guarantees byte-identical output for identical input.
  • benchmarks/scorecard.md — committed, auto-generated markdown with routing accuracy (precision@k, recall@k, MRR) and context-pipeline metrics (token budgets, dedup, compaction) at three catalog sizes.
  • benchmarks/results/latest.json — updated with fresh latency measurements (p50/p95/p99 in ms).
  • benchmarks/scenarios/stress_conversation.jsonl — new 30+ turn SEV2 incident-response scenario with 3 large tool results to exercise the firewall, near-duplicate pairs to exercise dedup, and total token volume that pushes past the 6000-token answer budget so items_dropped > 0. The fixture is hand-authored to hit specific firewall/dedup/drop targets; regeneration must satisfy the invariants pinned in [eval] Expand long_conversation scenario to stress-test budget-driven selection #181's acceptance criteria.
  • tests/test_render_scorecard.py — unit tests for determinism, field validation, and table formatting.
  • docs/benchmarks.md — public documentation explaining what is measured and how to reproduce.
  • benchmarks/README.md — harness reference with the per-scenario invariants table.
  • Makefile — new scorecard, scorecard-check, and architectures targets.
  • .github/workflows/ci.yml — new gating scorecard-check step.

Discoverability & Documentation

  • README.md — added badge row (CI, PyPI version, Python versions, license, docs, Discussions); clarified "context engineering" as a secondary phrase; added decision-tree link and benchmark scorecard link.
  • llms-full.txt and llms.txt — regenerated to pick up the README and pyproject changes.
  • docs/which_pattern.md — new decision tree routing users to the smallest piece that fixes their symptom (routing-only, firewall-only, or full pipeline).
  • docs/architecture.md — added pointer to which_pattern.md for new users.
  • docs/interop.md — added "In a hurry?" link.
  • docs/cookbook.md — added link to reference architectures.
  • CITATION.cff — minimal CFF v1.2 citation metadata for GitHub's "Cite this repository" button. version and date-released are documented as release-process owned (updated at tag-cut time). The identifiers: block is intentionally omitted so the file passes CFF validators without a placeholder DOI; it will be added once a real DOI is minted.
  • pyproject.tomlDocumentation / Issues / Discussions URLs, Framework :: AsyncIO + Topic :: Scientific/Engineering :: Artificial Intelligence classifiers, extended keywords.
  • AGENTS.md and docs/agent-context/workflows.md — updated commands tables with the new make targets.

Non-goals

Follow-up issues (deliberately out of scope)

# Title Source issue
#204 code_review_bot reference architecture #198
#205 voice_agent reference architecture (Pipecat) #198
#206 per-backend scorecard matrix (tfidf/bm25/fuzzy at 100/500/1000) #197
#207 weekly scheduled scorecard regeneration CI job #197

#198 and #197 remain open after this PR so the follow-up work has a parent issue to track against.

Risks & rollback

Low risk — additive surface (new example, new benchmark fixture, new docs, new optional Makefile targets) plus small metadata edits. The only behavioral surface inside the library is unchanged. scorecard-check is the new gating CI step; if it ever drifts, rerun make scorecard from a clean checkout to regenerate. Rollback is git revert of this PR's merge commit.

Verification

All 6 make ci targets pass locally on Python 3.12 (ruff format --check, ruff check, mypy src/, pytest — 899 passed / 8 skipped, make example, make demo), plus make scorecard-check (CI-gated, separate from make ci). CI is green across Python 3.10 / 3.11 / 3.12 on the latest head.

…#199, #200)

Lands the five-issue "external-facing surface polish" group selected by the
triage pass. No changes to src/contextweaver/ pipeline code.

#200 — Discoverability polish
  - README badges (CI, PyPI, Python versions, license, docs, Discussions)
  - "context engineering" as a secondary phrase in intro + Problem
  - CITATION.cff at repo root (CFF v1.2)
  - pyproject Documentation/Issues/Discussions URLs, Framework :: AsyncIO
    and Topic :: Scientific/Engineering :: AI classifiers, extended keywords
  - Social-preview PNG deferred (needs GitHub UI step)

#199 — Decision-tree landing page
  - docs/which_pattern.md branches on user symptom -> one concrete next step
  - Linked from README Quickstart and top of architecture.md / interop.md
  - mkdocs nav entry added

#198 — Production reference architectures cookbook (Slack ops bot)
  - examples/architectures/slack_ops_bot/ with 48-tool catalog.yaml,
    6-turn scripted main.py, README, OUTPUT.md
  - Demonstrates bounded-choice pattern (Router narrows 48 -> 3; bot picks
    from shortlist), firewall on 34 KB log dump, persistent facts across turns
  - docs/architectures/index.md + slack_ops_bot.md pages
  - make architectures umbrella target wired into make example
  - Code-review bot + voice agent deferred to follow-up issues

#197 — Benchmark scorecard
  - scripts/render_scorecard.py: stdlib-only deterministic renderer
  - make scorecard / make scorecard-check targets
  - benchmarks/scorecard.md committed, regeneratable in one command
  - docs/benchmarks.md methodology page
  - README "Why Trust" gains scorecard link
  - CI: new gating scorecard-check step (runs before the non-gating benchmark
    step so it validates the committed pair, not the runner's fresh numbers)
  - Per-backend matrix and weekly CI regen tracked as follow-ups

#181 — Stress-test benchmark scenario
  - benchmarks/scenarios/stress_conversation.jsonl: SEV2 incident-response
    transcript, 30 user turns, 3 large tool results (firewall fires), 4
    near-duplicate pairs (dedup fires), total tokens push past the 6000-token
    answer budget so items_dropped > 0
  - benchmarks/results/latest.json regenerated; benchmark README baseline
    metrics table refreshed and corrected the stale "git-ignored" note

Tests
  - tests/test_render_scorecard.py: 12 tests covering renderer determinism,
    routing/context table formatting, missing-field validation, --check mode
  - tests/test_architectures_slack.py: 8 tests pinning catalog size, intent
    match count, firewall trigger, persisted facts, final prompt rendering

Verification
  - ruff format --check src/ tests/ examples/ scripts/ - clean
  - ruff check src/ tests/ examples/ scripts/ - all checks passed
  - mypy src/ - 54 files, no issues
  - pytest - 768 passed, 8 skipped (was 748 baseline; +20 new tests)
  - make example - all scripts including Slack ops bot run cleanly
  - make demo - clean
  - make scorecard-check - deterministic (no drift)
  - make llms-check - up to date

https://claude.ai/code/session_01HeDDydF9mkCZfBx6tYo8Jj
Copilot AI review requested due to automatic review settings May 15, 2026 12:55
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR bundles three discoverability/reference-material initiatives: a Slack ops bot reference architecture (#198), a deterministic public benchmark scorecard with make scorecard/scorecard-check targets (#197), and README/PyPI metadata polish including CITATION.cff, badges, extended keywords/classifiers, and a symptom-based decision-tree landing page (#200). It also adds a stress-test benchmark scenario, expands make example with an architectures umbrella, and refreshes agent docs to reflect the new commands.

Changes:

  • Add Slack ops bot reference architecture (catalog.yaml, runnable main.py, README/OUTPUT, smoke tests) and link it from cookbook/docs/README.
  • Add deterministic stdlib-only scripts/render_scorecard.py plus benchmarks/scorecard.md, new stress_conversation.jsonl scenario, docs/benchmarks.md, and a gating scorecard-check CI step.
  • Discoverability polish: README badges, "context engineering" phrasing, CITATION.cff, extended pyproject.toml keywords/classifiers/URLs, docs/which_pattern.md decision tree, mkdocs nav additions, and llms.txt/llms-full.txt regeneration.

Reviewed changes

Copilot reviewed 32 out of 32 changed files in this pull request and generated no comments.

Show a summary per file
File Description
examples/architectures/slack_ops_bot/{main.py,catalog.yaml,README.md,OUTPUT.md,__init__.py} New Slack ops bot reference architecture wiring router, firewall, and FactStore
examples/architectures/__init__.py Architectures package marker
tests/test_architectures_slack.py Smoke tests pinning catalog size, intent match rate, firewall + fact invariants
scripts/render_scorecard.py Deterministic stdlib renderer with --check drift mode
tests/test_render_scorecard.py Determinism, validation, and CLI behavior tests
benchmarks/scorecard.md, benchmarks/results/latest.json, benchmarks/scenarios/stress_conversation.jsonl, benchmarks/README.md Scorecard output, refreshed baseline, new stress scenario, doc refresh
docs/{which_pattern.md,benchmarks.md,architectures/index.md,architectures/slack_ops_bot.md,architecture.md,interop.md,cookbook.md,agent-context/workflows.md} New decision tree, benchmarks doc, architecture pages, cross-links
README.md, llms.txt, llms-full.txt, mkdocs.yml, scripts/gen_llms.py Badges, phrasing, nav additions, regenerated LLM context files
pyproject.toml, CITATION.cff Extended metadata, classifiers, URLs, citation file
Makefile, .github/workflows/ci.yml, AGENTS.md, CHANGELOG.md New scorecard/scorecard-check/architectures targets, gating CI step, docs/changelog updates

@dgenio
Copy link
Copy Markdown
Owner Author

dgenio commented May 15, 2026

Review — REQUEST_CHANGES (Phase 2/3 of review cycle)

Large PR — 33 files, 2,471 additions across 5 functional areas. All 6 make ci targets pass locally (fmt ✓ lint ✓ type ✓ test 899/899 ✓ example ✓ demo ✓).


Must-fix ✅ (resolved in 97c3263 — already on branch)

.github/workflows/ci.yml — missing --retry in the gating Weaver-spec conformance step.

- curl -fsSL "https://raw.githubusercontent.com/dgenio/weaver-spec/main/contracts/json/${s}.schema.json" \
+ curl -fsSL --retry 3 --retry-delay 2 "https://raw.githubusercontent.com/dgenio/weaver-spec/main/contracts/json/${s}.schema.json" \

The original curl -fsSL had no --retry flag. A transient hiccup on raw.githubusercontent.com (CDN, DNS, GitHub incident) would fail all four schema fetches, fail the gating step, and block every open PR on this repo — including unrelated hotfixes. Fixed in 97c3263.


Strengths

  • Slack ops bot is a production-quality reference architecture: bounded-choice routing pattern correctly demonstrated (Router narrows → bot selects), firewall artifact interception works, persistent facts carry across all 6 turns, 100% intent match on scripted transcript.
  • Scorecard renderer (scripts/render_scorecard.py) is stdlib-only, deterministic, byte-stable, 13 unit tests covering determinism, field validation, --check drift mode, and missing file handling. Correctly follows the gen_llms.py convention.
  • stress_conversation.jsonl fully satisfies [eval] Expand long_conversation scenario to stress-test budget-driven selection #181: 7 dropped, 4 dedup removed, 3.29× avg compaction — all four previously-untested pipeline stages now appear in benchmark output.
  • Discoverability changes complete: Documentation/Issues URLs, Framework :: AsyncIO + AI classifier, extended keywords, CITATION.cff, docs/which_pattern.md (162 lines), linked from README/architecture.md/interop.md/mkdocs.yml.
  • All architectural boundaries maintained. No print() in library code. No business logic in __init__.py. AGENTS.md, CHANGELOG.md, docs/agent-context/workflows.md all updated.

Informational (non-blocking)

Follow-up issues created for deferred scope:

Suggest updating Closes #198Relates to #198 in the PR description so the issue stays open for #204/#205.

Nit: CITATION.cff is missing date-released and a DOI placeholder (doi: "10.5281/zenodo.XXXXXXX"). CFF 1.2 best practice — neither is required for GitHub's "Cite this repository" button today.

…nit)

Address the CFF 1.2 best-practice nit from the PR #203 review. Adds
`date-released: 2026-05-11` (matches the 0.3.0 release date in CHANGELOG)
and an `identifiers:` entry with a Zenodo DOI placeholder. Both are
optional for GitHub's "Cite this repository" button to render but
recommended by the spec — and trivial to add now so they don't get
forgotten when the project does get a real DOI.

The DOI value is intentionally a placeholder (`10.5281/zenodo.XXXXXXX`)
and is marked as such in the inline description; replace once a real
Zenodo archive of a tagged release exists.

https://claude.ai/code/session_01HeDDydF9mkCZfBx6tYo8Jj
- CITATION.cff: remove DOI placeholder identifier and document
  that version/date-released are owned by the release process.
- tests/test_architectures_slack.py: loosen brittle exact-char-count
  firewall assertion to substring checks for stability.
- examples/architectures/slack_ops_bot/main.py: lazy-init the large
  log dump and tool-response map via @functools.cache so import is cheap.
- scripts/render_scorecard.py: derive answer-phase budget from the
  benchmark payload instead of hard-coding 6000.
- benchmarks/README.md: note that stress_conversation.jsonl is
  hand-authored and point regeneration at the #181 acceptance criteria.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants