Skip to content

feat(rag): emit corpus manifest as upstream source for presentation surfaces#154

Merged
cipher813 merged 2 commits into
mainfrom
feat/rag-manifest-emitter
May 5, 2026
Merged

feat(rag): emit corpus manifest as upstream source for presentation surfaces#154
cipher813 merged 2 commits into
mainfrom
feat/rag-manifest-emitter

Conversation

@cipher813
Copy link
Copy Markdown
Owner

Summary

Adds rag/pipelines/emit_manifest.py — runs at the end of the weekly RAG ingestion (step 6/6) and writes a public-safe corpus snapshot to s3://alpha-engine-research/rag/manifest/{date}.json plus a latest.json pointer.

Why upstream-first

Per Decision 11 of the presentation revamp plan (alpha-engine-docs/private/alpha-engine-presentation-revamp-260503.md), the future Knowledge Base panel on nousergon.ai must be a view of an existing system output, not a new measurement layer in a dashboard loader. Without this manifest, the dashboard would have to query pgvector directly — exactly what Decision 11 forbids.

Manifest contents (public-safe aggregates only)

  • totals — documents · chunks · tickers
  • by_source — per doc_type rollup (10-K · 10-Q · 8-K · earnings · thesis): document count · distinct ticker count · chunk count
  • by_ticker_coveragetickers_with_any_doc + p25/p50/p75 docs/ticker
  • embedding — model name + dimension (voyage-3-lite · 1024d)
  • ingestion — overall + per-source last_run_ts

Disclosure boundary

Per ~/Development/CLAUDE.md: per-ticker doc lists, individual document titles, and chunk content are intentionally excluded — those land only on the private dashboard's Workstream 3.5 page under Cloudflare Access during interview screenshare.

Wiring

Hooked into run_weekly_ingestion.sh as step 6/6 (after filing change detection). Skipped in --dry-run mode like the other S3 writers.

Test plan

  • pytest tests/test_emit_manifest.py -q — 8/8 pass
  • Full unit suite — 422/422 pass (was 422 before, +8 new = 430)
  • First Saturday SF run produces s3://alpha-engine-research/rag/manifest/latest.json

Follow-up

Tracked as a P2 in alpha-engine-docs/private/ROADMAP.md: once latest.json exists, ship the public Knowledge Base panel on nousergon.ai reading from it.

Side discovery (not in this PR)

alpha-engine-lib/src/alpha_engine_lib/rag/schema.sql:31 declares embedding vector(512) but embeddings.py documents voyage-3-lite as 1024d. Either schema is stale or the column is undersized — worth a separate audit.

🤖 Generated with Claude Code

cipher813 and others added 2 commits May 5, 2026 08:00
…urfaces

Adds `rag/pipelines/emit_manifest.py` — runs at the end of the weekly
RAG ingestion (step 6/6) and writes a public-safe corpus snapshot to
`s3://alpha-engine-research/rag/manifest/{date}.json` plus a
`latest.json` pointer.

**Why upstream-first** (per Decision 11 of the presentation revamp plan):
the future Knowledge Base panel on nousergon.ai must be a *view* of an
existing system output, not a new measurement layer in a dashboard
loader. Without this manifest, the dashboard would have to query
pgvector directly — exactly what Decision 11 forbids.

**Manifest contents** (public-safe aggregates only):
- `totals` — documents · chunks · tickers
- `by_source` — per doc_type rollup (10-K · 10-Q · 8-K · earnings · thesis):
  document count · distinct ticker count · chunk count
- `by_ticker_coverage` — tickers_with_any_doc + p25/p50/p75 docs/ticker
- `embedding` — model name + dimension (voyage-3-lite · 1024d)
- `ingestion` — overall + per-source `last_run_ts`

**Disclosure boundary** (per `~/Development/CLAUDE.md`): per-ticker doc
lists, individual document titles, and chunk content are intentionally
excluded — those land only on the private dashboard's Workstream 3.5
page under Cloudflare Access during interview screenshare.

**Wired** into `run_weekly_ingestion.sh` as step 6/6 (after filing
change detection); skipped in `--dry-run` mode like the other S3
writers.

**Tests:** 8 new unit tests mocking `execute_query`. Verifies the
manifest schema, by-source rollup math, JSON-serializability, and the
S3 put-object key pattern. 422 → 430 total tests pass.

**Follow-up** (added as P2 in alpha-engine-docs ROADMAP): once the
first Saturday SF run produces `latest.json`, ship the Knowledge Base
panel on the nousergon.ai home page reading from it.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit fb95630 into main May 5, 2026
1 check passed
@cipher813 cipher813 deleted the feat/rag-manifest-emitter branch May 5, 2026 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant