feat(rag): emit corpus manifest as upstream source for presentation surfaces#154
Merged
Conversation
…urfaces
Adds `rag/pipelines/emit_manifest.py` — runs at the end of the weekly
RAG ingestion (step 6/6) and writes a public-safe corpus snapshot to
`s3://alpha-engine-research/rag/manifest/{date}.json` plus a
`latest.json` pointer.
**Why upstream-first** (per Decision 11 of the presentation revamp plan):
the future Knowledge Base panel on nousergon.ai must be a *view* of an
existing system output, not a new measurement layer in a dashboard
loader. Without this manifest, the dashboard would have to query
pgvector directly — exactly what Decision 11 forbids.
**Manifest contents** (public-safe aggregates only):
- `totals` — documents · chunks · tickers
- `by_source` — per doc_type rollup (10-K · 10-Q · 8-K · earnings · thesis):
document count · distinct ticker count · chunk count
- `by_ticker_coverage` — tickers_with_any_doc + p25/p50/p75 docs/ticker
- `embedding` — model name + dimension (voyage-3-lite · 1024d)
- `ingestion` — overall + per-source `last_run_ts`
**Disclosure boundary** (per `~/Development/CLAUDE.md`): per-ticker doc
lists, individual document titles, and chunk content are intentionally
excluded — those land only on the private dashboard's Workstream 3.5
page under Cloudflare Access during interview screenshare.
**Wired** into `run_weekly_ingestion.sh` as step 6/6 (after filing
change detection); skipped in `--dry-run` mode like the other S3
writers.
**Tests:** 8 new unit tests mocking `execute_query`. Verifies the
manifest schema, by-source rollup math, JSON-serializability, and the
S3 put-object key pattern. 422 → 430 total tests pass.
**Follow-up** (added as P2 in alpha-engine-docs ROADMAP): once the
first Saturday SF run produces `latest.json`, ship the Knowledge Base
panel on the nousergon.ai home page reading from it.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
rag/pipelines/emit_manifest.py— runs at the end of the weekly RAG ingestion (step 6/6) and writes a public-safe corpus snapshot tos3://alpha-engine-research/rag/manifest/{date}.jsonplus alatest.jsonpointer.Why upstream-first
Per Decision 11 of the presentation revamp plan (
alpha-engine-docs/private/alpha-engine-presentation-revamp-260503.md), the future Knowledge Base panel on nousergon.ai must be a view of an existing system output, not a new measurement layer in a dashboard loader. Without this manifest, the dashboard would have to query pgvector directly — exactly what Decision 11 forbids.Manifest contents (public-safe aggregates only)
totals— documents · chunks · tickersby_source— perdoc_typerollup (10-K · 10-Q · 8-K · earnings · thesis): document count · distinct ticker count · chunk countby_ticker_coverage—tickers_with_any_doc+ p25/p50/p75 docs/tickerembedding— model name + dimension (voyage-3-lite · 1024d)ingestion— overall + per-sourcelast_run_tsDisclosure boundary
Per
~/Development/CLAUDE.md: per-ticker doc lists, individual document titles, and chunk content are intentionally excluded — those land only on the private dashboard's Workstream 3.5 page under Cloudflare Access during interview screenshare.Wiring
Hooked into
run_weekly_ingestion.shas step 6/6 (after filing change detection). Skipped in--dry-runmode like the other S3 writers.Test plan
pytest tests/test_emit_manifest.py -q— 8/8 passs3://alpha-engine-research/rag/manifest/latest.jsonFollow-up
Tracked as a P2 in
alpha-engine-docs/private/ROADMAP.md: oncelatest.jsonexists, ship the public Knowledge Base panel on nousergon.ai reading from it.Side discovery (not in this PR)
alpha-engine-lib/src/alpha_engine_lib/rag/schema.sql:31declaresembedding vector(512)butembeddings.pydocuments voyage-3-lite as 1024d. Either schema is stale or the column is undersized — worth a separate audit.🤖 Generated with Claude Code