feat(rag): Wave 1 PR A.3 — news → RAG ingest pipeline#229
Merged
Conversation
Wave 1 PR A.3 of the institutional data-revamp arc (plan doc:
~/Development/alpha-engine-docs/private/data-revamp-260513.md).
Indexes aggregated news articles into the existing pgvector RAG
corpus alongside SEC filings. Consumer agents (thesis_update,
sector_quant/qual) retrieve relevant news at inference time via the
same hybrid-retrieval API the qual analyst's query_filings tool
already uses.
Pairs with PRs β/A.1/A.2 — completes the Wave 1 producer-side news
substrate. Consumer-side tool wiring (PR E) and fetch_data
integration (PR F) follow.
New module: rag/pipelines/ingest_news.py
ingest_articles(articles, *, filed_date, ticker_to_sector,
embed_texts_fn, document_exists_fn,
ingest_document_fn, dry_run) -> stats dict
Architecture:
- Mirrors ingest_8k_filings.py pattern (canonical RAG-pipeline shape)
- One document per (ticker, article) pair — multi-ticker articles
index once per ticker so the ticker-keyed RAG schema surfaces
them when the qual agent queries by any constituent
- Idempotent via document_exists pre-check — re-runs skip the
embedding call entirely (saves vector-API cost)
- Chunking: one chunk per article (title + longest body excerpt).
Polygon/GDELT/Yahoo bodies are short (typically <500 tokens);
multi-chunk splitting can land in a follow-up if Benzinga or
a full-body adapter joins
- Source labeling: RAG `source` field prefixed `news_` so
consumer queries can filter "news only" vs "filings only" by
source-prefix without enumerating vendors
Canonical source selection:
Picks alphabetically across variants so re-ingests produce the
same document (deterministic across runs). Composes with the
aggregator's source-provenance preservation in PR β.
Failure isolation:
Per-document failures (transient pgvector / embed-API hiccup)
isolated to that document; batch continues. ingest_document
returning None (lib's failure signal) counts as a failure
without crashing.
Composition chain (full Wave 1 dataflow):
PolygonNewsAdapter + GdeltNewsAdapter + YahooRssNewsAdapter (PR β)
│
▼
NewsAggregator.fetch() — fan-in + dedup + trust
│
▼
AggregatedNewsArticle list
│
┌─────────────────┼─────────────────┐
▼ ▼ ▼
NewsNLPPipeline aggregate_and_write ingest_articles
(PR A.1) (PR A.2 — parquet) (PR A.3 — RAG)
│ │ │
▼ ▼ ▼
sentiment+events per-ticker per-day pgvector docs
+entities streams structured parquet alongside filings
+18 unit tests:
- _rag_source prefix
- _chunk_text combines title + longest body
- _chunk_text handles missing title / all empty
- _canonical_source deterministic alphabetical / single variant
- Single-ticker happy path: embed + ingest called with right shape
- Idempotency: document_exists short-circuits embed AND ingest
- Empty / too-short bodies skipped (counter increments)
- Multi-ticker: emits one doc per ticker
- Multi-ticker per-ticker existence check (AAPL exists, MSFT new)
- Sector lookup: ticker_to_sector passed through; missing -> None
- dry_run mode skips embed/ingest but counts
- Failure isolation: one bad doc continues batch
- ingest returning None counts as failure not crash
- Stats dict shape pinned to 6 canonical keys
Suite: 929 passing (1 skipped).
Composes with:
- PR β (#226) — AggregatedNewsArticle input shape
- PR A.1 (#227) — NewsNLPOutput shape (joined to article fingerprints)
- PR A.2 (#228) — structured aggregates writer (parallel write path)
- PR α (lib v0.15.0) — NewsArticle base shape
- alpha_engine_lib.rag (embed_texts + document_exists + ingest_document)
- data-revamp-260513.md plan doc
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Wave 1 PR A.3 of the institutional data-revamp arc. Indexes aggregated news articles into the existing pgvector RAG corpus alongside SEC filings. Consumer agents (thesis_update, sector_quant/qual) retrieve relevant news at inference time via the same hybrid-retrieval API the qual analyst's
query_filingstool already uses.Completes the Wave 1 producer-side news substrate. Consumer-side tool wiring (PR E) and
fetch_dataintegration (PR F) follow.Composition chain (full Wave 1 dataflow)
What's in
New module
rag/pipelines/ingest_news.py:ingest_articles(articles, *, filed_date, ticker_to_sector, embed_texts_fn, document_exists_fn, ingest_document_fn, dry_run)returning a stats dictingest_8k_filings.pypattern (consistent RAG-pipeline shape across the repo)document_existspre-check — re-runs skip the embedding call entirely (saves vector-API cost)sourcefield prefixednews_(e.g.news_polygon,news_gdelt) so consumer queries can filter "news only" vs "filings only" by source-prefix without enumerating vendorsingest_documentreturning None (lib's failure signal) counts as a failure without crashingTest plan
tests/test_ingest_news.py):_rag_sourceprefix /_chunk_textcombines title+longest body / handles missing title / all-empty edge /_canonical_sourcedeterministic alphabetical + single variantdocument_existsshort-circuits embed AND ingestticker_to_sectorpassed through; missing → Nonedry_runmode skips embed/ingest but countsingest_documentreturning None counts as failure not crashWhat's remaining in Wave 1
search_news,search_filings, etc. — wires existing corpus to thesis_update + sector agents)fetch_data(supersedes fix(infra): drop inline EB-SFN role IAM writes #170's per-ticker pre-fetch)🤖 Generated with Claude Code