feat(rag): Wave 1 PR A.3 — news → RAG ingest pipeline by cipher813 · Pull Request #229 · cipher813/alpha-engine-data

cipher813 · 2026-05-13T18:12:09Z

Summary

Wave 1 PR A.3 of the institutional data-revamp arc. Indexes aggregated news articles into the existing pgvector RAG corpus alongside SEC filings. Consumer agents (thesis_update, sector_quant/qual) retrieve relevant news at inference time via the same hybrid-retrieval API the qual analyst's query_filings tool already uses.

Completes the Wave 1 producer-side news substrate. Consumer-side tool wiring (PR E) and fetch_data integration (PR F) follow.

Composition chain (full Wave 1 dataflow)

PolygonNewsAdapter + GdeltNewsAdapter + YahooRssNewsAdapter (PR β #226)
                              │
                              ▼
                  NewsAggregator.fetch() — fan-in + dedup + trust
                              │
                              ▼
                    AggregatedNewsArticle list
                              │
            ┌─────────────────┼─────────────────┐
            ▼                 ▼                 ▼
  NewsNLPPipeline       aggregate_and_write    ingest_articles
  (PR A.1 #227)         (PR A.2 #228)          (PR A.3 — THIS PR)
            │                 │                 │
            ▼                 ▼                 ▼
  sentiment+events       per-ticker per-day    pgvector docs
  +entities streams      structured parquet    alongside filings

What's in

New module rag/pipelines/ingest_news.py:

ingest_articles(articles, *, filed_date, ticker_to_sector, embed_texts_fn, document_exists_fn, ingest_document_fn, dry_run) returning a stats dict
Mirrors the canonical ingest_8k_filings.py pattern (consistent RAG-pipeline shape across the repo)
One document per (ticker, article) pair — multi-ticker articles index once per ticker so the ticker-keyed RAG schema surfaces them when the qual agent queries by any constituent
Idempotent via document_exists pre-check — re-runs skip the embedding call entirely (saves vector-API cost)
Chunking: one chunk per article (title + longest body excerpt across variants). Polygon/GDELT/Yahoo bodies are short (<500 tokens); multi-chunk splitting can land in a follow-up if Benzinga or a full-body adapter joins
Source labeling: RAG source field prefixed news_ (e.g. news_polygon, news_gdelt) so consumer queries can filter "news only" vs "filings only" by source-prefix without enumerating vendors
Canonical source selection: alphabetical across variants so re-ingests produce the same document (deterministic across runs)
Failure isolation: per-document failures (transient pgvector / embed-API hiccup) isolated; batch continues. ingest_document returning None (lib's failure signal) counts as a failure without crashing

Test plan

+18 unit tests (tests/test_ingest_news.py):
- Helpers: _rag_source prefix / _chunk_text combines title+longest body / handles missing title / all-empty edge / _canonical_source deterministic alphabetical + single variant
- Single-ticker happy path: embed + ingest called with right shape
- Idempotency: document_exists short-circuits embed AND ingest
- Empty / too-short bodies skipped (counter increments)
- Multi-ticker: emits one doc per ticker
- Multi-ticker per-ticker existence check (one exists, one new)
- Sector lookup: ticker_to_sector passed through; missing → None
- dry_run mode skips embed/ingest but counts
- Failure isolation: one bad doc continues batch
- ingest_document returning None counts as failure not crash
- Stats dict shape pinned to 6 canonical keys
Full data suite: 929 passing (1 skipped) in 4s

What's remaining in Wave 1

PR B — Filings substrate expansion (EDGAR full coverage — Form 4, 13F, 10-K/Q/14A/S-1/13D/G)
PR C — Analyst substrate (yfinance + FMP adapters + self-derived revisions)
PR D — Async + S3 cache + per-vendor rate limiters
PR E — RAG retrieval tools in research (search_news, search_filings, etc. — wires existing corpus to thesis_update + sector agents)
PR F — Wire substrate into research's fetch_data (supersedes fix(infra): drop inline EB-SFN role IAM writes #170's per-ticker pre-fetch)

🤖 Generated with Claude Code

Wave 1 PR A.3 of the institutional data-revamp arc (plan doc: ~/Development/alpha-engine-docs/private/data-revamp-260513.md). Indexes aggregated news articles into the existing pgvector RAG corpus alongside SEC filings. Consumer agents (thesis_update, sector_quant/qual) retrieve relevant news at inference time via the same hybrid-retrieval API the qual analyst's query_filings tool already uses. Pairs with PRs β/A.1/A.2 — completes the Wave 1 producer-side news substrate. Consumer-side tool wiring (PR E) and fetch_data integration (PR F) follow. New module: rag/pipelines/ingest_news.py ingest_articles(articles, *, filed_date, ticker_to_sector, embed_texts_fn, document_exists_fn, ingest_document_fn, dry_run) -> stats dict Architecture: - Mirrors ingest_8k_filings.py pattern (canonical RAG-pipeline shape) - One document per (ticker, article) pair — multi-ticker articles index once per ticker so the ticker-keyed RAG schema surfaces them when the qual agent queries by any constituent - Idempotent via document_exists pre-check — re-runs skip the embedding call entirely (saves vector-API cost) - Chunking: one chunk per article (title + longest body excerpt). Polygon/GDELT/Yahoo bodies are short (typically <500 tokens); multi-chunk splitting can land in a follow-up if Benzinga or a full-body adapter joins - Source labeling: RAG `source` field prefixed `news_` so consumer queries can filter "news only" vs "filings only" by source-prefix without enumerating vendors Canonical source selection: Picks alphabetically across variants so re-ingests produce the same document (deterministic across runs). Composes with the aggregator's source-provenance preservation in PR β. Failure isolation: Per-document failures (transient pgvector / embed-API hiccup) isolated to that document; batch continues. ingest_document returning None (lib's failure signal) counts as a failure without crashing. Composition chain (full Wave 1 dataflow): PolygonNewsAdapter + GdeltNewsAdapter + YahooRssNewsAdapter (PR β) │ ▼ NewsAggregator.fetch() — fan-in + dedup + trust │ ▼ AggregatedNewsArticle list │ ┌─────────────────┼─────────────────┐ ▼ ▼ ▼ NewsNLPPipeline aggregate_and_write ingest_articles (PR A.1) (PR A.2 — parquet) (PR A.3 — RAG) │ │ │ ▼ ▼ ▼ sentiment+events per-ticker per-day pgvector docs +entities streams structured parquet alongside filings +18 unit tests: - _rag_source prefix - _chunk_text combines title + longest body - _chunk_text handles missing title / all empty - _canonical_source deterministic alphabetical / single variant - Single-ticker happy path: embed + ingest called with right shape - Idempotency: document_exists short-circuits embed AND ingest - Empty / too-short bodies skipped (counter increments) - Multi-ticker: emits one doc per ticker - Multi-ticker per-ticker existence check (AAPL exists, MSFT new) - Sector lookup: ticker_to_sector passed through; missing -> None - dry_run mode skips embed/ingest but counts - Failure isolation: one bad doc continues batch - ingest returning None counts as failure not crash - Stats dict shape pinned to 6 canonical keys Suite: 929 passing (1 skipped). Composes with: - PR β (#226) — AggregatedNewsArticle input shape - PR A.1 (#227) — NewsNLPOutput shape (joined to article fingerprints) - PR A.2 (#228) — structured aggregates writer (parallel write path) - PR α (lib v0.15.0) — NewsArticle base shape - alpha_engine_lib.rag (embed_texts + document_exists + ingest_document) - data-revamp-260513.md plan doc Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cipher813 merged commit f3b10ae into main May 13, 2026
1 check passed

cipher813 deleted the feat/wave1-news-rag-ingest branch May 13, 2026 18:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(rag): Wave 1 PR A.3 — news → RAG ingest pipeline#229

feat(rag): Wave 1 PR A.3 — news → RAG ingest pipeline#229
cipher813 merged 1 commit into
mainfrom
feat/wave1-news-rag-ingest

cipher813 commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cipher813 commented May 13, 2026

Summary

Composition chain (full Wave 1 dataflow)

What's in

Test plan

What's remaining in Wave 1

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant