Skip to content

feat(rag): Wave 1 PR A.3 — news → RAG ingest pipeline#229

Merged
cipher813 merged 1 commit into
mainfrom
feat/wave1-news-rag-ingest
May 13, 2026
Merged

feat(rag): Wave 1 PR A.3 — news → RAG ingest pipeline#229
cipher813 merged 1 commit into
mainfrom
feat/wave1-news-rag-ingest

Conversation

@cipher813
Copy link
Copy Markdown
Owner

Summary

Wave 1 PR A.3 of the institutional data-revamp arc. Indexes aggregated news articles into the existing pgvector RAG corpus alongside SEC filings. Consumer agents (thesis_update, sector_quant/qual) retrieve relevant news at inference time via the same hybrid-retrieval API the qual analyst's query_filings tool already uses.

Completes the Wave 1 producer-side news substrate. Consumer-side tool wiring (PR E) and fetch_data integration (PR F) follow.

Composition chain (full Wave 1 dataflow)

PolygonNewsAdapter + GdeltNewsAdapter + YahooRssNewsAdapter (PR β #226)
                              │
                              ▼
                  NewsAggregator.fetch() — fan-in + dedup + trust
                              │
                              ▼
                    AggregatedNewsArticle list
                              │
            ┌─────────────────┼─────────────────┐
            ▼                 ▼                 ▼
  NewsNLPPipeline       aggregate_and_write    ingest_articles
  (PR A.1 #227)         (PR A.2 #228)          (PR A.3 — THIS PR)
            │                 │                 │
            ▼                 ▼                 ▼
  sentiment+events       per-ticker per-day    pgvector docs
  +entities streams      structured parquet    alongside filings

What's in

New module rag/pipelines/ingest_news.py:

  • ingest_articles(articles, *, filed_date, ticker_to_sector, embed_texts_fn, document_exists_fn, ingest_document_fn, dry_run) returning a stats dict
  • Mirrors the canonical ingest_8k_filings.py pattern (consistent RAG-pipeline shape across the repo)
  • One document per (ticker, article) pair — multi-ticker articles index once per ticker so the ticker-keyed RAG schema surfaces them when the qual agent queries by any constituent
  • Idempotent via document_exists pre-check — re-runs skip the embedding call entirely (saves vector-API cost)
  • Chunking: one chunk per article (title + longest body excerpt across variants). Polygon/GDELT/Yahoo bodies are short (<500 tokens); multi-chunk splitting can land in a follow-up if Benzinga or a full-body adapter joins
  • Source labeling: RAG source field prefixed news_ (e.g. news_polygon, news_gdelt) so consumer queries can filter "news only" vs "filings only" by source-prefix without enumerating vendors
  • Canonical source selection: alphabetical across variants so re-ingests produce the same document (deterministic across runs)
  • Failure isolation: per-document failures (transient pgvector / embed-API hiccup) isolated; batch continues. ingest_document returning None (lib's failure signal) counts as a failure without crashing

Test plan

  • +18 unit tests (tests/test_ingest_news.py):
    • Helpers: _rag_source prefix / _chunk_text combines title+longest body / handles missing title / all-empty edge / _canonical_source deterministic alphabetical + single variant
    • Single-ticker happy path: embed + ingest called with right shape
    • Idempotency: document_exists short-circuits embed AND ingest
    • Empty / too-short bodies skipped (counter increments)
    • Multi-ticker: emits one doc per ticker
    • Multi-ticker per-ticker existence check (one exists, one new)
    • Sector lookup: ticker_to_sector passed through; missing → None
    • dry_run mode skips embed/ingest but counts
    • Failure isolation: one bad doc continues batch
    • ingest_document returning None counts as failure not crash
    • Stats dict shape pinned to 6 canonical keys
  • Full data suite: 929 passing (1 skipped) in 4s

What's remaining in Wave 1

  • PR B — Filings substrate expansion (EDGAR full coverage — Form 4, 13F, 10-K/Q/14A/S-1/13D/G)
  • PR C — Analyst substrate (yfinance + FMP adapters + self-derived revisions)
  • PR D — Async + S3 cache + per-vendor rate limiters
  • PR E — RAG retrieval tools in research (search_news, search_filings, etc. — wires existing corpus to thesis_update + sector agents)
  • PR F — Wire substrate into research's fetch_data (supersedes fix(infra): drop inline EB-SFN role IAM writes #170's per-ticker pre-fetch)

🤖 Generated with Claude Code

Wave 1 PR A.3 of the institutional data-revamp arc (plan doc:
~/Development/alpha-engine-docs/private/data-revamp-260513.md).

Indexes aggregated news articles into the existing pgvector RAG
corpus alongside SEC filings. Consumer agents (thesis_update,
sector_quant/qual) retrieve relevant news at inference time via the
same hybrid-retrieval API the qual analyst's query_filings tool
already uses.

Pairs with PRs β/A.1/A.2 — completes the Wave 1 producer-side news
substrate. Consumer-side tool wiring (PR E) and fetch_data
integration (PR F) follow.

New module: rag/pipelines/ingest_news.py

  ingest_articles(articles, *, filed_date, ticker_to_sector,
                  embed_texts_fn, document_exists_fn,
                  ingest_document_fn, dry_run) -> stats dict

  Architecture:
    - Mirrors ingest_8k_filings.py pattern (canonical RAG-pipeline shape)
    - One document per (ticker, article) pair — multi-ticker articles
      index once per ticker so the ticker-keyed RAG schema surfaces
      them when the qual agent queries by any constituent
    - Idempotent via document_exists pre-check — re-runs skip the
      embedding call entirely (saves vector-API cost)
    - Chunking: one chunk per article (title + longest body excerpt).
      Polygon/GDELT/Yahoo bodies are short (typically <500 tokens);
      multi-chunk splitting can land in a follow-up if Benzinga or
      a full-body adapter joins
    - Source labeling: RAG `source` field prefixed `news_` so
      consumer queries can filter "news only" vs "filings only" by
      source-prefix without enumerating vendors

  Canonical source selection:
    Picks alphabetically across variants so re-ingests produce the
    same document (deterministic across runs). Composes with the
    aggregator's source-provenance preservation in PR β.

  Failure isolation:
    Per-document failures (transient pgvector / embed-API hiccup)
    isolated to that document; batch continues. ingest_document
    returning None (lib's failure signal) counts as a failure
    without crashing.

Composition chain (full Wave 1 dataflow):

  PolygonNewsAdapter + GdeltNewsAdapter + YahooRssNewsAdapter (PR β)
                              │
                              ▼
                  NewsAggregator.fetch() — fan-in + dedup + trust
                              │
                              ▼
                    AggregatedNewsArticle list
                              │
            ┌─────────────────┼─────────────────┐
            ▼                 ▼                 ▼
  NewsNLPPipeline       aggregate_and_write    ingest_articles
  (PR A.1)              (PR A.2 — parquet)     (PR A.3 — RAG)
            │                 │                 │
            ▼                 ▼                 ▼
  sentiment+events       per-ticker per-day    pgvector docs
  +entities streams      structured parquet    alongside filings

+18 unit tests:
  - _rag_source prefix
  - _chunk_text combines title + longest body
  - _chunk_text handles missing title / all empty
  - _canonical_source deterministic alphabetical / single variant
  - Single-ticker happy path: embed + ingest called with right shape
  - Idempotency: document_exists short-circuits embed AND ingest
  - Empty / too-short bodies skipped (counter increments)
  - Multi-ticker: emits one doc per ticker
  - Multi-ticker per-ticker existence check (AAPL exists, MSFT new)
  - Sector lookup: ticker_to_sector passed through; missing -> None
  - dry_run mode skips embed/ingest but counts
  - Failure isolation: one bad doc continues batch
  - ingest returning None counts as failure not crash
  - Stats dict shape pinned to 6 canonical keys

Suite: 929 passing (1 skipped).

Composes with:
  - PR β (#226) — AggregatedNewsArticle input shape
  - PR A.1 (#227) — NewsNLPOutput shape (joined to article fingerprints)
  - PR A.2 (#228) — structured aggregates writer (parallel write path)
  - PR α (lib v0.15.0) — NewsArticle base shape
  - alpha_engine_lib.rag (embed_texts + document_exists + ingest_document)
  - data-revamp-260513.md plan doc

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit f3b10ae into main May 13, 2026
1 check passed
@cipher813 cipher813 deleted the feat/wave1-news-rag-ingest branch May 13, 2026 18:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant