feat(nlp): Wave 1 PR A.1 — news NLP pipeline (Loughran-McDonald + LLM event extraction)#227
Merged
Merged
Conversation
… event extraction)
Wave 1 PR A.1 of the institutional data-revamp arc (plan doc:
~/Development/alpha-engine-docs/private/data-revamp-260513.md).
Producer-side NLP layer that consumes AggregatedNewsArticle output
from PR β's aggregator and emits three parallel structured streams:
sentiment scores, entity mentions, and event flags. Output is ready
for the PR A.2 parquet writer + PR A.3 RAG ingest pass.
Architecture: each NLP dimension is a Protocol with one or more
concrete implementations. The pipeline orchestrator (NewsNLPPipeline)
composes them without knowing which concrete classes are wired —
upgrade paths drop in as new adapter classes without touching the
orchestrator or downstream consumers.
New modules:
collectors/nlp/protocols.py
SentimentScore, EntityMention, EventFlag Pydantic shapes
(frozen + extra='forbid' so parquet writer has stable column
schema)
SentimentScorer, EntityExtractor, EventExtractor Protocols
(runtime_checkable; structural subtyping)
collectors/nlp/loughran_mcdonald.py
LoughranMcDonaldScorer — finance-domain dictionary sentiment,
the academic gold standard (Loughran & McDonald 2011). Composite
= (positive - negative) / total_tokens clipped to [-clip, +clip].
Uncertainty counted separately from polarity.
load_lm_master_dict() CSV parser — tolerates missing file (logs
warning, returns empty dict so production fails clearly).
collectors/nlp/event_extraction.py
AnthropicEventExtractor — Haiku-tier structured event extraction
via tool_use API. Closed taxonomy of 18 event categories
(DEFAULT_EVENT_CATEGORIES — earnings, M&A, IPO, management
change, regulatory action, FDA action, etc.). Cost-tracked under
agent_id='news_event_extractor'. Tolerates transient failures
(returns empty) + malformed entries (drops single entry, keeps
others). Handles tool_use.input as either dict or JSON string.
collectors/nlp/pipeline.py
NewsNLPPipeline(sentiment_scorers, entity_extractors,
event_extractors).process(articles) →
NewsNLPOutput. Composes any number of components per stream.
Per-component exceptions are isolated — one broken scorer can't
take down a batch. Article text uses canonical_title +
longest body_excerpt across variants.
scripts/download_lm_dict.py
Operator script. Fetches the canonical Loughran-McDonald 2022
Master Dictionary CSV from Notre Dame's research page. Idempotent
(overwrites). Pin via VCS commit if reproducibility matters. URL
is overridable when the source moves.
What's deferred to subsequent sub-PRs:
PR A.1.1 — FinBERT scorer (HF transformer; heavier deps; lives in
collectors/nlp/finbert.py as a drop-in)
PR A.1.2 — spaCy NER entity extractor
PR A.2 — Structured aggregates writer (S3 parquet per (ticker,
date) joining sentiment + entity + event streams)
PR A.3 — RAG ingest path (raw article text → chunked → embedded →
pgvector alongside SEC filings)
+42 unit tests:
- Pydantic shape construction + frozen + extra='forbid' (3 shapes)
- Protocol structural-subtyping (3 protocols, structural matches)
- Tokenization (alphabetic only, lowercase, drops digits)
- _truthy helper (year-stamp / 0 / blank / non-numeric)
- LM scorer: pure positive / pure negative / balanced /
dilution-by-neutral / uncertainty-counted-separately / empty-text /
clipped-to-range / empty-dict-warns-and-yields-zero
- load_lm_master_dict: canonical-CSV / missing-file-warns / blank-rows-skipped
- Anthropic event extractor: tool-spec-includes-default-categories /
happy-path / empty-text-skips-call / transient-LLM-failure /
malformed-entry-dropped / tool_use-input-as-JSON-string /
no-tool-use-block-returns-empty
- Pipeline: empty / sentiment-per-article / multiple-scorers /
scorer-exception-isolated / event-extractor-receives-tickers /
empty-article-text-skipped / uses-longest-excerpt-across-variants
Suite: 890 passing (1 skipped).
Composes with:
- PR α (alpha-engine-lib #46, v0.15.0) — NewsArticle Pydantic shape
- PR β (this repo #226) — AggregatedNewsArticle input
- data-revamp-260513.md plan doc — full 4-wave arc context
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813
added a commit
that referenced
this pull request
May 13, 2026
…-ticker per-day parquet) (#228) Wave 1 PR A.2 of the institutional data-revamp arc (plan doc: ~/Development/alpha-engine-docs/private/data-revamp-260513.md). Joins the 3 NLP-pipeline output streams (sentiment_scores, entity_mentions, event_flags) with the aggregator's source-provenance information and produces one row per (ticker, aggregate_date) in S3 parquet. This is the canonical structured signal that downstream consumers (research's fetch_data) read into input_data_snapshot for thesis_update + sector_quant + sector_qual agents. New module: data/derived/news_aggregates.py NewsTickerDailyAggregate dataclass (frozen) — canonical row shape with schema_version pinned to column. 19 columns covering: - identity: ticker, aggregate_date, schema_version - volume: n_articles, n_articles_trusted_weighted, n_articles_by_source_json - LM sentiment: lm_sentiment_mean/max/min/trusted_mean + LM word counts (positive, negative, uncertainty, total_tokens) - events: event_count, severity_max/mean, categories (sorted comma-joined), top_event_descriptions (top-N by severity desc, " | "-joined) - entities: entity_mentions_count build_news_aggregates_df() — pure aggregation, no I/O. Accepts NewsNLPOutput + AggregatedNewsArticle list + aggregate_date + optional NewsAggregator (for trust-weighted means). write_news_aggregates_parquet() / read_news_aggregates_parquet() — S3 parquet I/O. Idempotent overwrite at ``s3://alpha-engine-research/data/news_aggregates/{date}.parquet``. aggregate_and_write() — orchestrator-friendly end-to-end helper returning (key, df) so callers can log row counts / emit CW metrics without re-reading from S3. Aggregation semantics: - Multi-ticker article emits one row per ticker (each row sees the full article). - Event.tickers gates per-ticker event attachment; empty tickers list means "inherit from article" (extractor didn't gate). - Trust-weighted sentiment mean weights each article by max trust across its source variants. n_articles_trusted_weighted sums trust per VARIANT (so a 3-source dedup counts 3× its trust contribution). - Top-3 event descriptions chosen by severity desc; ties broken by appearance order. Stored as " | "-joined string for parquet column friendliness. Why one parquet per date (vs per ticker): - ~25 held + ~50 universe tickers = ~75 rows/day. Single parquet reads faster than 75 separate files. - Single S3 object per day = cheaper LIST, atomic overwrite. - Schema migration simpler — one file per day, not 75. What's deferred: PR A.3 — RAG ingest path (full article text → chunked → embedded → pgvector alongside SEC filings corpus) PR D — Async + S3 cache + per-vendor rate limiters PR F — Wire substrate into research's fetch_data (supersedes #170's per-ticker pre-fetch) +21 unit tests: - Per-ticker aggregation: one-article-one-ticker / multi-ticker article emits one row per / multiple-articles-per-ticker / empty-articles-produces-empty-df-with-schema - Trust weighting: trusted-mean / n_articles_trusted_weighted / source counts json per variant - Event aggregation: tickers-set filters to those tickers / no-tickers inherits from article / top-N sorted by severity desc / categories sorted+joined / severity_mean computed - S3 round-trip: write-read preserves rows / missing parquet returns empty schema df / overwrite / s3 key format / custom prefix - End-to-end aggregate_and_write helper + accepts datetime - Schema version pinned to int constant + on every row In-memory S3 mock used for round-trip tests (no moto dep added). Suite: 911 passing (1 skipped). Composes with: - PR β (#226) — AggregatedNewsArticle input shape - PR A.1 (#227) — NewsNLPOutput shape - PR α (lib v0.15.0) — NewsArticle base shape - data-revamp-260513.md plan doc Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813
added a commit
that referenced
this pull request
May 13, 2026
Wave 1 PR A.3 of the institutional data-revamp arc (plan doc:
~/Development/alpha-engine-docs/private/data-revamp-260513.md).
Indexes aggregated news articles into the existing pgvector RAG
corpus alongside SEC filings. Consumer agents (thesis_update,
sector_quant/qual) retrieve relevant news at inference time via the
same hybrid-retrieval API the qual analyst's query_filings tool
already uses.
Pairs with PRs β/A.1/A.2 — completes the Wave 1 producer-side news
substrate. Consumer-side tool wiring (PR E) and fetch_data
integration (PR F) follow.
New module: rag/pipelines/ingest_news.py
ingest_articles(articles, *, filed_date, ticker_to_sector,
embed_texts_fn, document_exists_fn,
ingest_document_fn, dry_run) -> stats dict
Architecture:
- Mirrors ingest_8k_filings.py pattern (canonical RAG-pipeline shape)
- One document per (ticker, article) pair — multi-ticker articles
index once per ticker so the ticker-keyed RAG schema surfaces
them when the qual agent queries by any constituent
- Idempotent via document_exists pre-check — re-runs skip the
embedding call entirely (saves vector-API cost)
- Chunking: one chunk per article (title + longest body excerpt).
Polygon/GDELT/Yahoo bodies are short (typically <500 tokens);
multi-chunk splitting can land in a follow-up if Benzinga or
a full-body adapter joins
- Source labeling: RAG `source` field prefixed `news_` so
consumer queries can filter "news only" vs "filings only" by
source-prefix without enumerating vendors
Canonical source selection:
Picks alphabetically across variants so re-ingests produce the
same document (deterministic across runs). Composes with the
aggregator's source-provenance preservation in PR β.
Failure isolation:
Per-document failures (transient pgvector / embed-API hiccup)
isolated to that document; batch continues. ingest_document
returning None (lib's failure signal) counts as a failure
without crashing.
Composition chain (full Wave 1 dataflow):
PolygonNewsAdapter + GdeltNewsAdapter + YahooRssNewsAdapter (PR β)
│
▼
NewsAggregator.fetch() — fan-in + dedup + trust
│
▼
AggregatedNewsArticle list
│
┌─────────────────┼─────────────────┐
▼ ▼ ▼
NewsNLPPipeline aggregate_and_write ingest_articles
(PR A.1) (PR A.2 — parquet) (PR A.3 — RAG)
│ │ │
▼ ▼ ▼
sentiment+events per-ticker per-day pgvector docs
+entities streams structured parquet alongside filings
+18 unit tests:
- _rag_source prefix
- _chunk_text combines title + longest body
- _chunk_text handles missing title / all empty
- _canonical_source deterministic alphabetical / single variant
- Single-ticker happy path: embed + ingest called with right shape
- Idempotency: document_exists short-circuits embed AND ingest
- Empty / too-short bodies skipped (counter increments)
- Multi-ticker: emits one doc per ticker
- Multi-ticker per-ticker existence check (AAPL exists, MSFT new)
- Sector lookup: ticker_to_sector passed through; missing -> None
- dry_run mode skips embed/ingest but counts
- Failure isolation: one bad doc continues batch
- ingest returning None counts as failure not crash
- Stats dict shape pinned to 6 canonical keys
Suite: 929 passing (1 skipped).
Composes with:
- PR β (#226) — AggregatedNewsArticle input shape
- PR A.1 (#227) — NewsNLPOutput shape (joined to article fingerprints)
- PR A.2 (#228) — structured aggregates writer (parallel write path)
- PR α (lib v0.15.0) — NewsArticle base shape
- alpha_engine_lib.rag (embed_texts + document_exists + ingest_document)
- data-revamp-260513.md plan doc
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Wave 1 PR A.1 of the institutional data-revamp arc (plan doc:
~/Development/alpha-engine-docs/private/data-revamp-260513.md).Producer-side NLP layer that consumes
AggregatedNewsArticleoutput from PR β's aggregator (#226 MERGED) and emits three parallel structured streams: sentiment scores, entity mentions, event flags. Output ready for PR A.2 parquet writer + PR A.3 RAG ingest pass.Architecture
Each NLP dimension is a Protocol with one or more concrete implementations. Pipeline orchestrator (
NewsNLPPipeline) composes them without knowing which concrete classes are wired — upgrade paths drop in as new adapter classes:SentimentScorerLoughranMcDonaldScorer(LM 2011 finance-domain dict)EntityExtractorEventExtractorAnthropicEventExtractor(Haiku tool_use)What's in
Pydantic shapes (
collectors/nlp/protocols.py)Frozen +
extra='forbid'so PR A.2 parquet writer has stable column schema:SentimentScore— scorer, composite [-1,+1], LM-style positive/negative/uncertainty/total token countsEntityMention— extractor, surface text, label, optional canonical_tickerEventFlag— extractor, category (closed taxonomy), description, tickers, severity [0,1]Loughran-McDonald scorer (
collectors/nlp/loughran_mcdonald.py)Finance-domain dictionary sentiment — academic gold standard (Loughran & McDonald 2011, Journal of Finance). Composite formula:
(positive - negative) / total_tokensclipped to [-clip, +clip]. Uncertainty counted separately from polarity. CSV parser (load_lm_master_dict) tolerates missing file (logs warning, returns empty dict so production fails clearly rather than silently).LLM event extractor (
collectors/nlp/event_extraction.py)Haiku-tier structured event extraction via Anthropic
tool_use. Closed taxonomy of 18 event categories:earnings_release,earnings_guidance,merger_or_acquisition,ipo_or_secondary,spinoff_or_divestiture,management_change,board_change,buyback_or_dividend,regulatory_action,fda_action,product_launch,partnership_or_contract,credit_rating_change,analyst_action,insider_transaction,macro_or_sector,operational_disruption,other. Tolerates transient failures + malformed entries + JSON-string-vs-dict tool input.Pipeline orchestrator (
collectors/nlp/pipeline.py)NewsNLPPipeline(sentiment_scorers, entity_extractors, event_extractors).process(articles)→NewsNLPOutput. Composes any number of components per stream. Per-component exceptions are isolated — one broken scorer can't take down a batch. Article text uses canonical_title + longest body_excerpt across variants.Operator script (
scripts/download_lm_dict.py)Fetches canonical LM 2022 Master Dictionary CSV from Notre Dame's research page. Idempotent (overwrites). Free + academic license.
What's deferred
collectors/nlp/finbert.pyas a drop-in)Test plan
_truthyhelper, LM scorer (pure positive / pure negative / balanced / dilution / uncertainty / empty-text / clipped / empty-dict-warns),load_lm_master_dict(canonical CSV / missing file / blank rows), Anthropic event extractor (tool spec / happy path / empty text skips call / transient failure / malformed entry dropped / JSON-string input / no-tool-use-block), pipeline composition (empty / per-article / multi-scorer / exception isolation / ticker passthrough / longest-excerpt selection).🤖 Generated with Claude Code