feat(nlp): replace AnthropicEventExtractor with RuleBasedEventExtractor#310
Merged
Conversation
Applies the standing rule per ``[[preference_llm_calls_confined_to_research_module]]``
— LLM calls live in alpha-engine-research. The news pipeline's
Haiku-backed event extractor is removed and replaced with a deterministic
classifier that uses two zero-cost signals already on the wire:
1. **Vendor tags** (``NewsArticle.tags``). Polygon emits keywords,
GDELT emits structured event codes, Benzinga emits Channels. The
``alpha_engine_lib.sources.protocols.NewsArticle.tags`` docstring
explicitly names this as "a soft signal for downstream event-flag
extraction" — we were paying Haiku to re-derive what Polygon /
GDELT already tagged.
2. **Title-keyword regex**. Backstop for sources that don't populate
tags (Yahoo RSS). 17 pattern → category mappings against the
``DEFAULT_EVENT_CATEGORIES`` closed taxonomy.
**Why this is the right answer, not a kill-switch:** code audit found
the Haiku per-article structured output was aggregated to 5 scalar /
list columns (``event_count``, ``event_severity_max/mean``,
``event_categories``, ``top_event_descriptions``) before any research
consumer touched it. The "zero-shot novel-event detection" capability
was mostly wasted — research only sees per-ticker rollups. Tag-based +
keyword-based classification produces equivalent rollups
deterministically.
**Cost impact:** retires the largest previously-untracked LLM cost
slice in the system per the original Phase 0 audit estimate
($20–60/mo). Actual spend on the deleted call site goes to $0; the
research consumer sees identical EventFlag shape (extractor slug
changes from ``"anthropic_haiku"`` to ``"rule_based"``) and identical
aggregate columns in ``news_aggregates/{date}.parquet``.
**Substrate cleanup:** retires three files added earlier this session:
- ``collectors/nlp/event_extraction.py`` (the Anthropic extractor itself)
- ``rag/pipelines/_cost_telemetry.py`` (Phase 0.2 cost-telemetry buffer,
PR #308 + Phase 4 #1 runaway-cost breaker, PR #309 — both retired
with the LLM call site they instrumented)
- ``tests/test_news_cost_telemetry.py`` (mirrored tests)
``DEFAULT_EVENT_CATEGORIES`` moves into the new
``collectors/nlp/rule_based_event_extraction.py`` so the closed
taxonomy stays accessible to downstream consumers.
**Protocol contract:** ``EventExtractor.extract`` gains an optional
``article_tags: tuple[str, ...] = ()`` kwarg (back-compat default).
The pipeline plumbs the tag union across article variants. Any future
EventExtractor implementation (FinBERT, spaCy, reactivated LLM via
research module) consumes the same shape.
**Severity convention:** rule-based flags emit ``severity=0.5``
uniformly (the EventFlag protocol's documented default). The Haiku
severity was a free-floating judgment never tuned by any operator
alert. Per-category severity tuning can be added via YAML if a
downstream surface needs it.
**Tests:** ``TestRuleBasedEventExtractor`` (10 tests) covers
empty-text short-circuit, no-match returns empty, title-keyword
classification per category (earnings / M&A / FDA), tag-based
classification (Polygon/GDELT shape), tag+title union, multi-category
emission, deterministic ordering per ``DEFAULT_EVENT_CATEGORIES``,
zero-LLM-dependency contract, title-as-description shape. Suite
1493 → 1479 net (retired the 9 cost-telemetry tests + 7 Anthropic
extractor tests; added 10 rule-based tests).
**Composes with:**
- ``[[preference_llm_calls_confined_to_research_module]]`` — the
rule this PR enforces
- alpha-engine #212 (executor EOD narrative kill switch) — sibling
application of the same rule. Two non-research LLM call sites; this
PR retires data's entirely, executor's keeps the kill switch
substrate (default off) since the LLM path may be operator-reactivated.
- Retires the substrate from data #308 + data #309 (Phase 0.2 + Phase
4 #1 cost-telemetry buffer + breaker) — both became dead code with
the LLM call site they instrumented.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Applies the standing rule per `[[preference_llm_calls_confined_to_research_module]]` — LLM calls live in research; data uses existing metadata + rule-based classifiers. The news pipeline's Haiku-backed event extractor is removed and replaced with a deterministic classifier using two zero-cost signals already on the wire:
Why this is the right answer, not a kill-switch
Code audit found the Haiku per-article structured output was aggregated to 5 scalar/list columns (`event_count`, `event_severity_max/mean`, `event_categories`, `top_event_descriptions`) before any research consumer touched it. The "zero-shot novel-event detection" capability was mostly wasted — research only sees per-ticker rollups. Tag-based + keyword-based classification produces equivalent rollups deterministically.
Cost impact
Retires the largest previously-untracked LLM cost slice per the Phase 0 audit estimate ($20–60/mo → $0). The research consumer sees identical EventFlag shape (extractor slug changes from `anthropic_haiku` to `rule_based`) and identical aggregate columns in `news_aggregates/{date}.parquet`.
Substrate cleanup
Retires three files added earlier this session — all dead code now that the LLM call site is gone:
`DEFAULT_EVENT_CATEGORIES` moves into the new `rule_based_event_extraction.py` so the closed taxonomy stays accessible.
Protocol contract
`EventExtractor.extract` gains an optional `article_tags: tuple[str, ...] = ()` kwarg (back-compat default). Pipeline plumbs the tag union across article variants. Any future EventExtractor (FinBERT, spaCy, reactivated LLM via research module) consumes the same shape.
Severity convention
Rule-based flags emit `severity=0.5` uniformly (the EventFlag protocol's documented default). Haiku severity was a free-floating judgment never tuned by any operator alert. Per-category severity can be added via YAML if a downstream surface needs it.
Test plan
Composes with
🤖 Generated with Claude Code