Skip to content

feat(nlp): replace AnthropicEventExtractor with RuleBasedEventExtractor#310

Merged
cipher813 merged 1 commit into
mainfrom
feat/rule-based-event-extraction
May 25, 2026
Merged

feat(nlp): replace AnthropicEventExtractor with RuleBasedEventExtractor#310
cipher813 merged 1 commit into
mainfrom
feat/rule-based-event-extraction

Conversation

@cipher813
Copy link
Copy Markdown
Owner

Summary

Applies the standing rule per `[[preference_llm_calls_confined_to_research_module]]` — LLM calls live in research; data uses existing metadata + rule-based classifiers. The news pipeline's Haiku-backed event extractor is removed and replaced with a deterministic classifier using two zero-cost signals already on the wire:

  1. Vendor tags (`NewsArticle.tags`) — Polygon emits keywords, GDELT emits structured event codes, Benzinga emits Channels. We were paying Haiku to re-derive what Polygon / GDELT already tagged.
  2. Title-keyword regex — backstop for sources that don't populate tags (Yahoo RSS). 17 pattern → category mappings against the closed `DEFAULT_EVENT_CATEGORIES` taxonomy.

Why this is the right answer, not a kill-switch

Code audit found the Haiku per-article structured output was aggregated to 5 scalar/list columns (`event_count`, `event_severity_max/mean`, `event_categories`, `top_event_descriptions`) before any research consumer touched it. The "zero-shot novel-event detection" capability was mostly wasted — research only sees per-ticker rollups. Tag-based + keyword-based classification produces equivalent rollups deterministically.

Cost impact

Retires the largest previously-untracked LLM cost slice per the Phase 0 audit estimate ($20–60/mo → $0). The research consumer sees identical EventFlag shape (extractor slug changes from `anthropic_haiku` to `rule_based`) and identical aggregate columns in `news_aggregates/{date}.parquet`.

Substrate cleanup

Retires three files added earlier this session — all dead code now that the LLM call site is gone:

`DEFAULT_EVENT_CATEGORIES` moves into the new `rule_based_event_extraction.py` so the closed taxonomy stays accessible.

Protocol contract

`EventExtractor.extract` gains an optional `article_tags: tuple[str, ...] = ()` kwarg (back-compat default). Pipeline plumbs the tag union across article variants. Any future EventExtractor (FinBERT, spaCy, reactivated LLM via research module) consumes the same shape.

Severity convention

Rule-based flags emit `severity=0.5` uniformly (the EventFlag protocol's documented default). Haiku severity was a free-floating judgment never tuned by any operator alert. Per-category severity can be added via YAML if a downstream surface needs it.

Test plan

  • `TestRuleBasedEventExtractor` (10 tests) — empty-text short-circuit, no-match, per-category title classification (earnings / M&A / FDA), tag-based classification, tag+title union, multi-category, deterministic ordering, zero-LLM contract, description shape
  • Suite 1493 → 1479 net (retired 9 cost-telemetry + 7 Anthropic extractor tests; added 10 rule-based tests)
  • Sat 5/30 SF: RAGIngestion fires, `news_aggregates/{date}.parquet` shows non-empty event_categories from rule-based path, `event_count` > 0 for tickers with vendor tags
  • Research's substrate snapshot reads the new shape cleanly (no consumer-side changes required)

Composes with

🤖 Generated with Claude Code

Applies the standing rule per ``[[preference_llm_calls_confined_to_research_module]]``
— LLM calls live in alpha-engine-research. The news pipeline's
Haiku-backed event extractor is removed and replaced with a deterministic
classifier that uses two zero-cost signals already on the wire:

1. **Vendor tags** (``NewsArticle.tags``). Polygon emits keywords,
   GDELT emits structured event codes, Benzinga emits Channels. The
   ``alpha_engine_lib.sources.protocols.NewsArticle.tags`` docstring
   explicitly names this as "a soft signal for downstream event-flag
   extraction" — we were paying Haiku to re-derive what Polygon /
   GDELT already tagged.
2. **Title-keyword regex**. Backstop for sources that don't populate
   tags (Yahoo RSS). 17 pattern → category mappings against the
   ``DEFAULT_EVENT_CATEGORIES`` closed taxonomy.

**Why this is the right answer, not a kill-switch:** code audit found
the Haiku per-article structured output was aggregated to 5 scalar /
list columns (``event_count``, ``event_severity_max/mean``,
``event_categories``, ``top_event_descriptions``) before any research
consumer touched it. The "zero-shot novel-event detection" capability
was mostly wasted — research only sees per-ticker rollups. Tag-based +
keyword-based classification produces equivalent rollups
deterministically.

**Cost impact:** retires the largest previously-untracked LLM cost
slice in the system per the original Phase 0 audit estimate
($20–60/mo). Actual spend on the deleted call site goes to $0; the
research consumer sees identical EventFlag shape (extractor slug
changes from ``"anthropic_haiku"`` to ``"rule_based"``) and identical
aggregate columns in ``news_aggregates/{date}.parquet``.

**Substrate cleanup:** retires three files added earlier this session:
- ``collectors/nlp/event_extraction.py`` (the Anthropic extractor itself)
- ``rag/pipelines/_cost_telemetry.py`` (Phase 0.2 cost-telemetry buffer,
  PR #308 + Phase 4 #1 runaway-cost breaker, PR #309 — both retired
  with the LLM call site they instrumented)
- ``tests/test_news_cost_telemetry.py`` (mirrored tests)
``DEFAULT_EVENT_CATEGORIES`` moves into the new
``collectors/nlp/rule_based_event_extraction.py`` so the closed
taxonomy stays accessible to downstream consumers.

**Protocol contract:** ``EventExtractor.extract`` gains an optional
``article_tags: tuple[str, ...] = ()`` kwarg (back-compat default).
The pipeline plumbs the tag union across article variants. Any future
EventExtractor implementation (FinBERT, spaCy, reactivated LLM via
research module) consumes the same shape.

**Severity convention:** rule-based flags emit ``severity=0.5``
uniformly (the EventFlag protocol's documented default). The Haiku
severity was a free-floating judgment never tuned by any operator
alert. Per-category severity tuning can be added via YAML if a
downstream surface needs it.

**Tests:** ``TestRuleBasedEventExtractor`` (10 tests) covers
empty-text short-circuit, no-match returns empty, title-keyword
classification per category (earnings / M&A / FDA), tag-based
classification (Polygon/GDELT shape), tag+title union, multi-category
emission, deterministic ordering per ``DEFAULT_EVENT_CATEGORIES``,
zero-LLM-dependency contract, title-as-description shape. Suite
1493 → 1479 net (retired the 9 cost-telemetry tests + 7 Anthropic
extractor tests; added 10 rule-based tests).

**Composes with:**
- ``[[preference_llm_calls_confined_to_research_module]]`` — the
  rule this PR enforces
- alpha-engine #212 (executor EOD narrative kill switch) — sibling
  application of the same rule. Two non-research LLM call sites; this
  PR retires data's entirely, executor's keeps the kill switch
  substrate (default off) since the LLM path may be operator-reactivated.
- Retires the substrate from data #308 + data #309 (Phase 0.2 + Phase
  4 #1 cost-telemetry buffer + breaker) — both became dead code with
  the LLM call site they instrumented.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit 7a48427 into main May 25, 2026
1 check passed
@cipher813 cipher813 deleted the feat/rule-based-event-extraction branch May 25, 2026 18:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant