feat(nlp): replace AnthropicEventExtractor with RuleBasedEventExtractor by cipher813 · Pull Request #310 · cipher813/alpha-engine-data

cipher813 · 2026-05-25T18:52:31Z

Summary

Applies the standing rule per `[[preference_llm_calls_confined_to_research_module]]` — LLM calls live in research; data uses existing metadata + rule-based classifiers. The news pipeline's Haiku-backed event extractor is removed and replaced with a deterministic classifier using two zero-cost signals already on the wire:

Vendor tags (`NewsArticle.tags`) — Polygon emits keywords, GDELT emits structured event codes, Benzinga emits Channels. We were paying Haiku to re-derive what Polygon / GDELT already tagged.
Title-keyword regex — backstop for sources that don't populate tags (Yahoo RSS). 17 pattern → category mappings against the closed `DEFAULT_EVENT_CATEGORIES` taxonomy.

Why this is the right answer, not a kill-switch

Code audit found the Haiku per-article structured output was aggregated to 5 scalar/list columns (`event_count`, `event_severity_max/mean`, `event_categories`, `top_event_descriptions`) before any research consumer touched it. The "zero-shot novel-event detection" capability was mostly wasted — research only sees per-ticker rollups. Tag-based + keyword-based classification produces equivalent rollups deterministically.

Cost impact

Retires the largest previously-untracked LLM cost slice per the Phase 0 audit estimate ($20–60/mo → $0). The research consumer sees identical EventFlag shape (extractor slug changes from `anthropic_haiku` to `rule_based`) and identical aggregate columns in `news_aggregates/{date}.parquet`.

Substrate cleanup

Retires three files added earlier this session — all dead code now that the LLM call site is gone:

`collectors/nlp/event_extraction.py` — the Anthropic extractor itself
`rag/pipelines/_cost_telemetry.py` — Phase 0.2 buffer (PR feat(cost-telemetry): wire news event extraction + lib v0.32→v0.33 (Phase 0.2) #308) + Phase 4 feat: data quality gates — parquet validation + email fixes #1 runaway-cost breaker (PR feat(cost-telemetry): runaway-cost circuit breaker on news pipeline (Phase 4 #1) #309)
`tests/test_news_cost_telemetry.py` — mirrored tests

`DEFAULT_EVENT_CATEGORIES` moves into the new `rule_based_event_extraction.py` so the closed taxonomy stays accessible.

Protocol contract

`EventExtractor.extract` gains an optional `article_tags: tuple[str, ...] = ()` kwarg (back-compat default). Pipeline plumbs the tag union across article variants. Any future EventExtractor (FinBERT, spaCy, reactivated LLM via research module) consumes the same shape.

Severity convention

Rule-based flags emit `severity=0.5` uniformly (the EventFlag protocol's documented default). Haiku severity was a free-floating judgment never tuned by any operator alert. Per-category severity can be added via YAML if a downstream surface needs it.

Test plan

`TestRuleBasedEventExtractor` (10 tests) — empty-text short-circuit, no-match, per-category title classification (earnings / M&A / FDA), tag-based classification, tag+title union, multi-category, deterministic ordering, zero-LLM contract, description shape
Suite 1493 → 1479 net (retired 9 cost-telemetry + 7 Anthropic extractor tests; added 10 rule-based tests)
Sat 5/30 SF: RAGIngestion fires, `news_aggregates/{date}.parquet` shows non-empty event_categories from rule-based path, `event_count` > 0 for tickers with vendor tags
Research's substrate snapshot reads the new shape cleanly (no consumer-side changes required)

Composes with

`[[preference_llm_calls_confined_to_research_module]]` — the rule this PR enforces
alpha-engine perf(daily_append): chunked universe pass with gc.collect between chunks #212 (executor EOD narrative kill switch) — sibling application of the same rule. Two non-research LLM call sites: this PR retires data's entirely; executor's keeps the kill-switch substrate (default off) since the LLM path may be operator-reactivated
Retires the substrate from data feat(cost-telemetry): wire news event extraction + lib v0.32→v0.33 (Phase 0.2) #308 + data feat(cost-telemetry): runaway-cost circuit breaker on news pipeline (Phase 4 #1) #309 — both became dead code with the LLM call site they instrumented

🤖 Generated with Claude Code

Applies the standing rule per ``[[preference_llm_calls_confined_to_research_module]]`` — LLM calls live in alpha-engine-research. The news pipeline's Haiku-backed event extractor is removed and replaced with a deterministic classifier that uses two zero-cost signals already on the wire: 1. **Vendor tags** (``NewsArticle.tags``). Polygon emits keywords, GDELT emits structured event codes, Benzinga emits Channels. The ``alpha_engine_lib.sources.protocols.NewsArticle.tags`` docstring explicitly names this as "a soft signal for downstream event-flag extraction" — we were paying Haiku to re-derive what Polygon / GDELT already tagged. 2. **Title-keyword regex**. Backstop for sources that don't populate tags (Yahoo RSS). 17 pattern → category mappings against the ``DEFAULT_EVENT_CATEGORIES`` closed taxonomy. **Why this is the right answer, not a kill-switch:** code audit found the Haiku per-article structured output was aggregated to 5 scalar / list columns (``event_count``, ``event_severity_max/mean``, ``event_categories``, ``top_event_descriptions``) before any research consumer touched it. The "zero-shot novel-event detection" capability was mostly wasted — research only sees per-ticker rollups. Tag-based + keyword-based classification produces equivalent rollups deterministically. **Cost impact:** retires the largest previously-untracked LLM cost slice in the system per the original Phase 0 audit estimate ($20–60/mo). Actual spend on the deleted call site goes to $0; the research consumer sees identical EventFlag shape (extractor slug changes from ``"anthropic_haiku"`` to ``"rule_based"``) and identical aggregate columns in ``news_aggregates/{date}.parquet``. **Substrate cleanup:** retires three files added earlier this session: - ``collectors/nlp/event_extraction.py`` (the Anthropic extractor itself) - ``rag/pipelines/_cost_telemetry.py`` (Phase 0.2 cost-telemetry buffer, PR #308 + Phase 4 #1 runaway-cost breaker, PR #309 — both retired with the LLM call site they instrumented) - ``tests/test_news_cost_telemetry.py`` (mirrored tests) ``DEFAULT_EVENT_CATEGORIES`` moves into the new ``collectors/nlp/rule_based_event_extraction.py`` so the closed taxonomy stays accessible to downstream consumers. **Protocol contract:** ``EventExtractor.extract`` gains an optional ``article_tags: tuple[str, ...] = ()`` kwarg (back-compat default). The pipeline plumbs the tag union across article variants. Any future EventExtractor implementation (FinBERT, spaCy, reactivated LLM via research module) consumes the same shape. **Severity convention:** rule-based flags emit ``severity=0.5`` uniformly (the EventFlag protocol's documented default). The Haiku severity was a free-floating judgment never tuned by any operator alert. Per-category severity tuning can be added via YAML if a downstream surface needs it. **Tests:** ``TestRuleBasedEventExtractor`` (10 tests) covers empty-text short-circuit, no-match returns empty, title-keyword classification per category (earnings / M&A / FDA), tag-based classification (Polygon/GDELT shape), tag+title union, multi-category emission, deterministic ordering per ``DEFAULT_EVENT_CATEGORIES``, zero-LLM-dependency contract, title-as-description shape. Suite 1493 → 1479 net (retired the 9 cost-telemetry tests + 7 Anthropic extractor tests; added 10 rule-based tests). **Composes with:** - ``[[preference_llm_calls_confined_to_research_module]]`` — the rule this PR enforces - alpha-engine #212 (executor EOD narrative kill switch) — sibling application of the same rule. Two non-research LLM call sites; this PR retires data's entirely, executor's keeps the kill switch substrate (default off) since the LLM path may be operator-reactivated. - Retires the substrate from data #308 + data #309 (Phase 0.2 + Phase 4 #1 cost-telemetry buffer + breaker) — both became dead code with the LLM call site they instrumented. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cipher813 merged commit 7a48427 into main May 25, 2026
1 check passed

cipher813 deleted the feat/rule-based-event-extraction branch May 25, 2026 18:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(nlp): replace AnthropicEventExtractor with RuleBasedEventExtractor#310

feat(nlp): replace AnthropicEventExtractor with RuleBasedEventExtractor#310
cipher813 merged 1 commit into
mainfrom
feat/rule-based-event-extraction

cipher813 commented May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cipher813 commented May 25, 2026

Summary

Why this is the right answer, not a kill-switch

Cost impact

Substrate cleanup

Protocol contract

Severity convention

Test plan

Composes with

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant