Upload alpha-engine-config/data/config.yaml to spot#46
Merged
Conversation
Smoke test surfaced weekly_collector.py's load_config() can't find the private data config on a fresh spot — it searches /home/ec2-user/alpha-engine-config/data/config.yaml first. SCP from the dispatcher's clone (ae-dashboard's boot-pull already refreshes alpha-engine-config daily) rather than cloning the private repo on the spot, which would require a broader git-auth setup than the lib-only insteadOf rewrite this launcher uses. drift_detector.py doesn't read config.yaml, so spot_drift_detection.sh needs no equivalent change. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Merged
2 tasks
cipher813
added a commit
that referenced
this pull request
May 13, 2026
…; requires lib v0.15.0) (#226) Wave 1 PR β of the institutional data-revamp arc (plan doc: ~/Development/alpha-engine-docs/private/data-revamp-260513.md). Producer-side concrete adapter implementations + multi-source aggregator. Pairs with alpha-engine-lib PR #46 (PR α, v0.15.0) which defined NewsSource Protocol + NewsArticle shape. Architectural pattern: data is the producer; lib defines the contract; research is the consumer (will read producer outputs via S3 + RAG retrieval in future sub-PRs, never imports adapters directly). New modules: collectors/news_sources/ polygon.py — FREE. Uses our existing polygon_client (data repo's copy) for rate-limit reuse. Normalizes /v2/reference/news. gdelt.py — FREE (no key). GDELT 2.0 DOC API; academic-grade event-extracted news. Requires ticker→name map for query building. yahoo_rss.py — FREE (fallback). Pure feedparser-based; matches existing collectors/alternative.py pattern but normalized into NewsArticle. benzinga.py — PAID stub. Raises NotImplementedError on init. ravenpack.py — PAID stub. bloomberg.py — PAID stub. collectors/news_aggregator.py NewsAggregator(sources, trust_weights) — fan-in across enabled NewsSource adapters → dedup (composite fingerprint: normalized title + URL path hash with querystring/fragment stripped) → preserve all source-provenance variants → return AggregatedNewsArticle records sorted by earliest_published_at desc. DEFAULT_TRUST_WEIGHTS: paid 0.95-1.0, polygon 0.9, gdelt 0.85, edgar_press 0.95, yahoo_rss 0.5. Lib pin bumps (lockstep, both must move per the pin-lockstep test): requirements.txt v0.12.0 → v0.15.0 Dockerfile v0.12.0 → v0.15.0 What's deferred (subsequent Wave 1 sub-PRs): PR A.1 — NLP pipeline (Loughran-McDonald + FinBERT + spaCy NER + LLM event extraction). Heavier deps; separate PR. PR A.2 — Structured aggregates writer (S3 parquet per ticker per day). Joined onto research's snapshot in PR F. PR A.3 — RAG ingest path: news → chunked → embedded → indexed in pgvector alongside existing SEC filings corpus. PR B — Filings substrate expansion (EDGAR full coverage: 10-K/Q/14A/S-1/13D/G/13F/Form-4). PR C — Analyst substrate (yfinance + FMP adapters + self-derived revisions tracking from daily snapshots). PR D — Async + S3 cache + per-vendor rate limiters. PR E — Wire RAG retrieval tools into research repo's thesis_update + sector agents. PR F — Wire new substrate into research's fetch_data (supersedes #170's per-ticker pre-fetch). +37 unit tests: - Protocol structural-subtyping for all 3 free adapters - Polygon: happy + transient-failure-per-ticker + schema-drift-skip - GDELT: happy + query building (multi-word vs single-word) + failure-skips-ticker + missing-name-map-fallback - Yahoo RSS: happy + entries-older-than-cutoff-dropped + no-link-skipped + fetch-failure-skips-ticker - Paid stubs: all 3 raise NotImplementedError on init - Aggregator: fan-in + URL/title dedup + canonical-title-longest + canonical-url-highest-trust + ticker-union + one-broken-source- isolated + output-sorted-desc + empty-fan-in - Trust weights: defaults + overrides + unknown-source-defaults-half - Fingerprint determinism - Lib shape contract pin (extra='forbid' + frozen) Suite: 848 passing. Composes with: - alpha-engine-lib PR #46 (v0.15.0) — required for shapes + Protocols - alpha-engine-research PR #172 (CLOSED) — original mis-located substrate; relocated here per architectural correction Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813
added a commit
that referenced
this pull request
May 13, 2026
… event extraction) (#227) Wave 1 PR A.1 of the institutional data-revamp arc (plan doc: ~/Development/alpha-engine-docs/private/data-revamp-260513.md). Producer-side NLP layer that consumes AggregatedNewsArticle output from PR β's aggregator and emits three parallel structured streams: sentiment scores, entity mentions, and event flags. Output is ready for the PR A.2 parquet writer + PR A.3 RAG ingest pass. Architecture: each NLP dimension is a Protocol with one or more concrete implementations. The pipeline orchestrator (NewsNLPPipeline) composes them without knowing which concrete classes are wired — upgrade paths drop in as new adapter classes without touching the orchestrator or downstream consumers. New modules: collectors/nlp/protocols.py SentimentScore, EntityMention, EventFlag Pydantic shapes (frozen + extra='forbid' so parquet writer has stable column schema) SentimentScorer, EntityExtractor, EventExtractor Protocols (runtime_checkable; structural subtyping) collectors/nlp/loughran_mcdonald.py LoughranMcDonaldScorer — finance-domain dictionary sentiment, the academic gold standard (Loughran & McDonald 2011). Composite = (positive - negative) / total_tokens clipped to [-clip, +clip]. Uncertainty counted separately from polarity. load_lm_master_dict() CSV parser — tolerates missing file (logs warning, returns empty dict so production fails clearly). collectors/nlp/event_extraction.py AnthropicEventExtractor — Haiku-tier structured event extraction via tool_use API. Closed taxonomy of 18 event categories (DEFAULT_EVENT_CATEGORIES — earnings, M&A, IPO, management change, regulatory action, FDA action, etc.). Cost-tracked under agent_id='news_event_extractor'. Tolerates transient failures (returns empty) + malformed entries (drops single entry, keeps others). Handles tool_use.input as either dict or JSON string. collectors/nlp/pipeline.py NewsNLPPipeline(sentiment_scorers, entity_extractors, event_extractors).process(articles) → NewsNLPOutput. Composes any number of components per stream. Per-component exceptions are isolated — one broken scorer can't take down a batch. Article text uses canonical_title + longest body_excerpt across variants. scripts/download_lm_dict.py Operator script. Fetches the canonical Loughran-McDonald 2022 Master Dictionary CSV from Notre Dame's research page. Idempotent (overwrites). Pin via VCS commit if reproducibility matters. URL is overridable when the source moves. What's deferred to subsequent sub-PRs: PR A.1.1 — FinBERT scorer (HF transformer; heavier deps; lives in collectors/nlp/finbert.py as a drop-in) PR A.1.2 — spaCy NER entity extractor PR A.2 — Structured aggregates writer (S3 parquet per (ticker, date) joining sentiment + entity + event streams) PR A.3 — RAG ingest path (raw article text → chunked → embedded → pgvector alongside SEC filings) +42 unit tests: - Pydantic shape construction + frozen + extra='forbid' (3 shapes) - Protocol structural-subtyping (3 protocols, structural matches) - Tokenization (alphabetic only, lowercase, drops digits) - _truthy helper (year-stamp / 0 / blank / non-numeric) - LM scorer: pure positive / pure negative / balanced / dilution-by-neutral / uncertainty-counted-separately / empty-text / clipped-to-range / empty-dict-warns-and-yields-zero - load_lm_master_dict: canonical-CSV / missing-file-warns / blank-rows-skipped - Anthropic event extractor: tool-spec-includes-default-categories / happy-path / empty-text-skips-call / transient-LLM-failure / malformed-entry-dropped / tool_use-input-as-JSON-string / no-tool-use-block-returns-empty - Pipeline: empty / sentiment-per-article / multiple-scorers / scorer-exception-isolated / event-extractor-receives-tickers / empty-article-text-skipped / uses-longest-excerpt-across-variants Suite: 890 passing (1 skipped). Composes with: - PR α (alpha-engine-lib #46, v0.15.0) — NewsArticle Pydantic shape - PR β (this repo #226) — AggregatedNewsArticle input - data-revamp-260513.md plan doc — full 4-wave arc context Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Followup #2 on the spot migration (stacked after #44 + #45). Smoke test of `spot_data_weekly.sh` surfaced `FileNotFoundError: Config not found` on the fresh spot — `weekly_collector.py`'s `load_config()` searches `/home/ec2-user/alpha-engine-config/data/config.yaml` first, which isn't present on a fresh spot clone.
Change
SCP the dispatcher's already-cloned config file (ae-dashboard's `boot-pull.timer` refreshes alpha-engine-config daily) onto the spot at the expected path. Cheaper than cloning the private config repo on the spot, which would require broader git-auth setup than the `alpha-engine-lib`-scoped insteadOf rewrite the launcher already uses.
`spot_drift_detection.sh` unchanged — `drift_detector.py` has no config.yaml refs.
Test plan
🤖 Generated with Claude Code