Skip to content

Upload alpha-engine-config/data/config.yaml to spot#46

Merged
cipher813 merged 1 commit into
mainfrom
fix/spot-launcher-upload-config
Apr 16, 2026
Merged

Upload alpha-engine-config/data/config.yaml to spot#46
cipher813 merged 1 commit into
mainfrom
fix/spot-launcher-upload-config

Conversation

@cipher813
Copy link
Copy Markdown
Owner

Summary

Followup #2 on the spot migration (stacked after #44 + #45). Smoke test of `spot_data_weekly.sh` surfaced `FileNotFoundError: Config not found` on the fresh spot — `weekly_collector.py`'s `load_config()` searches `/home/ec2-user/alpha-engine-config/data/config.yaml` first, which isn't present on a fresh spot clone.

Change

SCP the dispatcher's already-cloned config file (ae-dashboard's `boot-pull.timer` refreshes alpha-engine-config daily) onto the spot at the expected path. Cheaper than cloning the private config repo on the spot, which would require broader git-auth setup than the `alpha-engine-lib`-scoped insteadOf rewrite the launcher already uses.

`spot_drift_detection.sh` unchanged — `drift_detector.py` has no config.yaml refs.

Test plan

  • `bash -n` on spot_data_weekly.sh
  • Post-merge: rerun `bash infrastructure/spot_data_weekly.sh --smoke-only` on ae-dashboard — should complete through `weekly_collector.py --phase 1 --dry-run` and terminate cleanly

🤖 Generated with Claude Code

Smoke test surfaced weekly_collector.py's load_config() can't find
the private data config on a fresh spot — it searches
/home/ec2-user/alpha-engine-config/data/config.yaml first.

SCP from the dispatcher's clone (ae-dashboard's boot-pull already
refreshes alpha-engine-config daily) rather than cloning the private
repo on the spot, which would require a broader git-auth setup than
the lib-only insteadOf rewrite this launcher uses.

drift_detector.py doesn't read config.yaml, so spot_drift_detection.sh
needs no equivalent change.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit a8c7d4f into main Apr 16, 2026
1 check passed
@cipher813 cipher813 deleted the fix/spot-launcher-upload-config branch April 16, 2026 18:16
cipher813 added a commit that referenced this pull request May 13, 2026
…; requires lib v0.15.0) (#226)

Wave 1 PR β of the institutional data-revamp arc (plan doc:
~/Development/alpha-engine-docs/private/data-revamp-260513.md).

Producer-side concrete adapter implementations + multi-source
aggregator. Pairs with alpha-engine-lib PR #46 (PR α, v0.15.0) which
defined NewsSource Protocol + NewsArticle shape.

Architectural pattern: data is the producer; lib defines the contract;
research is the consumer (will read producer outputs via S3 + RAG
retrieval in future sub-PRs, never imports adapters directly).

New modules:

  collectors/news_sources/
    polygon.py     — FREE. Uses our existing polygon_client (data
                     repo's copy) for rate-limit reuse. Normalizes
                     /v2/reference/news.
    gdelt.py       — FREE (no key). GDELT 2.0 DOC API; academic-grade
                     event-extracted news. Requires ticker→name map
                     for query building.
    yahoo_rss.py   — FREE (fallback). Pure feedparser-based; matches
                     existing collectors/alternative.py pattern but
                     normalized into NewsArticle.
    benzinga.py    — PAID stub. Raises NotImplementedError on init.
    ravenpack.py   — PAID stub.
    bloomberg.py   — PAID stub.

  collectors/news_aggregator.py
    NewsAggregator(sources, trust_weights) — fan-in across enabled
    NewsSource adapters → dedup (composite fingerprint: normalized
    title + URL path hash with querystring/fragment stripped) →
    preserve all source-provenance variants → return
    AggregatedNewsArticle records sorted by earliest_published_at desc.
    DEFAULT_TRUST_WEIGHTS: paid 0.95-1.0, polygon 0.9, gdelt 0.85,
    edgar_press 0.95, yahoo_rss 0.5.

Lib pin bumps (lockstep, both must move per the pin-lockstep test):
  requirements.txt v0.12.0 → v0.15.0
  Dockerfile       v0.12.0 → v0.15.0

What's deferred (subsequent Wave 1 sub-PRs):
  PR A.1 — NLP pipeline (Loughran-McDonald + FinBERT + spaCy NER +
           LLM event extraction). Heavier deps; separate PR.
  PR A.2 — Structured aggregates writer (S3 parquet per ticker per
           day). Joined onto research's snapshot in PR F.
  PR A.3 — RAG ingest path: news → chunked → embedded → indexed in
           pgvector alongside existing SEC filings corpus.
  PR B   — Filings substrate expansion (EDGAR full coverage:
           10-K/Q/14A/S-1/13D/G/13F/Form-4).
  PR C   — Analyst substrate (yfinance + FMP adapters + self-derived
           revisions tracking from daily snapshots).
  PR D   — Async + S3 cache + per-vendor rate limiters.
  PR E   — Wire RAG retrieval tools into research repo's
           thesis_update + sector agents.
  PR F   — Wire new substrate into research's fetch_data (supersedes
           #170's per-ticker pre-fetch).

+37 unit tests:
  - Protocol structural-subtyping for all 3 free adapters
  - Polygon: happy + transient-failure-per-ticker + schema-drift-skip
  - GDELT: happy + query building (multi-word vs single-word) +
    failure-skips-ticker + missing-name-map-fallback
  - Yahoo RSS: happy + entries-older-than-cutoff-dropped +
    no-link-skipped + fetch-failure-skips-ticker
  - Paid stubs: all 3 raise NotImplementedError on init
  - Aggregator: fan-in + URL/title dedup + canonical-title-longest +
    canonical-url-highest-trust + ticker-union + one-broken-source-
    isolated + output-sorted-desc + empty-fan-in
  - Trust weights: defaults + overrides + unknown-source-defaults-half
  - Fingerprint determinism
  - Lib shape contract pin (extra='forbid' + frozen)

Suite: 848 passing.

Composes with:
  - alpha-engine-lib PR #46 (v0.15.0) — required for shapes + Protocols
  - alpha-engine-research PR #172 (CLOSED) — original mis-located
    substrate; relocated here per architectural correction

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cipher813 added a commit that referenced this pull request May 13, 2026
… event extraction) (#227)

Wave 1 PR A.1 of the institutional data-revamp arc (plan doc:
~/Development/alpha-engine-docs/private/data-revamp-260513.md).

Producer-side NLP layer that consumes AggregatedNewsArticle output
from PR β's aggregator and emits three parallel structured streams:
sentiment scores, entity mentions, and event flags. Output is ready
for the PR A.2 parquet writer + PR A.3 RAG ingest pass.

Architecture: each NLP dimension is a Protocol with one or more
concrete implementations. The pipeline orchestrator (NewsNLPPipeline)
composes them without knowing which concrete classes are wired —
upgrade paths drop in as new adapter classes without touching the
orchestrator or downstream consumers.

New modules:

  collectors/nlp/protocols.py
    SentimentScore, EntityMention, EventFlag Pydantic shapes
    (frozen + extra='forbid' so parquet writer has stable column
    schema)
    SentimentScorer, EntityExtractor, EventExtractor Protocols
    (runtime_checkable; structural subtyping)

  collectors/nlp/loughran_mcdonald.py
    LoughranMcDonaldScorer — finance-domain dictionary sentiment,
    the academic gold standard (Loughran & McDonald 2011). Composite
    = (positive - negative) / total_tokens clipped to [-clip, +clip].
    Uncertainty counted separately from polarity.
    load_lm_master_dict() CSV parser — tolerates missing file (logs
    warning, returns empty dict so production fails clearly).

  collectors/nlp/event_extraction.py
    AnthropicEventExtractor — Haiku-tier structured event extraction
    via tool_use API. Closed taxonomy of 18 event categories
    (DEFAULT_EVENT_CATEGORIES — earnings, M&A, IPO, management
    change, regulatory action, FDA action, etc.). Cost-tracked under
    agent_id='news_event_extractor'. Tolerates transient failures
    (returns empty) + malformed entries (drops single entry, keeps
    others). Handles tool_use.input as either dict or JSON string.

  collectors/nlp/pipeline.py
    NewsNLPPipeline(sentiment_scorers, entity_extractors,
                    event_extractors).process(articles) →
    NewsNLPOutput. Composes any number of components per stream.
    Per-component exceptions are isolated — one broken scorer can't
    take down a batch. Article text uses canonical_title +
    longest body_excerpt across variants.

  scripts/download_lm_dict.py
    Operator script. Fetches the canonical Loughran-McDonald 2022
    Master Dictionary CSV from Notre Dame's research page. Idempotent
    (overwrites). Pin via VCS commit if reproducibility matters. URL
    is overridable when the source moves.

What's deferred to subsequent sub-PRs:
  PR A.1.1 — FinBERT scorer (HF transformer; heavier deps; lives in
             collectors/nlp/finbert.py as a drop-in)
  PR A.1.2 — spaCy NER entity extractor
  PR A.2   — Structured aggregates writer (S3 parquet per (ticker,
             date) joining sentiment + entity + event streams)
  PR A.3   — RAG ingest path (raw article text → chunked → embedded →
             pgvector alongside SEC filings)

+42 unit tests:
  - Pydantic shape construction + frozen + extra='forbid' (3 shapes)
  - Protocol structural-subtyping (3 protocols, structural matches)
  - Tokenization (alphabetic only, lowercase, drops digits)
  - _truthy helper (year-stamp / 0 / blank / non-numeric)
  - LM scorer: pure positive / pure negative / balanced /
    dilution-by-neutral / uncertainty-counted-separately / empty-text /
    clipped-to-range / empty-dict-warns-and-yields-zero
  - load_lm_master_dict: canonical-CSV / missing-file-warns / blank-rows-skipped
  - Anthropic event extractor: tool-spec-includes-default-categories /
    happy-path / empty-text-skips-call / transient-LLM-failure /
    malformed-entry-dropped / tool_use-input-as-JSON-string /
    no-tool-use-block-returns-empty
  - Pipeline: empty / sentiment-per-article / multiple-scorers /
    scorer-exception-isolated / event-extractor-receives-tickers /
    empty-article-text-skipped / uses-longest-excerpt-across-variants

Suite: 890 passing (1 skipped).

Composes with:
  - PR α (alpha-engine-lib #46, v0.15.0) — NewsArticle Pydantic shape
  - PR β (this repo #226) — AggregatedNewsArticle input
  - data-revamp-260513.md plan doc — full 4-wave arc context

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant