feat(daily-closes): stamp source column on every row#159
Merged
Conversation
The staging/daily_closes/{date}.parquet writer threads `--source`
(polygon_only / yfinance_only / fred / auto) as a CLI arg and knows
definitively which provider each row came from, but doesn't persist
the label. Downstream consumers (predictor inference, dashboard,
backtester) that need to know the canonical source today resort to
heuristics like "is VWAP populated?" — works because polygon writes
VWAP and yfinance doesn't, but it's coincidental.
Adds a `source` column to every appended record:
• polygon grouped-daily + per-ticker fallback → "polygon"
• FRED single-value index closes → "fred"
• yfinance EOD batch → "yfinance"
Schema change is additive (new column only) so backward-compatible
per CLAUDE.md S3 contract rules. Verified downstream consumers
(`builders/daily_append.py`, `features/compute.py`,
`sf_preflight.py`) all column-select by name (`OHLCV_COLS` filter
pattern at daily_append.py:851/854/869/915), so the new column is
silently ignored or carried along but not actively consumed.
Test plan
- [x] AST parse OK
- [ ] Next morning polygon pass writes `source=polygon` in every row;
next EOD yfinance pass writes `source=yfinance`. Manual smoke:
`python -c "import boto3,io,pandas as pd; df = pd.read_parquet(io.BytesIO(boto3.client('s3').get_object(Bucket='alpha-engine-research', Key='staging/daily_closes/<date>.parquet')['Body'].read())); print(df.source.value_counts())"`
Companion dashboard PR #51 (alpha-engine-dashboard) will read this
column directly with VWAP-presence as a fallback for older parquets
that pre-date the schema bump.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
2 tasks
cipher813
added a commit
that referenced
this pull request
May 5, 2026
The Phase 2 Lambda deploy job has been failing on: ERROR: Cannot find command 'git' - do you have 'git' installed and in your PATH? The Lambda Python 3.12 base image (public.ecr.aws/lambda/python:3.12) doesn't ship with git, but the Dockerfile uses `pip install ... @ git+https://...` which requires it. Surfaced today when PR #159's post-merge Deploy fired against a fresh build (vs the prior in-flight image which had a cached install layer that masked the gap). Fix: same one-line microdnf install applied to alpha-engine-research Dockerfile after PR #105's lib-public flip. AL2023 minimal package manager; image-size impact ~25MB. Out of scope: the FromPlatformFlagConstDisallowed warning (line 1) about `--platform=linux/amd64` is a separate buildkit lint that doesn't block builds — leave for a follow-up. Test plan - [x] Diff parity with alpha-engine-research Dockerfile (same line) - [ ] Deploy workflow re-runs cleanly post-merge Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The
staging/daily_closes/{date}.parquetwriter threads a--sourceCLI arg (polygon_only / yfinance_only / fred / auto) and knows definitively which provider each row came from, but doesn't persist the label. Downstream consumers that need to know the canonical source resort to heuristics ("is VWAP populated?" — coincidentally reliable today because polygon writes VWAP and yfinance doesn't, but it's not contractual).This PR stamps a
sourcestring column on every record:_fetch_polygon_closes)"polygon"_fetch_polygon_closes_per_ticker)"polygon"_fetch_fred_index_closes)"fred"_fetch_yfinance_closes)"yfinance"Backward compatibility
Schema change is additive (new column only) per CLAUDE.md S3 contract rules. Verified downstream consumers handle this gracefully:
builders/daily_append.py— explicit column-select viaOHLCV_COLS = ["Open", "High", "Low", "Close", "Volume", "VWAP"]filter at lines 851/854/869/915. Newsourcecolumn is silently ignored.features/compute.py— reads selected columns by name, doesn't iterate.sf_preflight.py— only checks SPY existence in the index, doesn't iterate columns.Companion
Dashboard PR #51 (alpha-engine-dashboard) reads this column directly to populate the per-feature
Sourcecolumn on/Feature_Store. It currently uses a VWAP-presence heuristic as a fallback for older parquets that pre-date this schema bump; once a few morning/EOD passes write labeled parquets, the heuristic becomes the never-fires fallback path.Test plan
source=polygonpopulated for all rowssource=yfinancepopulated for all rowspython -c "import boto3,io,pandas as pd; df = pd.read_parquet(io.BytesIO(boto3.client('s3').get_object(Bucket='alpha-engine-research', Key='staging/daily_closes/<date>.parquet')['Body'].read())); print(df.source.value_counts())"🤖 Generated with Claude Code