Skip to content

feat(daily-closes): stamp source column on every row#159

Merged
cipher813 merged 1 commit into
mainfrom
feat/daily-closes-source-column
May 5, 2026
Merged

feat(daily-closes): stamp source column on every row#159
cipher813 merged 1 commit into
mainfrom
feat/daily-closes-source-column

Conversation

@cipher813
Copy link
Copy Markdown
Owner

Summary

The staging/daily_closes/{date}.parquet writer threads a --source CLI arg (polygon_only / yfinance_only / fred / auto) and knows definitively which provider each row came from, but doesn't persist the label. Downstream consumers that need to know the canonical source resort to heuristics ("is VWAP populated?" — coincidentally reliable today because polygon writes VWAP and yfinance doesn't, but it's not contractual).

This PR stamps a source string column on every record:

Writer site source value
polygon grouped-daily (_fetch_polygon_closes) "polygon"
polygon per-ticker fallback (_fetch_polygon_closes_per_ticker) "polygon"
FRED single-value index closes (_fetch_fred_index_closes) "fred"
yfinance EOD batch (_fetch_yfinance_closes) "yfinance"

Backward compatibility

Schema change is additive (new column only) per CLAUDE.md S3 contract rules. Verified downstream consumers handle this gracefully:

  • builders/daily_append.py — explicit column-select via OHLCV_COLS = ["Open", "High", "Low", "Close", "Volume", "VWAP"] filter at lines 851/854/869/915. New source column is silently ignored.
  • features/compute.py — reads selected columns by name, doesn't iterate.
  • sf_preflight.py — only checks SPY existence in the index, doesn't iterate columns.

Companion

Dashboard PR #51 (alpha-engine-dashboard) reads this column directly to populate the per-feature Source column on /Feature_Store. It currently uses a VWAP-presence heuristic as a fallback for older parquets that pre-date this schema bump; once a few morning/EOD passes write labeled parquets, the heuristic becomes the never-fires fallback path.

Test plan

  • AST parse OK
  • Next polygon morning pass: source=polygon populated for all rows
  • Next yfinance EOD pass: source=yfinance populated for all rows
  • Smoke after first labeled write: python -c "import boto3,io,pandas as pd; df = pd.read_parquet(io.BytesIO(boto3.client('s3').get_object(Bucket='alpha-engine-research', Key='staging/daily_closes/<date>.parquet')['Body'].read())); print(df.source.value_counts())"
  • Verify daily_append + features/compute + sf_preflight don't barf on the new column

🤖 Generated with Claude Code

The staging/daily_closes/{date}.parquet writer threads `--source`
(polygon_only / yfinance_only / fred / auto) as a CLI arg and knows
definitively which provider each row came from, but doesn't persist
the label. Downstream consumers (predictor inference, dashboard,
backtester) that need to know the canonical source today resort to
heuristics like "is VWAP populated?" — works because polygon writes
VWAP and yfinance doesn't, but it's coincidental.

Adds a `source` column to every appended record:
  • polygon grouped-daily + per-ticker fallback → "polygon"
  • FRED single-value index closes → "fred"
  • yfinance EOD batch → "yfinance"

Schema change is additive (new column only) so backward-compatible
per CLAUDE.md S3 contract rules. Verified downstream consumers
(`builders/daily_append.py`, `features/compute.py`,
`sf_preflight.py`) all column-select by name (`OHLCV_COLS` filter
pattern at daily_append.py:851/854/869/915), so the new column is
silently ignored or carried along but not actively consumed.

Test plan
- [x] AST parse OK
- [ ] Next morning polygon pass writes `source=polygon` in every row;
      next EOD yfinance pass writes `source=yfinance`. Manual smoke:
      `python -c "import boto3,io,pandas as pd; df = pd.read_parquet(io.BytesIO(boto3.client('s3').get_object(Bucket='alpha-engine-research', Key='staging/daily_closes/<date>.parquet')['Body'].read())); print(df.source.value_counts())"`

Companion dashboard PR #51 (alpha-engine-dashboard) will read this
column directly with VWAP-presence as a fallback for older parquets
that pre-date the schema bump.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit d702bd2 into main May 5, 2026
1 check passed
@cipher813 cipher813 deleted the feat/daily-closes-source-column branch May 5, 2026 18:43
cipher813 added a commit that referenced this pull request May 5, 2026
The Phase 2 Lambda deploy job has been failing on:

  ERROR: Cannot find command 'git' - do you have 'git' installed and
  in your PATH?

The Lambda Python 3.12 base image (public.ecr.aws/lambda/python:3.12)
doesn't ship with git, but the Dockerfile uses `pip install ... @
git+https://...` which requires it. Surfaced today when PR #159's
post-merge Deploy fired against a fresh build (vs the prior in-flight
image which had a cached install layer that masked the gap).

Fix: same one-line microdnf install applied to alpha-engine-research
Dockerfile after PR #105's lib-public flip. AL2023 minimal package
manager; image-size impact ~25MB.

Out of scope: the FromPlatformFlagConstDisallowed warning (line 1)
about `--platform=linux/amd64` is a separate buildkit lint that
doesn't block builds — leave for a follow-up.

Test plan
- [x] Diff parity with alpha-engine-research Dockerfile (same line)
- [ ] Deploy workflow re-runs cleanly post-merge

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant