ci(simulate) Part 5: wire QR_SKIP_OSAP + add compute/cache/osap to cache paths (closes 4th external-data loop)#241
Merged
Merged
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
dackclup
pushed a commit
that referenced
this pull request
May 24, 2026
…al-data loop) PR #241's Part 5 (QR_SKIP_OSAP) STILL didn't prevent simulate cancellation — it cancelled at 45m14s on the very same PR with all four prior skip vars active. ci-triage-engineer session-6 root-cause: `compute/ingest/cross_source.py: fetch_yfinance_market_cap` calls `yf.Ticker(ticker).info` 502 times SERIALLY in `compute/main.py:cross_source_validate_market_cap`. Even though `compute/cache/yfinance_info` IS in both workflows' cache `path:` blocks (cron writes the cache; simulate restores it), the cache has a 24-hour freshness TTL inside `_cache_read`. The weekly cron writes the cache Friday 22:00 UTC; a Sunday simulate's restore gives 39-hour-old entries → fails the freshness check → live `yf.Ticker(ticker).info` fetch for all 502 tickers. Cost: 2-8 seconds per ticker cold (tenacity 3 attempts × `wait_exponential(min=1, max=10)`) = 17-67 minutes alone, enough to fill the entire 45m budget by itself. This is the FIFTH external-data loop in compute/main.py. The four prior skip vars (FORM4_FETCH_SKIP + QR_SKIP_TIER2 + QR_SKIP_FUNDAMENTALS + QR_SKIP_OSAP from PR #230 + this PR's Part 5) all correctly skip their respective loops. But they don't touch the cross_source path. Fix: add QR_SKIP_CROSS_SOURCE=1 escape hatch at the TOP of fetch_yfinance_market_cap: (a) When set + stale-but-present cache entry exists → read the JSON directly (bypassing the 24h TTL in `_cache_read`) (b) When set + cache is genuinely empty → return None (skip the cross-source validation entirely; semantically identical to a cold-cache fetch failure which the call site already handles per the existing graceful-degradation pattern) Wired in `.github/workflows/pre-merge-prod-sim.yml` env block alongside the four prior skip vars. CLAUDE.md §Gotchas rewritten: "4-env-var combo" → "5-env-var combo" with full read-site mapping. PHASE_STATUS_INFLIGHT.md entry expanded to record Part 6 alongside Part 5. Together the FIVE env vars now collectively cover ALL FIVE independent external-data loops in `compute/main.py`: Loop | Skip var ------------------------|---------------------- Form-4 | FORM4_FETCH_SKIP Tier-2 (10-K + 8-K) | QR_SKIP_TIER2 Fundamentals × 2 (SEC) | QR_SKIP_FUNDAMENTALS OSAP bulk download | QR_SKIP_OSAP Cross-source yfinance | QR_SKIP_CROSS_SOURCE ← NEW Cron path UNAFFECTED — weekly compute-rankings.yml doesn't set any QR_SKIP_* var so full live fetches still run there once a week. Tests: smoke-verified that QR_SKIP_CROSS_SOURCE check is at line 24 of fetch_yfinance_market_cap (well BEFORE `_cache_read` at line 48). Cold-cache call returns None (no live fetch) when env var is set. ruff clean. YAML parse-verified clean. Live validation: re-pushing this PR triggers a fresh simulate with all 5 skip vars active + 4 of 5 caches restored from cron. Expected: simulate completes in under 25 minutes. If simulate STILL cancels after this, escalate to incident-commander + consider pragmatic backstop (bump timeout to 90m OR disable simulate path-filter for docs-only PRs). https://claude.ai/code/session_01JwntEE4PNAXSMkZxRA9BB4
dackclup
pushed a commit
that referenced
this pull request
May 24, 2026
Part 6's live-fire on PR #241 STILL cancelled at 45m15s. The 5 skip env vars are all set + read-site verified by direct smoke tests, but the cancellation recurs. Four iterations of "walk the next loop" haven't surfaced a sixth obvious culprit; time for the pragmatic backstop the ci-triage-engineer flagged as Option B in session 6. Bump `timeout-minutes: 45 → 90` in pre-merge-prod-sim.yml. Rationale: - Simulate is informational-only per workflow comment line 24 ("This check is intentionally NOT a required status"); a wider budget doesn't block PR merge - Cron uses timeout-minutes: 150 — simulate at 90 stays well below - If 90m STILL cancels, we have HARD evidence that one of: (a) the cache restore is genuinely empty (cron skipped a week, or the new path additions invalidated the key), so all 5 fall-through-to-live paths in the QR_SKIP_* helpers trigger live fetches anyway (b) there's a 6th external-data loop we haven't identified (c) the env vars aren't being propagated to the running process - 50-70m successful completion surfaces the timing breakdown we need to root-cause definitively; cancellation at 45m hides it This is NOT a permanent fix; it's diagnostic headroom + a backstop while the structural issue is identified. The five skip env vars remain in place (FORM4_FETCH_SKIP + QR_SKIP_TIER2 + QR_SKIP_FUNDAMENTALS + QR_SKIP_OSAP + QR_SKIP_CROSS_SOURCE) so the existing fast-path still kicks in when caches are warm. Inline workflow comment expanded to document the recurrence history (PRs #230 / #238 / #241 all hit the 45m cap despite each round's fix) + the rationale for the 90m backstop. Follow-ups (separate PRs): - ci-triage-engineer session-7 with hard ask: pull the actual job log via raw GitHub API (curl + GITHUB_TOKEN if available) and identify which step is at minute 44 when GitHub force-cancels - If 90m runs succeed: post-merge audit of where the time actually went, name the 6th loop (if any), wire its skip var - If 90m runs ALSO cancel: escalate to incident-commander — cancellation at 90m means something is genuinely broken in cache restore OR env propagation, not just "one more loop" https://claude.ai/code/session_01JwntEE4PNAXSMkZxRA9BB4
…che paths (closes 4th external-data loop) PR #230's Parts 2 + 4 (QR_SKIP_TIER2 + QR_SKIP_FUNDAMENTALS) killed three SEC EDGAR loops but left the 4th external-data fetch — OSAP / openassetpricing.com — running unconditionally on every simulate. PR #238's simulate cancelled at 45m16s despite Parts 2 + 4 being in place (smoking-gun proof that one more loop was unkilled). ci-triage-engineer session-6 root-cause: `compute/cache/osap/` was ABSENT from BOTH workflow cache `path:` blocks (compute-rankings.yml + pre-merge-prod-sim.yml). The weekly cron writes the OSAP parquet to the runner's LOCAL disk via osap.py:108-110, but actions/cache@v5 never uploaded that path to GitHub's cache store. Every simulate therefore started with NO cached OSAP parquet → `_is_fresh()` returned False → live `openassetpricing.OpenAP().dl_port("op", "pandas")` bulk download from openassetpricing.com. The host has no SLA (research project, not SEC EDGAR / not PyPI). Tenacity policy: `stop_after_attempt(2)` × `stop_after_delay(30)` allows up to 60s wall-clock per call in best case. A multi-MB CSV covering ~1,188 signals × hundreds of months can saturate that budget under throttling, and on top of any small accumulated cost from the other steps pushes total above the 45m cap. Why local smoke-test passed but CI failed: Local runs had `compute/cache/osap/returns.parquet` on disk from prior compute runs → `_is_fresh()` hit instantly → zero network. CI runners are ephemeral — each gets a fresh disk, restore only the paths in actions/cache `path:` block, which didn't include osap/. The PR #230 analysis correctly identified the three SEC EDGAR loops but missed OSAP because it's a different external host + single-shot bulk download + wrapped in graceful-degradation try/except that makes failure silent. Two-part fix: (a) compute/cache/osap added to BOTH workflow cache `path:` blocks so the weekly cron's parquet becomes part of the cache artifact and simulate restores it warm. (b) QR_SKIP_OSAP=1 escape hatch in `fetch_osap_returns` (mirrors QR_SKIP_FUNDAMENTALS pattern): - Added `import os` to osap.py - Check at TOP of function BEFORE the `_is_fresh` gate - When set + `force_refresh=False` + cached parquet exists, return it unconditionally without the 31-day freshness check AND without the live download - Falls through to live fetch if NO cache exists (cold-runner / cache-evicted scenario — graceful-degradation logged at WARNING level) - Required-columns validation + signals filter + as_of slice apply the same way as the normal cache-hit branch - Wired in `pre-merge-prod-sim.yml` env block alongside FORM4_FETCH_SKIP + QR_SKIP_TIER2 + QR_SKIP_FUNDAMENTALS CLAUDE.md §Gotchas: previous FORM4_FETCH_SKIP-only entry rewritten as a "CI escape-hatch env-var combo for simulate" entry listing all four env vars with read sites + the new expected steady-state (8-15 min warm-cache vs the pre-fix 45-min cap breach). PHASE_STATUS_INFLIGHT.md (PR #237 convention): this PR's in-flight entry lives in the side-file's "In flight (current)" section per the established pattern. No CLAUDE.md §Phase status edit needed — the §Gotcha update IS the substance change that satisfies §Conventions "ship with every PR" lockstep. Cron path UNAFFECTED — weekly compute-rankings.yml doesn't set any QR_SKIP env var so full live fetch runs there once a week, populating the warm cache for future simulate restores. Live validation: re-pushing this PR + landing it triggers a fresh simulate run with all 4 skip vars active + `compute/cache/osap` restored from the cron's artifact. Expected: simulate completes in under 25 minutes. Tests: smoke-verified that QR_SKIP_OSAP check is at line 18 of `fetch_osap_returns` (well BEFORE the `_is_fresh` at line 54). ruff clean. YAML parse-verified clean. Existing offline OSAP tests (tests/test_features/test_fundamentals.py etc) trivially unaffected. https://claude.ai/code/session_01JwntEE4PNAXSMkZxRA9BB4
…al-data loop) PR #241's Part 5 (QR_SKIP_OSAP) STILL didn't prevent simulate cancellation — it cancelled at 45m14s on the very same PR with all four prior skip vars active. ci-triage-engineer session-6 root-cause: `compute/ingest/cross_source.py: fetch_yfinance_market_cap` calls `yf.Ticker(ticker).info` 502 times SERIALLY in `compute/main.py:cross_source_validate_market_cap`. Even though `compute/cache/yfinance_info` IS in both workflows' cache `path:` blocks (cron writes the cache; simulate restores it), the cache has a 24-hour freshness TTL inside `_cache_read`. The weekly cron writes the cache Friday 22:00 UTC; a Sunday simulate's restore gives 39-hour-old entries → fails the freshness check → live `yf.Ticker(ticker).info` fetch for all 502 tickers. Cost: 2-8 seconds per ticker cold (tenacity 3 attempts × `wait_exponential(min=1, max=10)`) = 17-67 minutes alone, enough to fill the entire 45m budget by itself. This is the FIFTH external-data loop in compute/main.py. The four prior skip vars (FORM4_FETCH_SKIP + QR_SKIP_TIER2 + QR_SKIP_FUNDAMENTALS + QR_SKIP_OSAP from PR #230 + this PR's Part 5) all correctly skip their respective loops. But they don't touch the cross_source path. Fix: add QR_SKIP_CROSS_SOURCE=1 escape hatch at the TOP of fetch_yfinance_market_cap: (a) When set + stale-but-present cache entry exists → read the JSON directly (bypassing the 24h TTL in `_cache_read`) (b) When set + cache is genuinely empty → return None (skip the cross-source validation entirely; semantically identical to a cold-cache fetch failure which the call site already handles per the existing graceful-degradation pattern) Wired in `.github/workflows/pre-merge-prod-sim.yml` env block alongside the four prior skip vars. CLAUDE.md §Gotchas rewritten: "4-env-var combo" → "5-env-var combo" with full read-site mapping. PHASE_STATUS_INFLIGHT.md entry expanded to record Part 6 alongside Part 5. Together the FIVE env vars now collectively cover ALL FIVE independent external-data loops in `compute/main.py`: Loop | Skip var ------------------------|---------------------- Form-4 | FORM4_FETCH_SKIP Tier-2 (10-K + 8-K) | QR_SKIP_TIER2 Fundamentals × 2 (SEC) | QR_SKIP_FUNDAMENTALS OSAP bulk download | QR_SKIP_OSAP Cross-source yfinance | QR_SKIP_CROSS_SOURCE ← NEW Cron path UNAFFECTED — weekly compute-rankings.yml doesn't set any QR_SKIP_* var so full live fetches still run there once a week. Tests: smoke-verified that QR_SKIP_CROSS_SOURCE check is at line 24 of fetch_yfinance_market_cap (well BEFORE `_cache_read` at line 48). Cold-cache call returns None (no live fetch) when env var is set. ruff clean. YAML parse-verified clean. Live validation: re-pushing this PR triggers a fresh simulate with all 5 skip vars active + 4 of 5 caches restored from cron. Expected: simulate completes in under 25 minutes. If simulate STILL cancels after this, escalate to incident-commander + consider pragmatic backstop (bump timeout to 90m OR disable simulate path-filter for docs-only PRs). https://claude.ai/code/session_01JwntEE4PNAXSMkZxRA9BB4
Part 6's live-fire on PR #241 STILL cancelled at 45m15s. The 5 skip env vars are all set + read-site verified by direct smoke tests, but the cancellation recurs. Four iterations of "walk the next loop" haven't surfaced a sixth obvious culprit; time for the pragmatic backstop the ci-triage-engineer flagged as Option B in session 6. Bump `timeout-minutes: 45 → 90` in pre-merge-prod-sim.yml. Rationale: - Simulate is informational-only per workflow comment line 24 ("This check is intentionally NOT a required status"); a wider budget doesn't block PR merge - Cron uses timeout-minutes: 150 — simulate at 90 stays well below - If 90m STILL cancels, we have HARD evidence that one of: (a) the cache restore is genuinely empty (cron skipped a week, or the new path additions invalidated the key), so all 5 fall-through-to-live paths in the QR_SKIP_* helpers trigger live fetches anyway (b) there's a 6th external-data loop we haven't identified (c) the env vars aren't being propagated to the running process - 50-70m successful completion surfaces the timing breakdown we need to root-cause definitively; cancellation at 45m hides it This is NOT a permanent fix; it's diagnostic headroom + a backstop while the structural issue is identified. The five skip env vars remain in place (FORM4_FETCH_SKIP + QR_SKIP_TIER2 + QR_SKIP_FUNDAMENTALS + QR_SKIP_OSAP + QR_SKIP_CROSS_SOURCE) so the existing fast-path still kicks in when caches are warm. Inline workflow comment expanded to document the recurrence history (PRs #230 / #238 / #241 all hit the 45m cap despite each round's fix) + the rationale for the 90m backstop. Follow-ups (separate PRs): - ci-triage-engineer session-7 with hard ask: pull the actual job log via raw GitHub API (curl + GITHUB_TOKEN if available) and identify which step is at minute 44 when GitHub force-cancels - If 90m runs succeed: post-merge audit of where the time actually went, name the 6th loop (if any), wire its skip var - If 90m runs ALSO cancel: escalate to incident-commander — cancellation at 90m means something is genuinely broken in cache restore OR env propagation, not just "one more loop" https://claude.ai/code/session_01JwntEE4PNAXSMkZxRA9BB4
6509a96 to
db00aae
Compare
6 tasks
dackclup
added a commit
that referenced
this pull request
May 24, 2026
…bagent count 15 → 18 + INFLIGHT.md housekeeping (#243) User called out a recurring error from session 7: I repeatedly stated "cron Sun 22:00 UTC" in summaries despite the actual workflow schedule being Mon-Fri only. ROOT CAUSE: the handoff prompt at session start said "Next prod cron: Sun 2026-05-24 22:00 UTC (cron-#4)" and I echoed that without verifying against `.github/workflows/compute-rankings.yml`. The YAML has read `"0 22 * * 1-5"` from initial commit (verified via `git log --all -p`). Inline comment in the workflow even says "Weekends skipped (no new trading data)". Stale "Sunday/Sun 22:00 UTC" references across 4 files corrected: - CLAUDE.md:31 §Stack — "cron Sun 22:00 UTC" → "cron Mon-Fri 22:00 UTC" - docs/RESEARCH_FINDINGS.md:854 — "WEEKLY (GitHub Actions, Sunday 22:00 UTC)" → "WEEKDAY (GitHub Actions, Mon-Fri 22:00 UTC; weekends skipped)" - docs/ARCHITECTURE.md:7 mermaid diagram — "Sun 22:00 UTC" → "Mon-Fri 22:00 UTC" + edge "run weekly" → "run weekdays" - docs/stock_ranking_knowledge.md:993 — "Weekly Sunday 22:00 UTC" → "Weekday Mon-Fri 22:00 UTC: Main compute cron (weekends skipped — no new trading data)" Companion stale-info fix found during audit: - AGENTS.md:1294 — "The 15 subagents under `.claude/agents/`" → "The 18 subagents under `.claude/agents/`". The roster expanded to 18 in PR #225 (ci-triage-engineer + vercel-preview-auditor + literature-searcher). AGENTS.md:91 already said 18 — line 1294 was unsynced drift. INFLIGHT.md housekeeping (per PR #237 convention — append-only with periodic move from "In flight" → "Merged" sub-section): - PR #241 (simulate Parts 5+6+7, `e9d7836`) → moved to Merged with consolidated 4-iteration summary preserving the QR_SKIP_OSAP + QR_SKIP_CROSS_SOURCE + timeout-45→90 history - PR #242 (light-mode soften + Strong Buy nowrap + StockLogo square, `a30c017`) → moved to Merged, original entry text intact - Duplicate "## Merged (awaiting housekeeping move to CLAUDE.md)" header at line 257 deleted (was a rebase-artifact duplicate from the PR #239 + PR #240 chronology resolution); file now has exactly ONE Merged section - Old "PR (this PR) — Simulate Parts 5+6+7" body block excised (lines 201-300 of the post-rebase file) — replaced by this PR's clean Doc-staleness sweep entry in the In-flight section Audit completeness: - Schema version `0.10.2-phase4.5e` — verified current across CLAUDE.md, PHASE_STATUS.md, AGENTS.md (no stale references) - Defense layer "32 boolean flags emitted" — verified current - Subagent file count = 19 in `.claude/agents/` (18 agent files + 1 README.md) — corresponds to 18 subagents, AGENTS.md:91 + :1294 now both consistent No compute / schema / scoring / valuation / Python / TypeScript code change. Doc-only PR. PHASE_STATUS_INFLIGHT.md side-file satisfies §Conventions lockstep. https://claude.ai/code/session_01JwntEE4PNAXSMkZxRA9BB4 Co-authored-by: Claude <noreply@anthropic.com>
Contributor
Pre-merge production simulation
Diff vs main
Main baseline: Top-10 movers (sorted by |Δcomposite_score|)
Tickers in PR only (1): |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes the 4th external-data loop missed by PR #230's Parts 2 + 4. PR #238's simulate cancelled at 45m16s despite
QR_SKIP_TIER2+QR_SKIP_FUNDAMENTALSbeing in place — smoking-gun proof that one more loop was unkilled.ci-triage-engineersession-6 deep-dive identified the loop:FORM4_FETCH_SKIPQR_SKIP_TIER2(PR #230 Part 2)QR_SKIP_FUNDAMENTALS(PR #230 Part 4)Why local smoke-test passed but CI failed
Local runs had
compute/cache/osap/returns.parqueton disk from prior compute runs →_is_fresh()hit instantly → zero network. CI runners are ephemeral — each gets a fresh disk, restores only the paths inactions/cachepath:block, which didn't includecompute/cache/osap/. PR #230's analysis correctly identified the three SEC EDGAR loops but missed OSAP because it's a different external host + single-shot bulk download + wrapped in graceful-degradation try/except that makes failure silent.Two-part fix
(a)
compute/cache/osapadded to BOTH workflow cachepath:blockscompute-rankings.ymlactions/cache@v5(save+restore)pre-merge-prod-sim.ymlactions/cache/restore@v5(restore-only)(b)
QR_SKIP_OSAP=1escape hatch infetch_osap_returns— mirrors QR_SKIP_FUNDAMENTALS patternimport osadded tocompute/ingest/osap.pyfetch_osap_returnsBEFORE the_is_freshgateforce_refresh=False+ cached parquet exists → return unconditionally without the 31-day freshness check AND without the live downloadpre-merge-prod-sim.ymlenv block alongside the existing three skip varsFiles touched
compute/ingest/osap.pyimport os+ escape hatch block infetch_osap_returns.github/workflows/pre-merge-prod-sim.ymlQR_SKIP_OSAP: "1"env +compute/cache/osapcache path.github/workflows/compute-rankings.ymlcompute/cache/osapcache pathCLAUDE.mdPHASE_STATUS_INFLIGHT.mdNo schema / scoring / valuation / behavior change. Defense layer flag count UNCHANGED at 32. Cron path UNAFFECTED — weekly cron doesn't set any
QR_SKIP_*var so full live fetch runs there.CLAUDE.md §Gotchas: consolidated 4-env-var combo
Previous
FORM4_FETCH_SKIP=1-only Gotcha rewritten as:FORM4_FETCH_SKIPcompute/main.py:840QR_SKIP_TIER2compute/scoring/tier2.py:162QR_SKIP_FUNDAMENTALScompute/ingest/fundamentals.py(both fetch functions)QR_SKIP_OSAPcompute/ingest/osap.py:fetch_osap_returnsTogether they cover the four independent external-data loops. Expected post-fix steady-state: 8-15 min on warm cache (vs the pre-fix 45-min cap breach).
Test plan
ruff check compute/— cleanQR_SKIP_OSAPcheck is at line 18 offetch_osap_returns(BEFORE_is_freshat line 54)git diff --statconfirms scope (5 files / +131 / −9)compute/cache/osapin restore path)Live validation expectation
Pre-fix recurrence on simulate workflow:
If simulate STILL cancels on this PR, the next investigation target is the cache-restore step itself (cache key eviction OR a 5th external-data loop that's somehow even more hidden).
Out of scope (follow-ups)
QR_SKIP_*env var — would catch this class of regression structurallyPHASE_STATUS_INFLIGHT.md"In flight" → "Merged" section (separate housekeeping PR after merge)https://claude.ai/code/session_01JwntEE4PNAXSMkZxRA9BB4
Generated by Claude Code