feat(validation): Phase 4.6 task #2e — manipulation_index distribution shift#279
Merged
Merged
Conversation
…n shift Closes follow-up #2e from ``docs/research/historical-revalidation-harness.md``. Consumes the #2a rankings.json time-series loader (PR #278) and reports the manipulation_index distribution shift across the cron's lifetime: mean / std / quantiles + fire-rate by band (LOW [0,20), MODERATE [20,50), HIGH [50,∞)) per date + window-end deltas. This answers the honest question — **has the cohort of flagged stocks materially changed across the cron's history?** A universe-mean drift > ~5 pts would signal Phase 4.5e weight recalibration is needed (per Q3 2026-08-19 cohort-audit gate). ## What ships - `compute/validation/manipulation_distribution.py` (NEW, ~250 LOC): - `DistributionSummary` dataclass — per-date snapshot (n / mean / std / median / q25 / q75 / q95 / max + band counts + top-3 tickers) - `ShiftReport` dataclass — aggregate across window + first-to-last deltas (mean_delta / std_delta / high_count_delta) + note - `compute_manipulation_distribution_shift(start_date, end_date, repo=None)` — main entry; pure-function wrapping #2a's loader - `format_shift_report(report, max_dates=20)` — human-readable text rendering; truncates long windows with "..." marker - `tests/test_validation/test_manipulation_distribution.py` (NEW, 11 tests): - Band-boundary constants pin (LOW < 20 ≤ MODERATE < 50 ≤ HIGH) - `_summarize_one_date` band partition correctness + empty + top-3 ordering - `compute_shift` empty-window / all-null-window / single-date / two-date-with-deltas paths (monkeypatched loader) - `format_shift_report` header + delta + cap rendering - Live-git smoke against the repo's recent cron commits ## Real-world artifact (live repo, window 2026-05-01 → 2026-05-27) 3 cron dates available on main: date n mean std p75 p95 HIGH top 2026-05-22 502 4.38 9.28 5.00 25.00 2 SMCI=84.0, WAT=64.0, NVDA=48.0 2026-05-23 502 4.38 9.28 5.00 25.00 2 SMCI=84.0, WAT=64.0, NVDA=48.0 2026-05-26 502 4.38 9.28 5.00 25.00 2 SMCI=84.0, WAT=64.0, NVDA=48.0 Δmean=+0.00, Δstd=+0.00, ΔHIGH=+0 (2 → 2) Distribution is **stable** across the window (expected for 5-day horizon — pillar inputs change slowly). Top-3 invariant: SMCI 84, WAT 64, NVDA 48 — matches Phase 4.5f production-verified ``ManipulationRiskCard`` fire-rate snapshot. Universe mean 4.38 sits solidly in LOW band; only 2 tickers in HIGH band (Phase 4.5f spec target: 1-3 stocks). No recalibration signal. A longer window (≥ 90 days) would let this report detect drift; the chain is now ready when cron history accumulates. ## Hard rules preserved - ✅ Rule 9 — no schema change (read-only consumer of rankings.json) - ✅ Rule 16 — N/A (no scoring change) - ✅ Rule 18 — diagnostic surface ships in same PR - ✅ License — pure stdlib + pandas; no new deps - ✅ Universe S&P 500 only ## Verification - `pytest tests/test_validation/test_manipulation_distribution.py` — 11/11 pass in 0.70s - `ruff check` — clean (linter trimmed unused imports + sorted) - Live-git artifact above generated cleanly ## Next in the chain Per ``docs/research/historical-revalidation-harness.md`` Future-work TODO: - #2b forward-return computation from `compute/cache/prices/` (0.5d) — gitignored cache; CI-only data - #2c per-pillar IC at historical dates (1d, needs 2a + 2b) - #2d PBO/DSR re-baseline via PR #275's `universe_provider` kwarg (1d, needs 2c) - #2f honest-baseline report (0.5d, needs 2d) #2e (this PR) is independent of 2b/2c — could ship before, in parallel with, or after them.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
This was referenced May 27, 2026
Merged
dackclup
added a commit
that referenced
this pull request
May 27, 2026
…tes (#281) New `compute/validation/historical_ic.py` orchestrator pairs PR #278's `load_ranking_history` (ranking at T) with PR #280's `compute_forward_returns_batch` (realized return at T + horizon) and computes per-pillar Spearman IC across the historical window — closes the IC re-baseline half of the Phase 4.6 chain. API: - `compute_pillar_ic(scores, returns, *, method, min_tickers)` pure cross-sectional IC for one (pillar, date) pair - `compute_historical_ic_report(start, end, *, horizon_months, pillars, ...)` walks rankings.json snapshots + forward returns cache, aggregates into `HistoricalICReport` - `format_ic_report(report)` human-readable text rendering - `PillarICEntry` / `PillarICSummary` / `HistoricalICReport` three frozen-dataclass carriers Spearman computed as Pearson on rank-transformed series (Spearman 1904 + Conover 1999 §5.4) to avoid pulling scipy into the dep set (QuantRank ships without scipy; pandas' `Series.corr(method= 'spearman')` requires it transitively). Drops with descriptive notes: - cross-section < MIN_TICKERS_PER_DATE = 30 (Grinold-Kahn 2000 §4.2) - None / NaN / inf in either input - constant inputs (std=0 → correlation undefined) Aggregates per pillar: mean / std / median / min / max / IC IR / hit-rate. IC IR = mean/std × sqrt(n_dates) (Grinold-Kahn 2000 §4.4). Hit-rate = fraction of dates with strictly positive IC. Honest-baseline disclaimer per Research Report v1.0: - IC reported here is NAIVE — no costs / slippage / sector neutralization. Real net-of-cost IC typically 30-50% smaller per McLean-Pontiff 2016 JF post-publication decay - The historical universe MUST come from PR #274 members_at() to avoid survivorship bias (Hou-Xue-Zhang 2020 RFS); orchestrator reads the historical universe FROM rankings.json at as-of T which is correct by construction (snapshot itself is historical universe) - Report is a TIME SERIES + summary, not a single headline number Tests: 28 new (28 passing). Coverage: module constants, pure IC computation edge cases (perfect ±1.0, constant inputs, NaN drops, below-min cross-section, method validation), summary aggregation math (IC IR formula pinned, hit-rate semantics), orchestrator full-path (one date / multi-date / missing pillar / malformed JSON), text rendering, and a live-git smoke that auto-degrades gracefully when the gitignored price cache is absent. Schema impact: zero. No new Pydantic / TS / snapshot field. Production-wiring impact: zero. No compute/main.py import. The orchestrator is purely a validation / re-baseline tool. Downstream PRs (#2d PBO/DSR re-baseline + #2f honest-baseline report) consume the output. Phase 4.6 chain status: 5 of 6 items now landed (#1/#2 PR #277, #2a PR #278, #2b PR #280, #2c this PR, #2e PR #279; #2d gate kwarg shipped PR #275). #4 PBO/DSR re-baseline needs a warm-CI execution to publish actual numbers; #6 honest-baseline doc closes the chain. PHASE_STATUS_INFLIGHT.md updated per PR #237 side-file convention. Harness doc TODO list updated: 5 of 6 items now landed. https://claude.ai/code/session_01AGU8d6pm4u2fQQ5cebg9qa Co-authored-by: Claude <noreply@anthropic.com>
6 tasks
dackclup
added a commit
that referenced
this pull request
May 27, 2026
…on + CLI (closing the chain) (#282) Closes the Phase 4.6 honest re-validation harness structurally — 6 of 6 chain items now landed. The only remaining work is a warm-CI execution session that fills the TBD numeric cells. New artifacts: - docs/research/honest-baseline-2026-05-27.md — 10-section skeleton with TBD numeric cells in §2 (per-pillar IC) / §3 (PBO/DSR) / §4 (manipulation distribution) / §5 (survivorship-bias delta). Methodology + framing + honest-α ceiling + disclaimer ladder final-form. Citation block: Hou-Xue-Zhang 2020 RFS, McLean-Pontiff 2016 JF, Bailey-Lopez de Prado 2014 JPM, Bailey-Borwein-Lopez de Prado-Zhu 2014 AMS Notices, Grinold-Kahn 2000, Spearman 1904, Conover 1999, Kissell-Glantz 2003. - scripts/generate_honest_baseline.py — argparse CLI that wires compute_historical_ic_report (PR #281) + compute_manipulation_ distribution_shift (PR #279) end-to-end. Text mode emits the disclaimer banner to stderr; JSON mode embeds __banner__ in the payload. Exit codes: 0 (report produced), 1 (input validation), 2 (empty report — useful CI signal). - tests/test_validation/test_generate_honest_baseline_cli.py — 17 tests: argparse shape, _parse_date, exit codes, banner emission, JSON payload shape + α ceiling cells + disclaimer string, banner embedding, _report_to_payload with synthetic + populated manip reports, and a constant pin on the banner's 5 mandatory phrases (NAIVE / McLean-Pontiff / 2-5% / Rule 16 / S&P 500). Honest-baseline disclaimer per Research Report v1.0 autonomous mission: - IC / PBO / DSR figures NAIVE — no costs / slippage / sector neutralization - Real net-of-cost IC typically 30-50% smaller per McLean-Pontiff 2016 JF 32% post-publication decay - Honest net α ceiling: 2-5% per year (hard-coded into doc + JSON) - Composite formula sacred (Rule 16) — never replayed retroactively - Universe = S&P 500 (502) only - No trade recommendation of specific tickers — methodological report only Schema impact: zero. No new Pydantic / TS / snapshot field. Production-wiring impact: zero. No compute/main.py import. Smoke run against real repo's recent rankings.json (no live price cache) — orchestrator walks 3 commits, returns n_dates_with_ic=0, exit code 2 surfaces missing-cache signal cleanly. Phase 4.6 chain status — 6 of 6 items structurally landed: - #1/#2 universe-drift first unit: PR #277 - #2a ranking_history loader: PR #278 - #2b forward_returns loader: PR #280 - #2c per-pillar IC orchestrator: PR #281 - #2d PBO/DSR gate kwarg: PR #275 (warm-CI execution pending) - #2e manipulation_distribution shift: PR #279 - #2f honest-baseline skeleton + CLI: this PR Deferred follow-ups (NOT in this PR): warm-CI execution session, --markdown writer mode, --include-pbo-dsr factor-return wiring. PHASE_STATUS_INFLIGHT.md updated per PR #237 side-file convention. Harness doc TODO list: 6 of 6 items now landed. https://claude.ai/code/session_01AGU8d6pm4u2fQQ5cebg9qa Co-authored-by: Claude <noreply@anthropic.com>
6 tasks
dackclup
added a commit
that referenced
this pull request
May 27, 2026
…clones (#284) Root cause: `test_compute_shift_live_repo_recent_window` asserts `report.n_dates >= 1` for window [2026-05-22, 2026-05-27], but CI's `actions/checkout@v6` defaults to fetch-depth=1 (shallow). On shallow clones, `git log -- frontend/public/data/rankings.json` returns only the HEAD commit which (a) didn't touch rankings.json (release commit) and (b) is dated 2026-05-28 (outside the test window) → empty report → AssertionError → CI Python step exit code 1. This is the failure mode triggered on PR #283's main-push CI run (26526262716, Python job 78138619376 — Failing after 2m 43s). Sibling smoke tests are resilient because they use `HEAD` directly (`git show`) or don't constrain by a date window. Fix: pytest.skip() when report.n_dates == 0 AND note == "empty window" (the signature of a shallow-clone walk). Full clones still exercise the real assertion. Skip message cites the CI checkout convention so future readers understand why the test is intentionally lenient. Verified: - Shallow clone (CI): 10 passed + 1 skipped (graceful) - Full clone (sandbox / dev): 11 passed (assertion exercised) - ruff: clean Alternative considered (not adopted): bump `actions/checkout` fetch-depth to 0 (full history). Rejected because (a) the live-smoke tests are designed to skip when their data substrate isn't available (matches the no-price-cache pattern in `test_forward_returns.py` and `test_historical_ic.py`), (b) full git fetch adds 30-60s to CI cold start with diminishing returns, and (c) the universe of tests that need rankings.json history is small and bounded to validation/. Phase 4.6 task #2e (PR #279) regression — applies same pattern as the other 4 live-smoke tests already use. No schema / compute / output JSON change. https://claude.ai/code/session_01AGU8d6pm4u2fQQ5cebg9qa Co-authored-by: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes Phase 4.6 task #2e — the manipulation_index distribution shift report. Consumes #2a's rankings.json time-series loader (PR #278) and answers the honest question: has the cohort of flagged stocks materially changed across the cron's history?
A universe-mean drift > ~5 pts would signal Phase 4.5e weight recalibration is needed (per Q3 2026-08-19 cohort-audit gate).
What ships
compute/validation/manipulation_distribution.pyDistributionSummary,ShiftReport) +compute_manipulation_distribution_shift()+format_shift_report()tests/test_validation/test_manipulation_distribution.pyReal-world artifact (live, window 2026-05-01 → 2026-05-27)
Findings:
ManipulationRiskCardfire-rate snapshotHard rules preserved
Verification
pytest tests/test_validation/test_manipulation_distribution.pyruff checkNext in chain (independent of #2b/2c/2d)
compute/cache/prices/Subscribe-after-open suggestion: same pattern as #271-#278.
Generated by Claude Code