Skip to content

feat(ingest): Qlib scout — pyqlib MIT install + Alpha158 handler smoke + 158-feature manifest#119

Merged
dackclup merged 1 commit into
mainfrom
claude/resume-quantrank-phase-4.5-Zh0pO
May 19, 2026
Merged

feat(ingest): Qlib scout — pyqlib MIT install + Alpha158 handler smoke + 158-feature manifest#119
dackclup merged 1 commit into
mainfrom
claude/resume-quantrank-phase-4.5-Zh0pO

Conversation

@dackclup
Copy link
Copy Markdown
Owner

Summary

Phase 4j scout PR — 3rd of 4 factor-library scouts (OSAP ✅ #110, JKP ✅ #114, Qlib next, IPCA later). Ships pyqlib install + 158-feature manifest + 6 offline tests. NO production wiring this PR; the yfinance-to-Qlib BYO adapter + full Alpha158 feature compute on the 502-ticker universe ships in a follow-on integration PR.

No new veto. Defense layer unchanged at 17. Top-5 rotation unchanged. Schema unchanged at 0.9.1-phase4h.2.

Pre-plan investigation results (verified 2026-05-19)

# Finding Verdict
1 PyPI package pyqlib 0.9.7 (also 0.9.6). Other candidate names (qlib, microsoft-qlib) return 404.
2 License MIT — verified via wheel METADATA inspection (Classifier: License :: OSI Approved :: MIT License). No CC BY-NC complication like JKP. Safe for Phase 6+ commercial roadmap.
3 Data init qlib.init(provider_uri=..., region="us"). NO public US data bundle — Qlib's default provider_uri covers CN A-share only; US universe is BYO via local .bin files.
4 Alpha158 surface qlib.contrib.data.handler.Alpha158 → 158 columns. Manifest captured at scout time via Alpha158DL.get_feature_config()[1] and hardcoded; offline test 3 locks against upstream drift.

🚨 Critical scope decision — NO @network test for this scout

Phase 4h scout (PR #110) and Phase 4i scout (PR #114) each had a @pytest.mark.network test that hit a remote CDN. Qlib has no remote CDN — its data flow is local-bin filesystem I/O. The originally planned synthetic-OHLCV → bin → init_qlibAlpha158.fetch smoke test was DROPPED post-investigation: pyqlib's PyPI wheel does NOT bundle the scripts/dump_bin.py utility needed for OHLCV → .bin conversion. That scaffolding is integration-PR scope.

Replacement verification surface (test #3 below): the hardcoded ALPHA158_FEATURE_NAMES tuple is asserted against the runtime introspection from Alpha158DL.get_feature_config()[1]. This is actually a stronger drift detector than the dropped end-to-end test would have been — fires on every pip install upgrade if Qlib changes the feature set.

⚠️ CI install footprint disclosure

pip install pyqlib pulls ~22 transitive deps. Heavy ones NET-NEW to QuantRank's tree:

Dep Approx size Purpose
mlflow ~20 MB ML experiment tracking (not used by scout)
lightgbm ~15 MB Default Qlib ML model (not used by scout)
cvxpy ~30 MB Portfolio optimization (not used by scout)
pymongo ~5 MB Experiment store backend (not used by scout)
redis (Python client) ~1 MB Cache backend (not used by scout)
gym ~10 MB RL gym env (not used by scout)
jupyter + nbconvert ~100 MB Notebook tooling (not used by scout)

Net CI install footprint bump: ~150-180 MB. None of these heavy deps are consumed by the scout — they come along because pyqlib doesn't expose a [minimal] extra upstream. CI cold-start latency bump is one-time per workflow; pip wheel caching mitigates subsequent runs.

Module-name choice locked

The new module is compute/ingest/qlib_features.py, NOT compute/ingest/qlib.py. Python's import resolution would treat the latter as the qlib package and shadow the actual installed PyPI package, breaking the entire factor-library integration. Distinct module name avoids the namespace collision.

Files

Path Action LOC
compute/ingest/qlib_features.py NEW — module docstring + ALPHA158_FEATURE_NAMES + init_qlib + fetch_alpha158_features 186
compute/config.py Edit — 3 new constants in a # --- Phase 4j scout --- block +23
tests/test_ingest/test_qlib_features.py NEW — 6 offline tests 113
pyproject.toml Edit — add pyqlib>=0.9.7,<0.10 to [factors] extra +7
PHASE_STATUS.md Edit — row 4 sub-bullet for Phase 4j scout +1
Total ~330 LOC

Within scout-style budget (Phase 4i scout was ~360 LOC; Phase 4j is leaner because no @network test scaffolding).

Tenacity policy NOT applied

Qlib's data flow is local filesystem I/O. No network retry semantics needed. This is the first ingest module in QuantRank that diverges from the canonical compute/ingest/osap.py:52-56 retry decorator (documented explicitly in the module docstring).

Tests (6 offline; NO @network)

# Test Coverage
1 test_alpha158_feature_manifest_has_158_entries Primary CI signal. Pure cardinality + uniqueness; survives even without [factors] extra.
2 test_alpha158_feature_manifest_first_5_anchor K-bar leading features (KMID, KLEN, KMID2, KUP, KUP2) anchored against Qlib v0.9.7.
3 test_alpha158_feature_manifest_matches_runtime_introspection The drift detector. Hardcoded tuple must equal Alpha158DL.get_feature_config()[1]. Wrapped in pytest.importorskip("qlib").
4 test_qlib_data_cache_constant_under_repo_cache_dir Config sanity. Locks gitignore coverage via compute/cache/ parent glob (.gitignore:221).
5 test_init_qlib_passes_us_region_and_provider_uri Monkeypatch capture; asserts region="us" + path passthrough.
6 test_init_qlib_defaults_to_config_cache_when_no_uri Default provider_uri = config.QLIB_DATA_CACHE.

Verification ladder (8-step; STOP at step 8)

Step Command Result
1 ruff check . ✅ clean
2 pytest tests/ -m "not network" 930 passed (924 prior + 6 new)
3 pytest -m network --run-network (unchanged at 20 — no new @network)
4 python -m compute.output.schema_check ✅ in-sync (no schema delta)
5 python -c "from compute.ingest.qlib_features import init_qlib, fetch_alpha158_features, ALPHA158_FEATURE_NAMES; print('OK', len(ALPHA158_FEATURE_NAMES))" OK 158
6 git push -u origin claude/resume-quantrank-phase-4.5-Zh0pO ✅ at 68ed2386
7 Open PR as Draft (this PR)
8 subscribe_pr_activity + STOP for user audit ⏳ next

Ask-first surfaces touched

  • pyproject.toml [factors] — extended with pyqlib>=0.9.7,<0.10 (authorized in advance via plan-mode approval)
  • .github/workflows/ci.yml — UNCHANGED ([dev,factors] install already covers the new dep)
  • .github/workflows/compute-rankings.yml — UNTOUCHED per user hard constraint
  • Schema triple (schemas.py / types.ts / schema-snapshot.json) — UNTOUCHED (no schema delta this scout)

Out of scope (deferred to follow-on integration PR — ~5-commit cluster mirroring Phase 4h shape)

  • yfinance-to-Qlib BYO adapter (~150 LOC + custom S&P 500 instruments universe registration). Converts compute/cache/prices/*.parquet to Qlib .bin format.
  • Full Alpha158 feature compute on the 502-ticker universe → 502 × N_dates × 158 DataFrame.
  • Per-feature cross-validation framework — PBO/DSR doesn't directly apply to per-stock-per-date features. Walk-forward IC scoring per feature is the likely replacement; Phase 5 backtest infra is the canonical version.
  • Schema additions (StockDetail.qlib_features + Metadata.qlib_features_used + IC observability) → schema bump 0.9.1-phase4h.2 → 0.10.0-phase4j.
  • compute/main.py wiring decision — observability-only? blended into composite? Phase-5 ML-meta-learner-only consumer?
  • Top-5 rotation impact analysis (Rule 16 lock applies as for prior factor libraries).

Risks (from plan, with post-implementation resolution)

# Risk (planned) Status post-scout
1 NO real @network test — divergence from Phase 4h/4i pattern Documented + replaced with manifest-matches-runtime introspection (stronger drift detector)
2 Qlib's dump_bin API may have changed across 0.9.6 → 0.9.7 Defer to integration PR; pyqlib pinned to <0.10 so any future drift surfaces deliberately
3 pip install pyqlib CI cold-start ~150-180 MB Documented in PR body; pip wheel caching mitigates subsequent runs
4 pytest.importorskip("qlib") masks failures when extra isn't installed Acceptable — matches tests/test_features/test_osap_e2e_integration.py pattern; CI installs [factors]
5 Alpha158 hardcoded manifest drifts vs upstream Test #3 catches drift; manifest is hand-updated on deliberate pyqlib bump
6 qlib.init global state pollutes other tests Tests use monkeypatch + tmp_path; no actual init called during the 6 scout tests

Test plan

  • Commit local — all 6 tests green; 930 total offline; ruff clean; schema in-sync
  • CI green on 68ed2386 (Python lint+test + Frontend build + Vercel preview)
  • Audit: install footprint disclosure + no-@network rationale + manifest drift detector design
  • User authorizes Draft → Ready flip
  • (post-merge) Section I post-merge no-op (no UI surface change; preview build sanity-check sufficient)
  • (post-merge) File "Phase 4j.1 — Full integration" follow-up tracking issue with explicit BYO-adapter precondition

🤖 Drafted with Claude Code via the Anthropic SDK.


Generated by Claude Code

…e + 158-feature manifest

Phase 4j scout PR. Mirrors the proven Phase 4i scout pattern (PR
#114) for Microsoft Qlib's Alpha158 feature library. Scope is
install + API surface + manifest verification ONLY; the
yfinance-to-Qlib BYO adapter + full Alpha158 feature compute on the
502-ticker universe ships in a follow-on integration PR.

**Pre-plan access-path discovery** (verified 2026-05-19; full record
in ``compute/ingest/qlib_features.py`` module docstring):

1. **PyPI package**: ``pyqlib`` 0.9.7 (also 0.9.6 available). Other
   candidate names (``qlib``, ``microsoft-qlib``) return 404.
2. **License**: MIT (verified via wheel METADATA inspection —
   ``Classifier: License :: OSI Approved :: MIT License``). **No
   CC BY-NC complication** like JKP. Safe for Phase 6+ commercial
   roadmap.
3. **Data init**: ``qlib.init(provider_uri=..., region=REG_US)``
   where ``REG_US = "us"``. **NO public US data bundle published
   by Qlib** — the ``provider_uri`` defaults to
   ``~/.qlib/qlib_data/cn_data`` (Chinese A-share, irrelevant for
   QuantRank); the US universe is BYO via local ``.bin`` files.
4. **Alpha158 surface**: ``qlib.contrib.data.handler.Alpha158`` →
   ``handler.fetch(col_set="feature")`` returns a DataFrame with
   ``(datetime, instrument)`` MultiIndex × 158 feature columns. The
   158-name manifest is fetched via
   ``Alpha158DL.get_feature_config()[1]`` — captured at scout time
   and hardcoded for stability; offline test 3 below locks it
   against upstream drift.

**Module** (``compute/ingest/qlib_features.py``, 186 LOC including
docstring):

- Module-name choice locked per architectural review: NOT
  ``compute/ingest/qlib.py``. Python's import resolution would
  treat the latter as the ``qlib`` package and shadow the actual
  installed PyPI package, breaking the entire integration. Distinct
  module name avoids the namespace collision.
- ``QLIB_INSTRUMENTS_UNIVERSE = "sp500"`` — custom universe ID;
  integration PR registers this against Qlib's instruments API.
- ``ALPHA158_FEATURE_NAMES: tuple[str, ...]`` — 158-name manifest
  hardcoded from ``Alpha158DL.get_feature_config()[1]`` at scout
  implementation time against pyqlib 0.9.7. Cardinality asserted
  at module load against ``config.ALPHA158_FEATURE_COUNT``.
- ``init_qlib(provider_uri=None)`` — idempotent thin wrapper around
  ``qlib.init(provider_uri=..., region="us")``. Local import so the
  scout module loads even when ``[factors]`` extra isn't installed.
- ``fetch_alpha158_features(*, instruments, start_time, end_time)``
  — forward-compat wrapper around ``Alpha158(...).fetch(col_set=
  "feature")``. NOT exercised end-to-end by the scout (see §"No
  ``@network`` test" below).

**Config** (``compute/config.py``, +23 LOC): new
``# --- Phase 4j scout: Microsoft Qlib (Alpha158) integration ---``
block adds:

- ``QLIB_DATA_CACHE: Path = CACHE_DIR / "qlib" / "us_data"``
  (gitignored — ``compute/cache/`` parent glob at .gitignore:221
  covers it).
- ``QLIB_DATA_MAX_AGE_DAYS: int = 31`` (BYO bundle, monthly refresh).
- ``ALPHA158_FEATURE_COUNT: int = 158``.

**pyproject.toml**: ``[factors]`` extra extended with
``pyqlib>=0.9.7,<0.10``. The ``<0.10`` cap pins against Qlib 0.10+
which may drift the feature set; offline test 3 will catch any
drift on a deliberate version bump.

**Tests** (``tests/test_ingest/test_qlib_features.py``, 113 LOC, 6
offline — NO ``@network``):

1. ``test_alpha158_feature_manifest_has_158_entries`` — primary CI
   signal. Pure cardinality + uniqueness check; survives even when
   the ``[factors]`` extra isn't installed.
2. ``test_alpha158_feature_manifest_first_5_anchor`` — anchors the
   K-bar leading features (``KMID, KLEN, KMID2, KUP, KUP2``)
   against the canonical Qlib v0.9.7 surface.
3. ``test_alpha158_feature_manifest_matches_runtime_introspection``
   — hardcoded tuple must equal ``Alpha158DL.get_feature_config()
   [1]``. Wrapped in ``pytest.importorskip("qlib")``. The drift
   detector.
4. ``test_qlib_data_cache_constant_under_repo_cache_dir`` — config
   sanity + locks gitignore coverage via the ``compute/cache/``
   parent glob.
5. ``test_init_qlib_passes_us_region_and_provider_uri`` —
   monkeypatch capture; asserts ``region="us"`` + provided
   ``provider_uri`` are passed through.
6. ``test_init_qlib_defaults_to_config_cache_when_no_uri`` —
   default ``provider_uri`` resolves to ``config.QLIB_DATA_CACHE``.

**Critical scope decision — NO ``@network`` test for this scout**:

Phase 4h scout (PR #110) and Phase 4i scout (PR #114) each had a
``@pytest.mark.network`` test that hit a remote CDN. **Qlib has no
remote CDN** — its data flow is local-bin filesystem I/O, not
download-from-network. The originally planned synthetic-OHLCV →
``.bin`` conversion → ``init_qlib`` → ``Alpha158.fetch`` smoke
test was DROPPED post-investigation: pyqlib's PyPI wheel does NOT
bundle the ``scripts/dump_bin.py`` utility needed for OHLCV →
``.bin`` conversion. That scaffolding is integration-PR scope.

Test #3 (runtime introspection match) is the **replacement
verification surface** — actually a stronger drift detector than
the dropped end-to-end test would have been, because it asserts
the hardcoded manifest matches upstream on every ``pip install``.

**CI install footprint impact**: ~150-180 MB net-new. ``pyqlib``
pulls ~22 transitive deps including ``mlflow`` (~20 MB),
``lightgbm`` (~15 MB), ``cvxpy`` (~30 MB), ``pymongo``, ``redis``
client, ``gym``, ``jupyter``, ``nbconvert``. None of these heavy
deps are actually consumed by the scout — they come along for the
ride because pyqlib doesn't expose a ``[minimal]`` extra. CI cold-
start latency bump is one-time per workflow; pip wheel caching
mitigates subsequent runs.

**Tenacity policy NOT applied**: Qlib's data flow is local
filesystem I/O. No network retry semantics needed. This is the
first ingest module in QuantRank that diverges from the canonical
``compute/ingest/osap.py:52-56`` retry decorator (documented
explicitly in the module docstring).

**Verification ladder** (steps 1-5 complete):

- ``ruff check .`` → clean ✅
- ``pytest tests/ -m "not network"`` → **930 passed** (924 baseline +
  6 new offline) ✅
- ``pytest -m network --run-network`` → 20 (unchanged; NO new
  ``@network``) ✅
- ``python -m compute.output.schema_check`` → in-sync (NO schema
  delta this scout) ✅
- ``python -c "from compute.ingest.qlib_features import init_qlib,
  fetch_alpha158_features, ALPHA158_FEATURE_NAMES; print('OK',
  len(ALPHA158_FEATURE_NAMES))"`` → ``OK 158`` ✅

Steps 6-8: ``git push`` → open Draft PR → ``subscribe_pr_activity``
+ STOP for user audit + Mark-Ready authorization.

**Ask-first surfaces touched**: NONE for the workflow / schema
triple. ``pyproject.toml [factors]`` extra extended in this commit
(authorized in advance via the plan-mode approval).
``.github/workflows/ci.yml`` unchanged (``[dev,factors]`` install
already covers the new pyqlib dep).
``.github/workflows/compute-rankings.yml`` UNTOUCHED per user
hard constraint.

**Defense layer**: unchanged at 17. **Top-5 rotation**: unchanged.
**Schema version**: unchanged at ``0.9.1-phase4h.2`` (no schema
delta this scout).

**Out of scope** (deferred to follow-on full Phase 4j integration
PR, ~5-commit cluster like Phase 4h):

- yfinance-to-Qlib BYO adapter (~150 LOC; ``compute/cache/prices/
  *.parquet`` → Qlib ``.bin`` format conversion)
- Full Alpha158 feature compute on 502-ticker universe
  (502 × N_dates × 158 DataFrame)
- Per-feature cross-validation framework (PBO/DSR doesn't directly
  apply to per-stock-per-date features — walk-forward IC scoring
  per feature is the likely replacement)
- Schema additions (``StockDetail.qlib_features`` +
  ``Metadata.qlib_features_used`` + IC observability) → bump
  ``0.9.1-phase4h.2 → 0.10.0-phase4j``
- ``compute/main.py`` wiring decision (observability-only? blended
  into composite? Phase-5 ML-meta-learner-only consumer?)
- Top-5 rotation impact analysis (Rule 16 lock applies)

https://claude.ai/code/session_01T8FE3MAnmk6hcjvH4SgYNU
@vercel
Copy link
Copy Markdown

vercel Bot commented May 19, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
quantrank Ready Ready Preview, Comment May 19, 2026 9:33am

@dackclup dackclup marked this pull request as ready for review May 19, 2026 10:47
@dackclup dackclup merged commit f0ade65 into main May 19, 2026
4 checks passed
@dackclup dackclup deleted the claude/resume-quantrank-phase-4.5-Zh0pO branch May 19, 2026 10:48
dackclup added a commit that referenced this pull request May 19, 2026
…all + InstrumentedPCA 8-method API surface lock + 6 offline tests (#121)

Phase 4k scout — **FINAL of 4 factor-library scouts** (OSAP ✅ #110, JKP ✅ #114, Qlib ✅ #119, IPCA THIS). Ships `ipca` install + 8-method public-API surface lock + 6 offline tests + inline synthetic fixture. NO production wiring; characteristics-matrix construction + universe-wide IPCA fit + composite blend decision are integration-PR scope (Phase 4k.1, tracked as follow-up).

**After this merges → all 4 factor scouts done → eligible for v1.1.0-phase4 tag readiness audit** (gated on 4h.2 Part 2 + 4i.1 + 4j.1 + 4k.1 integration PRs landing — ~6-8w combined effort).

5 pre-plan investigations (verified 2026-05-19, carried verbatim into module docstring):

1. PyPI package: `ipca` 0.6.7 (29 historical versions; last release 2021-04-22 — ~5 years stale). Pin tight `>=0.6.7,<0.7`.
2. License: MIT verbatim from LICENSE.md (Buechner / Bybee 2019). No CC BY-NC complication unlike JKP. Safe for Phase 6+ commercial roadmap.
3. sklearn-compatible API surface — 8 public methods: fit / get_factors / fit_path / predict / predict_panel / predict_portfolio / score / predictOOS. Post-fit attrs: Gamma (L×K) + Factors (K×T) + metad dict + n_factors_eff + has_PSF + PSFcase. NO transform/fit_transform (user brief assumed presence; they don't exist in 0.6.7).
4. Data requirements: MultiIndex (entity, date) DataFrame OR explicit indices array. Min stable shape 10 firms × 20 years × 2 chars (maintainer's test_ipca.py). NaN handling internal. Unbalanced panels supported. 502-ticker scale uses data_type="portfolio" ALS path (integration-PR scope).
5. CI install footprint: ~50-80 MB net-new (numba ~50 MB + llvmlite ~30 MB + small progressbar). Substantially lighter than Qlib's 150-180 MB.

IPCA structural shape — 4th distinct vs prior scouts:
- OSAP (4h): factor returns CSV → proxy/36m regression
- JKP (4i): factor returns CSV → 36m regression
- Qlib (4j): per-stock per-date features → native Alpha158
- IPCA (4k): panel decomposition → Gamma (L×K loadings) + Factors (K×T) latent returns

Critical scope decision — NO @network test (mirrors Phase 4j Qlib rationale):

IPCA is pure local sklearn-style computation. No remote endpoint to network-test. Scout ships 6 offline tests / 0 @network. The synthetic-fixture smoke test exercises the full fit→Gamma/Factors path locally.

Architectural locks:

- Module placement `compute/features/ipca_factors.py` (NOT compute/ingest/) per pre-existing `.claude/skills/phase-4/ipca-factor-fit/PLAN.md:24` + `compute/features/osap_replicate.py` precedent. NO namespace collision (module=`ipca_factors`, package=`ipca`) — Phase 4j's `qlib_features.py` workaround doesn't apply here.
- INSTRUMENTED_PCA_PUBLIC_API 8-method tuple — drift detector; module-load assertion against config.IPCA_PUBLIC_API_METHOD_COUNT.
- IPCA_DEFAULT_N_FACTORS=5, IPCA_DEFAULT_INTERCEPT=True (KPS 2019 baseline) — validated by smoke test, NOT module-load assert.
- Tenacity NOT applied — pure local sklearn-style; no network retry. Second module after Phase 4j that diverges from osap.py:52-56 pattern; documented in module docstring.
- Synthetic fixture inline as @pytest.fixture (NOT committed CSV/parquet) — IPCA inputs are numpy arrays, no roundtrip needed.

Module layer (compute/features/ipca_factors.py, ~190 LOC):
- IPCA_FITTED_ARTIFACTS_CACHE re-export from config
- INSTRUMENTED_PCA_PUBLIC_API 8-tuple + module-load invariants (cardinality + uniqueness)
- IPCA_DEFAULT_N_FACTORS / IPCA_DEFAULT_INTERCEPT constants
- init_ipca(n_factors, intercept, **kwargs) → unfitted InstrumentedPCA
- fit_ipca_panel(estimator, *, X, y, indices, **fit_kwargs) → fitted estimator (returns-self)

Config layer (compute/config.py, +28 LOC):
- IPCA_FITTED_ARTIFACTS_CACHE: Path = CACHE_DIR / "ipca"
- IPCA_FITTED_ARTIFACTS_MAX_AGE_DAYS: int = 31
- IPCA_PUBLIC_API_METHOD_COUNT: int = 8

Tests (6 offline; ~228 LOC):
1. test_ipca_imports_and_exposes_instrumented_pca — primary CI signal (importorskip)
2. test_instrumented_pca_public_api_manifest_locks_8_methods — pure assertion (no ipca runtime)
3. test_instrumented_pca_public_api_matches_runtime_introspection — drift detector
4. test_ipca_fitted_artifacts_cache_under_repo_cache_dir — config sanity
5. test_init_ipca_returns_unfitted_estimator_with_kps_defaults — KPS defaults validation
6. test_fit_ipca_panel_on_synthetic_5x30x10_fixture — smoke fit; asserts Gamma (10,2) + Factors (2,30) + metad N/T/L

pyproject.toml: append `ipca>=0.6.7,<0.7` to [factors] (authorized in advance via plan-mode approval; pin range because 2021-04-22 staleness).

Ask-first surfaces touched:
- pyproject.toml [factors] — extended (authorized via plan-mode)
- ci.yml UNCHANGED ([dev,factors] install already covers new dep)
- compute-rankings.yml UNTOUCHED per user hard constraint
- Schema triple UNTOUCHED (no schema delta this scout)

Verification (local sandbox without [factors] + CI with [dev,factors]):
- ruff check . → clean
- python -m compute.output.schema_check → in-sync
- Import smoke: from compute.features.ipca_factors import init_ipca, fit_ipca_panel, INSTRUMENTED_PCA_PUBLIC_API → OK 8
- pytest tests/ -m "not network" excluding factor-extra files → 864 passed locally
- 2/6 IPCA tests PASS locally; 4/6 SKIP via pytest.importorskip (expected — local lacks [factors])
- CI on 82ade3a with [dev,factors] → both Python+Frontend GREEN; 936 offline expected

Defense layer unchanged at 17. Top-5 rotation unchanged. Schema unchanged at 0.9.1-phase4h.2.

Out of scope (deferred to follow-on Phase 4k.1 integration PR, ~5-commit cluster):
- Characteristics-matrix construction (which features feed X?)
- Full IPCA fit on 502-ticker universe (data_type="portfolio" canonical scaling)
- Walk-forward / rolling-window fit cadence
- Latent-factor composite integration decision
- Schema additions (StockDetail.ipca_loadings + Metadata.ipca_n_factors_eff + ipca_in_sample_r2) → bump 0.9.1-phase4h.2 → 0.10.0-phase4k
- PBO/DSR doesn't apply (loadings ≠ portfolio returns); IC walk-forward observability instead per PLAN.md:36
- Top-5 rotation impact analysis (Rule 16 lock)
- WRDS data backfill consideration

Audit history:
- Plan-audit round 1: 5 investigations verified · MIT lock · heavy-deps disclosure
- Plan-audit round 2: Q1 (public-API surface lock) + Q2 (inline pytest.fixture) design choices applied
- Plan-audit round 3: line citations verified
- Implementation: main session direct (Phase 4j paste-loop precedent — worker session was stuck re-presenting plan)
- CI green on 82ade3a: Python+Frontend both passing · Vercel ✅ READY
- Conditional Mark-Ready authorization given · user confirmed CI green · squash merged

Closes the factor-library scout cluster. Next: v1.1.0-phase4 tag readiness audit gated on 4 integration PRs.

https://claude.ai/code/session_015649aRyi2bvciQYZVNACd2
dackclup added a commit that referenced this pull request May 20, 2026
Part of epic #125 (Item #6 of 6). Pure tooling addition — no
runtime / scoring / schema impact.

Motivation
----------
PR #123 (2026-05-19, closed without merging): a worker session
opened a Phase 4j + 4k scout duplicate on branch
`claude/resume-quantrank-phase-4.5-Zh0pO` while the main session
shipped the same work directly via PRs #119 (Qlib) + #121 (IPCA).
Root cause: the worker session never inspected the `claude/*`
branch list + recent PRs before writing code, producing 100%
wasted effort.

This change ships a preflight check that surfaces in-flight scope
BEFORE any code is written, so the duplicate-PR failure mode is
caught at the handoff-prompt entry rather than at PR review.

Files (2 new, +271 LOC)
------------------------
- tools/check_branch_collisions.py (+149 LOC) — git-only preflight
  script. Lists active `claude/*` branches via `git ls-remote
  origin "refs/heads/claude/*"` and recent main-branch commits
  via `git log --since="48 hours ago" --oneline --no-merges
  origin/main`. Optional keyword args flag case-insensitive
  substring matches. Always exit 0 (informational only).

- .claude/skills/branch-collision-check/SKILL.md (+122 LOC) —
  skill description with YAML frontmatter, trigger conditions
  (handoff prompts, Phase / issue / Item #N mentions, fresh worker
  sessions), skip conditions (doc-only chores, iteration #2+,
  user-authorized parallel work), sample output (clean + warning),
  and output-interpretation guidance pointing the caller to STOP
  + ask the user when any ⚠️ line surfaces.

Design notes
------------
- Git-only data sources — no `gh` CLI / GitHub API auth required.
  Works in the QuantRank Claude Code Web sandbox where `gh` is
  unavailable, and on any contributor machine with bare git.
- 48-hour window — matches typical worker ↔ main session handoff
  cadence; long enough to catch duplicate work, short enough to
  keep the output scannable.
- Pure read-only — no destructive git ops, no branch creation,
  no push, no GitHub API mutation. Always returns exit 0; the
  caller decides whether to proceed.

Verification ladder all green
------------------------------
- ruff check . → All checks passed
- python tools/check_branch_collisions.py → lists 1 active
  claude/* branch + 16 recent commits (last 48h), exit 0
- python tools/check_branch_collisions.py "Alpha158" → fires
  ⚠️  on PR #119 commit "Alpha158 158-feature manifest", summary
  reports "1 potential scope collision(s) found", exit 0
- python tools/check_branch_collisions.py "Phase 99 nonsense" →
  no match, summary reports "No scope collisions detected",
  exit 0
- python tools/check_doc_test_counts.py → exit 0 (Item #2 guard
  still passes; new files don't introduce hardcoded counts)
- python -m compute.output.schema_check → in sync (no schema touch)
- python -m pytest tests/ -m "not network" → 959 passed
  (unchanged; tools/ + .claude/skills/ aren't imported by tests)
- SKILL.md YAML frontmatter parses — confirmed via Claude Code's
  skill registry picking it up at module load

Constraints honored
-------------------
- No touch to compute/ / frontend/ / tests/ — tools/ +
  .claude/skills/ only
- No network calls / no GitHub API auth — git remote ls + git log
- No destructive actions — read-only preflight check
- No push to main; no force-push; no --no-verify
- No workflow_dispatch trigger (compute-rankings.yml untouched)

Epic #125 status after this PR
-------------------------------
Item #1 ✅ Hypothesis property tests (PR #127)
Item #2 ✅ Strip hardcoded test counts + CI guard (PR #128)
Item #4 ✅ Observability-before-wiring pattern (PR #129)
Item #6 ✅ Branch-collision preflight (this PR)
Items #3, #5 remain — separate PRs per epic decomposition.

https://claude.ai/code/session_01T8FE3MAnmk6hcjvH4SgYNU

Co-authored-by: Claude <noreply@anthropic.com>
dackclup added a commit that referenced this pull request May 20, 2026
…imization PR F) (#146)

Sixth PR in the .md optimization sequence (Option D). Audit of 18
QR-origin skill descriptions found all are well-formed (parseable
YAML, TRIGGER + SKIP clauses present, average 888 chars). The
critical YAML bug (#119+#121 plain-scalar bug in branch-collision-
check and pr-quality-gate) was already fixed in PR A. So PR F's
remaining work is light polish, not structural change.

Vendored skills (20) FROZEN per the boundary convention — Anthropic
skills, mattpocock-* (8), karpathy-guidelines, thananon/9arm-skills
(4), karpathy-llm-wiki are all upstream-only edits.

Trim targets (cut redundancy, fix drift, add Thai triggers):

1. pr-quality-gate (1207 → ~1015): cut redundant "ALSO use right
   before flipping Draft→Ready" clause that duplicated the first
   TRIGGER ("before authorizing the Draft→Ready flip"). Tightened
   wrapping.

2. pr-iteration-flow (990 → ~890): cut redundant "ALSO use this
   skill as the default workflow harness any time a PR is open"
   that duplicated the TRIGGER list. Dropped stale "PR-3c → PR-3d
   → PR-20" historical reference. Added Thai trigger phrases
   "เช็ค CI" / "ดู PR" since the user invokes this skill in Thai.

3. phase-status-bump (918 → ~840): dropped two historical examples
   ("PR 3d → tag v0.6.0-phase3d" and "3a→3b, 3c→3d") that anchored
   the description to one shipped phase. Wording now phase-agnostic.

4. verify-production-output (1086 → ~870): compressed the
   "Surfaces..." enumeration of Section A-H content (was 8 detailed
   items; now 8 short items) without losing dispatch specificity.
   Added Thai trigger phrases "ตรวจ output" / "เช็ค production".
   Folded "ALSO use" into first TRIGGER as one phrase.

YAML moved from plain scalar to `description: >` (folded block) on
the 3 plain-scalar descriptions edited (pr-iteration-flow,
phase-status-bump, verify-production-output) — same safety pattern
PR A applied. Prevents the ' #' comment-eating bug from re-emerging
if anyone adds a `#issue` reference later.

Net token impact: ~-650 chars × ~0.25 tokens/char ≈ -162 tokens
per session-start. Modest but compounds.

Why not aggressive trim:
- Each TRIGGER phrase + SKIP clause IS dispatch-useful — verified
  by sampling. Aggressive 50% cuts would risk dispatch quality.
- Remaining 14 QR-origin skills already at 700-900 chars with
  no redundancy to remove.

CLAUDE.md (181 → 181, lockstep): §Phase status — added PR #145 (E)
to "Recently merged"; replaced "PR E in flight" with "PR F in flight"
note explaining the audit found health.

AGENTS.md (343 → 343, lockstep): §Phase + version state —
optimization sequence tracker updated: PR E ✅, PR F in flight, PR G
remaining.

Next: PR G (PHASE_STATUS.md "Current State" summary at top + chronological
table below).

Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants