Skip to content

Phase 4j.1 — Full Qlib Alpha158 integration (BYO adapter + 502-ticker feature compute + per-feature IC validation) #120

@dackclup

Description

@dackclup

Context

Phase 4j scout (PR #119, merged 2026-05-19 at SHA f0ade65b) shipped the pyqlib install + Alpha158 158-feature manifest + 6 offline tests + access-path investigation. This issue tracks the full integration PR that consumes the scout's foundation and wires Qlib's Alpha158 per-stock per-date features into the QuantRank composite layer.

The scout intentionally deferred all production wiring per PR #119 §"Out of scope". This issue is that deferred work.

Access-path foundation (locked by scout)

The integration PR builds on the scout's compute/ingest/qlib_features.py module docstring, which records the 2026-05-19 verification:

  • pyqlib 0.9.7 installs cleanly via [factors] extra
  • ✅ MIT license (no commercial complication unlike JKP's CC BY-NC 4.0)
  • qlib.contrib.data.handler.Alpha158 exposes 158-feature surface
  • ALPHA158_FEATURE_NAMES manifest hardcoded; drift detector test catches upstream changes
  • ✅ Module name qlib_features.py (not qlib.py) avoids Python namespace collision with installed qlib package
  • ❌ NO public US data bundle — Qlib's default provider_uri covers CN A-share only; US universe is BYO via local .bin files
  • ❌ NO PyPI [minimal] extra — pip install pyqlib pulls ~150-180 MB of heavy transitives (mlflow / lightgbm / cvxpy / pymongo / redis / gym / jupyter)

Qlib's distinct shape vs OSAP/JKP

Critical reminder from the scout's PR body — Qlib is structurally different from prior factor libraries:

Scout Data shape Per-stock surface BYO required
OSAP (4h, PR #112) factor returns CSV proxy / 36m regression no
JKP (4i, PR #114) factor returns CSV 36m regression no
Qlib (4j, PR #119) per-stock per-date features (LOCAL compute) native — Alpha158 emits 158 features per (stock, date) YES — yfinance OHLCV → Qlib .bin adapter

This means the OSAP/JKP integration pattern does NOT apply here. There's no factor-returns CSV to proxy/regress against; Alpha158 produces per-stock features directly. The integration PR's shape is therefore substantially different from PR #112 (Phase 4h full integration) and the eventual Phase 4i full integration (#115).

Scope IN (5 deferred work items from PR #119 §"Out of scope")

(a) yfinance-to-Qlib BYO adapter — THE precondition

This is the largest single piece of work and the precondition for everything else. Qlib publishes no public US data bundle; the integration PR must convert QuantRank's existing compute/cache/prices/*.parquet files (yfinance OHLCV) into Qlib's .bin format that qlib.init(provider_uri=...) can read.

Approximate steps:

  • Custom universe registration — Qlib expects an "instruments" file listing stock symbols + active date ranges. Build sp500.txt from the universe-resolution step that already runs in compute/main.py.
  • Per-ticker .bin conversion — for each of 502 tickers, walk the cached yfinance parquet → write OPEN / HIGH / LOW / CLOSE / VWAP / VOLUME columns to .bin files at QLIB_DATA_CACHE/features/<ticker>/.
  • Calendar file — Qlib expects a calendars/day.txt listing all trading dates in the universe.
  • dump_bin scaffoldingpyqlib's PyPI wheel does NOT bundle scripts/dump_bin.py (verified during scout). Either vendor a minimal port (~50 LOC) into compute/ingest/qlib_features.py or pin pyqlib to a release that includes it, OR write our own .bin writer using qlib's low-level binary format spec.
  • Refresh cadence — Qlib bundle stays valid for 31 days per config.QLIB_DATA_MAX_AGE_DAYS; weekly cron re-converts only when stale.

Effort: ~150 LOC + ~80 LOC tests. ~2 days.

(b) Full Alpha158 feature compute on 502-ticker universe

Once the BYO adapter is live:

  • init_qlib(QLIB_DATA_CACHE) — single init per weekly compute run (qlib.init is global state)
  • fetch_alpha158_features(instruments="sp500", start_time=..., end_time=...) — returns the 502 × N_dates × 158 DataFrame
  • Decide whether to materialize for the full S&P 500 universe in production OR sample (e.g., latest single date only — ~502 × 158 = 79 316 floats, lightweight)

Effort: ~80 LOC + tests. ~1 day.

(c) Per-feature cross-validation framework

PBO/DSR (used by Phase 4h OSAP for long-short factor returns) does NOT directly apply to Alpha158's per-stock per-date features. Bailey 2014 PBO is rank-based across strategy time-series; Alpha158 emits cross-sectional features per date, not strategy returns. The integration PR needs a different validation surface.

Likely replacement: walk-forward IC scoring per feature. For each of the 158 features, compute Spearman rank correlation between (feature value at t) vs (forward 1-month or 3-month return) over a rolling window. Surface accepted features (IC > 0.02 absolute mean over the rolling window, per defense-infrastructure/PLAN.md:121) into composite blending; reject the rest.

Walk-forward CV is the canonical version, but per defense-infrastructure/PLAN.md:270: the full walk-forward + purged + embargoed CV is the Phase 5 backtest infra stronger version. Phase 4j integration ships with simpler rolling-12m IC validation as a stopgap.

Effort: ~120 LOC + tests. ~2 days.

(d) Schema additions

Schema triple lockstep edit (per AGENTS.md:229-231):

  • compute/output/schemas.py — add to StockDetail:
    • qlib_features: dict[str, float] | None = None — per-stock subset of accepted Alpha158 features (curated cross-section at as_of date)
    • qlib_blended_score: float | None = None — optional blended score IF feature blending into composite is wired (see §(e) below)
  • compute/output/schemas.py — add to Metadata:
    • qlib_features_used: list[str] | None = None — Alpha158 features that passed IC gate
    • qlib_features_excluded: list[str] | None = None — features rejected by IC gate
    • qlib_features_ic_12m: dict[str, float] | None = None — per-feature rolling-12m IC (observability)
    • qlib_features_coverage_pct: dict[str, float] | None = None — per-feature S&P 500 coverage %
  • compute/config.py:30SCHEMA_VERSION: 0.9.1-phase4h.20.10.0-phase4j (MINOR bump — new phase boundary)
  • frontend/lib/types.ts — mirror Pydantic additions
  • frontend/lib/schema-snapshot.json — regenerate via python -m compute.output.schema_check --update-snapshot

Effort: ~50 LOC + ~30 LOC tests. ~0.5 day.

(e) compute/main.py wiring decision

Open question — defer to integration-PR planning time: how does Alpha158 feed the composite?

Three plausible patterns:

  1. Observability-only — write per-feature IC + per-stock features into metadata + StockDetail but do NOT blend into composite. Rule 16 trivially holds. Lowest risk.
  2. Phase 4h-style Path-b blend — pool accepted Alpha158 features into a single per-stock aggregate, blend OUTSIDE compute_composite() (preserves PHASE3_WEIGHTS sum-to-1.0 invariant at compute/scoring/composite.py:43-45). Mirror PR feat(phase-4h): OSAP integration — foundation + replicate + blend + PBO/DSR gate #112 pattern.
  3. Phase 5 ML meta-learner consumer only — Alpha158 features feed the LightGBM meta-label classifier (per phase-5/meta-label/PLAN.md) but NOT the Phase 4 composite. Waits for Phase 5 backtest infra (PR 4b: defense-infrastructure (cross-source validator + PBO/DSR gate + IC-decay monitor) #75).

Recommended default: (1) observability-only for the integration PR. Defer (2)/(3) to Phase 5+ once IC evidence accumulates from production diagnostics.

Effort: ~50 LOC + tests. ~0.5 day.

Top-5 rotation impact analysis

Rule 16 lock applies as for all prior factor libraries: Top-5 ranking stays on raw composite_score. qlib_blended_score (if wired in §(e)) is informational only. Confirm entered_top5 / exited_top5 distributions are unchanged when qlib_blended_score is excluded from ranking.

Effort: ~30 LOC test + spot-check. ~0.5 day.

Triggers (open implementation PR when EITHER fires)

  1. Phase 5 backtest infra lands (.claude/skills/phase-5/backtest-infrastructure/PLAN.md) — provides the canonical walk-forward + purged + embargoed CV that replaces this issue's rolling-12m IC stopgap. Recommended trigger.

  2. Analyst / user feedback indicates Alpha158 features are needed for an in-flight analysis use case (forces the BYO-adapter + integration timeline ahead of Phase 5).

Effort estimate

Sub-item LOC Days
(a) yfinance-to-Qlib BYO adapter ~150 2
(b) Full Alpha158 feature compute on 502-ticker ~80 1
(c) Per-feature walk-forward IC validation ~120 2
(d) Schema additions (triple lockstep) ~50 0.5
(e) compute/main.py wiring (default = observability-only) ~50 0.5
(f) Top-5 rotation impact test ~30 0.5
Tests + docs + module docstring ~210 1
Total ~690 LOC ~7.5 days

Slightly larger than Phase 4h's ~1160 LOC because:

  • Phase 4h had the Path-b blend + PBO/DSR gate which are reusable for Phase 4j only if wiring path (2) is chosen
  • BYO adapter is new ground; no prior precedent in QuantRank
  • Per-feature IC validation is structurally different from per-signal PBO/DSR

Sequencing relative to other Phase 4+ tracks

Tag v1.1.0-phase4 is gated on all 4 factor library scouts (4h ✅ + 4i ✅ + 4j ✅ + 4k pending) and their respective integration PRs (4h ✅ + 4h.2 ✅ + 4i.1 + 4j.1 + 4k.1) all merging. This issue is the gating item for 4j specifically.

Out of scope for this issue

  • CN A-share market integration — Qlib's default region; QuantRank is US-only universe
  • Qlib's built-in model training (LightGBM, MLP, etc.) — Alpha158 is feature engineering only; ML model training is Phase 5 ML meta-learner work
  • Qlib's portfolio optimization (cvxpy-based) — out of scope; QuantRank's Top-5 selection is Rule 16 composite-based
  • scripts/dump_bin.py upstream contribution — if we end up vendoring a port, consider upstreaming to microsoft/qlib as a follow-up community contribution (out of scope for QuantRank itself)
  • Heavy-transitive trimming — mlflow / cvxpy / gym / jupyter are pulled by pip install pyqlib but unused by QuantRank. Upstream [minimal] extra would help; for now we accept the ~150-180 MB install footprint per PR feat(ingest): Qlib scout — pyqlib MIT install + Alpha158 handler smoke + 158-feature manifest #119 disclosure

Related

🤖 Filed by Claude Code via the Anthropic SDK after PR #119 (Phase 4j Qlib scout) shipped at f0ade65b.


Generated by Claude Code

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions