Phase 4j.1 — Full Qlib Alpha158 integration (BYO adapter + 502-ticker feature compute + per-feature IC validation)

## Context

Phase 4j scout (PR #119, merged 2026-05-19 at SHA `f0ade65b`) shipped the `pyqlib` install + Alpha158 158-feature manifest + 6 offline tests + access-path investigation. This issue tracks the **full integration PR** that consumes the scout's foundation and wires Qlib's Alpha158 per-stock per-date features into the QuantRank composite layer.

The scout intentionally deferred all production wiring per [PR #119](https://github.com/dackclup/quantrank/pull/119) §"Out of scope". This issue is that deferred work.

## Access-path foundation (locked by scout)

The integration PR builds on the scout's [`compute/ingest/qlib_features.py`](https://github.com/dackclup/quantrank/blob/main/compute/ingest/qlib_features.py) module docstring, which records the 2026-05-19 verification:

- ✅ `pyqlib` 0.9.7 installs cleanly via `[factors]` extra
- ✅ MIT license (no commercial complication unlike JKP's CC BY-NC 4.0)
- ✅ `qlib.contrib.data.handler.Alpha158` exposes 158-feature surface
- ✅ `ALPHA158_FEATURE_NAMES` manifest hardcoded; drift detector test catches upstream changes
- ✅ Module name `qlib_features.py` (not `qlib.py`) avoids Python namespace collision with installed `qlib` package
- ❌ NO public US data bundle — Qlib's default `provider_uri` covers CN A-share only; US universe is BYO via local `.bin` files
- ❌ NO PyPI `[minimal]` extra — `pip install pyqlib` pulls ~150-180 MB of heavy transitives (mlflow / lightgbm / cvxpy / pymongo / redis / gym / jupyter)

## Qlib's distinct shape vs OSAP/JKP

Critical reminder from the scout's PR body — Qlib is structurally different from prior factor libraries:

| Scout | Data shape | Per-stock surface | BYO required |
|---|---|---|---|
| OSAP (4h, PR #112) | factor returns CSV | proxy / 36m regression | no |
| JKP (4i, PR #114) | factor returns CSV | 36m regression | no |
| **Qlib (4j, PR #119)** | **per-stock per-date features (LOCAL compute)** | **native — Alpha158 emits 158 features per (stock, date)** | **YES — yfinance OHLCV → Qlib .bin adapter** |

This means **the OSAP/JKP integration pattern does NOT apply here**. There's no factor-returns CSV to proxy/regress against; Alpha158 produces per-stock features directly. The integration PR's shape is therefore substantially different from PR #112 (Phase 4h full integration) and the eventual Phase 4i full integration (#115).

## Scope IN (5 deferred work items from PR #119 §"Out of scope")

### (a) yfinance-to-Qlib BYO adapter — THE precondition

This is the **largest single piece of work** and the precondition for everything else. Qlib publishes no public US data bundle; the integration PR must convert QuantRank's existing `compute/cache/prices/*.parquet` files (yfinance OHLCV) into Qlib's `.bin` format that `qlib.init(provider_uri=...)` can read.

Approximate steps:

- **Custom universe registration** — Qlib expects an "instruments" file listing stock symbols + active date ranges. Build `sp500.txt` from the universe-resolution step that already runs in `compute/main.py`.
- **Per-ticker `.bin` conversion** — for each of 502 tickers, walk the cached yfinance parquet → write OPEN / HIGH / LOW / CLOSE / VWAP / VOLUME columns to `.bin` files at `QLIB_DATA_CACHE/features/<ticker>/`.
- **Calendar file** — Qlib expects a `calendars/day.txt` listing all trading dates in the universe.
- **dump_bin scaffolding** — `pyqlib`'s PyPI wheel does NOT bundle `scripts/dump_bin.py` (verified during scout). Either vendor a minimal port (~50 LOC) into `compute/ingest/qlib_features.py` or pin `pyqlib` to a release that includes it, OR write our own .bin writer using `qlib`'s low-level binary format spec.
- **Refresh cadence** — Qlib bundle stays valid for 31 days per `config.QLIB_DATA_MAX_AGE_DAYS`; weekly cron re-converts only when stale.

Effort: ~150 LOC + ~80 LOC tests. ~2 days.

### (b) Full Alpha158 feature compute on 502-ticker universe

Once the BYO adapter is live:

- `init_qlib(QLIB_DATA_CACHE)` — single init per weekly compute run (`qlib.init` is global state)
- `fetch_alpha158_features(instruments="sp500", start_time=..., end_time=...)` — returns the 502 × N_dates × 158 DataFrame
- Decide whether to materialize for the full S&P 500 universe in production OR sample (e.g., latest single date only — ~502 × 158 = 79 316 floats, lightweight)

Effort: ~80 LOC + tests. ~1 day.

### (c) Per-feature cross-validation framework

PBO/DSR (used by Phase 4h OSAP for long-short factor returns) **does NOT directly apply** to Alpha158's per-stock per-date features. Bailey 2014 PBO is rank-based across strategy time-series; Alpha158 emits cross-sectional features per date, not strategy returns. The integration PR needs a different validation surface.

Likely replacement: **walk-forward IC scoring per feature**. For each of the 158 features, compute Spearman rank correlation between (feature value at t) vs (forward 1-month or 3-month return) over a rolling window. Surface accepted features (IC > 0.02 absolute mean over the rolling window, per `defense-infrastructure/PLAN.md:121`) into composite blending; reject the rest.

Walk-forward CV is the canonical version, but per [`defense-infrastructure/PLAN.md:270`](https://github.com/dackclup/quantrank/blob/main/.claude/skills/phase-4/defense-infrastructure/PLAN.md#L270): the full walk-forward + purged + embargoed CV is the **Phase 5 backtest infra stronger version**. Phase 4j integration ships with simpler rolling-12m IC validation as a stopgap.

Effort: ~120 LOC + tests. ~2 days.

### (d) Schema additions

Schema triple lockstep edit (per [`AGENTS.md:229-231`](https://github.com/dackclup/quantrank/blob/main/AGENTS.md#L229-L231)):

- `compute/output/schemas.py` — add to `StockDetail`:
  - `qlib_features: dict[str, float] | None = None` — per-stock subset of accepted Alpha158 features (curated cross-section at `as_of` date)
  - `qlib_blended_score: float | None = None` — optional blended score IF feature blending into composite is wired (see §(e) below)
- `compute/output/schemas.py` — add to `Metadata`:
  - `qlib_features_used: list[str] | None = None` — Alpha158 features that passed IC gate
  - `qlib_features_excluded: list[str] | None = None` — features rejected by IC gate
  - `qlib_features_ic_12m: dict[str, float] | None = None` — per-feature rolling-12m IC (observability)
  - `qlib_features_coverage_pct: dict[str, float] | None = None` — per-feature S&P 500 coverage %
- `compute/config.py:30` — `SCHEMA_VERSION`: `0.9.1-phase4h.2` → `0.10.0-phase4j` (MINOR bump — new phase boundary)
- `frontend/lib/types.ts` — mirror Pydantic additions
- `frontend/lib/schema-snapshot.json` — regenerate via `python -m compute.output.schema_check --update-snapshot`

Effort: ~50 LOC + ~30 LOC tests. ~0.5 day.

### (e) `compute/main.py` wiring decision

**Open question — defer to integration-PR planning time**: how does Alpha158 feed the composite?

Three plausible patterns:

1. **Observability-only** — write per-feature IC + per-stock features into metadata + StockDetail but do NOT blend into composite. Rule 16 trivially holds. Lowest risk.
2. **Phase 4h-style Path-b blend** — pool accepted Alpha158 features into a single per-stock aggregate, blend OUTSIDE `compute_composite()` (preserves `PHASE3_WEIGHTS` sum-to-1.0 invariant at `compute/scoring/composite.py:43-45`). Mirror PR #112 pattern.
3. **Phase 5 ML meta-learner consumer only** — Alpha158 features feed the LightGBM meta-label classifier (per [`phase-5/meta-label/PLAN.md`](https://github.com/dackclup/quantrank/blob/main/.claude/skills/phase-5/meta-label/PLAN.md)) but NOT the Phase 4 composite. Waits for Phase 5 backtest infra (#75).

Recommended default: **(1) observability-only** for the integration PR. Defer (2)/(3) to Phase 5+ once IC evidence accumulates from production diagnostics.

Effort: ~50 LOC + tests. ~0.5 day.

### Top-5 rotation impact analysis

Rule 16 lock applies as for all prior factor libraries: Top-5 ranking stays on raw `composite_score`. `qlib_blended_score` (if wired in §(e)) is informational only. Confirm `entered_top5` / `exited_top5` distributions are unchanged when `qlib_blended_score` is excluded from ranking.

Effort: ~30 LOC test + spot-check. ~0.5 day.

## Triggers (open implementation PR when EITHER fires)

1. **Phase 5 backtest infra lands** (`.claude/skills/phase-5/backtest-infrastructure/PLAN.md`) — provides the canonical walk-forward + purged + embargoed CV that replaces this issue's rolling-12m IC stopgap. Recommended trigger.

2. **Analyst / user feedback** indicates Alpha158 features are needed for an in-flight analysis use case (forces the BYO-adapter + integration timeline ahead of Phase 5).

## Effort estimate

| Sub-item | LOC | Days |
|---|---|---|
| (a) yfinance-to-Qlib BYO adapter | ~150 | 2 |
| (b) Full Alpha158 feature compute on 502-ticker | ~80 | 1 |
| (c) Per-feature walk-forward IC validation | ~120 | 2 |
| (d) Schema additions (triple lockstep) | ~50 | 0.5 |
| (e) `compute/main.py` wiring (default = observability-only) | ~50 | 0.5 |
| (f) Top-5 rotation impact test | ~30 | 0.5 |
| Tests + docs + module docstring | ~210 | 1 |
| **Total** | **~690 LOC** | **~7.5 days** |

Slightly larger than Phase 4h's ~1160 LOC because:
- Phase 4h had the Path-b blend + PBO/DSR gate which are reusable for Phase 4j only if wiring path (2) is chosen
- BYO adapter is new ground; no prior precedent in QuantRank
- Per-feature IC validation is structurally different from per-signal PBO/DSR

## Sequencing relative to other Phase 4+ tracks

- **Phase 4h.1** (#113, OSAP full per-stock replication) — parallel; can ship independently
- **Phase 4h.2 Part 2** (#116, OSAP threshold calibration) — gated on ≥1 week of production diagnostic data from PR #118 / commit `2125aea8`
- **Phase 4i.1** (#115, JKP full integration) — gated on license-review checkpoint; runs in parallel with this 4j.1
- **Phase 4k scout** (IPCA) — the final factor scout, ships separately
- **Phase 5 backtest infra** — strongly preferred trigger for this 4j.1 (walk-forward CV supersedes the rolling-12m IC stopgap)

Tag `v1.1.0-phase4` is gated on all 4 factor library scouts (4h ✅ + 4i ✅ + 4j ✅ + 4k pending) and their respective integration PRs (4h ✅ + 4h.2 ✅ + 4i.1 + 4j.1 + 4k.1) all merging. This issue is the gating item for 4j specifically.

## Out of scope for this issue

- **CN A-share market integration** — Qlib's default region; QuantRank is US-only universe
- **Qlib's built-in model training** (LightGBM, MLP, etc.) — Alpha158 is feature engineering only; ML model training is Phase 5 ML meta-learner work
- **Qlib's portfolio optimization** (cvxpy-based) — out of scope; QuantRank's Top-5 selection is Rule 16 composite-based
- **`scripts/dump_bin.py` upstream contribution** — if we end up vendoring a port, consider upstreaming to `microsoft/qlib` as a follow-up community contribution (out of scope for QuantRank itself)
- **Heavy-transitive trimming** — mlflow / cvxpy / gym / jupyter are pulled by `pip install pyqlib` but unused by QuantRank. Upstream `[minimal]` extra would help; for now we accept the ~150-180 MB install footprint per PR #119 disclosure

## Related

- Scout PR: #119 (merged 2026-05-19)
- Phase 4h merge: PR #112 (`fbd1acf4`)
- Phase 4h.2 Part 1: PR #118 (`2125aea8`) — observability follow-up
- Phase 4h.1 (OSAP per-stock): #113
- Phase 4h.2 Part 2 (OSAP calibration): #116
- Phase 4i.1 (JKP integration): #115
- Schema: `0.9.1-phase4h.2` (current) → `0.10.0-phase4j` (this issue's target)
- Plan source-of-truth: `.claude/skills/phase-4/qlib-alpha158-fit/PLAN.md` (if exists; else create stub during integration-PR planning)
- Phase 5 prerequisite: `.claude/skills/phase-5/backtest-infrastructure/PLAN.md`

🤖 Filed by Claude Code via the Anthropic SDK after PR #119 (Phase 4j Qlib scout) shipped at `f0ade65b`.

---
_Generated by [Claude Code](https://claude.ai/code/session_015649aRyi2bvciQYZVNACd2)_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 4j.1 — Full Qlib Alpha158 integration (BYO adapter + 502-ticker feature compute + per-feature IC validation) #120

Context

Access-path foundation (locked by scout)

Qlib's distinct shape vs OSAP/JKP

Scope IN (5 deferred work items from PR #119 §"Out of scope")

(a) yfinance-to-Qlib BYO adapter — THE precondition

(b) Full Alpha158 feature compute on 502-ticker universe

(c) Per-feature cross-validation framework

(d) Schema additions

(e) `compute/main.py` wiring decision

Top-5 rotation impact analysis

Triggers (open implementation PR when EITHER fires)

Effort estimate

Sequencing relative to other Phase 4+ tracks

Out of scope for this issue

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Scout	Data shape	Per-stock surface	BYO required
OSAP (4h, PR #112)	factor returns CSV	proxy / 36m regression	no
JKP (4i, PR #114)	factor returns CSV	36m regression	no
Qlib (4j, PR #119)	per-stock per-date features (LOCAL compute)	native — Alpha158 emits 158 features per (stock, date)	YES — yfinance OHLCV → Qlib .bin adapter

Sub-item	LOC	Days
(a) yfinance-to-Qlib BYO adapter	~150	2
(b) Full Alpha158 feature compute on 502-ticker	~80	1
(c) Per-feature walk-forward IC validation	~120	2
(d) Schema additions (triple lockstep)	~50	0.5
(e) `compute/main.py` wiring (default = observability-only)	~50	0.5
(f) Top-5 rotation impact test	~30	0.5
Tests + docs + module docstring	~210	1
Total	~690 LOC	~7.5 days

Phase 4j.1 — Full Qlib Alpha158 integration (BYO adapter + 502-ticker feature compute + per-feature IC validation) #120

Description

Context

Access-path foundation (locked by scout)

Qlib's distinct shape vs OSAP/JKP

Scope IN (5 deferred work items from PR #119 §"Out of scope")

(a) yfinance-to-Qlib BYO adapter — THE precondition

(b) Full Alpha158 feature compute on 502-ticker universe

(c) Per-feature cross-validation framework

(d) Schema additions

(e) compute/main.py wiring decision

Top-5 rotation impact analysis

Triggers (open implementation PR when EITHER fires)

Effort estimate

Sequencing relative to other Phase 4+ tracks

Out of scope for this issue

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

(e) `compute/main.py` wiring decision