Chimeric two-pass cascade (opt-in --chimeric): +101% Astral / +11% PXD PSMs vs Java, faster + bounded-mem [DRAFT - TMT blocks gate] by ypriverol · Pull Request #42 · bigbio/msgf-rust

ypriverol · 2026-05-31T12:49:00Z

Summary

Opt-in --chimeric two-pass cascade for msgf-rust: recovers co-isolated second peptides (MaxQuant "second-peptide" model) without the wide-window FDR-inflation cost. --chimeric off (default) is byte-identical to the current engine.

Pass 1: narrow top-1 primary search per scan (the normal fast path).
Pass 2: MS1-gated targeted secondary search — detect co-isolated precursors in the isolation window (averagine-KL), score a few candidates at each co-isolated mass on the residual spectrum, one single-bin GF SpecEValue per secondary, emitted as extra rows.

Hardened through two rounds of adversarial review (see below).

Results — same-machine vs Java MS-GF+ (entrapment-validated, FDRBench 1:1)

dataset	Rust @1%	Java @1%	PSMs	speed	maxRSS	entrapment FDP
Astral (LFQ DDA, HCD)	71,877	35,818	+101%	6:38 vs 6:46 (faster)	10.9 GB (= non-chimeric)	1.04%
PXD001819 (UPS1 yeast)	16,592	14,989	+11%	1:14 vs 1:22 (faster)	2.3 GB	1.13%
TMT (PXD007683, CID)	9,671	10,194	−5%	2:14 vs 3:07 (faster)	7.7 GB	—

Gains are real co-isolated peptides — both the reversed-decoy and entrapment rulers agree (FDP ~1% on all three). Astral and PXD001819 beat Java on both PSMs and speed; the cascade adds no unbounded memory (chimeric maxRSS == non-chimeric; both index-dominated).

Cascade core

crates/search/src/coisolation.rs (new), run_pass2_coisolation + PreparedParts in match_engine.rs, streaming wiring in the binary, force_push in psm.rs.

Speed/quality optimizations

raw-peak primary match; build candidate index once (PreparedParts); single-bin GF per secondary.

Review fixes (two rounds, in this PR)

Round 1 (5-agent + line review): top_n MGF gating; real secondary features (compute_psm_features on residual); precursor_mz_override for correct ExpMass/dm/absdm (PIN + TSV); lower-isotope self-secondary exclusion; deterministic secondary winner (best-SpecEValue, not arbitrary heap order — this drove a large Astral gain); --chimeric-frag-index defaults off; renamed the prefilter index to FragmentPostingIndex.

Round 2 (adversarial): bounded memory — chimeric path now STREAMS (read_with_ms1_chunked, per-chunk bounded MS1, parser-thread pipeline) instead of buffering the whole file; Pass-2 calibration — search_secondary applies the learned precursor shift before the candidate prefilter; Pass-2 competition — secondaries on one scan now claim peaks sequentially so they can't double-count shared leftovers; real resync — a malformed spectrum is skipped and parsing continues (no silent tail-truncation). Each fix has a regression/unit test.

Do not merge yet — blocked by the merge gate

Rust must beat Java on both PSMs and speed on all 3 datasets. TMT PSMs still trail −5% — TMT is CID/narrow with ~no co-isolation, so the cascade can't help it; the gap is a per-peptide CID node-scoring divergence, deferred to a future iteration (additive Percolator features / per-ion CID trace). Diagnosis: docs/parity-analysis/notes/2026-05-31-tmt-gap-diagnosis-not-gf-bug.md. This PR is a reviewable checkpoint, marked draft.

Known follow-ups

Dead refuted-experiment code (fragment_index.rs, fragment_posting_index.rs, shared_fragment.rs, behind false flags) to be stripped before a real merge.
TMT CID scoring gap (the blocker).
The MS2-only send_chunks path could also adopt resync (currently unchanged; out of chimeric scope).

Reference

docs/parity-analysis/notes/2026-05-31-cascade-optimized-multidataset-summary.md

Phased design to recover co-fragmented peptide IDs via the DDA+ approach (full-isolation-window search FIRST, then MS1 targeted-XIC isotope refinement, then greedy shared-fragment rescoring). Maps each DDA+ step to concrete msgf-rust components: - Phase 1: parse isolation-window width + widen candidate enumeration to the full window + emit top-N distinct-peptide PSMs/scan. Behind --chimeric (default off => bit-identical). Reuses the existing bucket_index range scan + per-charge multi-queue machinery. - Phase 2: MS1 targeted-XIC + isotope KL-divergence as an ADDITIVE PIN feature (audit-safe), or an external handoff (lean-Rust option). - Phase 3: greedy shared-fragment rescoring (additive-first; bench-gated). - Cross-cutting: candidate explosion ties into the planned fragment-index speed enabler; measure Phase-1 wall first before building it. Motivated by the PR #40 finding that scoring parity is exhausted, so ID gains now require a new capability rather than parity fixes.

…arch + multi-PSM) Bite-sized TDD tasks for Phase 1 (option A, existing bucket scan): 1. Spectrum isolation-window offset fields 2. mzML <isolationWindow> parsing (MS:1000828/829) 3. --chimeric + --isolation-halfwidth CLI/SearchParams wiring 4. widen candidate enumeration to the isolation window when chimeric 5. retain + emit top-N distinct-peptide PSMs/scan + SpecId uniqueness 6. workspace gate + VM bench decision point (off=bit-identical, on=PXD/TMT measure) Every task keeps the chimeric=false path on existing code, guarded by the bit-identical PIN golden test. Task 6 is the ship/defer decision (incl. whether the fragment-index enabler becomes a prerequisite if wall is unacceptable).

Add `isolation_lower_offset: Option<f64>` and `isolation_upper_offset: Option<f64>` to `Spectrum`, initialized to `None` in every constructor/literal across the workspace. Bit-identical to baseline (no scoring path touched).

Add `chimeric: bool` and `chimeric_isolation_halfwidth_da: f64` fields to `SearchParams` (defaulting to false/1.5) and the matching `--chimeric` / `--isolation-halfwidth` CLI flags; wire them into SearchParams assembly. When `--chimeric` is on and `--top-n` is at its default of 1, top-N is automatically raised to 5 with an eprintln notice. No search behavior changes; default off path is bit-identical.

…eric Extract candidate_nominal_bounds(spec, z, params, shift_ppm). When params.chimeric, derive the candidate nominal-mass window from the full isolation window (selected m/z ± isolation offsets, or the configured half-width fallback) converted to neutral nominal mass per charge, with isotope-error widening only (the window already exceeds precursor tol). When chimeric is off, the derivation is byte-identical to the original inline code. Test candidate_nominal_bounds_chimeric_spans_isolation_window asserts the chimeric window strictly contains the standard one and that an off-precursor co-isolated mass is reachable only under chimeric. Bit-identical PIN golden gate green (chimeric defaults off); clippy clean.

… loop Branch the per-candidate iso-offset match: under --chimeric use matches_isolation_window (accept anywhere in the isolation window, hoisted window bounds), else the existing matches_precursor tight check. Bit-identical PIN golden gate green (chimeric off path unchanged). Also fixes a clippy field-reassign-default in the isolation-window unit test.

A chimeric scan emits multiple distinct-peptide PSMs that can share a SpecE rank (iter_ranked increments rank only on a distinct spec_e_value), which would collide on the `spec_scan_rank` SpecId. Append the per-row emission index under --chimeric only; the standard SpecId format is unchanged so the bit-identical PIN golden still holds. Test asserts two same-SpecE co-fragmented PSMs get distinct SpecIds.

…DR without MS1/shared-fragment refinement

…nement) Five bite-sized tasks to add the MS1 isotope-envelope check that suppresses the Phase-1 FDR inflation: 1. averagine theoretical isotope envelope (new model::isotope) 2. optional MS1 capture + MS2->MS1 linkage in the mzML reader 3. observed precursor isotope envelope + KL-divergence + SNR features 4. additive PrecursorIsotopeKL + PrecursorSNR PIN columns 5. VM bench decision gate (does the KL feature deflate Astral's +94%? if not, Phase 3 shared-fragment rescoring is required) v1 uses the single linked MS1 (apex); multi-scan XIC correlation deferred to Phase 2b. Additive PIN columns + --chimeric-off bit-identical throughout.

Add `averagine_isotope_envelope(mass, n_isotopes) -> Vec<f64>` in a new `crates/model/src/isotope.rs` module. Uses the averagine + Poisson model (lambda ≈ mass * 4.76e-4 from 13C natural abundance) to compute normalized precursor isotope peak intensities. Registered as `pub mod isotope` in lib.rs. Three unit tests cover normalization, mass-dependent +1/+0 growth, and edge cases (n=0, n=1). New module is unused by default; zero change to existing pipeline output (bit-identical gate: ok).

… fields

…er --chimeric

…oes NOT control FDR (Astral still +97%)

…nt; fragment competition is the real missing discriminator

… merge-gate/TMT limit

…ng corollary

…the remaining gate blocker

…ew (low overlap tentatively challenges fragment-theft premise)

… overlap probe pending Astral

… 38% fracmin>=0.5 vs BSA 13%) Decisive Astral chimeric run (MSGF_CHIMERIC_OVERLAP=1, n=121,423 co-emitting scans): mean fracmin 0.367, 38% >=0.5, bimodal with a near-total-overlap mode at [0.9-1.0)=11.4%. BSA low-overlap pattern does NOT hold on real co-isolated data. Fragment-theft premise validated for a substantial fraction -> Phase 3 (shared-fragment competition) is the relevant fix for those scans; ~28% coincidental tail still needs per-scan FDR.

…ce filter spec Approach A (filter + additive PIN columns, no score modification). Greedy peak-claiming (rank-1 first, protected) drives a hard pre-Percolator filter on UniqueMatchedIons (swept knob, decoy-symmetric) + 3 additive PIN columns. Grounded in the 2026-05-29 Astral overlap result (theft confirmed, bimodal). DoD: Astral canary returns to ~36.7k AND PXD beats Java, --chimeric off bit-identical, wall within 3%. Research toward trustworthy chimeric; not a merge.

…l SpecEValue + Percolator on all features) Drop the hand-tuned --chimeric-min-unique-ions filter. Discrimination is the score itself at two layers: (1) in-engine residual SpecEValue re-score on uniquely-claimed peaks deflates theft/coincidental rank>=2 peptides; (2) Percolator on the full feature vector (re-scored RawScore/lnSpecEValue + additive unique-evidence columns). Rule-2-safe: off bit-identical, rank-1 untouched, only chimeric extra rows change.

… SpecEValue re-score Two-layer discrimination, no parameter (per design 93178d7): - Pure competition core (shared_fragment.rs): greedy peak-claiming most-confident first; per rank>=2 peptide compute unique-evidence (UniqueMatchedIons, UniqueExplainedFraction, SharedFracClaimed). 7 unit tests. - match_engine hook (guarded on --chimeric): walk emitted PSMs best-first, claim peaks, and re-score each rank>=2 PSM's RawScore + GF SpecEValue on the residual (unclaimed) spectrum via rescore_residual_spec_e. A theft/coincidental peptide, stripped of stolen peaks, gets a worse SpecEValue and drops out of the FDR set on its own; no hard filter, no threshold. Decoy-symmetric. - Additive PIN columns gated on --chimeric (off path byte-identical). Header gating test + existing schema-parity test green. Smoke (BSA test.mgf): off has no Phase-3 cols, on emits 3; 55 rows shared>0, 37 fully-stolen rows deflated to negative residual RawScore / lnSpecE ~0. Validation gate (Astral canary -> ~36.7k, PXD>Java) pending VM bench.

@1

…ary (+111%, not deflated) off vs on @1% FDR: PXD 14,808->18,306, Astral 36,715->77,444 (canary should be ~flat), TMT 9,605->9,362. Decoy fraction 0.83 on-runs (structural inflation intact). Root cause: a per-PSM score change deflates spurious targets AND decoys symmetrically -> q-value curve unchanged -> aggregate 1% count doesn't move. The broken part is the PSM-level FDR model (Phase-2 requirement #2), not the per-peptide score. Phase 3 refuted as a gate-clearer; impl kept as validated negative result; --chimeric off byte-identical; nothing ships.

@1

Approach A: separate target-decoy FDR for rank-1 vs rank>=2 strata (split PIN by rank -> 2 Percolator runs -> sum @1%). Tests the structural fix after Phase 3 (per-PSM rescore) was refuted. Measure on both non-rescored Phase-1 PIN (model alone) and Phase-3 rescored PIN (composition). Canary: Astral total ~36.7k; win: PXD>14,974. No Rust production change (test-only NO_RESCORE env gate).

…+4,362 real PSMs, FDP 1.54->1.13)

… ms2pip/deeplc) DRAFT design from two feasibility investigations: native Percolator integration (stdin pipe to bundled binary, perfect 3.7.1 parity) + ML rescoring features (spectral-angle/RT-delta additive PIN columns, predictions via self-hostable Koina first, native ONNX/XGBoost embedding as phase 2). Pending user review of the open decisions before any implementation plan.

…ct win on all 3 datasets

coderabbitai · 2026-05-31T12:49:07Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b62fd56a-31c9-4ae3-bde6-faba52aac99f

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/chimeric-dda-plus

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

… TSV precursor override CI fix: the raw-peak-match commit (3556bee) inserted primary_matched_peak_keys between search_secondary's doc and its #[allow(clippy::too_many_arguments)], so the allow landed on the 3-arg helper and the 11-arg search_secondary lost it -> clippy -D warnings failed CI. Reorganized so each fn has its own doc; allow back on search_secondary. Review fix: the TSV writer's Precursor + Da-mode PrecursorError columns used the primary scan's spec.precursor_mz for chimeric secondaries; now use precursor_mz_override (mirrors pin.rs). None for ordinary PSMs -> byte-identical. TSV is not the Percolator path, so validated PIN results are unaffected. Found by: CI (clippy) + the code review reviewers.

ypriverol · 2026-05-31T14:11:36Z

Code review (A/B + E fix commits)

Focused review of the post-open fix commits (c7940916, ffaab1d9) — the cascade core was already reviewed before this PR opened. Found 1 real issue; it plus a CI lint regression are now fixed in c45e9c48.

TSV writer used the primary scan's precursor m/z for chimeric secondaries. crates/output/src/tsv.rs computed the Precursor column and the Da-mode PrecursorError from spec.precursor_mz, which for a Pass-2 secondary is the primary's selected m/z — the same bug class the precursor_mz_override field was added to fix in pin.rs, but the TSV writer was missed. Flagged independently by two reviewers. Fixed by routing both through psm.precursor_mz_override.unwrap_or(spec.precursor_mz) (mirrors pin.rs; None for ordinary PSMs → byte-identical). TSV is not the Percolator path, so the validated PIN results are unaffected.
CI clippy too_many_arguments on search_secondary (-D warnings). The earlier raw-peak-match commit (3556bee8) inserted primary_matched_peak_keys between search_secondary's doc and its #[allow(clippy::too_many_arguments)], so the allow landed on the 3-arg helper and the 11-arg search_secondary lost it. Reorganized so each function carries its own doc and the allow sits on search_secondary.

No CLAUDE.md performance-invariant violations (off-path --chimeric off remains byte-identical; no shared-state mutation in the pre-pass). Field threading for the new precursor_mz_override is complete across all construction sites. The E fix (top_n gating for non-mzML) is correct and mzML-path-identical.

Note: PR is a draft (blocked by the TMT merge gate); this review was run on explicit request.

Review fixes: - HIGH: detect_coisolated now excludes the selected precursor's isotope peaks in BOTH directions (selected_mz +/- k*ISOTOPE/z). Previously only higher isotopes were skipped, so when the instrument selected M+1/M+2 the true monoisotope at selected_mz - k*ISOTOPE/z could be re-discovered as a fake co-isolated secondary (self-inflation of the primary's own peptide). - search_secondary picks the winner by score (min spec_e_value, then max rank_score, then candidate idx) instead of the unordered drain_into_vec().next() -> removes nondeterminism when tied secondaries survive the capacity-1 queue. - --chimeric on non-mzML now sets params.chimeric=false, so the PIN schema/SpecId gates and top-N forcing all stay on the normal path (the warning's 'runs normally' is now literally true). - --chimeric-frag-index defaults to off: the prefilter index is unreachable in the shipped cascade (cascade_wide is always false), so on/auto only built an unused index and cost memory + startup. Rename: the inverted fragment-posting prefilter index (Approach B) and its module/ file are renamed to FragmentPostingIndex / fragment_posting_index to avoid implying provenance from another engine. Build + clippy(-D warnings) + search/output tests green.

Comment-only cleanup of the chimeric cascade files (coisolation, match_engine cascade additions, binary chimeric wiring, pin/tsv precursor override, psm field doc). ~90 net comment lines removed; kept the essential 'why' notes and all public-item doc comments. Zero logic changes; build + clippy -D warnings green.

…libration, tolerant MS1 read Finding 1 (high, bounded memory): the chimeric path no longer batch-reads the whole file. New MzMLReader::read_with_ms1_chunked streams MS2 in CHUNK_SIZE batches, each with a bounded per-chunk Ms1Link (only the carry-over MS1 crosses a chunk boundary). The binary scores Pass 1 + Pass 2 per chunk on a parser-thread pipeline and drops peaks immediately, so RSS is bounded to ~CHUNK_SIZE spectra. run_pass2_coisolation now takes an explicit per-chunk link + global offset. Unit test proves chunked MS1 linkage matches the batch read across chunk boundaries + carry-over. Finding 2 (high, calibration): search_secondary applies the learned precursor_mass_shift_ppm (adjusted_observed_neutral_mass) to the co-isolated neutral mass before the candidate prefilter and mass-error report, matching the main search. Regression test with a non-zero shift added. Finding 3 (med, tolerant parsing): read_with_ms1 + read_with_ms1_chunked no longer abort on the first malformed spectrum; they deliver the spectra parsed so far and (chunked) report the error count, mirroring the MS2-only streaming path. clippy -D warnings + search/output/input tests green (incl. 2 new tests).

…al resync Finding A (high): Pass-2 secondaries on the same scan now COMPETE for residual evidence. search_secondary takes a prior_claimed peak-key set and returns the peaks its winner explained; run_pass2_coisolation threads these across the scan's co-isolated precursors so a peak claimed by one secondary is removed before the next is scored (no double-counting of shared leftover peaks on multi-precursor scans). New test asserts a peptide matches fewer ions once its peaks are claimed. Finding B (high): the chimeric mzML reader now does REAL resync instead of silent tail-truncation. New MzMLReader::resync_to_next_spectrum skips a malformed scan to the next <spectrum> and continues; only an unreadable XML stream stops parsing. Wired into read_with_ms1_chunked + read_with_ms1. New test: a bad spectrum between two good ones is skipped (err_count=1) and BOTH good spectra survive. clippy -D warnings + input/search tests green (incl. 2 new tests).

…m comments CLI help (msgf-rust binary): - Removed stale internal references (pre-iter39, 'G1 gate / DOCS.md section', 'Fix B'). - Rewrote --chimeric help to describe the actual two-pass cascade (was still the old blind wide-window description). - Rewrote --precursor-cal help (off/auto/on, learn+tighten); clarified --max-spectra, --ms-level, --isolation-halfwidth. - Hid the advanced/dead --chimeric-frag-index flag from --help. - Dropped redundant inline '(default X)' — clap already shows [default: X]. README: completed the parameter table (required vs optional with defaults in bold, added --precursor-cal/--chimeric/--decoy-prefix/--ms-level) + a 'Chimeric / co-isolated peptides' section describing the opt-in cascade. Comments: removed development-history jargon (iterNN labels, R-2/C-4/HIGH-2 change codes, project phase/task labels, dead doc-links, date stamps, commit SHAs) from ~150 comment lines across search/output/scoring/model/binary, preserving all technical rationale and doc contracts. Zero logic changes; clippy -D warnings + build + tests green.

Removes the refuted blind-wide-window 'Approach B' that was unreachable behind a hardcoded `cascade_wide = false`: the fragment-posting prefilter index, the greedy shared-fragment competition + residual rescore, and the `--chimeric-frag-index` flag / FragIndexMode / frag_index_active. The shipped two-pass cascade (run_pass2_coisolation / coisolation.rs) and the --chimeric-off path are unchanged. - Delete fragment_index.rs, fragment_posting_index.rs, shared_fragment.rs. - Strip cascade_wide branches in run_chunk_inner (keep the narrow path), the PreparedSearch/PreparedParts fragment_posting_index field, rescore_residual_spec_e, and candidate_nominal_bounds' chimeric param (narrow-only). - Remove FragIndexMode + chimeric_frag_index from SearchParams + CLI. - Drop the 3 always-zero shared-fragment PIN columns (UniqueMatchedIons, UniqueExplainedFraction, SharedFracClaimed); chimeric PIN 42 -> 39 cols, off-path PIN unchanged at 39. build + clippy -D warnings + search/output/input tests green; removed-symbol grep empty.

P1 (perf, behavior-identical): - Drop per-spectrum window_cand_indices clone; iterate by reference. - Avoid isotope_error_range.clone() in the hot scoring loop (start..=end). - merge_unique_candidate_idxs: O(k^2) Vec::contains -> FxHashSet membership, Vec kept for identical output order. P2 (robustness / quality): - score_dist::get_probability returns 0.0 for score > max_score instead of panicking (empty-tail-mass; defensive — match_engine still guards, so normal results unchanged). New unit test. - generating_function: release-safe DROPPED_NODES atomic + accessor for the |score|>10000 node drop (was silent). - Rename misleading test to nearest_peak_full_picks_max_intensity_within_tolerance and assert max-intensity (not nearest-m/z) selection. - msgf-trace GF lookup uses rank_score (production key) not score; documented the remaining single-bin-graph divergence. - Refresh stale R-2.x header in match_engine_java_parity. build + clippy -D warnings + search/output/scoring/input tests green.

… blocker, reverted (n=9)

…ision to stop P0 grind

…F audit) Cascade summary note: add Final state (rev5) section recording the code review, two adversarial rounds, dead-code cleanup, and GF/SpecE parity audit; final numbers (Astral +101%, PXD +11%, TMT -5% blocker), HEAD b46b610, gate status. Fix stale HEAD reference in the header. CLAUDE.md: replace the abandoned fragment-index 'speed v2' next-work section with the current chimeric cascade DRAFT PR #42 state and TMT closing options; record that both fragment-index approaches were refuted this session.

@1

Integrates the DeltaRawScore feature (originally bea5d697 on feat/delta-raw-score) onto feat/chimeric-dda-plus. A direct branch merge conflicts on all touched files and would revert the cascade's golden, so the feature is re-applied at the current anchors instead. Top-1 dominance signal: RawScore(best) − RawScore(2nd-best distinct peptide) per spectrum, captured during candidate scoring (independent of the TopNQueue, so it survives top_n=1) and WITHOUT feeding the GF min_score — no emitted PSM's SpecEValue/RawScore changes. Distinct-peptide keyed by nominal residue mass so a shared peptide (multi-charge / multi-protein) doesn't zero the delta. Emitted on the rank-1 PIN row only (0.0 elsewhere), placed after the chimeric MS1 features (PrecursorIsotopeKL/PrecursorSNR), before Peptide/Proteins. Purely additive: off-path PIN gains one column and is otherwise unchanged; TSV is byte-identical. Golden precursor_cal_off.pin regenerated from this binary (39→40 cols). Schema-parity test updated to Java + 4 additive cols. Prior bench (vs the narrow pre-cascade baseline): +129 PXD / +104 Astral / +12 TMT @1% FDR, zero wall cost, decoy structure unchanged.

Add additive DeltaRawScore PIN column to the chimeric cascade

qodo-code-review · 2026-06-01T09:04:24Z

Review Summary by Qodo

Chimeric two-pass cascade: +101% Astral / +11% PXD PSMs with bounded-memory streaming and two-round adversarial hardening

✨ Enhancement 🧪 Tests 🐞 Bug fix

Walkthroughs

Description

• **Chimeric two-pass cascade** (opt-in --chimeric flag): recovers co-isolated second peptides
  without wide-window FDR inflation
  - Pass 1: narrow top-1 primary search per scan (fast path, byte-identical to current engine when
  disabled)
  - Pass 2: MS1-gated targeted secondary search on residual spectrum, detecting co-isolated precursors
  via averagine-KL divergence
• **Empirical results** (same-machine vs Java MS-GF+, entrapment-validated):
  - Astral (LFQ DDA, HCD): +101% PSMs, faster (6:38 vs 6:46), 10.9 GB maxRSS
  - PXD001819 (UPS1 yeast): +11% PSMs, faster (1:14 vs 1:22), 2.3 GB maxRSS
  - TMT (CID): −5% PSMs (narrow window, no co-isolation), faster (2:14 vs 3:07), 7.7 GB maxRSS
• **Core implementation**:
  - New coisolation.rs module: detect_coisolated() and search_secondary() functions
  - New chimeric_features.rs module: precursor isotope-envelope matching (KL divergence, SNR)
  - New isotope.rs module: theoretical averagine isotope envelope computation
  - MS1 capture infrastructure in mzml.rs: Ms1Link struct, read_with_ms1_chunked() streaming
  with bounded memory
  - PreparedParts caching in match_engine.rs to reuse candidate enumeration across calibration and
  main passes
• **Hardened through two rounds of adversarial review**:
  - Round 1: top_n MGF gating, real secondary features on residual, precursor_mz_override for correct
  ExpMass/dm, lower-isotope self-secondary exclusion, deterministic secondary winner (best-SpecEValue)
  - Round 2: bounded memory (streaming instead of buffering), Pass-2 calibration (precursor shift
  before prefilter), sequential peak competition (no double-counting), real resync (skip malformed
  spectra, continue parsing)
• **New PIN columns**: PrecursorIsotopeKL, PrecursorSNR, DeltaRawScore (additive
  chimeric/dominance features)
• **CLI integration**: --chimeric flag, --isolation-halfwidth parameter, forced top_n=1 under
  chimeric mode
• **Bug fixes**: GF score distribution out-of-range guard, msgf-trace GF threshold alignment with
  production
• **Documentation cleanup**: removed iteration-specific comments and references throughout codebase
• **Comprehensive test coverage**: isolation-window parsing, MS1 capture, chunked reading, error
  recovery, co-isolation detection, secondary peptide recovery

Diagram

flowchart LR
  A["Input: mzML<br/>with MS1 data"] -->|"read_with_ms1_chunked<br/>bounded per-chunk"| B["MS1 capture<br/>Ms1Link"]
  B -->|"Pass 1:<br/>narrow top-1"| C["Primary search<br/>best peptide"]
  C -->|"PreparedParts<br/>cached candidates"| D["Pass 2:<br/>MS1-gated targeted"]
  B -->|"averagine-KL<br/>co-isolation detect"| E["Secondary candidates<br/>co-isolated masses"]
  E -->|"residual spectrum<br/>sequential peaks"| F["Secondary search<br/>targeted scoring"]
  F -->|"single-bin GF<br/>SpecEValue"| G["Secondary PSMs<br/>extra rows"]
  C -->|"top-1 only"| H["PIN output<br/>+ isotope/SNR/delta features"]
  G -->|"precursor_mz_override"| H

File Changes

1. crates/search/src/match_engine.rs ✨ Enhancement +540/-80

Chimeric two-pass cascade with MS1 co-isolation detection

• Added N_PRECURSOR_ISOTOPES constant and Ms1Link import for chimeric precursor isotope features
• Introduced PreparedParts struct to cache candidate enumeration across calibration and main
 passes
• Added candidate_nominal_bounds() helper function to extract precursor window calculation logic
• Implemented into_parts(), from_parts(), and with_ms1_link() methods for PreparedSearch to
 support two-pass cascade
• Added DeltaRawScore capture logic tracking best and second-best distinct-peptide RawScores per
 spectrum
• Integrated run_pass2_coisolation() call for chimeric secondary peptide detection
• Added chimeric precursor isotope-envelope feature computation (currently disabled via
 CASCADE_SKIP_MS1_FEATURE)
• Implemented cleavage_credit_for() module-level function for secondary peptide scoring
• Added matched_peak_keys() diagnostic function for fragment-overlap measurement
• Updated compute_psm_features() to initialize new chimeric feature fields
 (precursor_isotope_kl, precursor_snr, delta_raw_score)
• Cleaned up dated commit references and iteration markers from comments

crates/search/src/match_engine.rs

2. crates/input/src/mzml.rs ✨ Enhancement +665/-14

MS1 capture and MS2-to-MS1 linkage for chimeric detection

• Added Ms1Link struct to link MS2 spectra to their preceding MS1 scans and store MS1 peak lists
• Added IsolationWindow parser state and isolation-window offset CV parameters (MS:1000828,
 MS:1000829)
• Extended SpectrumBuilder with isolation_lower_offset and isolation_upper_offset fields
• Refactored spectrum building into build_spectrum() and build_peaks() helper methods for reuse
• Added MS1 capture infrastructure to MzMLReader with capture_ms1 flag and captured_ms1
 storage
• Implemented read_with_ms1() method to drain reader and return MS2 spectra with MS1 linkage
• Implemented read_with_ms1_chunked() streaming method with bounded memory and error tolerance
• Added resync_to_next_spectrum() method to skip malformed spectra and continue parsing
• Added comprehensive tests for isolation-window parsing, MS1 capture, chunked reading, and error
 recovery

crates/input/src/mzml.rs

3. crates/output/src/pin.rs ✨ Enhancement +148/-29

PIN output schema expansion for chimeric and dominance features

• Added three new PIN columns: PrecursorIsotopeKL, PrecursorSNR, and DeltaRawScore (all
 between EdgeScore and Peptide)
• Updated write_header() to include the three new additive chimeric/dominance feature columns
• Modified write_spectrum_rows() to track per-row emission index for unique SpecId generation
 under --chimeric
• Added precursor_mz_override support in write_psm_row() to use co-isolated precursor m/z for
 secondary PSMs
• Implemented chimeric SpecId uniqueness by appending row index when params.chimeric is true
• Added DeltaRawScore gating to rank-1 rows only (mirroring lnDeltaSpecEValue behavior)
• Updated test fixtures to include new fields and added tests for chimeric SpecId uniqueness and
 DeltaRawScore rank gating

crates/output/src/pin.rs

View more (69)

4. crates/search/src/precursor_matching.rs ✨ Enhancement +108/-0

Isolation-window precursor matching for co-isolated peptides

• Added matches_isolation_window() function for wide-window co-isolated peptide matching
• Isolation-window variant clamps mass error to the nearest in-window neutral mass (reports
 near-zero error for in-window peptides)
• Supports isotope offset and precursor tolerance expansion for co-isolation detection
• Added comprehensive unit tests validating in-window acceptance and out-of-window rejection

crates/search/src/precursor_matching.rs

5. crates/output/tests/output_pin_schema_parity.rs 🧪 Tests +32/-20

PIN schema parity test updates for four additive columns

• Updated schema parity test to account for four additive Rust-only PIN columns instead of one
• Modified column-count assertion to expect Java + 4 instead of Java + 1
• Updated position validation to check all four additive columns sit between matchedIonRatio and
 Peptide
• Clarified comments documenting the four new additive features and their purposes

crates/output/tests/output_pin_schema_parity.rs

6. crates/search/src/coisolation.rs ✨ Enhancement +609/-0

Chimeric cascade: co-isolated precursor detection and targeted secondary search

• New module implementing the chimeric two-pass cascade for detecting co-isolated precursors in MS1
 isolation windows
• detect_coisolated function identifies co-isolated precursors by averagine KL divergence
 matching, excluding the selected precursor and its isotopes
• search_secondary function performs targeted residual-spectrum search on co-isolated masses with
 sequential peak competition
• Comprehensive unit tests validating co-isolation detection, secondary peptide recovery, and
 precursor calibration application

crates/search/src/coisolation.rs

7. crates/scoring/src/scoring/scored_spectrum.rs 📝 Documentation +60/-28

Documentation cleanup and test fixture updates for isolation windows

• Removed iteration-specific comments (iter30, iter31, iter33, iter36, iter38) to clean up
 documentation
• Fixed nearest_peak_full test to verify max-intensity selection within tolerance window, not
 closest-by-m/z
• Added isolation_lower_offset and isolation_upper_offset fields to test spectrum fixtures
• Clarified comments around deconvolution ordering and cache behavior

crates/scoring/src/scoring/scored_spectrum.rs

8. crates/msgf-rust/src/bin/msgf-rust.rs ✨ Enhancement +177/-84

Chimeric cascade CLI integration with streaming MS1 linking

• Added --chimeric flag to enable two-pass cascade (requires mzML with MS1 data)
• Added --isolation-halfwidth parameter for fallback isolation window width when mzML lacks
 per-scan offsets
• Implemented dual-path streaming: chimeric path uses read_with_ms1_chunked with bounded per-chunk
 MS1 linking; non-chimeric path unchanged
• Forced top_n = 1 under chimeric mode to prevent multi-emission inflation; Pass 1 emits only best
 primary, Pass 2 emits secondaries
• Reuse PreparedParts from calibration pre-pass to avoid re-enumerating candidates (~15s saved on
 Astral)
• Cleaned up CLI documentation to remove iteration-specific language

crates/msgf-rust/src/bin/msgf-rust.rs

9. crates/search/src/psm.rs ✨ Enhancement +82/-36

Chimeric features and secondary PSM emission support

• Added PsmFeatures fields: precursor_isotope_kl, precursor_snr (MS1 envelope matching), and
 delta_raw_score (top-1 dominance)
• Added precursor_mz_override field to PsmMatch for chimeric Pass-2 secondaries to report their
 own co-isolated precursor m/z
• Implemented force_push method on TopNQueue to add secondaries without eviction (distinct
 co-isolated peptides, not competitors)
• Removed iteration-specific comments and clarified documentation around rank_score, edge_score, and
 tie handling
• Updated test fixtures to initialize new fields

crates/search/src/psm.rs

10. crates/search/src/chimeric_features.rs ✨ Enhancement +241/-0

Precursor isotope envelope matching for chimeric MS1 filtering

• New module for precursor isotope-envelope matching against observed MS1 peaks
• precursor_isotope_match computes KL divergence and SNR between theoretical averagine envelope
 and observed MS1 isotope cluster
• Uses max-intensity peak selection within tolerance window for each isotope position
• Comprehensive unit tests covering clean envelopes, missing envelopes, empty peaks, and
 max-intensity selection

crates/search/src/chimeric_features.rs

11. crates/output/src/tsv.rs ✨ Enhancement +13/-5

TSV output support for chimeric secondary precursor m/z override

• Updated TSV writer to use precursor_mz_override when available (chimeric secondaries), falling
 back to spectrum's precursor m/z
• Applied override-aware precursor m/z to mass-error Da conversion calculation
• Added isolation_lower_offset and isolation_upper_offset to test spectrum fixtures
• Updated test PSM fixtures to include new fields

crates/output/src/tsv.rs

12. crates/model/src/isotope.rs ✨ Enhancement +73/-0

Theoretical averagine isotope envelope computation

• New module implementing averagine_isotope_envelope function using Poisson model for 13C
 distribution
• Computes relative intensities of isotope peaks from peptide neutral mass, normalized to sum 1.0
• Unit tests validate normalization, mass-dependent envelope shape, and edge cases (zero/one
 isotope)

crates/model/src/isotope.rs

13. crates/scoring/src/gf/primitive_graph.rs 📝 Documentation +11/-12

Documentation cleanup for GF primitive graph

• Removed iteration-specific comments (iter36, iter37 P-8) referencing prior per-graph cache removal
• Clarified documentation around spectrum-wide observed_mass_cache de-duplication
• Added isolation_lower_offset and isolation_upper_offset to test spectrum fixtures

crates/scoring/src/gf/primitive_graph.rs

14. crates/scoring/src/scoring/psm_score.rs 📝 Documentation +8/-2

Documentation cleanup and test fixture updates
• Removed iteration-specific comment (iter31 P-2) from env-var caching documentation
• Added isolation_lower_offset and isolation_upper_offset to test spectrum fixtures
crates/scoring/src/scoring/psm_score.rs

15. crates/scoring/src/scoring/rank_scorer.rs 📝 Documentation +3/-5

Documentation cleanup for rank scorer

• Removed iteration-specific references (iter25 fix) from Java-parity edge case documentation
• Simplified comments around prob_peak > 1 NaN/inf handling without changing logic

crates/scoring/src/scoring/rank_scorer.rs

16. crates/search/tests/match_engine_java_parity.rs 📝 Documentation +5/-22

Java parity test documentation simplification

• Simplified test documentation by removing detailed R-2/R-3/C-4/C-5/F-1 parity gap enumeration
• Clarified scope: tests verify spectrum coverage and top-1 peptide identity, not full feature
 distribution parity
• Removed references to specific audit documents and iteration numbers

crates/search/tests/match_engine_java_parity.rs

17. crates/model/src/spectrum.rs ✨ Enhancement +17/-0

Spectrum struct extended with isolation window offsets

• Added isolation_lower_offset and isolation_upper_offset fields to Spectrum struct for
 chimeric mode
• Updated Default implementation and all test fixtures to initialize new fields
• Added unit test validating isolation offsets default to None

crates/model/src/spectrum.rs

18. crates/search/src/search_params.rs ✨ Enhancement +11/-3

Search parameters extended for chimeric cascade configuration

• Added chimeric boolean flag (default false) to enable two-pass cascade
• Added chimeric_isolation_halfwidth_da parameter (default 1.5 Da) for fallback isolation window
 width
• Removed iteration-specific documentation references
• Updated default_tryptic to initialize new fields

crates/search/src/search_params.rs

19. crates/scoring/src/gf/score_dist.rs 🐞 Bug fix +18/-2

Score distribution out-of-range guard and test coverage

• Updated get_probability to return 0.0 for scores above max_score (out-of-range defensive
 guard)
• Added unit test validating out-of-range score returns empty tail mass 0.0
• Clarified documentation around score bounds and out-of-range behavior

crates/scoring/src/gf/score_dist.rs

20. crates/scoring/src/gf/generating_function.rs ✨ Enhancement +17/-0

GF-DP node drop counter for telemetry

• Added process-global DROPPED_NODES atomic counter to track GF-DP nodes pruned by score-range
 guard
• Implemented dropped_nodes_count() function for release-safe telemetry (relaxed atomic load)
• Incremented counter in compute_inner when nodes fall outside [-10000, 10000] range

crates/scoring/src/gf/generating_function.rs

21. crates/msgf-rust/src/bin/msgf-trace.rs 🐞 Bug fix +21/-1

Msgf-trace GF threshold alignment with production

• Updated GF score threshold to use rank_score (node + cleavage + edge) instead of score (node +
 cleavage)
• Aligns trace dump with production SpecEValue path seeding
• Added detailed comments explaining intentional differences from production (single-bin graph vs
 merged group, hardcoded terminal flags)

crates/msgf-rust/src/bin/msgf-trace.rs

22. crates/scoring/tests/gf_graph_dp.rs 🧪 Tests +3/-0

Test fixture updates for isolation window fields
• Added isolation_lower_offset and isolation_upper_offset to test spectrum fixtures
crates/scoring/tests/gf_graph_dp.rs

23. crates/search/tests/match_engine_smoke.rs 🧪 Tests +2/-0

Test fixture updates for isolation window fields
• Added isolation_lower_offset and isolation_upper_offset to test spectrum fixtures
crates/search/tests/match_engine_smoke.rs

24. crates/search/src/mass_calibrator.rs 🧪 Tests +4/-0

Add chimeric and isolation offset fields to test fixtures
• Added chimeric and chimeric_isolation_halfwidth_da fields to test fixture SearchParams
• Added isolation_lower_offset and isolation_upper_offset fields to test Spectrum struct
crates/search/src/mass_calibrator.rs

25. crates/search/tests/match_engine_specevalue.rs 🧪 Tests +3/-0

Add isolation and override fields to match engine test fixtures
• Added isolation_lower_offset and isolation_upper_offset fields to test Spectrum construction
• Added precursor_mz_override field to test PsmMatch construction
crates/search/tests/match_engine_specevalue.rs

26. crates/search/src/lib.rs ✨ Enhancement +3/-1

Register chimeric cascade modules and re-export Pass-2 driver
• Registered new modules chimeric_features and coisolation for chimeric two-pass cascade
• Re-exported run_pass2_coisolation from match_engine for public API
crates/search/src/lib.rs

27. crates/search/src/precursor_cal.rs 📝 Documentation +4/-4

Clarify precursor calibration documentation and defaults
• Simplified documentation by removing references to "Phase 0–1" and "Phase 3"
• Clarified that Default is Off (opt-in) to match CLI default
crates/search/src/precursor_cal.rs

28. crates/model/src/aa_set.rs Formatting +1/-1

Clean up test comment by removing audit label
• Removed "iter28 audit:" prefix from test comment, keeping the substantive note about GF DP source
 AAs
crates/model/src/aa_set.rs

29. crates/search/tests/mass_calibrator_integration.rs 🧪 Tests +2/-0

Add isolation offset fields to mass calibrator integration test
• Added isolation_lower_offset and isolation_upper_offset fields to test Spectrum construction
crates/search/tests/mass_calibrator_integration.rs

30. crates/model/src/tolerance.rs 📝 Documentation +1/-1

Simplify tolerance documentation reference
• Updated documentation to replace "Phase B's calibrator" with "The precursor calibrator"
crates/model/src/tolerance.rs

31. crates/scoring/tests/primitive_graph_arena_parity.rs 🧪 Tests +2/-0

Add isolation offset fields to scoring test fixture
• Added isolation_lower_offset and isolation_upper_offset fields to empty test Spectrum
crates/scoring/tests/primitive_graph_arena_parity.rs

32. crates/scoring/src/gf/group.rs 🧪 Tests +2/-0

Add isolation offset fields to GF group test fixture
• Added isolation_lower_offset and isolation_upper_offset fields to test Spectrum construction
crates/scoring/src/gf/group.rs

33. crates/search/tests/precursor_matching.rs 🧪 Tests +2/-0

Add isolation offset fields to precursor matching test
• Added isolation_lower_offset and isolation_upper_offset fields to test Spectrum construction
crates/search/tests/precursor_matching.rs

34. crates/input/src/mgf.rs ✨ Enhancement +2/-0

Add isolation offset fields to MGF spectrum parsing
• Added isolation_lower_offset and isolation_upper_offset fields to Spectrum struct
 initialization in MGF reader
crates/input/src/mgf.rs

35. crates/input/src/lib.rs ✨ Enhancement +1/-1

Export Ms1Link for MS1 precursor linking
• Added Ms1Link to public re-exports from mzml module for MS1 linking support
crates/input/src/lib.rs

36. crates/model/src/lib.rs ✨ Enhancement +1/-0

Register isotope module for envelope calculations
• Registered new isotope module for averagine isotope envelope calculations
crates/model/src/lib.rs

37. docs/superpowers/plans/2026-05-29-chimeric-fragment-index-prefilter.md 📝 Documentation +593/-0

Fragment-index prefilter implementation plan (Approach A)
• Comprehensive implementation plan for fragment-index prefilter (Approach A) with 6 tasks covering
 CSR index build, FragmentVoter, wiring, and VM gates
• Includes detailed step-by-step instructions, test templates, and integration points for the
 chimeric search optimization
docs/superpowers/plans/2026-05-29-chimeric-fragment-index-prefilter.md

38. docs/superpowers/plans/2026-05-30-chimeric-sage-style-fragment-index.md 📝 Documentation +410/-0

Sage-style fragment-index implementation plan (Approach B)
• Alternative implementation plan for Sage-style fragment index (Approach B) with mass-sorted
 candidates and m/z-bucketed fragments
• Includes 6 tasks with dual binary-search query, window-bounded scoring, and empirical gates on PXD
 and Astral datasets
docs/superpowers/plans/2026-05-30-chimeric-sage-style-fragment-index.md

39. docs/superpowers/plans/2026-05-30-chimeric-two-pass-cascade.md 📝 Documentation +325/-0

Two-pass cascade implementation plan for chimeric search
• Implementation plan for two-pass cascade (narrow Pass 1 + MS1-gated targeted Pass 2) with 4 tasks
• Covers co-isolated precursor detection, residual-spectrum targeted search, driver wiring, and VM
 gates
docs/superpowers/plans/2026-05-30-chimeric-two-pass-cascade.md

40. docs/2026-05-28-psm-gain-state-and-roadmap.md 📝 Documentation +259/-0

PSM-gain state and roadmap consolidation document

• Consolidated state-of-play document covering PSM-gain progress, empirical rules learned, remaining
 gaps, and ranked action plan
• Defines four levers (mod/param audit, SpecE-shape fix, additive features, top-2 emission) with
 risk/cost analysis and realistic outcomes

docs/2026-05-28-psm-gain-state-and-roadmap.md

41. docs/parity-analysis/notes/2026-05-29-gate-chimeric-norescore-vs-java.md 📝 Documentation +63/-0

Chimeric NO_RESCORE gate results and blocker analysis
• Gate-run results showing chimeric NO_RESCORE achieves +21.6% PXD and +115.8% Astral PSMs but −5%
 TMT and 1.16–2.71× slower wall time
• Identifies speed (fragment-index candidate generator) and TMT (GF SpecEValue shape) as binding
 constraints for merge gate
docs/parity-analysis/notes/2026-05-29-gate-chimeric-norescore-vs-java.md

42. .claude/CLAUDE.md Additional files +16/-1

...

.claude/CLAUDE.md

43. README.md Additional files +35/-20

...

README.md

44. crates/search/Cargo.toml Additional files +1/-1

...

crates/search/Cargo.toml

45. crates/search/src/search_index.rs Additional files +0/-2

...

crates/search/src/search_index.rs

46. docs/parity-analysis/notes/2026-05-28-SESSION-HANDOFF.md Additional files +27/-0

...

docs/parity-analysis/notes/2026-05-28-SESSION-HANDOFF.md

47. docs/parity-analysis/notes/2026-05-28-chimeric-fragment-overlap-diagnostic.md Additional files +97/-0

...

docs/parity-analysis/notes/2026-05-28-chimeric-fragment-overlap-diagnostic.md

48. docs/parity-analysis/notes/2026-05-28-chimeric-phase1-bench.md Additional files +42/-0

...

docs/parity-analysis/notes/2026-05-28-chimeric-phase1-bench.md

49. docs/parity-analysis/notes/2026-05-28-chimeric-phase2-bench.md Additional files +130/-0

...

docs/parity-analysis/notes/2026-05-28-chimeric-phase2-bench.md

50. docs/parity-analysis/notes/2026-05-29-chimeric-full-review-and-rethink.md Additional files +190/-0

...

docs/parity-analysis/notes/2026-05-29-chimeric-full-review-and-rethink.md

51. docs/parity-analysis/notes/2026-05-29-chimeric-phase3-bench-canary-fails.md Additional files +79/-0

...

docs/parity-analysis/notes/2026-05-29-chimeric-phase3-bench-canary-fails.md

52. docs/parity-analysis/notes/2026-05-29-entrapment-fdp-reversal.md Additional files +111/-0

...

docs/parity-analysis/notes/2026-05-29-entrapment-fdp-reversal.md

53. docs/parity-analysis/notes/2026-05-29-rank-stratified-fdr-bench.md Additional files +72/-0

...

docs/parity-analysis/notes/2026-05-29-rank-stratified-fdr-bench.md

54. docs/parity-analysis/notes/2026-05-30-cascade-astral-breakthrough.md Additional files +84/-0

...

docs/parity-analysis/notes/2026-05-30-cascade-astral-breakthrough.md

55. docs/parity-analysis/notes/2026-05-30-chimeric-cost-profile.md Additional files +42/-0

...

docs/parity-analysis/notes/2026-05-30-chimeric-cost-profile.md

56. docs/parity-analysis/notes/2026-05-30-frag-index-pxd-fails-lowres.md Additional files +80/-0

...

docs/parity-analysis/notes/2026-05-30-frag-index-pxd-fails-lowres.md

57. docs/parity-analysis/notes/2026-05-30-sage-index-astral-and-chimeric-speed-conclusion.md Additional files +54/-0

...

docs/parity-analysis/notes/2026-05-30-sage-index-astral-and-chimeric-speed-conclusion.md

58. docs/parity-analysis/notes/2026-05-30-sage-index-pxd-gate.md Additional files +22/-0

...

docs/parity-analysis/notes/2026-05-30-sage-index-pxd-gate.md

59. docs/parity-analysis/notes/2026-05-31-cascade-optimized-multidataset-summary.md Additional files +152/-0

...

docs/parity-analysis/notes/2026-05-31-cascade-optimized-multidataset-summary.md

60. docs/parity-analysis/notes/2026-05-31-tmt-gap-diagnosis-not-gf-bug.md Additional files +127/-0

...

docs/parity-analysis/notes/2026-05-31-tmt-gap-diagnosis-not-gf-bug.md

61. docs/parity-analysis/notes/2026-06-01-p0-parity-audit-bench.md Additional files +44/-0

...

docs/parity-analysis/notes/2026-06-01-p0-parity-audit-bench.md

62. docs/superpowers/plans/2026-05-28-chimeric-dda-plus-phase1-plan.md Additional files +204/-0

...

docs/superpowers/plans/2026-05-28-chimeric-dda-plus-phase1-plan.md

63. docs/superpowers/plans/2026-05-28-chimeric-dda-plus-phase2-plan.md Additional files +120/-0

...

docs/superpowers/plans/2026-05-28-chimeric-dda-plus-phase2-plan.md

64. docs/superpowers/specs/2026-05-28-chimeric-dda-plus-integration-design.md Additional files +184/-0

...

docs/superpowers/specs/2026-05-28-chimeric-dda-plus-integration-design.md

65. docs/superpowers/specs/2026-05-29-chimeric-fragment-index-prefilter-design.md Additional files +120/-0

...

docs/superpowers/specs/2026-05-29-chimeric-fragment-index-prefilter-design.md

66. docs/superpowers/specs/2026-05-29-chimeric-phase3-shared-fragment-design.md Additional files +191/-0

...

docs/superpowers/specs/2026-05-29-chimeric-phase3-shared-fragment-design.md

67. docs/superpowers/specs/2026-05-29-chimeric-rank-stratified-fdr-design.md Additional files +99/-0

...

docs/superpowers/specs/2026-05-29-chimeric-rank-stratified-fdr-design.md

68. docs/superpowers/specs/2026-05-29-ms2rescore-entrapment-fdp-proof.md Additional files +78/-0

...

docs/superpowers/specs/2026-05-29-ms2rescore-entrapment-fdp-proof.md

69. docs/superpowers/specs/2026-05-30-chimeric-sage-style-fragment-index-design.md Additional files +127/-0

...

docs/superpowers/specs/2026-05-30-chimeric-sage-style-fragment-index-design.md

70. docs/superpowers/specs/2026-05-30-chimeric-two-pass-cascade-design.md Additional files +113/-0

...

docs/superpowers/specs/2026-05-30-chimeric-two-pass-cascade-design.md

71. docs/superpowers/specs/2026-05-31-native-rescoring-pipeline-design.md Additional files +164/-0

...

docs/superpowers/specs/2026-05-31-native-rescoring-pipeline-design.md

72. test-fixtures/parity/goldens/precursor_cal_off.pin Additional files +633/-633

...

test-fixtures/parity/goldens/precursor_cal_off.pin

qodo-code-review · 2026-06-01T09:04:25Z

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (0)

1. MS level filter widened 🐞 Bug ≡ Correctness

Description

MzMLReader::read_with_ms1/read_with_ms1_chunked force ms_level_min = 1 during MS1 capture,
which broadens the configured [ms_level_min, ms_level_max] filter and can emit/search unintended
MS levels (e.g. MS2) when the CLI requested a single higher MS level (e.g. MS3). This changes search
inputs/results under --chimeric for non-default --ms-level values.

Code

crates/input/src/mzml.rs[R720-788]

Evidence

finish_spectrum drops spectra outside [ms_level_min, ms_level_max]. The MS1-capture helpers
lower ms_level_min to 1, broadening that range. The binary sets a single-level range
with_ms_level_range(mslevel, mslevel) before calling read_with_ms1_chunked, so lowering
ms_level_min causes additional lower ms-level spectra to be emitted and searched.

crates/input/src/mzml.rs[292-298]
crates/input/src/mzml.rs[720-728]
crates/input/src/mzml.rs[784-787]
crates/msgf-rust/src/bin/msgf-rust.rs[841-846]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`read_with_ms1` and `read_with_ms1_chunked` mutate `self.ms_level_min` to `1` to “allow MS1 through”, but MS1 capture already happens before `finish_spectrum()` applies the ms-level filter. Lowering `ms_level_min` therefore broadens the *emitted* spectrum levels: e.g. a requested range `[3,3]` becomes `[1,3]`, and MS2 spectra can be emitted/searched unexpectedly.

### Issue Context
The binary configures a single requested ms level via `with_ms_level_range(mslevel, mslevel)` and then calls `read_with_ms1_chunked` in the chimeric path. Because the reader lowers `ms_level_min`, levels below `mslevel` become eligible for emission.

### Fix Focus Areas
- Remove or redesign the `ms_level_min = 1` mutation so MS1 capture does not change the output ms-level filter.
- Add a regression test for `with_ms_level_range(3,3)` + MS1 capture ensuring MS2 is NOT emitted.

- crates/input/src/mzml.rs[720-728]
- crates/input/src/mzml.rs[784-787]
- crates/input/src/mzml.rs[292-298]
- crates/msgf-rust/src/bin/msgf-rust.rs[841-846]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. Chimeric MS1 features always zero 🐞 Bug ≡ Correctness

Description

The new PIN/TSV columns PrecursorIsotopeKL and PrecursorSNR are never populated: match_engine
hard-codes CASCADE_SKIP_MS1_FEATURE=true, and Pass-2 secondary construction never assigns these
fields. As a result, these output columns are constant zeros even under --chimeric, contradicting
the output comments and wasting schema surface area.

Code

crates/search/src/match_engine.rs[R690-706]

Evidence
The only Pass-1 code that assigns precursor_isotope_kl/precursor_snr is guarded by a constant
that is set to true, making the assignment unreachable. Pass-2 search_secondary initializes
features to default and later fills generic MS2 features, but never assigns the precursor-envelope
fields, while the PIN writer still outputs these columns.
crates/search/src/match_engine.rs[690-706]
crates/search/src/coisolation.rs[227-243]
crates/search/src/coisolation.rs[294-298]
crates/output/src/pin.rs[199-205]
crates/output/src/pin.rs[443-448]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
Two new output features (`PrecursorIsotopeKL`, `PrecursorSNR`) are added and written to PIN rows, but the search path never assigns non-zero values:
- In Pass 1, the only assignment site is guarded by `if !CASCADE_SKIP_MS1_FEATURE`, but the constant is hard-coded to `true`.
- In Pass 2, secondary PSMs are built with `PsmFeatures::default()` and later `compute_psm_features(...)` is used, but no precursor-envelope fields are set there either.

This makes the columns always `0`, which is misleading for downstream tooling and increases schema complexity without benefit.

### Issue Context
The PIN header and row writer describe these as “0.0 unless --chimeric populated them from a linked MS1”. However, the current implementation ensures that population never happens.

### Fix Focus Areas
Choose one:
1) **Implement population** (recommended):
  - For Pass 2 secondaries, reuse the MS1 peaks already available (`Ms1Link`) and compute KL/SNR cheaply (only a handful per scan).
  - Consider extending `CoIsolated` to carry `(kl, snr)` from `detect_coisolated`, so it’s computed once and threaded into `PsmFeatures`.
  - Optionally add a flag/env var to enable Pass-1 population if desired.
2) **Remove the columns/fields** if intentionally disabled for perf, and update header/docs accordingly.

- crates/search/src/match_engine.rs[690-738]
- crates/search/src/coisolation.rs[227-243]
- crates/search/src/coisolation.rs[294-298]
- crates/output/src/pin.rs[199-205]
- crates/output/src/pin.rs[443-448]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

qodo-code-review · 2026-06-01T09:14:33Z

+    pub fn read_with_ms1(mut self) -> std::io::Result<(Vec<Spectrum>, Ms1Link)> {
+        // To capture MS1 (level 1) the parser must let level-1 spectra reach
+        // the spectrum-End handler rather than being dropped by the level
+        // filter. We widen the internal min level to 1 ONLY when capturing;
+        // MS1 is intercepted before `finish_spectrum`, so the effective output
+        // is still MS2-only. With capture off, the filter is untouched.
+        if self.capture_ms1 {
+            self.ms_level_min = 1;
+        }
+
+        let mut spectra: Vec<Spectrum> = Vec::new();
+        let mut ms2_to_ms1: Vec<Option<usize>> = Vec::new();
+
+        loop {
+            match self.pump() {
+                Ok(Some(s)) => {
+                    // Each spectrum returned by `pump` here is an emitted MS2
+                    // (MS1 is intercepted inside `pump` and never returned).
+                    // Link it to whatever MS1 most recently preceded it.
+                    ms2_to_ms1.push(self.latest_ms1_idx);
+                    spectra.push(s);
+                }
+                Ok(None) => break,
+                Err(_e) => {
+                    // Resync past the malformed spectrum and keep parsing (skip the
+                    // bad scan, not the rest of the file). Only an unreadable XML
+                    // stream stops us.
+                    match self.resync_to_next_spectrum() {
+                        Ok(true) => continue,
+                        Ok(false) | Err(_) => break,
+                    }
+                }
+            }
+        }
+
+        let link = Ms1Link {
+            ms1_peaks: std::mem::take(&mut self.captured_ms1),
+            ms2_to_ms1,
+        };
+        Ok((spectra, link))
+    }
+
+    /// Streaming, bounded-memory, tolerant variant of [`Self::read_with_ms1`] for
+    /// the chimeric cascade.
+    ///
+    /// Calls `on_chunk(ms2_spectra, ms1_link)` for each batch of up to
+    /// `chunk_size` MS2 spectra, where `ms1_link` covers ONLY that chunk. RSS
+    /// stays bounded by the chunk size: at most the MS1 scans referenced by the
+    /// in-flight chunk are retained, never the whole file (each MS2 links to its
+    /// most-recent preceding MS1, so only that carry-over scan crosses a chunk
+    /// boundary). Stops after `cap` total MS2 (`usize::MAX` = unbounded).
+    ///
+    /// Tolerant: a malformed spectrum does NOT abort the run. The first parse
+    /// error stops streaming and the successfully-parsed spectra so far are still
+    /// delivered (mirroring the MS2-only streaming path); the error count and the
+    /// first few messages are returned for reporting.
+    pub fn read_with_ms1_chunked<F>(
+        mut self,
+        chunk_size: usize,
+        cap: usize,
+        mut on_chunk: F,
+    ) -> (usize, Vec<String>)
+    where
+        F: FnMut(Vec<Spectrum>, Ms1Link),
+    {
+        self.capture_ms1 = true;
+        self.ms_level_min = 1; // let MS1 reach the capture hook; output stays MS2-only
+
+        let mut err_count = 0usize;


1. Ms level filter widened 🐞 Bug ≡ Correctness

MzMLReader::read_with_ms1/read_with_ms1_chunked force ms_level_min = 1 during MS1 capture, which broadens the configured [ms_level_min, ms_level_max] filter and can emit/search unintended MS levels (e.g. MS2) when the CLI requested a single higher MS level (e.g. MS3). This changes search inputs/results under --chimeric for non-default --ms-level values.

Agent Prompt

### Issue description `read_with_ms1` and `read_with_ms1_chunked` mutate `self.ms_level_min` to `1` to “allow MS1 through”, but MS1 capture already happens before `finish_spectrum()` applies the ms-level filter. Lowering `ms_level_min` therefore broadens the *emitted* spectrum levels: e.g. a requested range `[3,3]` becomes `[1,3]`, and MS2 spectra can be emitted/searched unexpectedly. ### Issue Context The binary configures a single requested ms level via `with_ms_level_range(mslevel, mslevel)` and then calls `read_with_ms1_chunked` in the chimeric path. Because the reader lowers `ms_level_min`, levels below `mslevel` become eligible for emission. ### Fix Focus Areas - Remove or redesign the `ms_level_min = 1` mutation so MS1 capture does not change the output ms-level filter. - Add a regression test for `with_ms_level_range(3,3)` + MS1 capture ensuring MS2 is NOT emitted. - crates/input/src/mzml.rs[720-728] - crates/input/src/mzml.rs[784-787] - crates/input/src/mzml.rs[292-298] - crates/msgf-rust/src/bin/msgf-rust.rs[841-846]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

ypriverol · 2026-06-01T09:44:31Z

DeltaRawScore re-bench on the cascade (post-#43)

After merging the additive DeltaRawScore PIN column (#43), all 3 datasets re-benched on the same machine (chimeric mode, Percolator 3.7.1, entrapment-FDP via FDRBench). Comparison to the rev5 cascade baseline:

dataset	rev5 (no DRS)	with DeltaRawScore	Δ	entrapment FDP	vs Java
Astral	71,877	71,855	−22 (noise)	1.03%	+101%
PXD001819	16,592	16,603	+11 (noise)	1.09%	+11%
TMT	9,671	9,788	+117	not measured this run	−4.0% (vs Java 10,194)

Wall/RSS unchanged (Astral 6:51 / PXD 1:17 / TMT 2:17; maxRSS 10.9 / 2.3 / 7.7 GB) — DeltaRawScore is zero-cost.

Takeaways:

DeltaRawScore is flat on Astral/PXD (never runner-up-limited) and a real +117 on TMT (top-1 dominance is exactly TMT's failure mode). ~10× its historical +12 vs the narrow baseline.
TMT now −4.0% vs Java (was −5.1%). Narrowed, not closed — TMT still trails Java on PSM count.

Merge rationale: shipping as opt-in --chimeric (off by default; default engine byte-identical). A decisive, entrapment-validated win on Astral (+101%) and PXD (+11%), faster + bounded-memory on all 3. The TMT PSM gap (−4.0%) is tracked as separate follow-up (per-ion CID node-scoring trace), and cannot regress anyone since chimeric is off by default.

Caveats: the +117 TMT should be confirmed with a repeat (above typical Percolator noise but worth locking in), and TMT entrapment FDP was not measured in this run (additive feature; rev5 was 0.80%).

Replaces the 'Iter2 perf / post-PR-V1 binary / 39% flamegraph' narrative with the durable rationale (FxHashMap over std HashMap because variants_for is a hot lookup and SipHash dominated its cost). The internal milestone references are meaningless to an outside reader. Last surviving such comment in the source tree (the chimeric files were cleaned in c5c8ea8).

Per the project's comment-hygiene preference: code comments describe the code as it is, not the development history. Strips iteration/milestone narrative, refuted- experiment write-ups, and perf-regression stories from doc/line comments across search, scoring, output, and the CLI binary — keeping the durable technical rationale (why FxHashMap, why abs-ppm units, why no charge>2 deconv guard, why EdgeScore is a separate column, the Java num_distinct offset semantics). The reverted/negative-result learnings these comments narrated are already recorded in project memory (iter-history), so nothing durable is lost. No code changed; cargo check green.

ypriverol added 30 commits May 28, 2026 15:02

feat(mzml): capture isolation-window lower/upper offsets

b0ab373

feat(search): add matches_isolation_window for chimeric candidate gating

c80a918

docs(chimeric): Phase 1 bench finding — multi-PSM emission inflates F…

c112e2a

…DR without MS1/shared-fragment refinement

feat(mzml): optional MS1 capture + MS2->MS1 linkage (Ms1Link)

df25f77

feat(search): precursor isotope-match (KL + SNR) helper + PsmFeatures…

19f7c2e

… fields

feat(search): plumb Ms1Link + populate precursor isotope features und…

359af79

…er --chimeric

feat(output): additive PrecursorIsotopeKL + PrecursorSNR PIN columns

88d3a1c

docs(chimeric): Phase 2 bench finding — additive isotope-KL feature d…

189acb4

…oes NOT control FDR (Astral still +97%)

docs(chimeric): deeper investigation — hard isotope filter insufficie…

e3393e8

…nt; fragment competition is the real missing discriminator

docs(chimeric): verify back-end GF is single-precursor-centered; note…

8cbe34f

… merge-gate/TMT limit

docs: consolidated PSM-gain state & roadmap (single entry point)

eb318e9

docs(roadmap): add deltaScore + lnEValue levers; selection-vs-rescori…

533173d

…ng corollary

docs(roadmap): 3a benched+kept (+129/+12/+104), 3b discarded; TMT is …

0747518

…the remaining gate blocker

research(chimeric): env-gated fragment-overlap diagnostic + BSA previ…

5942118

…ew (low overlap tentatively challenges fragment-theft premise)

docs: session handoff (pre-restart) — 3a kept, TMT needs 2a, chimeric…

0974fda

… overlap probe pending Astral

ypriverol added 3 commits May 31, 2026 13:39

docs(chimeric): record A/B revalidation — strict improvement (Astral …

9299b04

…+4,362 real PSMs, FDP 1.54->1.13)

docs(chimeric): confirm TMT A/B 9,706 (+78, no regression) — A/B stri…

9e635fa

…ct win on all 3 datasets

ypriverol added 3 commits May 31, 2026 15:27

timosachsenberg mentioned this pull request May 31, 2026

Add chimeric search to ProSE OpenMS/OpenMS#9422

Open

ypriverol changed the title ~~Chimeric two-pass cascade (opt-in --chimeric): +63.7% Astral / +17.2% PXD PSMs vs Java [DRAFT - TMT blocks gate]~~ Chimeric two-pass cascade (opt-in --chimeric): +101% Astral / +11% PXD PSMs vs Java, faster + bounded-mem [DRAFT - TMT blocks gate] Jun 1, 2026

ypriverol added 7 commits June 1, 2026 06:53

docs(parity): P0 audit bench — safe P1/P2 shipped; P0.4 regresses TMT…

1c70652

… blocker, reverted (n=9)

docs(parity): P0.6 already implemented (GF_SPECTRA_NO_GROUP log); dec…

b46b610

…ision to stop P0 grind

ypriverol mentioned this pull request Jun 1, 2026

Add additive DeltaRawScore PIN column to the chimeric cascade #43

Merged

Merge pull request #43 from bigbio/feat/delta-raw-score-cascade

18df366

Add additive DeltaRawScore PIN column to the chimeric cascade

ypriverol marked this pull request as ready for review June 1, 2026 09:00

qodo-code-review Bot reviewed Jun 1, 2026

View reviewed changes

ypriverol added 2 commits June 1, 2026 10:47

ypriverol merged commit 30e4008 into dev Jun 1, 2026
5 checks passed

ypriverol mentioned this pull request Jun 1, 2026

Multiple PR landed with support for chimeric; RAW and the .d native formats #46

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chimeric two-pass cascade (opt-in --chimeric): +101% Astral / +11% PXD PSMs vs Java, faster + bounded-mem [DRAFT - TMT blocks gate]#42

Chimeric two-pass cascade (opt-in --chimeric): +101% Astral / +11% PXD PSMs vs Java, faster + bounded-mem [DRAFT - TMT blocks gate]#42
ypriverol merged 99 commits into
devfrom
feat/chimeric-dda-plus

ypriverol commented May 31, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 31, 2026 •

edited

Loading

Review skipped

Uh oh!

ypriverol commented May 31, 2026

Uh oh!

qodo-code-review Bot commented Jun 1, 2026

Uh oh!

qodo-code-review Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

qodo-code-review Bot Jun 1, 2026

Uh oh!

ypriverol commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ypriverol commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Results — same-machine vs Java MS-GF+ (entrapment-validated, FDRBench 1:1)

Cascade core

Speed/quality optimizations

Review fixes (two rounds, in this PR)

Do not merge yet — blocked by the merge gate

Known follow-ups

Reference

Uh oh!

coderabbitai Bot commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

ypriverol commented May 31, 2026

Code review (A/B + E fix commits)

Uh oh!

qodo-code-review Bot commented Jun 1, 2026

Review Summary by Qodo

Walkthroughs

File Changes

Uh oh!

qodo-code-review Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review by Qodo

Uh oh!

qodo-code-review Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

ypriverol commented Jun 1, 2026

DeltaRawScore re-bench on the cascade (post-#43)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ypriverol commented May 31, 2026 •

edited

Loading

coderabbitai Bot commented May 31, 2026 •

edited

Loading

qodo-code-review Bot commented Jun 1, 2026 •

edited

Loading