Skip to content

Chimeric two-pass cascade (opt-in --chimeric): +101% Astral / +11% PXD PSMs vs Java, faster + bounded-mem [DRAFT - TMT blocks gate]#42

Merged
ypriverol merged 99 commits into
devfrom
feat/chimeric-dda-plus
Jun 1, 2026
Merged

Chimeric two-pass cascade (opt-in --chimeric): +101% Astral / +11% PXD PSMs vs Java, faster + bounded-mem [DRAFT - TMT blocks gate]#42
ypriverol merged 99 commits into
devfrom
feat/chimeric-dda-plus

Conversation

@ypriverol
Copy link
Copy Markdown
Member

@ypriverol ypriverol commented May 31, 2026

Summary

Opt-in --chimeric two-pass cascade for msgf-rust: recovers co-isolated second peptides (MaxQuant "second-peptide" model) without the wide-window FDR-inflation cost. --chimeric off (default) is byte-identical to the current engine.

  • Pass 1: narrow top-1 primary search per scan (the normal fast path).
  • Pass 2: MS1-gated targeted secondary search — detect co-isolated precursors in the isolation window (averagine-KL), score a few candidates at each co-isolated mass on the residual spectrum, one single-bin GF SpecEValue per secondary, emitted as extra rows.

Hardened through two rounds of adversarial review (see below).

Results — same-machine vs Java MS-GF+ (entrapment-validated, FDRBench 1:1)

dataset Rust @1% Java @1% PSMs speed maxRSS entrapment FDP
Astral (LFQ DDA, HCD) 71,877 35,818 +101% 6:38 vs 6:46 (faster) 10.9 GB (= non-chimeric) 1.04%
PXD001819 (UPS1 yeast) 16,592 14,989 +11% 1:14 vs 1:22 (faster) 2.3 GB 1.13%
TMT (PXD007683, CID) 9,671 10,194 −5% 2:14 vs 3:07 (faster) 7.7 GB

Gains are real co-isolated peptides — both the reversed-decoy and entrapment rulers agree (FDP ~1% on all three). Astral and PXD001819 beat Java on both PSMs and speed; the cascade adds no unbounded memory (chimeric maxRSS == non-chimeric; both index-dominated).

Cascade core

crates/search/src/coisolation.rs (new), run_pass2_coisolation + PreparedParts in match_engine.rs, streaming wiring in the binary, force_push in psm.rs.

Speed/quality optimizations

raw-peak primary match; build candidate index once (PreparedParts); single-bin GF per secondary.

Review fixes (two rounds, in this PR)

Round 1 (5-agent + line review): top_n MGF gating; real secondary features (compute_psm_features on residual); precursor_mz_override for correct ExpMass/dm/absdm (PIN + TSV); lower-isotope self-secondary exclusion; deterministic secondary winner (best-SpecEValue, not arbitrary heap order — this drove a large Astral gain); --chimeric-frag-index defaults off; renamed the prefilter index to FragmentPostingIndex.

Round 2 (adversarial): bounded memory — chimeric path now STREAMS (read_with_ms1_chunked, per-chunk bounded MS1, parser-thread pipeline) instead of buffering the whole file; Pass-2 calibrationsearch_secondary applies the learned precursor shift before the candidate prefilter; Pass-2 competition — secondaries on one scan now claim peaks sequentially so they can't double-count shared leftovers; real resync — a malformed spectrum is skipped and parsing continues (no silent tail-truncation). Each fix has a regression/unit test.

Do not merge yet — blocked by the merge gate

Rust must beat Java on both PSMs and speed on all 3 datasets. TMT PSMs still trail −5% — TMT is CID/narrow with ~no co-isolation, so the cascade can't help it; the gap is a per-peptide CID node-scoring divergence, deferred to a future iteration (additive Percolator features / per-ion CID trace). Diagnosis: docs/parity-analysis/notes/2026-05-31-tmt-gap-diagnosis-not-gf-bug.md. This PR is a reviewable checkpoint, marked draft.

Known follow-ups

  • Dead refuted-experiment code (fragment_index.rs, fragment_posting_index.rs, shared_fragment.rs, behind false flags) to be stripped before a real merge.
  • TMT CID scoring gap (the blocker).
  • The MS2-only send_chunks path could also adopt resync (currently unchanged; out of chimeric scope).

Reference

docs/parity-analysis/notes/2026-05-31-cascade-optimized-multidataset-summary.md

ypriverol added 30 commits May 28, 2026 15:02
Phased design to recover co-fragmented peptide IDs via the DDA+ approach
(full-isolation-window search FIRST, then MS1 targeted-XIC isotope
refinement, then greedy shared-fragment rescoring). Maps each DDA+ step to
concrete msgf-rust components:

 - Phase 1: parse isolation-window width + widen candidate enumeration to
   the full window + emit top-N distinct-peptide PSMs/scan. Behind
   --chimeric (default off => bit-identical). Reuses the existing
   bucket_index range scan + per-charge multi-queue machinery.
 - Phase 2: MS1 targeted-XIC + isotope KL-divergence as an ADDITIVE PIN
   feature (audit-safe), or an external handoff (lean-Rust option).
 - Phase 3: greedy shared-fragment rescoring (additive-first; bench-gated).
 - Cross-cutting: candidate explosion ties into the planned fragment-index
   speed enabler; measure Phase-1 wall first before building it.

Motivated by the PR #40 finding that scoring parity is exhausted, so ID
gains now require a new capability rather than parity fixes.
…arch + multi-PSM)

Bite-sized TDD tasks for Phase 1 (option A, existing bucket scan):
 1. Spectrum isolation-window offset fields
 2. mzML <isolationWindow> parsing (MS:1000828/829)
 3. --chimeric + --isolation-halfwidth CLI/SearchParams wiring
 4. widen candidate enumeration to the isolation window when chimeric
 5. retain + emit top-N distinct-peptide PSMs/scan + SpecId uniqueness
 6. workspace gate + VM bench decision point (off=bit-identical, on=PXD/TMT measure)

Every task keeps the chimeric=false path on existing code, guarded by the
bit-identical PIN golden test. Task 6 is the ship/defer decision (incl. whether
the fragment-index enabler becomes a prerequisite if wall is unacceptable).
Add `isolation_lower_offset: Option<f64>` and `isolation_upper_offset: Option<f64>` to `Spectrum`, initialized to `None` in every constructor/literal across the workspace. Bit-identical to baseline (no scoring path touched).
Add `chimeric: bool` and `chimeric_isolation_halfwidth_da: f64` fields to
`SearchParams` (defaulting to false/1.5) and the matching `--chimeric` /
`--isolation-halfwidth` CLI flags; wire them into SearchParams assembly.
When `--chimeric` is on and `--top-n` is at its default of 1, top-N is
automatically raised to 5 with an eprintln notice. No search behavior
changes; default off path is bit-identical.
…eric

Extract candidate_nominal_bounds(spec, z, params, shift_ppm). When
params.chimeric, derive the candidate nominal-mass window from the full
isolation window (selected m/z ± isolation offsets, or the configured
half-width fallback) converted to neutral nominal mass per charge, with
isotope-error widening only (the window already exceeds precursor tol).
When chimeric is off, the derivation is byte-identical to the original
inline code.

Test candidate_nominal_bounds_chimeric_spans_isolation_window asserts the
chimeric window strictly contains the standard one and that an off-precursor
co-isolated mass is reachable only under chimeric. Bit-identical PIN golden
gate green (chimeric defaults off); clippy clean.
… loop

Branch the per-candidate iso-offset match: under --chimeric use
matches_isolation_window (accept anywhere in the isolation window, hoisted
window bounds), else the existing matches_precursor tight check. Bit-identical
PIN golden gate green (chimeric off path unchanged). Also fixes a clippy
field-reassign-default in the isolation-window unit test.
A chimeric scan emits multiple distinct-peptide PSMs that can share a SpecE
rank (iter_ranked increments rank only on a distinct spec_e_value), which
would collide on the `spec_scan_rank` SpecId. Append the per-row emission
index under --chimeric only; the standard SpecId format is unchanged so the
bit-identical PIN golden still holds. Test asserts two same-SpecE
co-fragmented PSMs get distinct SpecIds.
…nement)

Five bite-sized tasks to add the MS1 isotope-envelope check that suppresses
the Phase-1 FDR inflation:
 1. averagine theoretical isotope envelope (new model::isotope)
 2. optional MS1 capture + MS2->MS1 linkage in the mzML reader
 3. observed precursor isotope envelope + KL-divergence + SNR features
 4. additive PrecursorIsotopeKL + PrecursorSNR PIN columns
 5. VM bench decision gate (does the KL feature deflate Astral's +94%? if not,
    Phase 3 shared-fragment rescoring is required)

v1 uses the single linked MS1 (apex); multi-scan XIC correlation deferred to
Phase 2b. Additive PIN columns + --chimeric-off bit-identical throughout.
Add `averagine_isotope_envelope(mass, n_isotopes) -> Vec<f64>` in a new
`crates/model/src/isotope.rs` module. Uses the averagine + Poisson model
(lambda ≈ mass * 4.76e-4 from 13C natural abundance) to compute normalized
precursor isotope peak intensities. Registered as `pub mod isotope` in lib.rs.
Three unit tests cover normalization, mass-dependent +1/+0 growth, and edge
cases (n=0, n=1). New module is unused by default; zero change to existing
pipeline output (bit-identical gate: ok).
…nt; fragment competition is the real missing discriminator
…ew (low overlap tentatively challenges fragment-theft premise)
… 38% fracmin>=0.5 vs BSA 13%)

Decisive Astral chimeric run (MSGF_CHIMERIC_OVERLAP=1, n=121,423 co-emitting
scans): mean fracmin 0.367, 38% >=0.5, bimodal with a near-total-overlap mode at
[0.9-1.0)=11.4%. BSA low-overlap pattern does NOT hold on real co-isolated data.
Fragment-theft premise validated for a substantial fraction -> Phase 3
(shared-fragment competition) is the relevant fix for those scans; ~28%
coincidental tail still needs per-scan FDR.
…ce filter spec

Approach A (filter + additive PIN columns, no score modification). Greedy
peak-claiming (rank-1 first, protected) drives a hard pre-Percolator filter on
UniqueMatchedIons (swept knob, decoy-symmetric) + 3 additive PIN columns.
Grounded in the 2026-05-29 Astral overlap result (theft confirmed, bimodal).
DoD: Astral canary returns to ~36.7k AND PXD beats Java, --chimeric off
bit-identical, wall within 3%. Research toward trustworthy chimeric; not a merge.
…l SpecEValue + Percolator on all features)

Drop the hand-tuned --chimeric-min-unique-ions filter. Discrimination is the
score itself at two layers: (1) in-engine residual SpecEValue re-score on
uniquely-claimed peaks deflates theft/coincidental rank>=2 peptides; (2)
Percolator on the full feature vector (re-scored RawScore/lnSpecEValue + additive
unique-evidence columns). Rule-2-safe: off bit-identical, rank-1 untouched, only
chimeric extra rows change.
… SpecEValue re-score

Two-layer discrimination, no parameter (per design 93178d7):
- Pure competition core (shared_fragment.rs): greedy peak-claiming most-confident
  first; per rank>=2 peptide compute unique-evidence (UniqueMatchedIons,
  UniqueExplainedFraction, SharedFracClaimed). 7 unit tests.
- match_engine hook (guarded on --chimeric): walk emitted PSMs best-first, claim
  peaks, and re-score each rank>=2 PSM's RawScore + GF SpecEValue on the residual
  (unclaimed) spectrum via rescore_residual_spec_e. A theft/coincidental peptide,
  stripped of stolen peaks, gets a worse SpecEValue and drops out of the FDR set
  on its own; no hard filter, no threshold. Decoy-symmetric.
- Additive PIN columns gated on --chimeric (off path byte-identical). Header
  gating test + existing schema-parity test green.

Smoke (BSA test.mgf): off has no Phase-3 cols, on emits 3; 55 rows shared>0,
37 fully-stolen rows deflated to negative residual RawScore / lnSpecE ~0.
Validation gate (Astral canary -> ~36.7k, PXD>Java) pending VM bench.
…ary (+111%, not deflated)

off vs on @1% FDR: PXD 14,808->18,306, Astral 36,715->77,444 (canary should be
~flat), TMT 9,605->9,362. Decoy fraction 0.83 on-runs (structural inflation
intact). Root cause: a per-PSM score change deflates spurious targets AND decoys
symmetrically -> q-value curve unchanged -> aggregate 1% count doesn't move. The
broken part is the PSM-level FDR model (Phase-2 requirement #2), not the
per-peptide score. Phase 3 refuted as a gate-clearer; impl kept as validated
negative result; --chimeric off byte-identical; nothing ships.
Approach A: separate target-decoy FDR for rank-1 vs rank>=2 strata (split PIN by
rank -> 2 Percolator runs -> sum @1%). Tests the structural fix after Phase 3
(per-PSM rescore) was refuted. Measure on both non-rescored Phase-1 PIN (model
alone) and Phase-3 rescored PIN (composition). Canary: Astral total ~36.7k; win:
PXD>14,974. No Rust production change (test-only NO_RESCORE env gate).
ypriverol added 3 commits May 31, 2026 13:39
… ms2pip/deeplc)

DRAFT design from two feasibility investigations: native Percolator integration
(stdin pipe to bundled binary, perfect 3.7.1 parity) + ML rescoring features
(spectral-angle/RT-delta additive PIN columns, predictions via self-hostable
Koina first, native ONNX/XGBoost embedding as phase 2). Pending user review of
the open decisions before any implementation plan.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 31, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b62fd56a-31c9-4ae3-bde6-faba52aac99f

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/chimeric-dda-plus

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

… TSV precursor override

CI fix: the raw-peak-match commit (3556bee) inserted primary_matched_peak_keys
between search_secondary's doc and its #[allow(clippy::too_many_arguments)], so the
allow landed on the 3-arg helper and the 11-arg search_secondary lost it -> clippy
-D warnings failed CI. Reorganized so each fn has its own doc; allow back on
search_secondary.

Review fix: the TSV writer's Precursor + Da-mode PrecursorError columns used the
primary scan's spec.precursor_mz for chimeric secondaries; now use
precursor_mz_override (mirrors pin.rs). None for ordinary PSMs -> byte-identical.
TSV is not the Percolator path, so validated PIN results are unaffected.

Found by: CI (clippy) + the code review reviewers.
@ypriverol
Copy link
Copy Markdown
Member Author

Code review (A/B + E fix commits)

Focused review of the post-open fix commits (c7940916, ffaab1d9) — the cascade core was already reviewed before this PR opened. Found 1 real issue; it plus a CI lint regression are now fixed in c45e9c48.

  1. TSV writer used the primary scan's precursor m/z for chimeric secondaries. crates/output/src/tsv.rs computed the Precursor column and the Da-mode PrecursorError from spec.precursor_mz, which for a Pass-2 secondary is the primary's selected m/z — the same bug class the precursor_mz_override field was added to fix in pin.rs, but the TSV writer was missed. Flagged independently by two reviewers. Fixed by routing both through psm.precursor_mz_override.unwrap_or(spec.precursor_mz) (mirrors pin.rs; None for ordinary PSMs → byte-identical). TSV is not the Percolator path, so the validated PIN results are unaffected.

  2. CI clippy too_many_arguments on search_secondary (-D warnings). The earlier raw-peak-match commit (3556bee8) inserted primary_matched_peak_keys between search_secondary's doc and its #[allow(clippy::too_many_arguments)], so the allow landed on the 3-arg helper and the 11-arg search_secondary lost it. Reorganized so each function carries its own doc and the allow sits on search_secondary.

No CLAUDE.md performance-invariant violations (off-path --chimeric off remains byte-identical; no shared-state mutation in the pre-pass). Field threading for the new precursor_mz_override is complete across all construction sites. The E fix (top_n gating for non-mzML) is correct and mzML-path-identical.

Note: PR is a draft (blocked by the TMT merge gate); this review was run on explicit request.

ypriverol added 3 commits May 31, 2026 15:27
Review fixes:
- HIGH: detect_coisolated now excludes the selected precursor's isotope peaks in
  BOTH directions (selected_mz +/- k*ISOTOPE/z). Previously only higher isotopes
  were skipped, so when the instrument selected M+1/M+2 the true monoisotope at
  selected_mz - k*ISOTOPE/z could be re-discovered as a fake co-isolated secondary
  (self-inflation of the primary's own peptide).
- search_secondary picks the winner by score (min spec_e_value, then max rank_score,
  then candidate idx) instead of the unordered drain_into_vec().next() -> removes
  nondeterminism when tied secondaries survive the capacity-1 queue.
- --chimeric on non-mzML now sets params.chimeric=false, so the PIN schema/SpecId
  gates and top-N forcing all stay on the normal path (the warning's 'runs normally'
  is now literally true).
- --chimeric-frag-index defaults to off: the prefilter index is unreachable in the
  shipped cascade (cascade_wide is always false), so on/auto only built an unused
  index and cost memory + startup.

Rename: the inverted fragment-posting prefilter index (Approach B) and its module/
file are renamed to FragmentPostingIndex / fragment_posting_index to avoid implying
provenance from another engine.

Build + clippy(-D warnings) + search/output tests green.
Comment-only cleanup of the chimeric cascade files (coisolation, match_engine
cascade additions, binary chimeric wiring, pin/tsv precursor override, psm field
doc). ~90 net comment lines removed; kept the essential 'why' notes and all
public-item doc comments. Zero logic changes; build + clippy -D warnings green.
…libration, tolerant MS1 read

Finding 1 (high, bounded memory): the chimeric path no longer batch-reads the whole
file. New MzMLReader::read_with_ms1_chunked streams MS2 in CHUNK_SIZE batches, each
with a bounded per-chunk Ms1Link (only the carry-over MS1 crosses a chunk boundary).
The binary scores Pass 1 + Pass 2 per chunk on a parser-thread pipeline and drops
peaks immediately, so RSS is bounded to ~CHUNK_SIZE spectra. run_pass2_coisolation
now takes an explicit per-chunk link + global offset. Unit test proves chunked MS1
linkage matches the batch read across chunk boundaries + carry-over.

Finding 2 (high, calibration): search_secondary applies the learned
precursor_mass_shift_ppm (adjusted_observed_neutral_mass) to the co-isolated neutral
mass before the candidate prefilter and mass-error report, matching the main search.
Regression test with a non-zero shift added.

Finding 3 (med, tolerant parsing): read_with_ms1 + read_with_ms1_chunked no longer
abort on the first malformed spectrum; they deliver the spectra parsed so far and
(chunked) report the error count, mirroring the MS2-only streaming path.

clippy -D warnings + search/output/input tests green (incl. 2 new tests).
…al resync

Finding A (high): Pass-2 secondaries on the same scan now COMPETE for residual
evidence. search_secondary takes a prior_claimed peak-key set and returns the
peaks its winner explained; run_pass2_coisolation threads these across the scan's
co-isolated precursors so a peak claimed by one secondary is removed before the
next is scored (no double-counting of shared leftover peaks on multi-precursor
scans). New test asserts a peptide matches fewer ions once its peaks are claimed.

Finding B (high): the chimeric mzML reader now does REAL resync instead of silent
tail-truncation. New MzMLReader::resync_to_next_spectrum skips a malformed scan to
the next <spectrum> and continues; only an unreadable XML stream stops parsing.
Wired into read_with_ms1_chunked + read_with_ms1. New test: a bad spectrum between
two good ones is skipped (err_count=1) and BOTH good spectra survive.

clippy -D warnings + input/search tests green (incl. 2 new tests).
@ypriverol ypriverol changed the title Chimeric two-pass cascade (opt-in --chimeric): +63.7% Astral / +17.2% PXD PSMs vs Java [DRAFT - TMT blocks gate] Chimeric two-pass cascade (opt-in --chimeric): +101% Astral / +11% PXD PSMs vs Java, faster + bounded-mem [DRAFT - TMT blocks gate] Jun 1, 2026
ypriverol added 7 commits June 1, 2026 06:53
…m comments

CLI help (msgf-rust binary):
- Removed stale internal references (pre-iter39, 'G1 gate / DOCS.md section', 'Fix B').
- Rewrote --chimeric help to describe the actual two-pass cascade (was still the
  old blind wide-window description).
- Rewrote --precursor-cal help (off/auto/on, learn+tighten); clarified --max-spectra,
  --ms-level, --isolation-halfwidth.
- Hid the advanced/dead --chimeric-frag-index flag from --help.
- Dropped redundant inline '(default X)' — clap already shows [default: X].

README: completed the parameter table (required vs optional with defaults in bold,
added --precursor-cal/--chimeric/--decoy-prefix/--ms-level) + a 'Chimeric /
co-isolated peptides' section describing the opt-in cascade.

Comments: removed development-history jargon (iterNN labels, R-2/C-4/HIGH-2 change
codes, project phase/task labels, dead doc-links, date stamps, commit SHAs) from
~150 comment lines across search/output/scoring/model/binary, preserving all
technical rationale and doc contracts. Zero logic changes; clippy -D warnings +
build + tests green.
Removes the refuted blind-wide-window 'Approach B' that was unreachable behind a
hardcoded `cascade_wide = false`: the fragment-posting prefilter index, the greedy
shared-fragment competition + residual rescore, and the `--chimeric-frag-index`
flag / FragIndexMode / frag_index_active. The shipped two-pass cascade
(run_pass2_coisolation / coisolation.rs) and the --chimeric-off path are unchanged.

- Delete fragment_index.rs, fragment_posting_index.rs, shared_fragment.rs.
- Strip cascade_wide branches in run_chunk_inner (keep the narrow path), the
  PreparedSearch/PreparedParts fragment_posting_index field, rescore_residual_spec_e,
  and candidate_nominal_bounds' chimeric param (narrow-only).
- Remove FragIndexMode + chimeric_frag_index from SearchParams + CLI.
- Drop the 3 always-zero shared-fragment PIN columns (UniqueMatchedIons,
  UniqueExplainedFraction, SharedFracClaimed); chimeric PIN 42 -> 39 cols, off-path
  PIN unchanged at 39.

build + clippy -D warnings + search/output/input tests green; removed-symbol grep empty.
P1 (perf, behavior-identical):
- Drop per-spectrum window_cand_indices clone; iterate by reference.
- Avoid isotope_error_range.clone() in the hot scoring loop (start..=end).
- merge_unique_candidate_idxs: O(k^2) Vec::contains -> FxHashSet membership,
  Vec kept for identical output order.

P2 (robustness / quality):
- score_dist::get_probability returns 0.0 for score > max_score instead of
  panicking (empty-tail-mass; defensive — match_engine still guards, so normal
  results unchanged). New unit test.
- generating_function: release-safe DROPPED_NODES atomic + accessor for the
  |score|>10000 node drop (was silent).
- Rename misleading test to nearest_peak_full_picks_max_intensity_within_tolerance
  and assert max-intensity (not nearest-m/z) selection.
- msgf-trace GF lookup uses rank_score (production key) not score; documented the
  remaining single-bin-graph divergence.
- Refresh stale R-2.x header in match_engine_java_parity.

build + clippy -D warnings + search/output/scoring/input tests green.
…F audit)

Cascade summary note: add Final state (rev5) section recording the code review,
two adversarial rounds, dead-code cleanup, and GF/SpecE parity audit; final numbers
(Astral +101%, PXD +11%, TMT -5% blocker), HEAD b46b610, gate status. Fix stale
HEAD reference in the header.

CLAUDE.md: replace the abandoned fragment-index 'speed v2' next-work section with the
current chimeric cascade DRAFT PR #42 state and TMT closing options; record that both
fragment-index approaches were refuted this session.
Integrates the DeltaRawScore feature (originally bea5d697 on feat/delta-raw-score)
onto feat/chimeric-dda-plus. A direct branch merge conflicts on all touched files
and would revert the cascade's golden, so the feature is re-applied at the current
anchors instead.

Top-1 dominance signal: RawScore(best) − RawScore(2nd-best distinct peptide) per
spectrum, captured during candidate scoring (independent of the TopNQueue, so it
survives top_n=1) and WITHOUT feeding the GF min_score — no emitted PSM's
SpecEValue/RawScore changes. Distinct-peptide keyed by nominal residue mass so a
shared peptide (multi-charge / multi-protein) doesn't zero the delta. Emitted on the
rank-1 PIN row only (0.0 elsewhere), placed after the chimeric MS1 features
(PrecursorIsotopeKL/PrecursorSNR), before Peptide/Proteins.

Purely additive: off-path PIN gains one column and is otherwise unchanged; TSV is
byte-identical. Golden precursor_cal_off.pin regenerated from this binary (39→40
cols). Schema-parity test updated to Java + 4 additive cols.

Prior bench (vs the narrow pre-cascade baseline): +129 PXD / +104 Astral / +12 TMT
@1% FDR, zero wall cost, decoy structure unchanged.
Add additive DeltaRawScore PIN column to the chimeric cascade
@ypriverol ypriverol marked this pull request as ready for review June 1, 2026 09:00
@qodo-code-review
Copy link
Copy Markdown

Review Summary by Qodo

Chimeric two-pass cascade: +101% Astral / +11% PXD PSMs with bounded-memory streaming and two-round adversarial hardening

✨ Enhancement 🧪 Tests 🐞 Bug fix

Grey Divider

Walkthroughs

Description
• **Chimeric two-pass cascade** (opt-in --chimeric flag): recovers co-isolated second peptides
  without wide-window FDR inflation
  - Pass 1: narrow top-1 primary search per scan (fast path, byte-identical to current engine when
  disabled)
  - Pass 2: MS1-gated targeted secondary search on residual spectrum, detecting co-isolated precursors
  via averagine-KL divergence
• **Empirical results** (same-machine vs Java MS-GF+, entrapment-validated):
  - Astral (LFQ DDA, HCD): +101% PSMs, faster (6:38 vs 6:46), 10.9 GB maxRSS
  - PXD001819 (UPS1 yeast): +11% PSMs, faster (1:14 vs 1:22), 2.3 GB maxRSS
  - TMT (CID): −5% PSMs (narrow window, no co-isolation), faster (2:14 vs 3:07), 7.7 GB maxRSS
• **Core implementation**:
  - New coisolation.rs module: detect_coisolated() and search_secondary() functions
  - New chimeric_features.rs module: precursor isotope-envelope matching (KL divergence, SNR)
  - New isotope.rs module: theoretical averagine isotope envelope computation
  - MS1 capture infrastructure in mzml.rs: Ms1Link struct, read_with_ms1_chunked() streaming
  with bounded memory
  - PreparedParts caching in match_engine.rs to reuse candidate enumeration across calibration and
  main passes
• **Hardened through two rounds of adversarial review**:
  - Round 1: top_n MGF gating, real secondary features on residual, precursor_mz_override for correct
  ExpMass/dm, lower-isotope self-secondary exclusion, deterministic secondary winner (best-SpecEValue)
  - Round 2: bounded memory (streaming instead of buffering), Pass-2 calibration (precursor shift
  before prefilter), sequential peak competition (no double-counting), real resync (skip malformed
  spectra, continue parsing)
• **New PIN columns**: PrecursorIsotopeKL, PrecursorSNR, DeltaRawScore (additive
  chimeric/dominance features)
• **CLI integration**: --chimeric flag, --isolation-halfwidth parameter, forced top_n=1 under
  chimeric mode
• **Bug fixes**: GF score distribution out-of-range guard, msgf-trace GF threshold alignment with
  production
• **Documentation cleanup**: removed iteration-specific comments and references throughout codebase
• **Comprehensive test coverage**: isolation-window parsing, MS1 capture, chunked reading, error
  recovery, co-isolation detection, secondary peptide recovery
Diagram
flowchart LR
  A["Input: mzML<br/>with MS1 data"] -->|"read_with_ms1_chunked<br/>bounded per-chunk"| B["MS1 capture<br/>Ms1Link"]
  B -->|"Pass 1:<br/>narrow top-1"| C["Primary search<br/>best peptide"]
  C -->|"PreparedParts<br/>cached candidates"| D["Pass 2:<br/>MS1-gated targeted"]
  B -->|"averagine-KL<br/>co-isolation detect"| E["Secondary candidates<br/>co-isolated masses"]
  E -->|"residual spectrum<br/>sequential peaks"| F["Secondary search<br/>targeted scoring"]
  F -->|"single-bin GF<br/>SpecEValue"| G["Secondary PSMs<br/>extra rows"]
  C -->|"top-1 only"| H["PIN output<br/>+ isotope/SNR/delta features"]
  G -->|"precursor_mz_override"| H

Loading

Grey Divider

File Changes

1. crates/search/src/match_engine.rs ✨ Enhancement +540/-80

Chimeric two-pass cascade with MS1 co-isolation detection

• Added N_PRECURSOR_ISOTOPES constant and Ms1Link import for chimeric precursor isotope features
• Introduced PreparedParts struct to cache candidate enumeration across calibration and main
 passes
• Added candidate_nominal_bounds() helper function to extract precursor window calculation logic
• Implemented into_parts(), from_parts(), and with_ms1_link() methods for PreparedSearch to
 support two-pass cascade
• Added DeltaRawScore capture logic tracking best and second-best distinct-peptide RawScores per
 spectrum
• Integrated run_pass2_coisolation() call for chimeric secondary peptide detection
• Added chimeric precursor isotope-envelope feature computation (currently disabled via
 CASCADE_SKIP_MS1_FEATURE)
• Implemented cleavage_credit_for() module-level function for secondary peptide scoring
• Added matched_peak_keys() diagnostic function for fragment-overlap measurement
• Updated compute_psm_features() to initialize new chimeric feature fields
 (precursor_isotope_kl, precursor_snr, delta_raw_score)
• Cleaned up dated commit references and iteration markers from comments

crates/search/src/match_engine.rs


2. crates/input/src/mzml.rs ✨ Enhancement +665/-14

MS1 capture and MS2-to-MS1 linkage for chimeric detection

• Added Ms1Link struct to link MS2 spectra to their preceding MS1 scans and store MS1 peak lists
• Added IsolationWindow parser state and isolation-window offset CV parameters (MS:1000828,
 MS:1000829)
• Extended SpectrumBuilder with isolation_lower_offset and isolation_upper_offset fields
• Refactored spectrum building into build_spectrum() and build_peaks() helper methods for reuse
• Added MS1 capture infrastructure to MzMLReader with capture_ms1 flag and captured_ms1
 storage
• Implemented read_with_ms1() method to drain reader and return MS2 spectra with MS1 linkage
• Implemented read_with_ms1_chunked() streaming method with bounded memory and error tolerance
• Added resync_to_next_spectrum() method to skip malformed spectra and continue parsing
• Added comprehensive tests for isolation-window parsing, MS1 capture, chunked reading, and error
 recovery

crates/input/src/mzml.rs


3. crates/output/src/pin.rs ✨ Enhancement +148/-29

PIN output schema expansion for chimeric and dominance features

• Added three new PIN columns: PrecursorIsotopeKL, PrecursorSNR, and DeltaRawScore (all
 between EdgeScore and Peptide)
• Updated write_header() to include the three new additive chimeric/dominance feature columns
• Modified write_spectrum_rows() to track per-row emission index for unique SpecId generation
 under --chimeric
• Added precursor_mz_override support in write_psm_row() to use co-isolated precursor m/z for
 secondary PSMs
• Implemented chimeric SpecId uniqueness by appending row index when params.chimeric is true
• Added DeltaRawScore gating to rank-1 rows only (mirroring lnDeltaSpecEValue behavior)
• Updated test fixtures to include new fields and added tests for chimeric SpecId uniqueness and
 DeltaRawScore rank gating

crates/output/src/pin.rs


View more (69)
4. crates/search/src/precursor_matching.rs ✨ Enhancement +108/-0

Isolation-window precursor matching for co-isolated peptides

• Added matches_isolation_window() function for wide-window co-isolated peptide matching
• Isolation-window variant clamps mass error to the nearest in-window neutral mass (reports
 near-zero error for in-window peptides)
• Supports isotope offset and precursor tolerance expansion for co-isolation detection
• Added comprehensive unit tests validating in-window acceptance and out-of-window rejection

crates/search/src/precursor_matching.rs


5. crates/output/tests/output_pin_schema_parity.rs 🧪 Tests +32/-20

PIN schema parity test updates for four additive columns

• Updated schema parity test to account for four additive Rust-only PIN columns instead of one
• Modified column-count assertion to expect Java + 4 instead of Java + 1
• Updated position validation to check all four additive columns sit between matchedIonRatio and
 Peptide
• Clarified comments documenting the four new additive features and their purposes

crates/output/tests/output_pin_schema_parity.rs


6. crates/search/src/coisolation.rs ✨ Enhancement +609/-0

Chimeric cascade: co-isolated precursor detection and targeted secondary search

• New module implementing the chimeric two-pass cascade for detecting co-isolated precursors in MS1
 isolation windows
• detect_coisolated function identifies co-isolated precursors by averagine KL divergence
 matching, excluding the selected precursor and its isotopes
• search_secondary function performs targeted residual-spectrum search on co-isolated masses with
 sequential peak competition
• Comprehensive unit tests validating co-isolation detection, secondary peptide recovery, and
 precursor calibration application

crates/search/src/coisolation.rs


7. crates/scoring/src/scoring/scored_spectrum.rs 📝 Documentation +60/-28

Documentation cleanup and test fixture updates for isolation windows

• Removed iteration-specific comments (iter30, iter31, iter33, iter36, iter38) to clean up
 documentation
• Fixed nearest_peak_full test to verify max-intensity selection within tolerance window, not
 closest-by-m/z
• Added isolation_lower_offset and isolation_upper_offset fields to test spectrum fixtures
• Clarified comments around deconvolution ordering and cache behavior

crates/scoring/src/scoring/scored_spectrum.rs


8. crates/msgf-rust/src/bin/msgf-rust.rs ✨ Enhancement +177/-84

Chimeric cascade CLI integration with streaming MS1 linking

• Added --chimeric flag to enable two-pass cascade (requires mzML with MS1 data)
• Added --isolation-halfwidth parameter for fallback isolation window width when mzML lacks
 per-scan offsets
• Implemented dual-path streaming: chimeric path uses read_with_ms1_chunked with bounded per-chunk
 MS1 linking; non-chimeric path unchanged
• Forced top_n = 1 under chimeric mode to prevent multi-emission inflation; Pass 1 emits only best
 primary, Pass 2 emits secondaries
• Reuse PreparedParts from calibration pre-pass to avoid re-enumerating candidates (~15s saved on
 Astral)
• Cleaned up CLI documentation to remove iteration-specific language

crates/msgf-rust/src/bin/msgf-rust.rs


9. crates/search/src/psm.rs ✨ Enhancement +82/-36

Chimeric features and secondary PSM emission support

• Added PsmFeatures fields: precursor_isotope_kl, precursor_snr (MS1 envelope matching), and
 delta_raw_score (top-1 dominance)
• Added precursor_mz_override field to PsmMatch for chimeric Pass-2 secondaries to report their
 own co-isolated precursor m/z
• Implemented force_push method on TopNQueue to add secondaries without eviction (distinct
 co-isolated peptides, not competitors)
• Removed iteration-specific comments and clarified documentation around rank_score, edge_score, and
 tie handling
• Updated test fixtures to initialize new fields

crates/search/src/psm.rs


10. crates/search/src/chimeric_features.rs ✨ Enhancement +241/-0

Precursor isotope envelope matching for chimeric MS1 filtering

• New module for precursor isotope-envelope matching against observed MS1 peaks
• precursor_isotope_match computes KL divergence and SNR between theoretical averagine envelope
 and observed MS1 isotope cluster
• Uses max-intensity peak selection within tolerance window for each isotope position
• Comprehensive unit tests covering clean envelopes, missing envelopes, empty peaks, and
 max-intensity selection

crates/search/src/chimeric_features.rs


11. crates/output/src/tsv.rs ✨ Enhancement +13/-5

TSV output support for chimeric secondary precursor m/z override

• Updated TSV writer to use precursor_mz_override when available (chimeric secondaries), falling
 back to spectrum's precursor m/z
• Applied override-aware precursor m/z to mass-error Da conversion calculation
• Added isolation_lower_offset and isolation_upper_offset to test spectrum fixtures
• Updated test PSM fixtures to include new fields

crates/output/src/tsv.rs


12. crates/model/src/isotope.rs ✨ Enhancement +73/-0

Theoretical averagine isotope envelope computation

• New module implementing averagine_isotope_envelope function using Poisson model for 13C
 distribution
• Computes relative intensities of isotope peaks from peptide neutral mass, normalized to sum 1.0
• Unit tests validate normalization, mass-dependent envelope shape, and edge cases (zero/one
 isotope)

crates/model/src/isotope.rs


13. crates/scoring/src/gf/primitive_graph.rs 📝 Documentation +11/-12

Documentation cleanup for GF primitive graph

• Removed iteration-specific comments (iter36, iter37 P-8) referencing prior per-graph cache removal
• Clarified documentation around spectrum-wide observed_mass_cache de-duplication
• Added isolation_lower_offset and isolation_upper_offset to test spectrum fixtures

crates/scoring/src/gf/primitive_graph.rs


14. crates/scoring/src/scoring/psm_score.rs 📝 Documentation +8/-2

Documentation cleanup and test fixture updates

• Removed iteration-specific comment (iter31 P-2) from env-var caching documentation
• Added isolation_lower_offset and isolation_upper_offset to test spectrum fixtures

crates/scoring/src/scoring/psm_score.rs


15. crates/scoring/src/scoring/rank_scorer.rs 📝 Documentation +3/-5

Documentation cleanup for rank scorer

• Removed iteration-specific references (iter25 fix) from Java-parity edge case documentation
• Simplified comments around prob_peak > 1 NaN/inf handling without changing logic

crates/scoring/src/scoring/rank_scorer.rs


16. crates/search/tests/match_engine_java_parity.rs 📝 Documentation +5/-22

Java parity test documentation simplification

• Simplified test documentation by removing detailed R-2/R-3/C-4/C-5/F-1 parity gap enumeration
• Clarified scope: tests verify spectrum coverage and top-1 peptide identity, not full feature
 distribution parity
• Removed references to specific audit documents and iteration numbers

crates/search/tests/match_engine_java_parity.rs


17. crates/model/src/spectrum.rs ✨ Enhancement +17/-0

Spectrum struct extended with isolation window offsets

• Added isolation_lower_offset and isolation_upper_offset fields to Spectrum struct for
 chimeric mode
• Updated Default implementation and all test fixtures to initialize new fields
• Added unit test validating isolation offsets default to None

crates/model/src/spectrum.rs


18. crates/search/src/search_params.rs ✨ Enhancement +11/-3

Search parameters extended for chimeric cascade configuration

• Added chimeric boolean flag (default false) to enable two-pass cascade
• Added chimeric_isolation_halfwidth_da parameter (default 1.5 Da) for fallback isolation window
 width
• Removed iteration-specific documentation references
• Updated default_tryptic to initialize new fields

crates/search/src/search_params.rs


19. crates/scoring/src/gf/score_dist.rs 🐞 Bug fix +18/-2

Score distribution out-of-range guard and test coverage

• Updated get_probability to return 0.0 for scores above max_score (out-of-range defensive
 guard)
• Added unit test validating out-of-range score returns empty tail mass 0.0
• Clarified documentation around score bounds and out-of-range behavior

crates/scoring/src/gf/score_dist.rs


20. crates/scoring/src/gf/generating_function.rs ✨ Enhancement +17/-0

GF-DP node drop counter for telemetry

• Added process-global DROPPED_NODES atomic counter to track GF-DP nodes pruned by score-range
 guard
• Implemented dropped_nodes_count() function for release-safe telemetry (relaxed atomic load)
• Incremented counter in compute_inner when nodes fall outside [-10000, 10000] range

crates/scoring/src/gf/generating_function.rs


21. crates/msgf-rust/src/bin/msgf-trace.rs 🐞 Bug fix +21/-1

Msgf-trace GF threshold alignment with production

• Updated GF score threshold to use rank_score (node + cleavage + edge) instead of score (node +
 cleavage)
• Aligns trace dump with production SpecEValue path seeding
• Added detailed comments explaining intentional differences from production (single-bin graph vs
 merged group, hardcoded terminal flags)

crates/msgf-rust/src/bin/msgf-trace.rs


22. crates/scoring/tests/gf_graph_dp.rs 🧪 Tests +3/-0

Test fixture updates for isolation window fields

• Added isolation_lower_offset and isolation_upper_offset to test spectrum fixtures

crates/scoring/tests/gf_graph_dp.rs


23. crates/search/tests/match_engine_smoke.rs 🧪 Tests +2/-0

Test fixture updates for isolation window fields

• Added isolation_lower_offset and isolation_upper_offset to test spectrum fixtures

crates/search/tests/match_engine_smoke.rs


24. crates/search/src/mass_calibrator.rs 🧪 Tests +4/-0

Add chimeric and isolation offset fields to test fixtures

• Added chimeric and chimeric_isolation_halfwidth_da fields to test fixture SearchParams
• Added isolation_lower_offset and isolation_upper_offset fields to test Spectrum struct

crates/search/src/mass_calibrator.rs


25. crates/search/tests/match_engine_specevalue.rs 🧪 Tests +3/-0

Add isolation and override fields to match engine test fixtures

• Added isolation_lower_offset and isolation_upper_offset fields to test Spectrum construction
• Added precursor_mz_override field to test PsmMatch construction

crates/search/tests/match_engine_specevalue.rs


26. crates/search/src/lib.rs ✨ Enhancement +3/-1

Register chimeric cascade modules and re-export Pass-2 driver

• Registered new modules chimeric_features and coisolation for chimeric two-pass cascade
• Re-exported run_pass2_coisolation from match_engine for public API

crates/search/src/lib.rs


27. crates/search/src/precursor_cal.rs 📝 Documentation +4/-4

Clarify precursor calibration documentation and defaults

• Simplified documentation by removing references to "Phase 0–1" and "Phase 3"
• Clarified that Default is Off (opt-in) to match CLI default

crates/search/src/precursor_cal.rs


28. crates/model/src/aa_set.rs Formatting +1/-1

Clean up test comment by removing audit label

• Removed "iter28 audit:" prefix from test comment, keeping the substantive note about GF DP source
 AAs

crates/model/src/aa_set.rs


29. crates/search/tests/mass_calibrator_integration.rs 🧪 Tests +2/-0

Add isolation offset fields to mass calibrator integration test

• Added isolation_lower_offset and isolation_upper_offset fields to test Spectrum construction

crates/search/tests/mass_calibrator_integration.rs


30. crates/model/src/tolerance.rs 📝 Documentation +1/-1

Simplify tolerance documentation reference

• Updated documentation to replace "Phase B's calibrator" with "The precursor calibrator"

crates/model/src/tolerance.rs


31. crates/scoring/tests/primitive_graph_arena_parity.rs 🧪 Tests +2/-0

Add isolation offset fields to scoring test fixture

• Added isolation_lower_offset and isolation_upper_offset fields to empty test Spectrum

crates/scoring/tests/primitive_graph_arena_parity.rs


32. crates/scoring/src/gf/group.rs 🧪 Tests +2/-0

Add isolation offset fields to GF group test fixture

• Added isolation_lower_offset and isolation_upper_offset fields to test Spectrum construction

crates/scoring/src/gf/group.rs


33. crates/search/tests/precursor_matching.rs 🧪 Tests +2/-0

Add isolation offset fields to precursor matching test

• Added isolation_lower_offset and isolation_upper_offset fields to test Spectrum construction

crates/search/tests/precursor_matching.rs


34. crates/input/src/mgf.rs ✨ Enhancement +2/-0

Add isolation offset fields to MGF spectrum parsing

• Added isolation_lower_offset and isolation_upper_offset fields to Spectrum struct
 initialization in MGF reader

crates/input/src/mgf.rs


35. crates/input/src/lib.rs ✨ Enhancement +1/-1

Export Ms1Link for MS1 precursor linking

• Added Ms1Link to public re-exports from mzml module for MS1 linking support

crates/input/src/lib.rs


36. crates/model/src/lib.rs ✨ Enhancement +1/-0

Register isotope module for envelope calculations

• Registered new isotope module for averagine isotope envelope calculations

crates/model/src/lib.rs


37. docs/superpowers/plans/2026-05-29-chimeric-fragment-index-prefilter.md 📝 Documentation +593/-0

Fragment-index prefilter implementation plan (Approach A)

• Comprehensive implementation plan for fragment-index prefilter (Approach A) with 6 tasks covering
 CSR index build, FragmentVoter, wiring, and VM gates
• Includes detailed step-by-step instructions, test templates, and integration points for the
 chimeric search optimization

docs/superpowers/plans/2026-05-29-chimeric-fragment-index-prefilter.md


38. docs/superpowers/plans/2026-05-30-chimeric-sage-style-fragment-index.md 📝 Documentation +410/-0

Sage-style fragment-index implementation plan (Approach B)

• Alternative implementation plan for Sage-style fragment index (Approach B) with mass-sorted
 candidates and m/z-bucketed fragments
• Includes 6 tasks with dual binary-search query, window-bounded scoring, and empirical gates on PXD
 and Astral datasets

docs/superpowers/plans/2026-05-30-chimeric-sage-style-fragment-index.md


39. docs/superpowers/plans/2026-05-30-chimeric-two-pass-cascade.md 📝 Documentation +325/-0

Two-pass cascade implementation plan for chimeric search

• Implementation plan for two-pass cascade (narrow Pass 1 + MS1-gated targeted Pass 2) with 4 tasks
• Covers co-isolated precursor detection, residual-spectrum targeted search, driver wiring, and VM
 gates

docs/superpowers/plans/2026-05-30-chimeric-two-pass-cascade.md


40. docs/2026-05-28-psm-gain-state-and-roadmap.md 📝 Documentation +259/-0

PSM-gain state and roadmap consolidation document

• Consolidated state-of-play document covering PSM-gain progress, empirical rules learned, remaining
 gaps, and ranked action plan
• Defines four levers (mod/param audit, SpecE-shape fix, additive features, top-2 emission) with
 risk/cost analysis and realistic outcomes

docs/2026-05-28-psm-gain-state-and-roadmap.md


41. docs/parity-analysis/notes/2026-05-29-gate-chimeric-norescore-vs-java.md 📝 Documentation +63/-0

Chimeric NO_RESCORE gate results and blocker analysis

• Gate-run results showing chimeric NO_RESCORE achieves +21.6% PXD and +115.8% Astral PSMs but −5%
 TMT and 1.16–2.71× slower wall time
• Identifies speed (fragment-index candidate generator) and TMT (GF SpecEValue shape) as binding
 constraints for merge gate

docs/parity-analysis/notes/2026-05-29-gate-chimeric-norescore-vs-java.md


42. .claude/CLAUDE.md Additional files +16/-1

...

.claude/CLAUDE.md


43. README.md Additional files +35/-20

...

README.md


44. crates/search/Cargo.toml Additional files +1/-1

...

crates/search/Cargo.toml


45. crates/search/src/search_index.rs Additional files +0/-2

...

crates/search/src/search_index.rs


46. docs/parity-analysis/notes/2026-05-28-SESSION-HANDOFF.md Additional files +27/-0

...

docs/parity-analysis/notes/2026-05-28-SESSION-HANDOFF.md


47. docs/parity-analysis/notes/2026-05-28-chimeric-fragment-overlap-diagnostic.md Additional files +97/-0

...

docs/parity-analysis/notes/2026-05-28-chimeric-fragment-overlap-diagnostic.md


48. docs/parity-analysis/notes/2026-05-28-chimeric-phase1-bench.md Additional files +42/-0

...

docs/parity-analysis/notes/2026-05-28-chimeric-phase1-bench.md


49. docs/parity-analysis/notes/2026-05-28-chimeric-phase2-bench.md Additional files +130/-0

...

docs/parity-analysis/notes/2026-05-28-chimeric-phase2-bench.md


50. docs/parity-analysis/notes/2026-05-29-chimeric-full-review-and-rethink.md Additional files +190/-0

...

docs/parity-analysis/notes/2026-05-29-chimeric-full-review-and-rethink.md


51. docs/parity-analysis/notes/2026-05-29-chimeric-phase3-bench-canary-fails.md Additional files +79/-0

...

docs/parity-analysis/notes/2026-05-29-chimeric-phase3-bench-canary-fails.md


52. docs/parity-analysis/notes/2026-05-29-entrapment-fdp-reversal.md Additional files +111/-0

...

docs/parity-analysis/notes/2026-05-29-entrapment-fdp-reversal.md


53. docs/parity-analysis/notes/2026-05-29-rank-stratified-fdr-bench.md Additional files +72/-0

...

docs/parity-analysis/notes/2026-05-29-rank-stratified-fdr-bench.md


54. docs/parity-analysis/notes/2026-05-30-cascade-astral-breakthrough.md Additional files +84/-0

...

docs/parity-analysis/notes/2026-05-30-cascade-astral-breakthrough.md


55. docs/parity-analysis/notes/2026-05-30-chimeric-cost-profile.md Additional files +42/-0

...

docs/parity-analysis/notes/2026-05-30-chimeric-cost-profile.md


56. docs/parity-analysis/notes/2026-05-30-frag-index-pxd-fails-lowres.md Additional files +80/-0

...

docs/parity-analysis/notes/2026-05-30-frag-index-pxd-fails-lowres.md


57. docs/parity-analysis/notes/2026-05-30-sage-index-astral-and-chimeric-speed-conclusion.md Additional files +54/-0

...

docs/parity-analysis/notes/2026-05-30-sage-index-astral-and-chimeric-speed-conclusion.md


58. docs/parity-analysis/notes/2026-05-30-sage-index-pxd-gate.md Additional files +22/-0

...

docs/parity-analysis/notes/2026-05-30-sage-index-pxd-gate.md


59. docs/parity-analysis/notes/2026-05-31-cascade-optimized-multidataset-summary.md Additional files +152/-0

...

docs/parity-analysis/notes/2026-05-31-cascade-optimized-multidataset-summary.md


60. docs/parity-analysis/notes/2026-05-31-tmt-gap-diagnosis-not-gf-bug.md Additional files +127/-0

...

docs/parity-analysis/notes/2026-05-31-tmt-gap-diagnosis-not-gf-bug.md


61. docs/parity-analysis/notes/2026-06-01-p0-parity-audit-bench.md Additional files +44/-0

...

docs/parity-analysis/notes/2026-06-01-p0-parity-audit-bench.md


62. docs/superpowers/plans/2026-05-28-chimeric-dda-plus-phase1-plan.md Additional files +204/-0

...

docs/superpowers/plans/2026-05-28-chimeric-dda-plus-phase1-plan.md


63. docs/superpowers/plans/2026-05-28-chimeric-dda-plus-phase2-plan.md Additional files +120/-0

...

docs/superpowers/plans/2026-05-28-chimeric-dda-plus-phase2-plan.md


64. docs/superpowers/specs/2026-05-28-chimeric-dda-plus-integration-design.md Additional files +184/-0

...

docs/superpowers/specs/2026-05-28-chimeric-dda-plus-integration-design.md


65. docs/superpowers/specs/2026-05-29-chimeric-fragment-index-prefilter-design.md Additional files +120/-0

...

docs/superpowers/specs/2026-05-29-chimeric-fragment-index-prefilter-design.md


66. docs/superpowers/specs/2026-05-29-chimeric-phase3-shared-fragment-design.md Additional files +191/-0

...

docs/superpowers/specs/2026-05-29-chimeric-phase3-shared-fragment-design.md


67. docs/superpowers/specs/2026-05-29-chimeric-rank-stratified-fdr-design.md Additional files +99/-0

...

docs/superpowers/specs/2026-05-29-chimeric-rank-stratified-fdr-design.md


68. docs/superpowers/specs/2026-05-29-ms2rescore-entrapment-fdp-proof.md Additional files +78/-0

...

docs/superpowers/specs/2026-05-29-ms2rescore-entrapment-fdp-proof.md


69. docs/superpowers/specs/2026-05-30-chimeric-sage-style-fragment-index-design.md Additional files +127/-0

...

docs/superpowers/specs/2026-05-30-chimeric-sage-style-fragment-index-design.md


70. docs/superpowers/specs/2026-05-30-chimeric-two-pass-cascade-design.md Additional files +113/-0

...

docs/superpowers/specs/2026-05-30-chimeric-two-pass-cascade-design.md


71. docs/superpowers/specs/2026-05-31-native-rescoring-pipeline-design.md Additional files +164/-0

...

docs/superpowers/specs/2026-05-31-native-rescoring-pipeline-design.md


72. test-fixtures/parity/goldens/precursor_cal_off.pin Additional files +633/-633

...

test-fixtures/parity/goldens/precursor_cal_off.pin


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented Jun 1, 2026

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (0)

Grey Divider


Action required

1. MS level filter widened 🐞 Bug ≡ Correctness
Description
MzMLReader::read_with_ms1/read_with_ms1_chunked force ms_level_min = 1 during MS1 capture,
which broadens the configured [ms_level_min, ms_level_max] filter and can emit/search unintended
MS levels (e.g. MS2) when the CLI requested a single higher MS level (e.g. MS3). This changes search
inputs/results under --chimeric for non-default --ms-level values.
Code

crates/input/src/mzml.rs[R720-788]

Evidence
finish_spectrum drops spectra outside [ms_level_min, ms_level_max]. The MS1-capture helpers
lower ms_level_min to 1, broadening that range. The binary sets a single-level range
with_ms_level_range(mslevel, mslevel) before calling read_with_ms1_chunked, so lowering
ms_level_min causes additional lower ms-level spectra to be emitted and searched.

crates/input/src/mzml.rs[292-298]
crates/input/src/mzml.rs[720-728]
crates/input/src/mzml.rs[784-787]
crates/msgf-rust/src/bin/msgf-rust.rs[841-846]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`read_with_ms1` and `read_with_ms1_chunked` mutate `self.ms_level_min` to `1` to “allow MS1 through”, but MS1 capture already happens before `finish_spectrum()` applies the ms-level filter. Lowering `ms_level_min` therefore broadens the *emitted* spectrum levels: e.g. a requested range `[3,3]` becomes `[1,3]`, and MS2 spectra can be emitted/searched unexpectedly.

### Issue Context
The binary configures a single requested ms level via `with_ms_level_range(mslevel, mslevel)` and then calls `read_with_ms1_chunked` in the chimeric path. Because the reader lowers `ms_level_min`, levels below `mslevel` become eligible for emission.

### Fix Focus Areas
- Remove or redesign the `ms_level_min = 1` mutation so MS1 capture does not change the output ms-level filter.
- Add a regression test for `with_ms_level_range(3,3)` + MS1 capture ensuring MS2 is NOT emitted.

- crates/input/src/mzml.rs[720-728]
- crates/input/src/mzml.rs[784-787]
- crates/input/src/mzml.rs[292-298]
- crates/msgf-rust/src/bin/msgf-rust.rs[841-846]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

2. Chimeric MS1 features always zero 🐞 Bug ≡ Correctness
Description
The new PIN/TSV columns PrecursorIsotopeKL and PrecursorSNR are never populated: match_engine
hard-codes CASCADE_SKIP_MS1_FEATURE=true, and Pass-2 secondary construction never assigns these
fields. As a result, these output columns are constant zeros even under --chimeric, contradicting
the output comments and wasting schema surface area.
Code

crates/search/src/match_engine.rs[R690-706]

Evidence
The only Pass-1 code that assigns precursor_isotope_kl/precursor_snr is guarded by a constant
that is set to true, making the assignment unreachable. Pass-2 search_secondary initializes
features to default and later fills generic MS2 features, but never assigns the precursor-envelope
fields, while the PIN writer still outputs these columns.

crates/search/src/match_engine.rs[690-706]
crates/search/src/coisolation.rs[227-243]
crates/search/src/coisolation.rs[294-298]
crates/output/src/pin.rs[199-205]
crates/output/src/pin.rs[443-448]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
Two new output features (`PrecursorIsotopeKL`, `PrecursorSNR`) are added and written to PIN rows, but the search path never assigns non-zero values:
- In Pass 1, the only assignment site is guarded by `if !CASCADE_SKIP_MS1_FEATURE`, but the constant is hard-coded to `true`.
- In Pass 2, secondary PSMs are built with `PsmFeatures::default()` and later `compute_psm_features(...)` is used, but no precursor-envelope fields are set there either.

This makes the columns always `0`, which is misleading for downstream tooling and increases schema complexity without benefit.

### Issue Context
The PIN header and row writer describe these as “0.0 unless --chimeric populated them from a linked MS1”. However, the current implementation ensures that population never happens.

### Fix Focus Areas
Choose one:
1) **Implement population** (recommended):
  - For Pass 2 secondaries, reuse the MS1 peaks already available (`Ms1Link`) and compute KL/SNR cheaply (only a handful per scan).
  - Consider extending `CoIsolated` to carry `(kl, snr)` from `detect_coisolated`, so it’s computed once and threaded into `PsmFeatures`.
  - Optionally add a flag/env var to enable Pass-1 population if desired.
2) **Remove the columns/fields** if intentionally disabled for perf, and update header/docs accordingly.

- crates/search/src/match_engine.rs[690-738]
- crates/search/src/coisolation.rs[227-243]
- crates/search/src/coisolation.rs[294-298]
- crates/output/src/pin.rs[199-205]
- crates/output/src/pin.rs[443-448]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

Comment thread crates/input/src/mzml.rs
Comment on lines +720 to +788
pub fn read_with_ms1(mut self) -> std::io::Result<(Vec<Spectrum>, Ms1Link)> {
// To capture MS1 (level 1) the parser must let level-1 spectra reach
// the spectrum-End handler rather than being dropped by the level
// filter. We widen the internal min level to 1 ONLY when capturing;
// MS1 is intercepted before `finish_spectrum`, so the effective output
// is still MS2-only. With capture off, the filter is untouched.
if self.capture_ms1 {
self.ms_level_min = 1;
}

let mut spectra: Vec<Spectrum> = Vec::new();
let mut ms2_to_ms1: Vec<Option<usize>> = Vec::new();

loop {
match self.pump() {
Ok(Some(s)) => {
// Each spectrum returned by `pump` here is an emitted MS2
// (MS1 is intercepted inside `pump` and never returned).
// Link it to whatever MS1 most recently preceded it.
ms2_to_ms1.push(self.latest_ms1_idx);
spectra.push(s);
}
Ok(None) => break,
Err(_e) => {
// Resync past the malformed spectrum and keep parsing (skip the
// bad scan, not the rest of the file). Only an unreadable XML
// stream stops us.
match self.resync_to_next_spectrum() {
Ok(true) => continue,
Ok(false) | Err(_) => break,
}
}
}
}

let link = Ms1Link {
ms1_peaks: std::mem::take(&mut self.captured_ms1),
ms2_to_ms1,
};
Ok((spectra, link))
}

/// Streaming, bounded-memory, tolerant variant of [`Self::read_with_ms1`] for
/// the chimeric cascade.
///
/// Calls `on_chunk(ms2_spectra, ms1_link)` for each batch of up to
/// `chunk_size` MS2 spectra, where `ms1_link` covers ONLY that chunk. RSS
/// stays bounded by the chunk size: at most the MS1 scans referenced by the
/// in-flight chunk are retained, never the whole file (each MS2 links to its
/// most-recent preceding MS1, so only that carry-over scan crosses a chunk
/// boundary). Stops after `cap` total MS2 (`usize::MAX` = unbounded).
///
/// Tolerant: a malformed spectrum does NOT abort the run. The first parse
/// error stops streaming and the successfully-parsed spectra so far are still
/// delivered (mirroring the MS2-only streaming path); the error count and the
/// first few messages are returned for reporting.
pub fn read_with_ms1_chunked<F>(
mut self,
chunk_size: usize,
cap: usize,
mut on_chunk: F,
) -> (usize, Vec<String>)
where
F: FnMut(Vec<Spectrum>, Ms1Link),
{
self.capture_ms1 = true;
self.ms_level_min = 1; // let MS1 reach the capture hook; output stays MS2-only

let mut err_count = 0usize;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Ms level filter widened 🐞 Bug ≡ Correctness

MzMLReader::read_with_ms1/read_with_ms1_chunked force ms_level_min = 1 during MS1 capture,
which broadens the configured [ms_level_min, ms_level_max] filter and can emit/search unintended
MS levels (e.g. MS2) when the CLI requested a single higher MS level (e.g. MS3). This changes search
inputs/results under --chimeric for non-default --ms-level values.
Agent Prompt
### Issue description
`read_with_ms1` and `read_with_ms1_chunked` mutate `self.ms_level_min` to `1` to “allow MS1 through”, but MS1 capture already happens before `finish_spectrum()` applies the ms-level filter. Lowering `ms_level_min` therefore broadens the *emitted* spectrum levels: e.g. a requested range `[3,3]` becomes `[1,3]`, and MS2 spectra can be emitted/searched unexpectedly.

### Issue Context
The binary configures a single requested ms level via `with_ms_level_range(mslevel, mslevel)` and then calls `read_with_ms1_chunked` in the chimeric path. Because the reader lowers `ms_level_min`, levels below `mslevel` become eligible for emission.

### Fix Focus Areas
- Remove or redesign the `ms_level_min = 1` mutation so MS1 capture does not change the output ms-level filter.
- Add a regression test for `with_ms_level_range(3,3)` + MS1 capture ensuring MS2 is NOT emitted.

- crates/input/src/mzml.rs[720-728]
- crates/input/src/mzml.rs[784-787]
- crates/input/src/mzml.rs[292-298]
- crates/msgf-rust/src/bin/msgf-rust.rs[841-846]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

@ypriverol
Copy link
Copy Markdown
Member Author

DeltaRawScore re-bench on the cascade (post-#43)

After merging the additive DeltaRawScore PIN column (#43), all 3 datasets re-benched on the same machine (chimeric mode, Percolator 3.7.1, entrapment-FDP via FDRBench). Comparison to the rev5 cascade baseline:

dataset rev5 (no DRS) with DeltaRawScore Δ entrapment FDP vs Java
Astral 71,877 71,855 −22 (noise) 1.03% +101%
PXD001819 16,592 16,603 +11 (noise) 1.09% +11%
TMT 9,671 9,788 +117 not measured this run −4.0% (vs Java 10,194)

Wall/RSS unchanged (Astral 6:51 / PXD 1:17 / TMT 2:17; maxRSS 10.9 / 2.3 / 7.7 GB) — DeltaRawScore is zero-cost.

Takeaways:

  • DeltaRawScore is flat on Astral/PXD (never runner-up-limited) and a real +117 on TMT (top-1 dominance is exactly TMT's failure mode). ~10× its historical +12 vs the narrow baseline.
  • TMT now −4.0% vs Java (was −5.1%). Narrowed, not closed — TMT still trails Java on PSM count.

Merge rationale: shipping as opt-in --chimeric (off by default; default engine byte-identical). A decisive, entrapment-validated win on Astral (+101%) and PXD (+11%), faster + bounded-memory on all 3. The TMT PSM gap (−4.0%) is tracked as separate follow-up (per-ion CID node-scoring trace), and cannot regress anyone since chimeric is off by default.

Caveats: the +117 TMT should be confirmed with a repeat (above typical Percolator noise but worth locking in), and TMT entrapment FDP was not measured in this run (additive feature; rev5 was 0.80%).

ypriverol added 2 commits June 1, 2026 10:47
Replaces the 'Iter2 perf / post-PR-V1 binary / 39% flamegraph' narrative with
the durable rationale (FxHashMap over std HashMap because variants_for is a hot
lookup and SipHash dominated its cost). The internal milestone references are
meaningless to an outside reader. Last surviving such comment in the source tree
(the chimeric files were cleaned in c5c8ea8).
Per the project's comment-hygiene preference: code comments describe the code as
it is, not the development history. Strips iteration/milestone narrative, refuted-
experiment write-ups, and perf-regression stories from doc/line comments across
search, scoring, output, and the CLI binary — keeping the durable technical
rationale (why FxHashMap, why abs-ppm units, why no charge>2 deconv guard, why
EdgeScore is a separate column, the Java num_distinct offset semantics).

The reverted/negative-result learnings these comments narrated are already
recorded in project memory (iter-history), so nothing durable is lost. No code
changed; cargo check green.
@ypriverol ypriverol merged commit 30e4008 into dev Jun 1, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant