Chimeric two-pass cascade (opt-in --chimeric): +101% Astral / +11% PXD PSMs vs Java, faster + bounded-mem [DRAFT - TMT blocks gate]#42
Conversation
Phased design to recover co-fragmented peptide IDs via the DDA+ approach (full-isolation-window search FIRST, then MS1 targeted-XIC isotope refinement, then greedy shared-fragment rescoring). Maps each DDA+ step to concrete msgf-rust components: - Phase 1: parse isolation-window width + widen candidate enumeration to the full window + emit top-N distinct-peptide PSMs/scan. Behind --chimeric (default off => bit-identical). Reuses the existing bucket_index range scan + per-charge multi-queue machinery. - Phase 2: MS1 targeted-XIC + isotope KL-divergence as an ADDITIVE PIN feature (audit-safe), or an external handoff (lean-Rust option). - Phase 3: greedy shared-fragment rescoring (additive-first; bench-gated). - Cross-cutting: candidate explosion ties into the planned fragment-index speed enabler; measure Phase-1 wall first before building it. Motivated by the PR #40 finding that scoring parity is exhausted, so ID gains now require a new capability rather than parity fixes.
…arch + multi-PSM) Bite-sized TDD tasks for Phase 1 (option A, existing bucket scan): 1. Spectrum isolation-window offset fields 2. mzML <isolationWindow> parsing (MS:1000828/829) 3. --chimeric + --isolation-halfwidth CLI/SearchParams wiring 4. widen candidate enumeration to the isolation window when chimeric 5. retain + emit top-N distinct-peptide PSMs/scan + SpecId uniqueness 6. workspace gate + VM bench decision point (off=bit-identical, on=PXD/TMT measure) Every task keeps the chimeric=false path on existing code, guarded by the bit-identical PIN golden test. Task 6 is the ship/defer decision (incl. whether the fragment-index enabler becomes a prerequisite if wall is unacceptable).
Add `isolation_lower_offset: Option<f64>` and `isolation_upper_offset: Option<f64>` to `Spectrum`, initialized to `None` in every constructor/literal across the workspace. Bit-identical to baseline (no scoring path touched).
Add `chimeric: bool` and `chimeric_isolation_halfwidth_da: f64` fields to `SearchParams` (defaulting to false/1.5) and the matching `--chimeric` / `--isolation-halfwidth` CLI flags; wire them into SearchParams assembly. When `--chimeric` is on and `--top-n` is at its default of 1, top-N is automatically raised to 5 with an eprintln notice. No search behavior changes; default off path is bit-identical.
…eric Extract candidate_nominal_bounds(spec, z, params, shift_ppm). When params.chimeric, derive the candidate nominal-mass window from the full isolation window (selected m/z ± isolation offsets, or the configured half-width fallback) converted to neutral nominal mass per charge, with isotope-error widening only (the window already exceeds precursor tol). When chimeric is off, the derivation is byte-identical to the original inline code. Test candidate_nominal_bounds_chimeric_spans_isolation_window asserts the chimeric window strictly contains the standard one and that an off-precursor co-isolated mass is reachable only under chimeric. Bit-identical PIN golden gate green (chimeric defaults off); clippy clean.
… loop Branch the per-candidate iso-offset match: under --chimeric use matches_isolation_window (accept anywhere in the isolation window, hoisted window bounds), else the existing matches_precursor tight check. Bit-identical PIN golden gate green (chimeric off path unchanged). Also fixes a clippy field-reassign-default in the isolation-window unit test.
A chimeric scan emits multiple distinct-peptide PSMs that can share a SpecE rank (iter_ranked increments rank only on a distinct spec_e_value), which would collide on the `spec_scan_rank` SpecId. Append the per-row emission index under --chimeric only; the standard SpecId format is unchanged so the bit-identical PIN golden still holds. Test asserts two same-SpecE co-fragmented PSMs get distinct SpecIds.
…DR without MS1/shared-fragment refinement
…nement)
Five bite-sized tasks to add the MS1 isotope-envelope check that suppresses
the Phase-1 FDR inflation:
1. averagine theoretical isotope envelope (new model::isotope)
2. optional MS1 capture + MS2->MS1 linkage in the mzML reader
3. observed precursor isotope envelope + KL-divergence + SNR features
4. additive PrecursorIsotopeKL + PrecursorSNR PIN columns
5. VM bench decision gate (does the KL feature deflate Astral's +94%? if not,
Phase 3 shared-fragment rescoring is required)
v1 uses the single linked MS1 (apex); multi-scan XIC correlation deferred to
Phase 2b. Additive PIN columns + --chimeric-off bit-identical throughout.
Add `averagine_isotope_envelope(mass, n_isotopes) -> Vec<f64>` in a new `crates/model/src/isotope.rs` module. Uses the averagine + Poisson model (lambda ≈ mass * 4.76e-4 from 13C natural abundance) to compute normalized precursor isotope peak intensities. Registered as `pub mod isotope` in lib.rs. Three unit tests cover normalization, mass-dependent +1/+0 growth, and edge cases (n=0, n=1). New module is unused by default; zero change to existing pipeline output (bit-identical gate: ok).
…oes NOT control FDR (Astral still +97%)
…nt; fragment competition is the real missing discriminator
… merge-gate/TMT limit
…the remaining gate blocker
…ew (low overlap tentatively challenges fragment-theft premise)
… overlap probe pending Astral
… 38% fracmin>=0.5 vs BSA 13%) Decisive Astral chimeric run (MSGF_CHIMERIC_OVERLAP=1, n=121,423 co-emitting scans): mean fracmin 0.367, 38% >=0.5, bimodal with a near-total-overlap mode at [0.9-1.0)=11.4%. BSA low-overlap pattern does NOT hold on real co-isolated data. Fragment-theft premise validated for a substantial fraction -> Phase 3 (shared-fragment competition) is the relevant fix for those scans; ~28% coincidental tail still needs per-scan FDR.
…ce filter spec Approach A (filter + additive PIN columns, no score modification). Greedy peak-claiming (rank-1 first, protected) drives a hard pre-Percolator filter on UniqueMatchedIons (swept knob, decoy-symmetric) + 3 additive PIN columns. Grounded in the 2026-05-29 Astral overlap result (theft confirmed, bimodal). DoD: Astral canary returns to ~36.7k AND PXD beats Java, --chimeric off bit-identical, wall within 3%. Research toward trustworthy chimeric; not a merge.
…l SpecEValue + Percolator on all features) Drop the hand-tuned --chimeric-min-unique-ions filter. Discrimination is the score itself at two layers: (1) in-engine residual SpecEValue re-score on uniquely-claimed peaks deflates theft/coincidental rank>=2 peptides; (2) Percolator on the full feature vector (re-scored RawScore/lnSpecEValue + additive unique-evidence columns). Rule-2-safe: off bit-identical, rank-1 untouched, only chimeric extra rows change.
… SpecEValue re-score Two-layer discrimination, no parameter (per design 93178d7): - Pure competition core (shared_fragment.rs): greedy peak-claiming most-confident first; per rank>=2 peptide compute unique-evidence (UniqueMatchedIons, UniqueExplainedFraction, SharedFracClaimed). 7 unit tests. - match_engine hook (guarded on --chimeric): walk emitted PSMs best-first, claim peaks, and re-score each rank>=2 PSM's RawScore + GF SpecEValue on the residual (unclaimed) spectrum via rescore_residual_spec_e. A theft/coincidental peptide, stripped of stolen peaks, gets a worse SpecEValue and drops out of the FDR set on its own; no hard filter, no threshold. Decoy-symmetric. - Additive PIN columns gated on --chimeric (off path byte-identical). Header gating test + existing schema-parity test green. Smoke (BSA test.mgf): off has no Phase-3 cols, on emits 3; 55 rows shared>0, 37 fully-stolen rows deflated to negative residual RawScore / lnSpecE ~0. Validation gate (Astral canary -> ~36.7k, PXD>Java) pending VM bench.
…ary (+111%, not deflated) off vs on @1% FDR: PXD 14,808->18,306, Astral 36,715->77,444 (canary should be ~flat), TMT 9,605->9,362. Decoy fraction 0.83 on-runs (structural inflation intact). Root cause: a per-PSM score change deflates spurious targets AND decoys symmetrically -> q-value curve unchanged -> aggregate 1% count doesn't move. The broken part is the PSM-level FDR model (Phase-2 requirement #2), not the per-peptide score. Phase 3 refuted as a gate-clearer; impl kept as validated negative result; --chimeric off byte-identical; nothing ships.
Approach A: separate target-decoy FDR for rank-1 vs rank>=2 strata (split PIN by rank -> 2 Percolator runs -> sum @1%). Tests the structural fix after Phase 3 (per-PSM rescore) was refuted. Measure on both non-rescored Phase-1 PIN (model alone) and Phase-3 rescored PIN (composition). Canary: Astral total ~36.7k; win: PXD>14,974. No Rust production change (test-only NO_RESCORE env gate).
…+4,362 real PSMs, FDP 1.54->1.13)
… ms2pip/deeplc) DRAFT design from two feasibility investigations: native Percolator integration (stdin pipe to bundled binary, perfect 3.7.1 parity) + ML rescoring features (spectral-angle/RT-delta additive PIN columns, predictions via self-hostable Koina first, native ONNX/XGBoost embedding as phase 2). Pending user review of the open decisions before any implementation plan.
…ct win on all 3 datasets
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
… TSV precursor override CI fix: the raw-peak-match commit (3556bee) inserted primary_matched_peak_keys between search_secondary's doc and its #[allow(clippy::too_many_arguments)], so the allow landed on the 3-arg helper and the 11-arg search_secondary lost it -> clippy -D warnings failed CI. Reorganized so each fn has its own doc; allow back on search_secondary. Review fix: the TSV writer's Precursor + Da-mode PrecursorError columns used the primary scan's spec.precursor_mz for chimeric secondaries; now use precursor_mz_override (mirrors pin.rs). None for ordinary PSMs -> byte-identical. TSV is not the Percolator path, so validated PIN results are unaffected. Found by: CI (clippy) + the code review reviewers.
Code review (A/B + E fix commits)Focused review of the post-open fix commits (
No CLAUDE.md performance-invariant violations (off-path Note: PR is a draft (blocked by the TMT merge gate); this review was run on explicit request. |
Review fixes: - HIGH: detect_coisolated now excludes the selected precursor's isotope peaks in BOTH directions (selected_mz +/- k*ISOTOPE/z). Previously only higher isotopes were skipped, so when the instrument selected M+1/M+2 the true monoisotope at selected_mz - k*ISOTOPE/z could be re-discovered as a fake co-isolated secondary (self-inflation of the primary's own peptide). - search_secondary picks the winner by score (min spec_e_value, then max rank_score, then candidate idx) instead of the unordered drain_into_vec().next() -> removes nondeterminism when tied secondaries survive the capacity-1 queue. - --chimeric on non-mzML now sets params.chimeric=false, so the PIN schema/SpecId gates and top-N forcing all stay on the normal path (the warning's 'runs normally' is now literally true). - --chimeric-frag-index defaults to off: the prefilter index is unreachable in the shipped cascade (cascade_wide is always false), so on/auto only built an unused index and cost memory + startup. Rename: the inverted fragment-posting prefilter index (Approach B) and its module/ file are renamed to FragmentPostingIndex / fragment_posting_index to avoid implying provenance from another engine. Build + clippy(-D warnings) + search/output tests green.
Comment-only cleanup of the chimeric cascade files (coisolation, match_engine cascade additions, binary chimeric wiring, pin/tsv precursor override, psm field doc). ~90 net comment lines removed; kept the essential 'why' notes and all public-item doc comments. Zero logic changes; build + clippy -D warnings green.
…libration, tolerant MS1 read Finding 1 (high, bounded memory): the chimeric path no longer batch-reads the whole file. New MzMLReader::read_with_ms1_chunked streams MS2 in CHUNK_SIZE batches, each with a bounded per-chunk Ms1Link (only the carry-over MS1 crosses a chunk boundary). The binary scores Pass 1 + Pass 2 per chunk on a parser-thread pipeline and drops peaks immediately, so RSS is bounded to ~CHUNK_SIZE spectra. run_pass2_coisolation now takes an explicit per-chunk link + global offset. Unit test proves chunked MS1 linkage matches the batch read across chunk boundaries + carry-over. Finding 2 (high, calibration): search_secondary applies the learned precursor_mass_shift_ppm (adjusted_observed_neutral_mass) to the co-isolated neutral mass before the candidate prefilter and mass-error report, matching the main search. Regression test with a non-zero shift added. Finding 3 (med, tolerant parsing): read_with_ms1 + read_with_ms1_chunked no longer abort on the first malformed spectrum; they deliver the spectra parsed so far and (chunked) report the error count, mirroring the MS2-only streaming path. clippy -D warnings + search/output/input tests green (incl. 2 new tests).
…al resync Finding A (high): Pass-2 secondaries on the same scan now COMPETE for residual evidence. search_secondary takes a prior_claimed peak-key set and returns the peaks its winner explained; run_pass2_coisolation threads these across the scan's co-isolated precursors so a peak claimed by one secondary is removed before the next is scored (no double-counting of shared leftover peaks on multi-precursor scans). New test asserts a peptide matches fewer ions once its peaks are claimed. Finding B (high): the chimeric mzML reader now does REAL resync instead of silent tail-truncation. New MzMLReader::resync_to_next_spectrum skips a malformed scan to the next <spectrum> and continues; only an unreadable XML stream stops parsing. Wired into read_with_ms1_chunked + read_with_ms1. New test: a bad spectrum between two good ones is skipped (err_count=1) and BOTH good spectra survive. clippy -D warnings + input/search tests green (incl. 2 new tests).
…m comments CLI help (msgf-rust binary): - Removed stale internal references (pre-iter39, 'G1 gate / DOCS.md section', 'Fix B'). - Rewrote --chimeric help to describe the actual two-pass cascade (was still the old blind wide-window description). - Rewrote --precursor-cal help (off/auto/on, learn+tighten); clarified --max-spectra, --ms-level, --isolation-halfwidth. - Hid the advanced/dead --chimeric-frag-index flag from --help. - Dropped redundant inline '(default X)' — clap already shows [default: X]. README: completed the parameter table (required vs optional with defaults in bold, added --precursor-cal/--chimeric/--decoy-prefix/--ms-level) + a 'Chimeric / co-isolated peptides' section describing the opt-in cascade. Comments: removed development-history jargon (iterNN labels, R-2/C-4/HIGH-2 change codes, project phase/task labels, dead doc-links, date stamps, commit SHAs) from ~150 comment lines across search/output/scoring/model/binary, preserving all technical rationale and doc contracts. Zero logic changes; clippy -D warnings + build + tests green.
Removes the refuted blind-wide-window 'Approach B' that was unreachable behind a hardcoded `cascade_wide = false`: the fragment-posting prefilter index, the greedy shared-fragment competition + residual rescore, and the `--chimeric-frag-index` flag / FragIndexMode / frag_index_active. The shipped two-pass cascade (run_pass2_coisolation / coisolation.rs) and the --chimeric-off path are unchanged. - Delete fragment_index.rs, fragment_posting_index.rs, shared_fragment.rs. - Strip cascade_wide branches in run_chunk_inner (keep the narrow path), the PreparedSearch/PreparedParts fragment_posting_index field, rescore_residual_spec_e, and candidate_nominal_bounds' chimeric param (narrow-only). - Remove FragIndexMode + chimeric_frag_index from SearchParams + CLI. - Drop the 3 always-zero shared-fragment PIN columns (UniqueMatchedIons, UniqueExplainedFraction, SharedFracClaimed); chimeric PIN 42 -> 39 cols, off-path PIN unchanged at 39. build + clippy -D warnings + search/output/input tests green; removed-symbol grep empty.
P1 (perf, behavior-identical): - Drop per-spectrum window_cand_indices clone; iterate by reference. - Avoid isotope_error_range.clone() in the hot scoring loop (start..=end). - merge_unique_candidate_idxs: O(k^2) Vec::contains -> FxHashSet membership, Vec kept for identical output order. P2 (robustness / quality): - score_dist::get_probability returns 0.0 for score > max_score instead of panicking (empty-tail-mass; defensive — match_engine still guards, so normal results unchanged). New unit test. - generating_function: release-safe DROPPED_NODES atomic + accessor for the |score|>10000 node drop (was silent). - Rename misleading test to nearest_peak_full_picks_max_intensity_within_tolerance and assert max-intensity (not nearest-m/z) selection. - msgf-trace GF lookup uses rank_score (production key) not score; documented the remaining single-bin-graph divergence. - Refresh stale R-2.x header in match_engine_java_parity. build + clippy -D warnings + search/output/scoring/input tests green.
… blocker, reverted (n=9)
…ision to stop P0 grind
…F audit) Cascade summary note: add Final state (rev5) section recording the code review, two adversarial rounds, dead-code cleanup, and GF/SpecE parity audit; final numbers (Astral +101%, PXD +11%, TMT -5% blocker), HEAD b46b610, gate status. Fix stale HEAD reference in the header. CLAUDE.md: replace the abandoned fragment-index 'speed v2' next-work section with the current chimeric cascade DRAFT PR #42 state and TMT closing options; record that both fragment-index approaches were refuted this session.
Integrates the DeltaRawScore feature (originally bea5d697 on feat/delta-raw-score) onto feat/chimeric-dda-plus. A direct branch merge conflicts on all touched files and would revert the cascade's golden, so the feature is re-applied at the current anchors instead. Top-1 dominance signal: RawScore(best) − RawScore(2nd-best distinct peptide) per spectrum, captured during candidate scoring (independent of the TopNQueue, so it survives top_n=1) and WITHOUT feeding the GF min_score — no emitted PSM's SpecEValue/RawScore changes. Distinct-peptide keyed by nominal residue mass so a shared peptide (multi-charge / multi-protein) doesn't zero the delta. Emitted on the rank-1 PIN row only (0.0 elsewhere), placed after the chimeric MS1 features (PrecursorIsotopeKL/PrecursorSNR), before Peptide/Proteins. Purely additive: off-path PIN gains one column and is otherwise unchanged; TSV is byte-identical. Golden precursor_cal_off.pin regenerated from this binary (39→40 cols). Schema-parity test updated to Java + 4 additive cols. Prior bench (vs the narrow pre-cascade baseline): +129 PXD / +104 Astral / +12 TMT @1% FDR, zero wall cost, decoy structure unchanged.
Add additive DeltaRawScore PIN column to the chimeric cascade
Review Summary by QodoChimeric two-pass cascade: +101% Astral / +11% PXD PSMs with bounded-memory streaming and two-round adversarial hardening
WalkthroughsDescription• **Chimeric two-pass cascade** (opt-in --chimeric flag): recovers co-isolated second peptides without wide-window FDR inflation - Pass 1: narrow top-1 primary search per scan (fast path, byte-identical to current engine when disabled) - Pass 2: MS1-gated targeted secondary search on residual spectrum, detecting co-isolated precursors via averagine-KL divergence • **Empirical results** (same-machine vs Java MS-GF+, entrapment-validated): - Astral (LFQ DDA, HCD): +101% PSMs, faster (6:38 vs 6:46), 10.9 GB maxRSS - PXD001819 (UPS1 yeast): +11% PSMs, faster (1:14 vs 1:22), 2.3 GB maxRSS - TMT (CID): −5% PSMs (narrow window, no co-isolation), faster (2:14 vs 3:07), 7.7 GB maxRSS • **Core implementation**: - New coisolation.rs module: detect_coisolated() and search_secondary() functions - New chimeric_features.rs module: precursor isotope-envelope matching (KL divergence, SNR) - New isotope.rs module: theoretical averagine isotope envelope computation - MS1 capture infrastructure in mzml.rs: Ms1Link struct, read_with_ms1_chunked() streaming with bounded memory - PreparedParts caching in match_engine.rs to reuse candidate enumeration across calibration and main passes • **Hardened through two rounds of adversarial review**: - Round 1: top_n MGF gating, real secondary features on residual, precursor_mz_override for correct ExpMass/dm, lower-isotope self-secondary exclusion, deterministic secondary winner (best-SpecEValue) - Round 2: bounded memory (streaming instead of buffering), Pass-2 calibration (precursor shift before prefilter), sequential peak competition (no double-counting), real resync (skip malformed spectra, continue parsing) • **New PIN columns**: PrecursorIsotopeKL, PrecursorSNR, DeltaRawScore (additive chimeric/dominance features) • **CLI integration**: --chimeric flag, --isolation-halfwidth parameter, forced top_n=1 under chimeric mode • **Bug fixes**: GF score distribution out-of-range guard, msgf-trace GF threshold alignment with production • **Documentation cleanup**: removed iteration-specific comments and references throughout codebase • **Comprehensive test coverage**: isolation-window parsing, MS1 capture, chunked reading, error recovery, co-isolation detection, secondary peptide recovery Diagramflowchart LR
A["Input: mzML<br/>with MS1 data"] -->|"read_with_ms1_chunked<br/>bounded per-chunk"| B["MS1 capture<br/>Ms1Link"]
B -->|"Pass 1:<br/>narrow top-1"| C["Primary search<br/>best peptide"]
C -->|"PreparedParts<br/>cached candidates"| D["Pass 2:<br/>MS1-gated targeted"]
B -->|"averagine-KL<br/>co-isolation detect"| E["Secondary candidates<br/>co-isolated masses"]
E -->|"residual spectrum<br/>sequential peaks"| F["Secondary search<br/>targeted scoring"]
F -->|"single-bin GF<br/>SpecEValue"| G["Secondary PSMs<br/>extra rows"]
C -->|"top-1 only"| H["PIN output<br/>+ isotope/SNR/delta features"]
G -->|"precursor_mz_override"| H
File Changes1. crates/search/src/match_engine.rs
|
Code Review by Qodo
1. MS level filter widened
|
| pub fn read_with_ms1(mut self) -> std::io::Result<(Vec<Spectrum>, Ms1Link)> { | ||
| // To capture MS1 (level 1) the parser must let level-1 spectra reach | ||
| // the spectrum-End handler rather than being dropped by the level | ||
| // filter. We widen the internal min level to 1 ONLY when capturing; | ||
| // MS1 is intercepted before `finish_spectrum`, so the effective output | ||
| // is still MS2-only. With capture off, the filter is untouched. | ||
| if self.capture_ms1 { | ||
| self.ms_level_min = 1; | ||
| } | ||
|
|
||
| let mut spectra: Vec<Spectrum> = Vec::new(); | ||
| let mut ms2_to_ms1: Vec<Option<usize>> = Vec::new(); | ||
|
|
||
| loop { | ||
| match self.pump() { | ||
| Ok(Some(s)) => { | ||
| // Each spectrum returned by `pump` here is an emitted MS2 | ||
| // (MS1 is intercepted inside `pump` and never returned). | ||
| // Link it to whatever MS1 most recently preceded it. | ||
| ms2_to_ms1.push(self.latest_ms1_idx); | ||
| spectra.push(s); | ||
| } | ||
| Ok(None) => break, | ||
| Err(_e) => { | ||
| // Resync past the malformed spectrum and keep parsing (skip the | ||
| // bad scan, not the rest of the file). Only an unreadable XML | ||
| // stream stops us. | ||
| match self.resync_to_next_spectrum() { | ||
| Ok(true) => continue, | ||
| Ok(false) | Err(_) => break, | ||
| } | ||
| } | ||
| } | ||
| } | ||
|
|
||
| let link = Ms1Link { | ||
| ms1_peaks: std::mem::take(&mut self.captured_ms1), | ||
| ms2_to_ms1, | ||
| }; | ||
| Ok((spectra, link)) | ||
| } | ||
|
|
||
| /// Streaming, bounded-memory, tolerant variant of [`Self::read_with_ms1`] for | ||
| /// the chimeric cascade. | ||
| /// | ||
| /// Calls `on_chunk(ms2_spectra, ms1_link)` for each batch of up to | ||
| /// `chunk_size` MS2 spectra, where `ms1_link` covers ONLY that chunk. RSS | ||
| /// stays bounded by the chunk size: at most the MS1 scans referenced by the | ||
| /// in-flight chunk are retained, never the whole file (each MS2 links to its | ||
| /// most-recent preceding MS1, so only that carry-over scan crosses a chunk | ||
| /// boundary). Stops after `cap` total MS2 (`usize::MAX` = unbounded). | ||
| /// | ||
| /// Tolerant: a malformed spectrum does NOT abort the run. The first parse | ||
| /// error stops streaming and the successfully-parsed spectra so far are still | ||
| /// delivered (mirroring the MS2-only streaming path); the error count and the | ||
| /// first few messages are returned for reporting. | ||
| pub fn read_with_ms1_chunked<F>( | ||
| mut self, | ||
| chunk_size: usize, | ||
| cap: usize, | ||
| mut on_chunk: F, | ||
| ) -> (usize, Vec<String>) | ||
| where | ||
| F: FnMut(Vec<Spectrum>, Ms1Link), | ||
| { | ||
| self.capture_ms1 = true; | ||
| self.ms_level_min = 1; // let MS1 reach the capture hook; output stays MS2-only | ||
|
|
||
| let mut err_count = 0usize; |
There was a problem hiding this comment.
1. Ms level filter widened 🐞 Bug ≡ Correctness
MzMLReader::read_with_ms1/read_with_ms1_chunked force ms_level_min = 1 during MS1 capture, which broadens the configured [ms_level_min, ms_level_max] filter and can emit/search unintended MS levels (e.g. MS2) when the CLI requested a single higher MS level (e.g. MS3). This changes search inputs/results under --chimeric for non-default --ms-level values.
Agent Prompt
### Issue description
`read_with_ms1` and `read_with_ms1_chunked` mutate `self.ms_level_min` to `1` to “allow MS1 through”, but MS1 capture already happens before `finish_spectrum()` applies the ms-level filter. Lowering `ms_level_min` therefore broadens the *emitted* spectrum levels: e.g. a requested range `[3,3]` becomes `[1,3]`, and MS2 spectra can be emitted/searched unexpectedly.
### Issue Context
The binary configures a single requested ms level via `with_ms_level_range(mslevel, mslevel)` and then calls `read_with_ms1_chunked` in the chimeric path. Because the reader lowers `ms_level_min`, levels below `mslevel` become eligible for emission.
### Fix Focus Areas
- Remove or redesign the `ms_level_min = 1` mutation so MS1 capture does not change the output ms-level filter.
- Add a regression test for `with_ms_level_range(3,3)` + MS1 capture ensuring MS2 is NOT emitted.
- crates/input/src/mzml.rs[720-728]
- crates/input/src/mzml.rs[784-787]
- crates/input/src/mzml.rs[292-298]
- crates/msgf-rust/src/bin/msgf-rust.rs[841-846]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
DeltaRawScore re-bench on the cascade (post-#43)After merging the additive
Wall/RSS unchanged (Astral 6:51 / PXD 1:17 / TMT 2:17; maxRSS 10.9 / 2.3 / 7.7 GB) — DeltaRawScore is zero-cost. Takeaways:
Merge rationale: shipping as opt-in Caveats: the +117 TMT should be confirmed with a repeat (above typical Percolator noise but worth locking in), and TMT entrapment FDP was not measured in this run (additive feature; rev5 was 0.80%). |
Replaces the 'Iter2 perf / post-PR-V1 binary / 39% flamegraph' narrative with the durable rationale (FxHashMap over std HashMap because variants_for is a hot lookup and SipHash dominated its cost). The internal milestone references are meaningless to an outside reader. Last surviving such comment in the source tree (the chimeric files were cleaned in c5c8ea8).
Per the project's comment-hygiene preference: code comments describe the code as it is, not the development history. Strips iteration/milestone narrative, refuted- experiment write-ups, and perf-regression stories from doc/line comments across search, scoring, output, and the CLI binary — keeping the durable technical rationale (why FxHashMap, why abs-ppm units, why no charge>2 deconv guard, why EdgeScore is a separate column, the Java num_distinct offset semantics). The reverted/negative-result learnings these comments narrated are already recorded in project memory (iter-history), so nothing durable is lost. No code changed; cargo check green.
Summary
Opt-in
--chimerictwo-pass cascade formsgf-rust: recovers co-isolated second peptides (MaxQuant "second-peptide" model) without the wide-window FDR-inflation cost.--chimeric off(default) is byte-identical to the current engine.Hardened through two rounds of adversarial review (see below).
Results — same-machine vs Java MS-GF+ (entrapment-validated, FDRBench 1:1)
Gains are real co-isolated peptides — both the reversed-decoy and entrapment rulers agree (FDP ~1% on all three). Astral and PXD001819 beat Java on both PSMs and speed; the cascade adds no unbounded memory (chimeric maxRSS == non-chimeric; both index-dominated).
Cascade core
crates/search/src/coisolation.rs(new),run_pass2_coisolation+PreparedPartsinmatch_engine.rs, streaming wiring in the binary,force_pushinpsm.rs.Speed/quality optimizations
raw-peak primary match; build candidate index once (
PreparedParts); single-bin GF per secondary.Review fixes (two rounds, in this PR)
Round 1 (5-agent + line review): top_n MGF gating; real secondary features (
compute_psm_featureson residual);precursor_mz_overridefor correct ExpMass/dm/absdm (PIN + TSV); lower-isotope self-secondary exclusion; deterministic secondary winner (best-SpecEValue, not arbitrary heap order — this drove a large Astral gain);--chimeric-frag-indexdefaults off; renamed the prefilter index toFragmentPostingIndex.Round 2 (adversarial): bounded memory — chimeric path now STREAMS (
read_with_ms1_chunked, per-chunk bounded MS1, parser-thread pipeline) instead of buffering the whole file; Pass-2 calibration —search_secondaryapplies the learned precursor shift before the candidate prefilter; Pass-2 competition — secondaries on one scan now claim peaks sequentially so they can't double-count shared leftovers; real resync — a malformed spectrum is skipped and parsing continues (no silent tail-truncation). Each fix has a regression/unit test.Do not merge yet — blocked by the merge gate
Rust must beat Java on both PSMs and speed on all 3 datasets. TMT PSMs still trail −5% — TMT is CID/narrow with ~no co-isolation, so the cascade can't help it; the gap is a per-peptide CID node-scoring divergence, deferred to a future iteration (additive Percolator features / per-ion CID trace). Diagnosis:
docs/parity-analysis/notes/2026-05-31-tmt-gap-diagnosis-not-gf-bug.md. This PR is a reviewable checkpoint, marked draft.Known follow-ups
fragment_index.rs,fragment_posting_index.rs,shared_fragment.rs, behindfalseflags) to be stripped before a real merge.send_chunkspath could also adopt resync (currently unchanged; out of chimeric scope).Reference
docs/parity-analysis/notes/2026-05-31-cascade-optimized-multidataset-summary.md