Conversation
Phased design to recover co-fragmented peptide IDs via the DDA+ approach (full-isolation-window search FIRST, then MS1 targeted-XIC isotope refinement, then greedy shared-fragment rescoring). Maps each DDA+ step to concrete msgf-rust components: - Phase 1: parse isolation-window width + widen candidate enumeration to the full window + emit top-N distinct-peptide PSMs/scan. Behind --chimeric (default off => bit-identical). Reuses the existing bucket_index range scan + per-charge multi-queue machinery. - Phase 2: MS1 targeted-XIC + isotope KL-divergence as an ADDITIVE PIN feature (audit-safe), or an external handoff (lean-Rust option). - Phase 3: greedy shared-fragment rescoring (additive-first; bench-gated). - Cross-cutting: candidate explosion ties into the planned fragment-index speed enabler; measure Phase-1 wall first before building it. Motivated by the PR #40 finding that scoring parity is exhausted, so ID gains now require a new capability rather than parity fixes.
…arch + multi-PSM) Bite-sized TDD tasks for Phase 1 (option A, existing bucket scan): 1. Spectrum isolation-window offset fields 2. mzML <isolationWindow> parsing (MS:1000828/829) 3. --chimeric + --isolation-halfwidth CLI/SearchParams wiring 4. widen candidate enumeration to the isolation window when chimeric 5. retain + emit top-N distinct-peptide PSMs/scan + SpecId uniqueness 6. workspace gate + VM bench decision point (off=bit-identical, on=PXD/TMT measure) Every task keeps the chimeric=false path on existing code, guarded by the bit-identical PIN golden test. Task 6 is the ship/defer decision (incl. whether the fragment-index enabler becomes a prerequisite if wall is unacceptable).
Add `isolation_lower_offset: Option<f64>` and `isolation_upper_offset: Option<f64>` to `Spectrum`, initialized to `None` in every constructor/literal across the workspace. Bit-identical to baseline (no scoring path touched).
Add `chimeric: bool` and `chimeric_isolation_halfwidth_da: f64` fields to `SearchParams` (defaulting to false/1.5) and the matching `--chimeric` / `--isolation-halfwidth` CLI flags; wire them into SearchParams assembly. When `--chimeric` is on and `--top-n` is at its default of 1, top-N is automatically raised to 5 with an eprintln notice. No search behavior changes; default off path is bit-identical.
…eric Extract candidate_nominal_bounds(spec, z, params, shift_ppm). When params.chimeric, derive the candidate nominal-mass window from the full isolation window (selected m/z ± isolation offsets, or the configured half-width fallback) converted to neutral nominal mass per charge, with isotope-error widening only (the window already exceeds precursor tol). When chimeric is off, the derivation is byte-identical to the original inline code. Test candidate_nominal_bounds_chimeric_spans_isolation_window asserts the chimeric window strictly contains the standard one and that an off-precursor co-isolated mass is reachable only under chimeric. Bit-identical PIN golden gate green (chimeric defaults off); clippy clean.
… loop Branch the per-candidate iso-offset match: under --chimeric use matches_isolation_window (accept anywhere in the isolation window, hoisted window bounds), else the existing matches_precursor tight check. Bit-identical PIN golden gate green (chimeric off path unchanged). Also fixes a clippy field-reassign-default in the isolation-window unit test.
A chimeric scan emits multiple distinct-peptide PSMs that can share a SpecE rank (iter_ranked increments rank only on a distinct spec_e_value), which would collide on the `spec_scan_rank` SpecId. Append the per-row emission index under --chimeric only; the standard SpecId format is unchanged so the bit-identical PIN golden still holds. Test asserts two same-SpecE co-fragmented PSMs get distinct SpecIds.
…DR without MS1/shared-fragment refinement
…nement)
Five bite-sized tasks to add the MS1 isotope-envelope check that suppresses
the Phase-1 FDR inflation:
1. averagine theoretical isotope envelope (new model::isotope)
2. optional MS1 capture + MS2->MS1 linkage in the mzML reader
3. observed precursor isotope envelope + KL-divergence + SNR features
4. additive PrecursorIsotopeKL + PrecursorSNR PIN columns
5. VM bench decision gate (does the KL feature deflate Astral's +94%? if not,
Phase 3 shared-fragment rescoring is required)
v1 uses the single linked MS1 (apex); multi-scan XIC correlation deferred to
Phase 2b. Additive PIN columns + --chimeric-off bit-identical throughout.
Add `averagine_isotope_envelope(mass, n_isotopes) -> Vec<f64>` in a new `crates/model/src/isotope.rs` module. Uses the averagine + Poisson model (lambda ≈ mass * 4.76e-4 from 13C natural abundance) to compute normalized precursor isotope peak intensities. Registered as `pub mod isotope` in lib.rs. Three unit tests cover normalization, mass-dependent +1/+0 growth, and edge cases (n=0, n=1). New module is unused by default; zero change to existing pipeline output (bit-identical gate: ok).
…oes NOT control FDR (Astral still +97%)
…nt; fragment competition is the real missing discriminator
… merge-gate/TMT limit
…the remaining gate blocker
…ew (low overlap tentatively challenges fragment-theft premise)
… overlap probe pending Astral
… 38% fracmin>=0.5 vs BSA 13%) Decisive Astral chimeric run (MSGF_CHIMERIC_OVERLAP=1, n=121,423 co-emitting scans): mean fracmin 0.367, 38% >=0.5, bimodal with a near-total-overlap mode at [0.9-1.0)=11.4%. BSA low-overlap pattern does NOT hold on real co-isolated data. Fragment-theft premise validated for a substantial fraction -> Phase 3 (shared-fragment competition) is the relevant fix for those scans; ~28% coincidental tail still needs per-scan FDR.
…ce filter spec Approach A (filter + additive PIN columns, no score modification). Greedy peak-claiming (rank-1 first, protected) drives a hard pre-Percolator filter on UniqueMatchedIons (swept knob, decoy-symmetric) + 3 additive PIN columns. Grounded in the 2026-05-29 Astral overlap result (theft confirmed, bimodal). DoD: Astral canary returns to ~36.7k AND PXD beats Java, --chimeric off bit-identical, wall within 3%. Research toward trustworthy chimeric; not a merge.
…l SpecEValue + Percolator on all features) Drop the hand-tuned --chimeric-min-unique-ions filter. Discrimination is the score itself at two layers: (1) in-engine residual SpecEValue re-score on uniquely-claimed peaks deflates theft/coincidental rank>=2 peptides; (2) Percolator on the full feature vector (re-scored RawScore/lnSpecEValue + additive unique-evidence columns). Rule-2-safe: off bit-identical, rank-1 untouched, only chimeric extra rows change.
… SpecEValue re-score Two-layer discrimination, no parameter (per design 93178d7): - Pure competition core (shared_fragment.rs): greedy peak-claiming most-confident first; per rank>=2 peptide compute unique-evidence (UniqueMatchedIons, UniqueExplainedFraction, SharedFracClaimed). 7 unit tests. - match_engine hook (guarded on --chimeric): walk emitted PSMs best-first, claim peaks, and re-score each rank>=2 PSM's RawScore + GF SpecEValue on the residual (unclaimed) spectrum via rescore_residual_spec_e. A theft/coincidental peptide, stripped of stolen peaks, gets a worse SpecEValue and drops out of the FDR set on its own; no hard filter, no threshold. Decoy-symmetric. - Additive PIN columns gated on --chimeric (off path byte-identical). Header gating test + existing schema-parity test green. Smoke (BSA test.mgf): off has no Phase-3 cols, on emits 3; 55 rows shared>0, 37 fully-stolen rows deflated to negative residual RawScore / lnSpecE ~0. Validation gate (Astral canary -> ~36.7k, PXD>Java) pending VM bench.
…ary (+111%, not deflated) off vs on @1% FDR: PXD 14,808->18,306, Astral 36,715->77,444 (canary should be ~flat), TMT 9,605->9,362. Decoy fraction 0.83 on-runs (structural inflation intact). Root cause: a per-PSM score change deflates spurious targets AND decoys symmetrically -> q-value curve unchanged -> aggregate 1% count doesn't move. The broken part is the PSM-level FDR model (Phase-2 requirement #2), not the per-peptide score. Phase 3 refuted as a gate-clearer; impl kept as validated negative result; --chimeric off byte-identical; nothing ships.
Approach A: separate target-decoy FDR for rank-1 vs rank>=2 strata (split PIN by rank -> 2 Percolator runs -> sum @1%). Tests the structural fix after Phase 3 (per-PSM rescore) was refuted. Measure on both non-rescored Phase-1 PIN (model alone) and Phase-3 rescored PIN (composition). Canary: Astral total ~36.7k; win: PXD>14,974. No Rust production change (test-only NO_RESCORE env gate).
…spatch) ThermoRawReader (crates/input/src/thermo.rs, feature 'thermo') wraps the official RawFileReader via thermorawfilereader, yielding the same Spectrum model as the mzML/MGF readers (native_id title, precursor m/z+charge+intensity, RT, centroided peaks; MS2 by default, with_all_ms_levels for chimeric later). Iterates by index (RawFileReader::get) so it owns the handle without a self-referential borrow. CLI dispatch: .raw auto-detected by extension -> ThermoRawReader in the non-chimeric streaming reader (feature-gated; a clear error when built without 'thermo'). is_mgf now distinguishes MGF from .raw for the MGF-only ms-level warning and TSV title synthesis; the MS-level filter message covers .raw too. --precursor-cal falls back to off with a warning on .raw for now (calibration .raw support is a follow-up); --chimeric on .raw degrades to a normal search (MS1 streaming is M2). Default (pure-Rust) build unaffected and green; the thermo path builds on rustc>=1.88 with the .NET 8 runtime at runtime. Validated on real data: reading the 2.4GB Astral .raw yields 127,225 spectra (3,686 MS1 + 123,539 MS2), an exact match to the mzML's spectrumList count.
…lf-contained release M2 — chimeric cascade on .raw: - thermo.rs maps the precursor isolation window (RawFileReader lower/target/upper m/z -> model offsets) so co-isolation detection works on .raw. - ThermoRawReader::read_with_ms1_chunked mirrors MzMLReader's: streams MS2 in bounded chunks each paired with an Ms1Link (per-chunk MS1 peaks + per-MS2 link to the preceding MS1, carried across chunk boundaries). Lets --chimeric consume .raw exactly like mzML. - CLI: --chimeric now enabled for mzML OR .raw (MGF still degrades with a warning); the chimeric parser thread dispatches MzMLReader vs ThermoRawReader. M3 — tests + docs: - crates/input/tests/thermo_raw.rs: feature-gated integration tests (read MS2 + chunked MS1-link consistency), skipped unless MSGF_TEST_RAW points at a .raw (proprietary format, no fixture committed). - README 'Reading Thermo .raw files' section (self-contained release vs build-from-source + .NET 8 install, auto-dispatch, --chimeric, container snippet); CLI table + --chimeric updated; crates/input/THERMO_LICENSE.txt (Thermo license). Release — self-contained cross-platform packaging: - main() auto-points DOTNET_ROOT at a bundled <exe_dir>/dotnet runtime when present (else honors an existing DOTNET_ROOT / system install). - release.yml: linux-x64 / macOS x64+arm64 / windows-x64 built --features thermo (RUSTUP_TOOLCHAIN=stable overrides the repo 1.87 pin for thermo's >=1.88 deps) and bundle the matching .NET 8 runtime in dotnet/ -> users install nothing. linux-arm64 stays a plain no-thermo cross build. Default (pure-Rust) build unaffected and green; M2 thermo build verified on the VM (rustc 1.95). Chimeric .raw-vs-mzml parity validating on Astral.
…ywhere The chimeric .raw reader (read_with_ms1_chunked) now explicitly skips MS3+ scans with a visible count — only MS2 is searched and only MS1 links it, so a TMT SPS-MS3 acquisition can never have its MS3 reporter-quant scans scored. Both the chimeric and non-chimeric paths already default to MS2 (ms_level default = 2) and honor --ms-level as an explicit override (mzML + .raw); the chimeric cascade is always MS2 by construction. Clarified the --ms-level docs + README accordingly. Verified on the VM (5000-spectrum cap): min-peaks filters .raw (45,996 vs 46,011), runs report 'only MS2 spectra', and .raw matches mzML within noise (Δ24/46,011). Charge --min/--max only gates charge-UNKNOWN spectra (determined charges are searched as-is) — identical behavior for .raw and mzML (same engine path).
Charge-unknown spectra now try precursor charges 2..=5 by default (determined charges from the spectrum are still searched as-is). Widens recall for higher charge states common on modern instruments. Applies to all input formats. Side effects (handled): the default PIN gains charge4/charge5 one-hot columns (42 cols on the BSA fixture); the precursor_cal_off golden is regenerated from the new default. The Java-parity schema test sets charge_range=2..=3 explicitly and is unaffected. README charge defaults + column note updated. All tests green.
… auto-route, chimeric MS2-only) Resolves the 4 issues from the code review (PR #44) + Codex adversarial review: 1. Precursor-less MS2 scans are now SKIPPED, not searched. convert() returns Option<Spectrum> and drops any MS2 whose precursor record is missing or has m/z <= 0 (both the iterator and the chunked chimeric reader honor it), matching the mzML reader instead of forwarding precursor_mz=0.0. 2. extract_peaks() now guarantees m/z-ascending order (sorts only when an inversion is present), upholding the Spectrum.peaks contract that coisolation's partition_point binary search relies on. 3. .raw param auto-routing: detect_activation_instrument() peeks the .raw's vendor activation (DissociationMethod) + analyzer (MassAnalyzer) and routes the bundled .param accordingly, so a CID/ETD/ion-trap .raw is no longer silently scored with HCD/QExactive. convert() also stamps each Spectrum's activation_method. Validated on the Astral .raw: auto-detects HCD + QExactive. 4. --chimeric is MS2-only on EVERY format: the mzML chimeric reader is hardcoded to MS2 (was with_ms_level_range(ms_level,...), which let --ms-level 3 widen to 1..=3 and admit MS3); a non-2 --ms-level under --chimeric now warns and is ignored. Closes the TMT SPS-MS3 gap on the mzML path too. Plus: mapping unit tests (map_dissociation/map_analyzer, no .NET needed) and updated the two stale comments the review flagged. Default build + thermo build (rustc 1.95) + all tests green.
Native Thermo .raw input (M1–M3) + self-contained cross-platform release
Add an optional, off-by-default `timstof` cargo feature that reads native Bruker timsTOF .d (DDA-PASEF) data via the pure-Rust timsrust crate (the same reader Sage uses) and produces the engine's model::Spectrum, so the search path is format-agnostic. A .d is a directory (TDF SQLite + binary blob) read natively with no vendor runtime and nothing to bundle. TimsTofReader opens a .d by path, iterates the centroided MS2 spectra by index (timsrust SpectrumReader::new + len/get), and maps each to a Spectrum: 1-based scan number, precursor m/z/charge/intensity, RT in seconds, isolation center+width to symmetric offsets, and ascending-sorted peaks. Precursor-less or non-positive-m/z spectra are skipped, mirroring the mzML/Thermo readers. The dep and all .d code sit behind cfg(feature = timstof); default builds read mzML/MGF only and never pull in timsrust.
Detect a .d input by path extension (works on the .d directory) and add is_d alongside is_mzml; is_mgf now means 'neither mzML nor .d'. In the non-chimeric streaming reader, route .d to TimsTofReader (feature-gated), with a clear error when the binary is built without --features timstof. Chimeric and precursor-calibration on .d are out of scope: the DDA reader exposes no MS1 stream, so --chimeric on .d degrades gracefully to a normal search (like MGF), and --precursor-cal is skipped with a warning. The TSV writer uses its scan-based (non-MGF) path for .d since the reader emits real scan numbers.
Add the design doc (approach, why timsrust, scope/out-of-scope, PXD072598 benchmark dataset, the not-built-locally note), a feature-gated integration test that opens a real .d only when MSGF_TEST_D points at one (no-op otherwise so CI stays green), and a README 'Reading Bruker timsTOF .d files' section plus the --spectrum flag update.
Integrate the timsTOF .d input alongside the now-merged Thermo .raw feature. Resolved the shared dispatch/feature files: both 'thermo' and 'timstof' cargo features coexist; the binary's format dispatch chains is_mzml / is_raw / is_d / MGF, is_mgf excludes all native formats, --chimeric covers mzML+.raw (.d degrades like MGF), --precursor-cal skips .raw and .d, and the README lists all four input formats. Default + --features timstof builds green.
Native Bruker timsTOF .d input (timstof feature)
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
|
Warning Review limit reached
More reviews will be available in 19 minutes and 25 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThis PR implements native Thermo ChangesChimeric cascade and native input integration
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Poem
✨ Finishing Touches🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Pull request overview
This PR lands three major opt-in capabilities together: a chimeric two-pass cascade search (--chimeric), native Thermo .raw input (feature thermo + bundled .NET 8), and native Bruker timsTOF .d input (feature timstof via the pure-Rust timsrust crate). It also extends the mzML reader to capture MS1 spectra for chimeric, adds several additive Percolator features (PrecursorIsotopeKL, PrecursorSNR, DeltaRawScore), refreshes the README/CLI tables, drops a large volume of historical dev-jargon comments, and adds many design / parity-analysis notes documenting the cascade investigation.
Changes:
- New
--chimericMS1-gated two-pass cascade (search + coisolation + averagine isotope features), new additive PIN columns, and aforce_pushqueue API for secondary PSMs. - Native
.raw(feature-gated, .NET 8) and.d(feature-gated, pure-Rusttimsrust) input backends with extension-based dispatch and updated release CI bundling a self-contained .NET runtime. - Spectrum model carries isolation-window offsets; mzML reader emits
Ms1Link; PSM model addsprecursor_mz_overrideso secondaries report their own ExpMass/dm.
Reviewed changes
Copilot reviewed 83 out of 85 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| README.md | Documents .raw / .d / --chimeric, restructures CLI flags into required/optional tables. |
| docs/superpowers/specs/, docs/design/, docs/parity-analysis/notes/* | New design specs, plans, and bench notes for chimeric, fragment-index investigation, .raw, .d, native rescoring pipeline. |
| .claude/CLAUDE.md | Updates "next planned work" with cascade PR + abandoned frag-index note. |
| .github/workflows/release.yml | Adds --features thermo builds and bundles .NET 8 runtime per-target. |
| crates/input/Cargo.toml, src/lib.rs | Adds optional thermo / timstof features and re-exports. |
| crates/input/src/timstof.rs (+ test) | New pure-Rust Bruker .d reader via timsrust. |
| crates/input/THERMO_LICENSE.txt, tests/thermo_raw.rs | Thermo license notice and live .raw integration test (env-gated). |
| crates/input/src/mgf.rs | Fills in new isolation-window fields as None. |
| crates/model/src/spectrum.rs | Adds isolation_lower_offset / isolation_upper_offset. |
| crates/model/src/isotope.rs (+ lib.rs) | New averagine isotope envelope. |
| crates/model/src/{tolerance, aa_set}.rs | Comment cleanups. |
| crates/scoring/src/gf/{generating_function, score_dist, primitive_graph, group}.rs | Adds DROPPED_NODES counter, get_probability OOB→0.0 guard, comment cleanup, test fixture updates. |
| crates/scoring/src/scoring/{rank_scorer, psm_score}.rs | Comment cleanup; test fixture additions for new spectrum fields. |
| crates/scoring/tests/* | Fixture updates for the new spectrum fields. |
| crates/search/Cargo.toml | Promotes input from dev-deps to deps (needed by new modules). |
| crates/search/src/lib.rs | Wires coisolation and chimeric_features modules and re-exports run_pass2_coisolation. |
| crates/search/src/chimeric_features.rs | Observed-vs-averagine KL + SNR scoring. |
| crates/search/src/precursor_matching.rs | New matches_isolation_window for chimeric candidate gating. |
| crates/search/src/psm.rs | New PsmFeatures (KL/SNR/DeltaRawScore), precursor_mz_override, force_push, comment cleanup. |
| crates/search/src/{search_params, mass_calibrator, precursor_cal, search_index, candidate_gen}.rs | New chimeric / chimeric_isolation_halfwidth_da params + fixture updates + comment cleanup. |
| crates/search/tests/* | Fixture updates for the new fields. |
| crates/output/src/tsv.rs | Uses precursor_mz_override for ExpMass/dm in chimeric secondaries. |
| crates/output/tests/output_pin_schema_parity.rs | Updated to expect 4 additive Rust-only columns. |
| crates/msgf-rust/Cargo.toml | Adds thermo / timstof features. |
| crates/msgf-rust/src/bin/msgf-trace.rs | Uses rank_score (node+cleavage+edge) for the trace's score threshold to align with the production SpecEValue path. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (9)
crates/search/src/coisolation.rs (1)
164-177: ⚖️ Poor tradeoffConsider grouping parameters into a context struct.
search_secondaryhas 12 parameters, which exceeds typical guidelines and hurts readability. The scoring/search context parameters (scorer,aa_set,enzyme,params,search_index,fragment_tolerance_da) could be grouped into aSecondarySearchContextstruct to reduce the parameter count to ~6-7.♻️ Example refactor
pub(crate) struct SecondarySearchContext<'a> { pub scorer: &'a RankScorer, pub aa_set: &'a AminoAcidSet, pub enzyme: Option<Enzyme>, pub params: &'a SearchParams, pub search_index: &'a SearchIndex, pub fragment_tolerance_da: f64, } pub(crate) fn search_secondary( spec: &Spectrum, primary: &Peptide, prior_claimed: &std::collections::HashSet<i64>, co: CoIsolated, candidates: &[Candidate], bucket_index: &BTreeMap<i32, Vec<usize>>, ctx: &SecondarySearchContext, ) -> Option<(PsmMatch, std::collections::HashSet<i64>)> { // Use ctx.scorer, ctx.aa_set, etc. }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/search/src/coisolation.rs` around lines 164 - 177, The function search_secondary currently takes many related scoring/search params; refactor by introducing a SecondarySearchContext struct containing scorer, aa_set, enzyme, params, search_index, and fragment_tolerance_da, change search_secondary's signature to accept ctx: &SecondarySearchContext instead of those individual parameters, update the body of search_secondary to use ctx.scorer, ctx.aa_set, ctx.enzyme, ctx.params, ctx.search_index, and ctx.fragment_tolerance_da, and update all call sites to construct and pass a SecondarySearchContext instance when invoking search_secondary.crates/search/src/chimeric_features.rs (1)
117-133: 💤 Low valueConsider explicit NaN handling in median calculation.
The function uses
partial_cmp(...).unwrap_or(Equal)which treats NaN as equal to any value during sorting, potentially placing NaNs unpredictably. While upstream parsers likely sanitize intensities, an explicit.filter(|&i| !i.is_nan())after line 121 would make this more robust.♻️ Optional defensive fix
fn median_nonzero_intensity(ms1_peaks: &[(f64, f32)]) -> Option<f64> { let mut nz: Vec<f64> = ms1_peaks .iter() .map(|&(_, i)| i as f64) .filter(|&i| i > 0.0) + .filter(|&i| !i.is_nan()) .collect();🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/search/src/chimeric_features.rs` around lines 117 - 133, The median_nonzero_intensity function currently sorts intensities using partial_cmp and can mishandle NaNs; update the data preparation for nz in median_nonzero_intensity to explicitly filter out NaN values (e.g., after mapping intensities, add a .filter(|&i| i > 0.0 && !i.is_nan()) or equivalent) so that NaNs are excluded before sorting and median computation, ensuring stable ordering and correct median results.crates/search/src/precursor_matching.rs (1)
65-110: 💤 Low valueClarify the ppm error computation in the comment.
Lines 74-75 state that
mass_error_*is "measured against the nearest in-window neutral mass (clamped)", but the implementation on line 102 computesmass_error_ppm = mass_error_da / peptide_mass * 1e6usingpeptide_massas the denominator, notnearest.While the Da error is indeed measured against
nearest(line 100:mass_error_da = peptide_mass - nearest), the ppm error is computed relative to the peptide's own mass. This is different from the originalmatches_precursor(line 50) which uses the spectrum neutral mass as the denominator.This appears intentional for chimeric matching where the error metric should be peptide-centric, and the implementation is correct. However, the comment could be more precise.
📝 Suggested comment clarification
-/// `lo_mz` / `hi_mz` are the isolation-window m/z bounds (selected m/z minus the -/// lower offset .. plus the upper offset). The reported `mass_error_*` is -/// measured against the nearest in-window neutral mass (clamped), so a peptide -/// matching anywhere inside the window reports a near-zero error rather than a -/// large error against the selected precursor. +/// `lo_mz` / `hi_mz` are the isolation-window m/z bounds (selected m/z minus the +/// lower offset .. plus the upper offset). The reported `mass_error_da` is +/// measured against the nearest in-window neutral mass (clamped); `mass_error_ppm` +/// is computed relative to the peptide's mass. A peptide matching anywhere inside +/// the window reports a near-zero error rather than a large error against the +/// selected precursor.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/search/src/precursor_matching.rs` around lines 65 - 110, The doc comment for matches_isolation_window is ambiguous about how mass_error_* are computed; update it to state that mass_error_da is measured against the nearest in-window neutral mass (nearest) while mass_error_ppm is computed relative to the peptide mass (peptide_mass) (i.e., mass_error_ppm = mass_error_da / peptide_mass * 1e6), so readers know this function uses a peptide-centric ppm metric instead of the spectrum neutral mass as in matches_precursor.crates/scoring/src/gf/score_dist.rs (1)
79-95: 💤 Low valueDefensive bounds check masks potential caller bugs.
Returning
0.0whenidx >= p.len()prevents panics but silently handles scores outside the allocated range. The comment notes "callers normally guardscore >= max_scorethemselves," which suggests this case shouldn't occur in correct usage. Consider whether adebug_assert!(idx < p.len(), "score {score} out of range [{}, {})", min_score, max_score)would help catch caller bugs during development while keeping the defensive return in release builds.The current defensive approach prioritizes robustness, which is reasonable for production code, but the tradeoff is that it may hide logic errors.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/scoring/src/gf/score_dist.rs` around lines 79 - 95, Add a debug-only assertion to detect out-of-range callers in get_probability while preserving the defensive 0.0 fallback: compute the index as currently done, then call debug_assert!(idx < p.len(), "score {} out of range [{}, {})", score, self.bound.min_score, self.bound.max_score) (referencing get_probability, self.prob_distribution, and self.bound.min_score/max_score) before the existing if idx >= p.len() check so development builds will surface caller bugs but release behavior still returns 0.0.docs/superpowers/plans/2026-05-30-chimeric-sage-style-fragment-index.md (1)
314-340: ⚡ Quick winVerify the neutral-mass window derivation.
Lines 322-326 note that
candidate_nominal_boundsreturns nominal mass, and the code divides byINTEGER_MASS_SCALERto convert to neutral mass. The comment explicitly flags this as an integration subtlety requiring verification. The implementer must confirm the correct conversion or derive the window directly fromprecursor_mzand charge (as suggested in the NOTE at line 342). The Task 4 recall gate will catch a mismatch, but catching it earlier during local testing would be better.Consider adding a small local test in Step 3 that verifies the window derivation matches
window_cand_indiceson a known example before running the full VM gate.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/superpowers/plans/2026-05-30-chimeric-sage-style-fragment-index.md` around lines 314 - 340, The neutral-mass window conversion used when replacing the `cand_iter` arm in `run_chunk_inner` may be incorrect: confirm that `candidate_nominal_bounds(spec, z, params, shift_ppm)` returns nominal (integer) mass and that dividing by `model::mass::INTEGER_MASS_SCALER` yields the same neutral-mass range used by `window_cand_indices`; if not, compute the window directly from `precursor_mz` and charge (or adjust the conversion) so `lo`/`hi` match `window_cand_indices` semantics. Update the `Some(si)` chimeric path (the block where you call `si.query(...)` using `scored_spec_for_charge(z).dump_active_peaks()` and `scorer.param().data_type.instrument.is_high_resolution()` to pick tol) to use the corrected neutral-mass bounds, then add a small unit/local test that compares the derived `(lo,hi)` vs the bounds implied by `window_cand_indices` for a known example to catch mismatches before the VM gate.crates/scoring/src/gf/generating_function.rs (1)
22-33: 💤 Low valueProcess-global counter accumulates across all GF computations.
The
DROPPED_NODEScounter is never reset and accumulates across the entire process lifetime. In long-running processes or test suites that invoke GF computation many times, the count will grow unboundedly. This is documented as intentional ("cumulative count since process start"), but consider whether per-computation counts (returned fromcomputeor tracked perGeneratingFunctioninstance) would be more useful for diagnosing individual problematic graphs.If the cumulative global count is the desired semantic (e.g., for process-level health metrics), this is fine as-is.
crates/search/src/match_engine.rs (2)
956-956: ⚡ Quick winInconsistent HashSet usage: prefer
FxHashSetfor performance.Line 956 uses
std::collections::HashSetwhile the rest of the file usesrustc_hash::FxHashSet(imported at line 31) for fast, non-cryptographic hashing. Theclaimedset is used in a per-scan loop that could run frequently under--chimeric.For consistency and performance, change to
FxHashSet.♻️ Proposed fix
- let mut claimed: std::collections::HashSet<i64> = std::collections::HashSet::new(); + let mut claimed: FxHashSet<i64> = FxHashSet::default();🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/search/src/match_engine.rs` at line 956, The local set named `claimed` (created for the per-scan loop iterating `for co in cos`) is currently a std::collections::HashSet; replace it with rustc_hash::FxHashSet to match the rest of the file and gain faster non-cryptographic hashing: update the type used where `claimed` is declared/initialized to `FxHashSet`, ensure the `FxHashSet` symbol already imported (it is imported earlier in the file), and keep the same element type and usage so the surrounding logic in the loop (`for co in cos`, insertions, lookups, etc.) continues to compile and behave identically.
747-776: ⚖️ Poor tradeoffEnv-gated diagnostic has high complexity; consider optimization or clearer safety note.
The chimeric overlap diagnostic clones the entire queue, builds peptide sequences for the top-2 distinct peptides, and computes set intersections on matched peak keys. While double-gated (
chim_overlapenv var +params.chimeric), this is still expensive if accidentally enabled.Consider either:
- Adding a compile-time
#[cfg(feature = "diagnostics")]gate so it's truly zero-cost in release builds, or- Adding a comment at the diagnostic's definition warning maintainers this is research-only and should not be enabled in production pipelines.
The diagnostic is valuable for research (as documented in the bench notes), but the current env-var-only gate means a mistyped environment variable could degrade production performance.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/search/src/match_engine.rs` around lines 747 - 776, The diagnostic block guarded by chim_overlap and params.chimeric in match_engine.rs performs an expensive clone of queue and heavy set operations (using queue.clone(), building peptide sequences from candidates, and matched_peak_keys/scored_spec_for_charge) and should be made zero-cost in non-research builds; either add a compile-time gate (e.g., wrap the entire block with #[cfg(feature = "diagnostics")] so it is compiled out unless the diagnostics feature is enabled) or, if you prefer to keep it runtime-gated, add a prominent comment above the chim_overlap diagnostic explaining it is research-only, expensive, and must not be enabled in production (mentioning the symbols queue.clone(), matched_peak_keys, scored_spec_for_charge, and params.chimeric so maintainers can find it).docs/parity-analysis/notes/2026-05-28-chimeric-fragment-overlap-diagnostic.md (1)
50-55: 💤 Low valueAdd language specifier to the fenced code block.
The shell command block lacks a language identifier. Add
shellafter the opening backticks for proper syntax highlighting.📝 Proposed fix
-``` +```shell MSGF_CHIMERIC_OVERLAP=1 <astral chimeric run> 2> astral-overlap.log🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/parity-analysis/notes/2026-05-28-chimeric-fragment-overlap-diagnostic.md` around lines 50 - 55, The fenced code block containing the shell snippet (starting with MSGF_CHIMERIC_OVERLAP=1 and referencing astral-overlap.log) lacks a language identifier; update the opening triple backticks to include "shell" (i.e., ```shell) so the block is properly syntax-highlighted, leaving the inner lines unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/input/Cargo.toml`:
- Around line 15-29: The Cargo.toml pins an unavailable thermorawfilereader
version (thermorawfilereader = "0.7.0") which will break resolution; update the
dependency entry for thermorawfilereader to a published version (e.g., match
crates.io latest such as 0.6.0) or vendor the crate, and ensure the thermo
feature (thermo = ["dep:thermorawfilereader"]) still references that updated
dependency; also verify or document any MSRV/rustc requirement for the
thermo/timstof features if the chosen version imposes a minimum compiler
version.
In `@crates/input/src/timstof.rs`:
- Around line 160-168: Validate that raw.mz_values.len() ==
raw.intensities.len() and return an error (or propagate a Result) if they
differ, then build peaks by iterating indices and filter out any pairs where mz
is not finite (f64::is_finite) or intensity is not finite after casting
(f32::is_finite) before collecting into peaks; finally keep the existing check
using peaks.windows(2) and peaks.sort_by(...). Use the existing symbols peaks,
raw.mz_values, raw.intensities, and the sort_by/windows logic so you drop
non-finite points early and fail fast on malformed parallel arrays.
In `@crates/search/src/match_engine.rs`:
- Around line 702-736: The block guarded by the hardcoded constant
CASCADE_SKIP_MS1_FEATURE is dead code; replace the compile-time constant with a
runtime flag (e.g., add a boolean field
params.compute_precursor_isotope_features) and use that flag to conditionally
run the MS1 precursor isotope-envelope computation (the existing logic that
checks ms1_link, ms2_to_ms1, ms1_peaks, computes z/theo_mono_mz/tol_da and calls
precursor_isotope_match and assigns features.precursor_isotope_kl and
features.precursor_snr). Remove the CASCADE_SKIP_MS1_FEATURE constant, thread
the new params flag through callers as needed, and keep the rest of the existing
precursor_isotope_match flow unchanged so the feature can be toggled at runtime
for experiments.
In `@docs/parity-analysis/notes/2026-05-30-chimeric-cost-profile.md`:
- Around line 7-10: The Markdown heading "Self-time (top)" is immediately
followed by a table which triggers MD058; insert a single blank line between the
heading line ("Self-time (top)") and the table header row ("| function | self% |
bucket |") so the heading is separated from the table and the linter passes.
In
`@docs/superpowers/specs/2026-05-29-chimeric-fragment-index-prefilter-design.md`:
- Around line 103-109: The markdown table starting with the header row "| Risk |
Mitigation |" is missing surrounding blank lines and triggers MD058; update the
document (the table block shown in the diff) by inserting a blank line both
immediately before the table header and immediately after the final table row so
the table is separated from adjacent paragraphs/blocks, which will satisfy
markdownlint.
In
`@docs/superpowers/specs/2026-05-30-chimeric-sage-style-fragment-index-design.md`:
- Around line 105-106: Insert a single blank line between the "## Risks &
mitigations" heading and the table that follows so the Markdown table does not
immediately follow the heading (fix the markdownlint MD058 warning); locate the
"## Risks & mitigations" heading in the document and add one empty line before
the pipe-delimited table starting with "| Risk | Mitigation |".
In `@docs/superpowers/specs/2026-05-30-chimeric-two-pass-cascade-design.md`:
- Around line 93-94: Add a blank line between the "## Risks & mitigations"
heading and the table that immediately follows so the Markdown renders correctly
(fix MD058). Locate the heading text "## Risks & mitigations" and insert an
empty line before the table rows starting with "| Risk | Mitigation |".
---
Nitpick comments:
In `@crates/scoring/src/gf/score_dist.rs`:
- Around line 79-95: Add a debug-only assertion to detect out-of-range callers
in get_probability while preserving the defensive 0.0 fallback: compute the
index as currently done, then call debug_assert!(idx < p.len(), "score {} out of
range [{}, {})", score, self.bound.min_score, self.bound.max_score) (referencing
get_probability, self.prob_distribution, and self.bound.min_score/max_score)
before the existing if idx >= p.len() check so development builds will surface
caller bugs but release behavior still returns 0.0.
In `@crates/search/src/chimeric_features.rs`:
- Around line 117-133: The median_nonzero_intensity function currently sorts
intensities using partial_cmp and can mishandle NaNs; update the data
preparation for nz in median_nonzero_intensity to explicitly filter out NaN
values (e.g., after mapping intensities, add a .filter(|&i| i > 0.0 &&
!i.is_nan()) or equivalent) so that NaNs are excluded before sorting and median
computation, ensuring stable ordering and correct median results.
In `@crates/search/src/coisolation.rs`:
- Around line 164-177: The function search_secondary currently takes many
related scoring/search params; refactor by introducing a SecondarySearchContext
struct containing scorer, aa_set, enzyme, params, search_index, and
fragment_tolerance_da, change search_secondary's signature to accept ctx:
&SecondarySearchContext instead of those individual parameters, update the body
of search_secondary to use ctx.scorer, ctx.aa_set, ctx.enzyme, ctx.params,
ctx.search_index, and ctx.fragment_tolerance_da, and update all call sites to
construct and pass a SecondarySearchContext instance when invoking
search_secondary.
In `@crates/search/src/match_engine.rs`:
- Line 956: The local set named `claimed` (created for the per-scan loop
iterating `for co in cos`) is currently a std::collections::HashSet; replace it
with rustc_hash::FxHashSet to match the rest of the file and gain faster
non-cryptographic hashing: update the type used where `claimed` is
declared/initialized to `FxHashSet`, ensure the `FxHashSet` symbol already
imported (it is imported earlier in the file), and keep the same element type
and usage so the surrounding logic in the loop (`for co in cos`, insertions,
lookups, etc.) continues to compile and behave identically.
- Around line 747-776: The diagnostic block guarded by chim_overlap and
params.chimeric in match_engine.rs performs an expensive clone of queue and
heavy set operations (using queue.clone(), building peptide sequences from
candidates, and matched_peak_keys/scored_spec_for_charge) and should be made
zero-cost in non-research builds; either add a compile-time gate (e.g., wrap the
entire block with #[cfg(feature = "diagnostics")] so it is compiled out unless
the diagnostics feature is enabled) or, if you prefer to keep it runtime-gated,
add a prominent comment above the chim_overlap diagnostic explaining it is
research-only, expensive, and must not be enabled in production (mentioning the
symbols queue.clone(), matched_peak_keys, scored_spec_for_charge, and
params.chimeric so maintainers can find it).
In `@crates/search/src/precursor_matching.rs`:
- Around line 65-110: The doc comment for matches_isolation_window is ambiguous
about how mass_error_* are computed; update it to state that mass_error_da is
measured against the nearest in-window neutral mass (nearest) while
mass_error_ppm is computed relative to the peptide mass (peptide_mass) (i.e.,
mass_error_ppm = mass_error_da / peptide_mass * 1e6), so readers know this
function uses a peptide-centric ppm metric instead of the spectrum neutral mass
as in matches_precursor.
In
`@docs/parity-analysis/notes/2026-05-28-chimeric-fragment-overlap-diagnostic.md`:
- Around line 50-55: The fenced code block containing the shell snippet
(starting with MSGF_CHIMERIC_OVERLAP=1 and referencing astral-overlap.log) lacks
a language identifier; update the opening triple backticks to include "shell"
(i.e., ```shell) so the block is properly syntax-highlighted, leaving the inner
lines unchanged.
In `@docs/superpowers/plans/2026-05-30-chimeric-sage-style-fragment-index.md`:
- Around line 314-340: The neutral-mass window conversion used when replacing
the `cand_iter` arm in `run_chunk_inner` may be incorrect: confirm that
`candidate_nominal_bounds(spec, z, params, shift_ppm)` returns nominal (integer)
mass and that dividing by `model::mass::INTEGER_MASS_SCALER` yields the same
neutral-mass range used by `window_cand_indices`; if not, compute the window
directly from `precursor_mz` and charge (or adjust the conversion) so `lo`/`hi`
match `window_cand_indices` semantics. Update the `Some(si)` chimeric path (the
block where you call `si.query(...)` using
`scored_spec_for_charge(z).dump_active_peaks()` and
`scorer.param().data_type.instrument.is_high_resolution()` to pick tol) to use
the corrected neutral-mass bounds, then add a small unit/local test that
compares the derived `(lo,hi)` vs the bounds implied by `window_cand_indices`
for a known example to catch mismatches before the VM gate.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 17d7e6db-037a-43e5-91f2-0138e0ab628c
⛔ Files ignored due to path filters (2)
Cargo.lockis excluded by!**/*.locktest-fixtures/parity/goldens/precursor_cal_off.tsvis excluded by!**/*.tsv
📒 Files selected for processing (83)
.claude/CLAUDE.md.github/workflows/release.ymlREADME.mdcrates/input/Cargo.tomlcrates/input/THERMO_LICENSE.txtcrates/input/src/lib.rscrates/input/src/mgf.rscrates/input/src/mzml.rscrates/input/src/thermo.rscrates/input/src/timstof.rscrates/input/tests/thermo_raw.rscrates/input/tests/timstof_d_loads.rscrates/model/src/aa_set.rscrates/model/src/isotope.rscrates/model/src/lib.rscrates/model/src/spectrum.rscrates/model/src/tolerance.rscrates/msgf-rust/Cargo.tomlcrates/msgf-rust/src/bin/msgf-rust.rscrates/msgf-rust/src/bin/msgf-trace.rscrates/output/src/pin.rscrates/output/src/tsv.rscrates/output/tests/output_pin_schema_parity.rscrates/scoring/src/gf/generating_function.rscrates/scoring/src/gf/group.rscrates/scoring/src/gf/primitive_graph.rscrates/scoring/src/gf/score_dist.rscrates/scoring/src/scoring/psm_score.rscrates/scoring/src/scoring/rank_scorer.rscrates/scoring/src/scoring/scored_spectrum.rscrates/scoring/tests/gf_graph_dp.rscrates/scoring/tests/primitive_graph_arena_parity.rscrates/search/Cargo.tomlcrates/search/src/candidate_gen.rscrates/search/src/chimeric_features.rscrates/search/src/coisolation.rscrates/search/src/lib.rscrates/search/src/mass_calibrator.rscrates/search/src/match_engine.rscrates/search/src/precursor_cal.rscrates/search/src/precursor_matching.rscrates/search/src/psm.rscrates/search/src/search_index.rscrates/search/src/search_params.rscrates/search/tests/mass_calibrator_integration.rscrates/search/tests/match_engine_java_parity.rscrates/search/tests/match_engine_smoke.rscrates/search/tests/match_engine_specevalue.rscrates/search/tests/precursor_matching.rsdocs/2026-05-28-psm-gain-state-and-roadmap.mddocs/design/2026-06-01-thermo-raw-input.mddocs/design/2026-06-01-timstof-d-input.mddocs/parity-analysis/notes/2026-05-28-SESSION-HANDOFF.mddocs/parity-analysis/notes/2026-05-28-chimeric-fragment-overlap-diagnostic.mddocs/parity-analysis/notes/2026-05-28-chimeric-phase1-bench.mddocs/parity-analysis/notes/2026-05-28-chimeric-phase2-bench.mddocs/parity-analysis/notes/2026-05-29-chimeric-full-review-and-rethink.mddocs/parity-analysis/notes/2026-05-29-chimeric-phase3-bench-canary-fails.mddocs/parity-analysis/notes/2026-05-29-entrapment-fdp-reversal.mddocs/parity-analysis/notes/2026-05-29-gate-chimeric-norescore-vs-java.mddocs/parity-analysis/notes/2026-05-29-rank-stratified-fdr-bench.mddocs/parity-analysis/notes/2026-05-30-cascade-astral-breakthrough.mddocs/parity-analysis/notes/2026-05-30-chimeric-cost-profile.mddocs/parity-analysis/notes/2026-05-30-frag-index-pxd-fails-lowres.mddocs/parity-analysis/notes/2026-05-30-sage-index-astral-and-chimeric-speed-conclusion.mddocs/parity-analysis/notes/2026-05-30-sage-index-pxd-gate.mddocs/parity-analysis/notes/2026-05-31-cascade-optimized-multidataset-summary.mddocs/parity-analysis/notes/2026-05-31-tmt-gap-diagnosis-not-gf-bug.mddocs/parity-analysis/notes/2026-06-01-p0-parity-audit-bench.mddocs/superpowers/plans/2026-05-28-chimeric-dda-plus-phase1-plan.mddocs/superpowers/plans/2026-05-28-chimeric-dda-plus-phase2-plan.mddocs/superpowers/plans/2026-05-29-chimeric-fragment-index-prefilter.mddocs/superpowers/plans/2026-05-30-chimeric-sage-style-fragment-index.mddocs/superpowers/plans/2026-05-30-chimeric-two-pass-cascade.mddocs/superpowers/specs/2026-05-28-chimeric-dda-plus-integration-design.mddocs/superpowers/specs/2026-05-29-chimeric-fragment-index-prefilter-design.mddocs/superpowers/specs/2026-05-29-chimeric-phase3-shared-fragment-design.mddocs/superpowers/specs/2026-05-29-chimeric-rank-stratified-fdr-design.mddocs/superpowers/specs/2026-05-29-ms2rescore-entrapment-fdp-proof.mddocs/superpowers/specs/2026-05-30-chimeric-sage-style-fragment-index-design.mddocs/superpowers/specs/2026-05-30-chimeric-two-pass-cascade-design.mddocs/superpowers/specs/2026-05-31-native-rescoring-pipeline-design.mdtest-fixtures/parity/goldens/precursor_cal_off.pin
| thermorawfilereader = { version = "0.7.0", optional = true } | ||
| timsrust = { version = "0.4.2", optional = true } | ||
|
|
||
| [features] | ||
| # Native Thermo .raw reading via the official RawFileReader (hosted .NET 8 | ||
| # runtime, FlatBuffers). Optional + off-by-default at the crate level; the | ||
| # msgf-rust binary turns it on by default. Builds without the .NET SDK | ||
| # (assemblies are vendored); only opening a `.raw` needs the .NET 8 runtime. | ||
| thermo = ["dep:thermorawfilereader"] | ||
| # Native Bruker timsTOF `.d` reading via the pure-Rust `timsrust` crate (the | ||
| # same reader Sage uses). Optional + off-by-default. NO vendor runtime and NO | ||
| # bundling — a `.d` is read natively (TDF SQLite + binary blob). Default builds | ||
| # stay pure and read mzML/MGF only; the optional dep tree only compiles under | ||
| # this feature. | ||
| timstof = ["dep:timsrust"] |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify the pinned versions exist and are not yanked, plus their rust-version (MSRV).
for crate in thermorawfilereader timsrust; do
echo "== $crate =="
curl -s "https://crates.io/api/v1/crates/$crate" | jq -r '.versions[] | "\(.num)\tyanked=\(.yanked)\trust_version=\(.rust_version)"' | head -20
doneRepository: bigbio/msgf-rust
Length of output: 235
🌐 Web query:
crates.io thermorawfilereader 0.7.0 rust_version yanked
💡 Result:
Crate: thermorawfilereader Version: 0.7.0 Yanked: Not confirmed as yanked (no authoritative source found in this research showing that 0.7.0 is yanked). [1] Rust version / MSRV ("rust_version"): Not found in the sources retrieved here. (Docs.rs page shown for 0.7.0 doesn’t expose MSRV/rust_version in the captured snippets.) [1] What I did find: the crate currently shows version history listing 0.6.0 as the latest on crates.io, and a Docs.rs source page indicates there is a 0.7.0 build available on Docs.rs. [2][1]
Citations:
- 1: https://docs.rs/crate/thermorawfilereader/latest/source/
- 2: https://crates.io/crates/thermorawfilereader
🌐 Web query:
crates.io timsrust 0.4.2 rust_version yanked
💡 Result:
Crates.io page for timsrust 0.4.2 does not indicate that version 0.4.2 is yanked; it shows a published release dated 2024-12-19.[1] The most authoritative sources found for this specific version (crates.io and docs.rs) only provide release metadata and do not label 0.4.2 as yanked.[1][2] What I could not confirm from the sources retrieved: an explicit “yanked” flag or yanked reason for timsrust 0.4.2 on crates.io. If you want, I can do a second targeted search specifically for the crates.io “yanked” field/state for that exact version.
Citations:
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check repo Cargo.toml context for any rust-version/MSRV mention
echo "== crates/input/Cargo.toml =="
rg -n "rust-version|package\.rust-version|MSRV" crates/input/Cargo.toml || true
sed -n '1,120p' crates/input/Cargo.toml | nl -ba | sed -n '1,120p' || true
# Query crates.io API for exact versions to get yanked + rust_version
for pair in "thermorawfilereader 0.7.0" "timsrust 0.4.2"; do
crate=$(echo "$pair" | awk '{print $1}')
ver=$(echo "$pair" | awk '{print $2}')
echo "== crates.io api: $crate $ver =="
url="https://crates.io/api/v1/crates/${crate}/${ver}"
# Print the key fields if present
curl -s "$url" | jq -r '
if .version == null then "version: null (unexpected response)" else
"yanked=\(.version.yanked // "null")\nrust_version=\(.version.rust_version // "null")\nnum=\(.version.num // "null")"
end
'
doneRepository: bigbio/msgf-rust
Length of output: 336
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== crates/input/Cargo.toml (first 80 lines with numbers) =="
# nl may not exist; use cat -n
cat -n crates/input/Cargo.toml | sed -n '1,80p'
for pair in "thermorawfilereader 0.7.0" "timsrust 0.4.2"; do
crate=$(echo "$pair" | awk '{print $1}')
ver=$(echo "$pair" | awk '{print $2}')
url="https://crates.io/api/v1/crates/${crate}/${ver}"
echo
echo "== crates.io api status: $crate $ver =="
# Show HTTP status and content-type
curl -sS -L -o /tmp/crates_resp.json -w "HTTP=%{http_code} CT=%{content_type}\n" "$url" || true
echo "== crates.io api preview (first 300 bytes) =="
head -c 300 /tmp/crates_resp.json | cat
echo
echo "== crates.io api parsed fields (if JSON) =="
# Try parsing only if it looks like JSON
if head -c 1 /tmp/crates_resp.json | grep -q '{'; then
jq -r '
"num=\(.version.num // "null")\nyanked=\((.version.yanked|tostring) // "null")\nrust_version=\(.version.rust_version // "null")"
' /tmp/crates_resp.json || true
else
echo "Response not JSON; skipping jq."
fi
doneRepository: bigbio/msgf-rust
Length of output: 2443
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "rust-version|MSRV|1\.88|rustc.*1\.88" Cargo.toml **/Cargo.toml . || trueRepository: bigbio/msgf-rust
Length of output: 50445
🌐 Web query:
docs.rs thermorawfilereader 0.7.0 Minimum Supported Rust Version
💡 Result:
As of June 1, 2026, there is no public version 0.7.0 of the thermorawfilereader crate available on crates.io or docs.rs [1][2]. The latest released version of the thermorawfilereader crate is 0.6.0, published on March 31, 2026 [1]. Because version 0.7.0 does not exist, there is no defined Minimum Supported Rust Version (MSRV) for it [1]. For the existing 0.6.0 version, the crate documentation and metadata do not explicitly specify a formal MSRV [1]. The crate relies on the.NET 8 runtime to function, which is a primary prerequisite for its operation [1][3].
Citations:
- 1: https://crates.io/crates/thermorawfilereader
- 2: https://docs.rs/crate/thermorawfilereader/0.2.7
- 3: https://github.com/mobiusklein/thermorawfilereader.rs
🌐 Web query:
docs.rs timsrust 0.4.2 Minimum Supported Rust Version
💡 Result:
The timsrust crate version 0.4.2 does not explicitly specify a "rust-version" (Minimum Supported Rust Version or MSRV) in its Cargo.toml file [1]. While the crate uses the 2021 Rust edition [1], it does not impose a stricter lower bound on the compiler version via the Cargo manifest. Users can generally verify the current configuration by checking the official repository's Cargo.toml file [1].
Citations:
🌐 Web query:
crates.io thermorawfilereader 0.7.0 yanked
💡 Result:
Result: thermorawfilereader 0.7.0 does not appear to be yanked on crates.io; in fact, the crates.io crate page shows the latest published version as 0.6.0, with no 0.7.0 listed in its version history. [1] Evidence details: - The crates.io entry for thermorawfilereader lists “Latest version: 0.6.0” and shows version history entries including 0.6.0 (2026-03-31), 0.5.5, 0.5.4, etc., but not 0.7.0. [1] - The docs.rs page for “thermorawfilereader 0.7.0” is accessible via a “latest/source” URL, but it doesn’t provide any crates.io “yanked” indicator for version 0.7.0, and (critically) crates.io’s version list for the crate doesn’t show 0.7.0 as a published version to be yanked. [2][1] If you intended a different version (e.g., 0.6.0 or one of the earlier versions), tell me which version string you care about and I’ll re-check its crates.io yanked status.
Citations:
- 1: https://crates.io/crates/thermorawfilereader
- 2: https://docs.rs/crate/thermorawfilereader/latest/source/
🌐 Web query:
crates.io timsrust 0.4.2 yanked
💡 Result:
For timsrust 0.4.2, I could not find any evidence that the version is yanked. The crates.io page for timsrust 0.4.2 loads as a normal published version, listing it as the latest version and showing the usual metadata for that release (e.g., downloads, created/updated dates) rather than indicating a yank status [1]. The Docs.rs entry for the same version also exists (docs build for 0.4.2), which is consistent with the crate/version being available normally, not removed via yanking [2]. If you meant “verify via the crates.io yank endpoint/flag,” crates.io’s UI elements for yanking weren’t exposed in the search snippets I retrieved, so I can’t cite a definitive “not yanked” flag value from crates.io itself—only that the version appears published and accessible via crates.io and docs.rs [1][2].
Citations:
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== root Cargo.toml rust-version =="
sed -n '1,40p' Cargo.toml | nl -ba
echo
echo "== README.md MSRV statements around thermo/timstof =="
rg -n "Build needs rustc|timsrust.*rustc|rustc ≥ 1\.88" README.md
sed -n '140,210p' README.md | nl -ba
echo
echo "== docs/design 2026-06-01-timstof-d-input.md relevant snippet =="
sed -n '110,170p' docs/design/2026-06-01-timstof-d-input.md | nl -ba
echo
echo "== occurrences of thermorawfilereader version pin =="
rg -n "thermorawfilereader\s*=\s*\{[^}]*version\s*=\s*\"[0-9]+\.[0-9]+\.[0-9]+\"" -S .Repository: bigbio/msgf-rust
Length of output: 156
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== root Cargo.toml (first 60 lines) =="
cat -n Cargo.toml | sed -n '1,60p'
echo
echo "== README.md MSRV statements around thermo/timstof (match lines) =="
rg -n "Build needs rustc|timsrust.*rustc|rustc ≥ 1\.88|MSRV" README.md || true
echo
echo "== README.md thermo/timstof sections (approx 130-190) =="
sed -n '130,210p' README.md | cat -n
echo
echo "== docs/design/2026-06-01-timstof-d-input.md relevant snippet (110-170) =="
sed -n '110,170p' docs/design/2026-06-01-timstof-d-input.md | cat -n
echo
echo "== occurrences of thermorawfilereader version pins (only Cargo.toml files) =="
rg -n "thermorawfilereader\s*=\s*\{[^}]*version\s*=\s*\"[0-9]+\.[0-9]+\.[0-9]+\"" --glob "**/Cargo.toml" . || true
echo
echo "== occurrences of timsrust version pins (only Cargo.toml files) =="
rg -n "timsrust\s*=\s*\{[^}]*version\s*=\s*\"[0-9]+\.[0-9]+\.[0-9]+\"" --glob "**/Cargo.toml" . || trueRepository: bigbio/msgf-rust
Length of output: 12304
Fix thermorawfilereader pin: 0.7.0 likely doesn’t exist on crates.io, so the dependency may fail to resolve.
crates/input/Cargo.tomlpinsthermorawfilereader = { version = "0.7.0", optional = true }andtimsrust = { version = "0.4.2", optional = true }.- crates.io’s
thermorawfilereaderpage shows latest0.6.0and the retrieved version history does not list0.7.0→ adjust the pin (or vendor) to an actually published version. timsrust 0.4.2appears published with no retrieved evidence of yanking; however, external MSRV metadata for the exact pinned versions couldn’t be pulled here (crates.io API is blocked with 403), so the repo’s documentedrustc ≥ 1.88requirement for the feature build remains repo-asserted, not evidenced from crate metadata.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/input/Cargo.toml` around lines 15 - 29, The Cargo.toml pins an
unavailable thermorawfilereader version (thermorawfilereader = "0.7.0") which
will break resolution; update the dependency entry for thermorawfilereader to a
published version (e.g., match crates.io latest such as 0.6.0) or vendor the
crate, and ensure the thermo feature (thermo = ["dep:thermorawfilereader"])
still references that updated dependency; also verify or document any MSRV/rustc
requirement for the thermo/timstof features if the chosen version imposes a
minimum compiler version.
Code review (+ adversarial pass)Both a 5-agent review and a Codex adversarial review ran against the dev→master diff. No correctness blocker in the default mzML/MGF engine, but the native-format surface has real gaps worth fixing before master:
msgf-rust/crates/msgf-rust/src/bin/msgf-rust.rs Lines 643 to 666 in 37971fa
msgf-rust/.github/workflows/release.yml Lines 40 to 61 in 37971fa
msgf-rust/crates/input/src/thermo.rs Lines 337 to 352 in 37971fa
msgf-rust/crates/msgf-rust/src/bin/msgf-rust.rs Lines 888 to 893 in 37971fa Also noted (cleanup, non-blocking): |
…he public/master tree
Resolves the 3 adversarial-review findings on PR #46: 1. (high) Bruker .d was scored with the Orbitrap HCD_QExactive model. The param auto-router now routes .d (timsTOF DDA-PASEF = beam-type CID on a TOF analyzer) to (CID, TOF) -> CID_TOF_Tryp.param, instead of falling through to the HCD_QExactive default. Override with --fragmentation/--instrument. 2. (medium) Release binaries could not open .d: the release workflow built only --features thermo. The 4 native targets now build --features "thermo timstof" so prebuilt binaries read both .raw (bundled .NET 8) and .d (pure Rust). The linux-arm64 cross target stays plain (timsrust's zstd C dep complicates cross). 3. (medium) Non-chimeric .raw/.d MS-level: --ms-level 3 could search MS3 (e.g. TMT SPS-MS3 reporter scans) and --ms-level 1 yielded an empty run. The native .raw reader is now hardcoded to MS2; a non-2 --ms-level on .raw/.d warns and is ignored (the chimeric path was already MS2-only). --ms-level still applies to mzML. Also fixes the misleading MGF/.d ms-level warning (it claimed .d coverage but its is_mgf guard excluded .d; .d is now covered by the native warning above). Default + --features timstof build green; param-resolver tests pass.
Review findings addressed (971a345)All three adversarial-review findings are fixed (Copilot generated no comments):
Also fixed the misleading MGF/ |
Moves the dev-internal design specs, the Java<->Rust parity-analysis investigation notes, and the dated psm-gain roadmap out of the public repo (preserved locally in the workspace). The repo's docs now carry only user-facing material: README.md, the DOCS.md CLI/column reference manual, and docs/CLI_MIGRATION.md. Removed 42 files (docs/parity-analysis/, docs/design/, docs/2026-05-28-psm-gain-state-and-roadmap.md); neutralized the one DOCS.md link that pointed into parity-analysis. Follows the earlier removal of the internal specs directory.
…current defaults - §1: new Input-formats table (mzML/MGF/.raw/.d, build + runtime reqs); add --precursor-cal, --chimeric, --isolation-halfwidth; fix --charge-max default (2-5) and --ms-level MS2-only wording for native formats - §3a: PIN header now 38 cols (charge2..charge5 default) - §4: native Thermo .raw (vendor metadata) and Bruker .d (CID_TOF_Tryp) routing - §5: feature-gated build commands for thermo/timstof - §8c: input-format row reflects native .raw/.d
Adds docs/benchmarks/ with the 2026-06-01 comparison of Java MS-GF+, Sage, MSFragger, and msgf-rust on vendor-native data, harmonized parameters, uniform Percolator -Y FDR, results (PSMs + distinct peptides + speed + RAM), findings, caveats, and the per-engine config files (Sage JSON, MSFragger params, mods). Linked from README.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
DOCS.md (1)
345-345:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAlign Rust toolchain requirements; current docs are contradictory.
Line 345 says Rust
1.85+(pinned1.87.0), but Line 357 says Thermo builds requirerustc >= 1.88. Please reconcile these into one authoritative minimum (global or feature-specific) and keep the build commands consistent with that requirement.Also applies to: 357-359
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@DOCS.md` at line 345, DOCS.md contains conflicting Rust version requirements (mentions "Rust 1.85+" and that rust-toolchain.toml pins 1.87.0, while the Thermo section requires rustc >= 1.88); reconcile these by choosing a single authoritative minimum Rust version (e.g., set global minimum to 1.88 if Thermo needs it) and update all mentions to that version, update rust-toolchain.toml pin to match the chosen minimum (or note feature-specific higher minimums for Thermo), and ensure all build commands and the Thermo section reference the same version string so the document is consistent.
🧹 Nitpick comments (1)
docs/benchmarks/configs/msfragger-timstof.params (1)
14-15: 💤 Low valueConsider tighter fragment tolerance for modern timsTOF instruments.
The fragment_mass_tolerance is set to 20 ppm, which is conservative for modern timsTOF instruments (many achieve <10 ppm fragment accuracy). While this may be intentional for broader benchmark comparability, tighter tolerances (e.g., 10-15 ppm) could improve discrimination on high-accuracy timsTOF data.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/benchmarks/configs/msfragger-timstof.params` around lines 14 - 15, Update the fragment_mass_tolerance parameter (fragment_mass_tolerance) from the current 20 ppm to a tighter value appropriate for modern timsTOF data (suggest 10–15 ppm); modify the value in the msfragger-timstof.params config so benchmarks on high-accuracy timsTOF runs use the tighter tolerance while optionally documenting the reason for the change in a nearby comment.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/benchmarks/configs/sage-astral.json`:
- Around line 18-19: The fragment ion tolerance ("fragment_tol") in the
sage-astral.json benchmark is set to ±20 ppm which is likely too permissive for
Thermo Orbitrap Astral searches; update the "fragment_tol" value to ~10 ppm
(e.g., 10) to match common practice or, if ±20 ppm is deliberate for this
benchmark, add a brief justification comment in the config/docs explaining why a
wider tolerance is used (refer to the "fragment_tol" key in
docs/benchmarks/configs/sage-astral.json).
---
Outside diff comments:
In `@DOCS.md`:
- Line 345: DOCS.md contains conflicting Rust version requirements (mentions
"Rust 1.85+" and that rust-toolchain.toml pins 1.87.0, while the Thermo section
requires rustc >= 1.88); reconcile these by choosing a single authoritative
minimum Rust version (e.g., set global minimum to 1.88 if Thermo needs it) and
update all mentions to that version, update rust-toolchain.toml pin to match the
chosen minimum (or note feature-specific higher minimums for Thermo), and ensure
all build commands and the Thermo section reference the same version string so
the document is consistent.
---
Nitpick comments:
In `@docs/benchmarks/configs/msfragger-timstof.params`:
- Around line 14-15: Update the fragment_mass_tolerance parameter
(fragment_mass_tolerance) from the current 20 ppm to a tighter value appropriate
for modern timsTOF data (suggest 10–15 ppm); modify the value in the
msfragger-timstof.params config so benchmarks on high-accuracy timsTOF runs use
the tighter tolerance while optionally documenting the reason for the change in
a nearby comment.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 13cf4fe1-5671-42c3-aba6-c1c742d20dea
⛔ Files ignored due to path filters (5)
docs/parity-analysis/notes/score-psm-trace-artifacts/java-trace-scan-16629.log.gzis excluded by!**/*.gzdocs/parity-analysis/notes/score-psm-trace-artifacts/java-trace-scan-23082.log.gzis excluded by!**/*.gzdocs/parity-analysis/notes/score-psm-trace-artifacts/java-trace-scan-23272.log.gzis excluded by!**/*.gzdocs/parity-analysis/notes/score-psm-trace-artifacts/java-trace-scan-34685.log.gzis excluded by!**/*.gzdocs/parity-analysis/notes/score-psm-trace-artifacts/java-trace-scan-41522.log.gzis excluded by!**/*.gz
📒 Files selected for processing (35)
.github/workflows/release.ymlDOCS.mdREADME.mdcrates/msgf-rust/src/bin/msgf-rust.rsdocs/benchmarks/2026-06-01-4engine-native-format.mddocs/benchmarks/README.mddocs/benchmarks/configs/mods.txtdocs/benchmarks/configs/msfragger-astral.paramsdocs/benchmarks/configs/msfragger-timstof.paramsdocs/benchmarks/configs/sage-astral.jsondocs/benchmarks/configs/sage-timstof.jsondocs/parity-analysis/notes/2026-05-25-precursor-cal-ship-gates.mddocs/parity-analysis/notes/2026-05-25-spece-tail-exploration.mddocs/parity-analysis/notes/2026-05-26-score-psm-trace-findings.mddocs/parity-analysis/notes/2026-05-28-phase2-peak-rank-parity.mddocs/parity-analysis/notes/score-psm-trace-artifacts/aggregate-analysis.txtdocs/parity-analysis/notes/score-psm-trace-artifacts/analyze.pydocs/parity-analysis/notes/score-psm-trace-artifacts/rust-trace-scan-16629.jsondocs/parity-analysis/notes/score-psm-trace-artifacts/rust-trace-scan-16629.txtdocs/parity-analysis/notes/score-psm-trace-artifacts/rust-trace-scan-23082.jsondocs/parity-analysis/notes/score-psm-trace-artifacts/rust-trace-scan-23082.txtdocs/parity-analysis/notes/score-psm-trace-artifacts/rust-trace-scan-23272.jsondocs/parity-analysis/notes/score-psm-trace-artifacts/rust-trace-scan-23272.txtdocs/parity-analysis/notes/score-psm-trace-artifacts/rust-trace-scan-34685.jsondocs/parity-analysis/notes/score-psm-trace-artifacts/rust-trace-scan-34685.txtdocs/parity-analysis/notes/score-psm-trace-artifacts/rust-trace-scan-41522.jsondocs/parity-analysis/notes/score-psm-trace-artifacts/rust-trace-scan-41522.txtdocs/parity-analysis/snapshots/cal-shifts-2026-05-25.jsondocs/superpowers/plans/2026-05-26-i5-score-psm-trace-plan.mddocs/superpowers/plans/2026-05-26-quality-cleanup-plan.mddocs/superpowers/plans/2026-05-28-id-rate-pxd001819-tmt-plan.mddocs/superpowers/specs/2026-05-26-i5-score-psm-trace-design.mddocs/superpowers/specs/2026-05-26-pr-v1-design.mddocs/superpowers/specs/2026-05-26-quality-cleanup-design.mddocs/superpowers/specs/2026-05-28-id-rate-pxd001819-tmt-design.md
💤 Files with no reviewable changes (24)
- docs/parity-analysis/notes/score-psm-trace-artifacts/rust-trace-scan-23082.json
- docs/superpowers/specs/2026-05-28-id-rate-pxd001819-tmt-design.md
- docs/parity-analysis/notes/score-psm-trace-artifacts/rust-trace-scan-16629.txt
- docs/parity-analysis/notes/2026-05-26-score-psm-trace-findings.md
- docs/parity-analysis/notes/score-psm-trace-artifacts/rust-trace-scan-23272.json
- docs/parity-analysis/notes/score-psm-trace-artifacts/aggregate-analysis.txt
- docs/parity-analysis/notes/score-psm-trace-artifacts/rust-trace-scan-16629.json
- docs/superpowers/specs/2026-05-26-i5-score-psm-trace-design.md
- docs/parity-analysis/notes/score-psm-trace-artifacts/rust-trace-scan-41522.txt
- docs/parity-analysis/notes/score-psm-trace-artifacts/rust-trace-scan-34685.txt
- docs/parity-analysis/notes/score-psm-trace-artifacts/rust-trace-scan-41522.json
- docs/parity-analysis/notes/2026-05-25-precursor-cal-ship-gates.md
- docs/parity-analysis/notes/score-psm-trace-artifacts/rust-trace-scan-23272.txt
- docs/superpowers/specs/2026-05-26-pr-v1-design.md
- docs/parity-analysis/notes/score-psm-trace-artifacts/rust-trace-scan-23082.txt
- docs/superpowers/plans/2026-05-26-quality-cleanup-plan.md
- docs/superpowers/plans/2026-05-26-i5-score-psm-trace-plan.md
- docs/parity-analysis/snapshots/cal-shifts-2026-05-25.json
- docs/superpowers/plans/2026-05-28-id-rate-pxd001819-tmt-plan.md
- docs/parity-analysis/notes/score-psm-trace-artifacts/rust-trace-scan-34685.json
- docs/parity-analysis/notes/2026-05-28-phase2-peak-rank-parity.md
- docs/parity-analysis/notes/2026-05-25-spece-tail-exploration.md
- docs/parity-analysis/notes/score-psm-trace-artifacts/analyze.py
- docs/superpowers/specs/2026-05-26-quality-cleanup-design.md
✅ Files skipped from review due to trivial changes (5)
- docs/benchmarks/configs/mods.txt
- docs/benchmarks/configs/msfragger-astral.params
- docs/benchmarks/README.md
- docs/benchmarks/2026-06-01-4engine-native-format.md
- README.md
🚧 Files skipped from review as they are similar to previous changes (2)
- .github/workflows/release.yml
- crates/msgf-rust/src/bin/msgf-rust.rs
| "precursor_tol": { "ppm": [-10.0, 10.0] }, | ||
| "fragment_tol": { "ppm": [-20.0, 20.0] }, |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
What is the typical fragment ion tolerance in ppm for Thermo Fisher Astral Orbitrap MS2 data in proteomics searches?
💡 Result:
For proteomics database searches of data acquired on the Thermo Fisher Orbitrap Astral mass spectrometer, a fragment ion mass tolerance of 10 ppm is commonly used [1]. While instrument performance specifications indicate that the Astral analyzer achieves mass accuracy with an RMS drift of less than 5 ppm over 24 hours under external calibration [2][3], practical proteomics search parameters often utilize a slightly wider window—frequently 10 ppm—to ensure high sensitivity and robust identification across varied experimental conditions [1]. Selection of the specific tolerance may also depend on the software, search engine, and whether the data is being integrated or quantified, with some workflows or specific software modules recommending adjustments to account for factors such as space charging or peak apex alignment [1]. Researchers should always verify optimal parameters for their specific sample complexity and software pipeline, as practices can evolve with new instrument versions and algorithmic improvements [4].
Citations:
- 1: https://pmc.ncbi.nlm.nih.gov/articles/PMC10563156/
- 2: https://docs.thermofisher.com/r/Orbitrap-Astral-Operating-Manual/1980040587v1en-US2032197771?contentId=yIvYylBjjYsQQh2ZYxM_Ew
- 3: https://assets.thermofisher.com/TFS-Assets/CMD/Specification-Sheets/ps-001797-ms-orbitrap-astral-ps001797-en.pdf
- 4: https://www.nature.com/articles/s41587-026-03131-2
Revisit fragment_tol for Astral MS2 search window
In docs/benchmarks/configs/sage-astral.json (lines 18-19), fragment_tol is set to ±20 ppm; for Thermo Orbitrap Astral proteomics database searches, ~10 ppm is commonly used for fragment ion tolerance, so ±20 ppm may be overly permissive unless the benchmark has a documented reason—either tighten to ~10 ppm or add justification.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/benchmarks/configs/sage-astral.json` around lines 18 - 19, The fragment
ion tolerance ("fragment_tol") in the sage-astral.json benchmark is set to ±20
ppm which is likely too permissive for Thermo Orbitrap Astral searches; update
the "fragment_tol" value to ~10 ppm (e.g., 10) to match common practice or, if
±20 ppm is deliberate for this benchmark, add a brief justification comment in
the config/docs explaining why a wider tolerance is used (refer to the
"fragment_tol" key in docs/benchmarks/configs/sage-astral.json).
CodeRabbit PR #46 review: - input/timstof.rs: pair mz/intensity over the overlapping prefix only (no read past a malformed length mismatch) and drop non-finite / non-positive-m/z / negative-intensity peaks before they reach the scorer. Adds two regression tests. Mirrors the thermo.rs peak guards. - search/match_engine.rs: remove the dead per-PSM MS1 isotope-feature block (CASCADE_SKIP_MS1_FEATURE was hardcoded true so it never ran) plus its now orphaned const/import/local. precursor_isotope_kl/snr stay at 0.0 defaults (PIN schema unchanged); Pass 2 co-isolation still uses MS1 via coisolation.rs. Behaviour is bit-identical (dead code path).
ProteoBench Astral .raw + HYE FASTA (proteobench.cubimed.rub.de), PRIDE PXD072598 timsTOF .d (ftp.pride.ebi.ac.uk), and the UniProt reviewed Human+Yeast proteomes used for the timsTOF search.
Summary by CodeRabbit
New Features
.rawspectra files (bundled .NET runtime included in releases).ddirectories via pure-Rust implementationDocumentation
Changes