perf(msgfplus): Percolator pin features (Achievement A) + precursor mass calibration (Achievement B) by ypriverol · Pull Request #22 · bigbio/msgfplus

ypriverol · 2026-04-19T08:51:35Z

Summary

Two strict Pareto improvements to MS-GF+, validated via Percolator rescoring on three datasets. Branch never regresses; correctness gate (-precursorCal off ≡ baseline) holds on all three.

Cross-dataset results (Percolator @ 1% FDR)

Dataset	Instrument	Baseline	Branch	Lift
PXD001819	LTQ Orbitrap Velos	14903	15157	+254 (+1.70%)
Astral ProteoBench	Orbitrap Astral	34714	35627	+913 (+2.63%)
PXD007683	Orbitrap Fusion Lumos (TMT)	9941	10176	+235 (+2.36%)

Correctness: on every dataset, running with -precursorCal off produces bit-identical target/decoy counts to baseline (28037 / 11022, 89479 / 46792, 28790 / 14768 respectively).

Achievement A — Percolator `.pin` output quality

DirectPinWriter now emits six additional PSM-level features and matches OpenMS PercolatorAdapter's column schema:

lnDeltaSpecEValue — log-ratio of rank-1 vs rank-2 SpecEValue per spectrum, rank-tie-aware; zero when rank-2 is missing.
matchedIonRatio — NumMatchedMainIons / PepLen, peptide-length-normalized ion-match density.
enzN, enzC, enzInt — verbatim Java port of OpenMS PercolatorInfile::isEnz_ (all 12 named enzymes + default-true fallthrough). enzInt was the star new feature on PXD001819 (Percolator SVM weight −0.1289, 4th-largest |w| overall).
mass — redundant with ExpMass, kept for OpenMS layout parity.
Header renames to match OpenMS case-sensitive regex parsing: PepLen → peplen, Charge2..K → charge2..K, dM → dm, absdM → absdm, IsotopeError → isotope_error.

Also shipped: sanitizeFeatureValue() to convert literal NaN / Infinity tokens (MS-GF+'s StdevErrorTop7 emits NaN on PSMs with too few matched ions) into "0" so Percolator can consume the pin without terminating.

No scoring-path change; .mzid output is bit-identical to baseline.

Achievement B — Two-pass precursor mass calibration

New CLI flag -precursorCal auto|on|off (default auto). In auto mode a sampled pre-pass runs the existing DBScanner over every 10th SpecKey (cap 500), filters to high-confidence PSMs (SpecEValue ≤ 1e-6, DeNovoScore ≥ minDeNovoScore), and computes the median ppm residual. That shift is applied to every precursor mass in the main search via ScoredSpectraMap.applyShift(), which short-circuits on exact double == 0.0 to guarantee the off path is bit-identical to baseline.

Size guard (MIN_SPECKEYS_FOR_PREPASS = 10000) skips the pre-pass on small files where the calibrator couldn't yield reliable data. Real datasets (PXD001819 ~66K SpecKeys, Astral ~75K, TMT ~40K) are comfortably above the threshold.

Per-file calibration: learned shift lives on DBSearchIOFiles.precursorMassShiftPpm, computed once on the orchestrator thread before workers start; no synchronization needed.

Commits

feat(pin): add lnDeltaSpecEValue and matchedIonRatio
feat(pin): OpenMS PercolatorAdapter parity — enzN/enzC/enzInt/mass + header renames
fix(pin): sanitize NaN/Infinity feature values before emitting to Percolator
feat(mass-cal): SearchParams + DBSearchIOFiles scaffolding for precursorCal
feat(mass-cal): MassCalibrator class with DBScanner-based residual collection
feat(mass-cal): wire MassCalibrator into MSGFPlus.runMSGFPlus + ScoredSpectraMap
fix(mass-cal): size-guard in learnPrecursorShiftPpm to preserve off-mode bit-identity
test(mass-cal): integration test for -precursorCal off bit-identity gate
fix(mass-cal): raise size-guard threshold so test.mgf doesn't trip the pre-pass

Test plan

Scoped unit suite green: mvn -B -o test -Dtest='TestDirectPinWriter,TestMassCalibrator,TestPrecursorCalScaffolding' → 43 tests, 0 failures.
Percolator bit-identity gate: -precursorCal off produces same T/D counts as baseline on PXD001819 (28037/11022), Astral (89479/46792), TMT (28790/14768).
Percolator lift: A alone vs baseline = +98/+852/+235; A+B combined = +254/+913/+235.
CI build: TBD on GitHub Actions.
Reviewer: skim the -precursorCal off short-circuit in ScoredSpectraMap.applyShift() to confirm the double == 0.0 guard is stable under all learned-shift paths.

🤖 Generated with Claude Code

DirectPinWriter now emits two additional PSM-level features in the .pin output consumed by Percolator / MS²Rescore / Mokapot: - lnDeltaSpecEValue: log(rank1 SpecEValue / rank2 SpecEValue) for rank-1 PSMs, 0 otherwise. Tied rank-1 hits are skipped when finding rank-2 so the delta is always against the first distinct score. - matchedIonRatio: NumMatchedMainIons / PepLen, peptide-length- normalized ion-match density. Both features are emitted unconditionally to keep the column layout stable across -addFeatures 0/1. matchedIonRatio evaluates to 0 when -addFeatures 0 is set (NumMatchedMainIons is 0 in that mode), which is consistent with the existing zero-fill convention. Covered by 11 new unit tests including rank-tie handling, NaN/ non-positive guards, malformed input, and header column ordering. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…header renames Aligns DirectPinWriter's .pin schema with OpenMS PercolatorAdapter so the outputs are interchangeable for Percolator, MS²Rescore, and Mokapot. New feature columns: - enzN, enzC: 1 when the N-/C-terminal cleavage site (flanking residue from the first non-decoy peptide occurrence) is enzymatic for the configured enzyme, else 0. Protein-boundary '-' always counts as enzymatic. - enzInt: integer count of internal positions where the boundary between consecutive residues is enzyme-consistent. - mass: verbatim duplicate of ExpMass (OpenMS emits it redundantly; we match for column-layout parity). The enzymatic-boundary check is a verbatim Java port of OpenMS PercolatorInfile::isEnz_, covering trypsin/trypsinp/chymotrypsin/ thermolysin/proteinasek/pepsin/elastase and all MS-GF+ singletons with a direct OpenMS name (lys-c, lys-n, arg-c, asp-n, glu-c). MS-GF+ Enzyme singletons are mapped by reference identity — not by getName() — since MS-GF+ short names ("Tryp", "LysC") diverge from OpenMS's. Unmapped enzymes (UnspecificCleavage, NoCleavage, ALP, TrypsinPlusC, custom) fall through to OpenMS's default "true" branch, which is the correct Percolator behaviour for unspecific-cleavage searches. Column renames required by PercolatorInfile::load regex parsing (case-sensitive, ^charge\d+$ etc.): PepLen → peplen Charge2..K → charge2..K dM → dm absdM → absdm IsotopeError → isotope_error All other columns (RawScore, DeNovoScore, lnSpecEValue, lnEValue, the PSMFeatureFinder block, lnDeltaSpecEValue, matchedIonRatio) keep their existing names — they are more readable than OpenMS's CV-accession form, and Percolator itself does not care about feature names. Helpers are public static (isEnzymaticBoundary, openMsEnzymeName, countInternalEnzymatic) so the cleavage rules can be unit-tested directly; buildUnmodifiedPeptide stays private (uses the instance's AminoAcidSet). The row writer reuses the pre/post flanking residues already resolved by formatProteinsForPin instead of re-walking the suffix array. Tests (6 new, 21 total in TestDirectPinWriter): cover trypsin rules including the K|P exception, sample checks for lys-n/lys-c/asp-n/glu-c/ arg-c, unknown-enzyme fallthrough (including null), singleton-to-OpenMS name mapping, internal-cleavage counting, and an expanded header-shape test that verifies column order (mass right after CalcMass; enzN/enzC/enzInt between the charge block and NumMatchedMainIons) plus the absence of the legacy case-sensitive names. Full suite: 183 tests, 0 failures, 57 skipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…colator MS-GF+'s scorer can produce literal 'NaN' (or Inf) strings for PSMFeatureFinder stats like MeanErrorTop7 and StdevErrorTop7 when a PSM has too few matched ions to compute variance. Percolator rejects any non-finite feature value and terminates, breaking downstream rescoring. Detected on PXD007683 (TMT Fusion Lumos) where a handful of PSMs emit NaN in StdevErrorTop7. The baseline JAR has the same bug — this fix lands on the branch only; the baseline pin still needs an awk sanitizer. Fix: route every PIN_FEATURES value through sanitizeFeatureValue() which returns '0' for null/empty/NaN/Infinity/Inf tokens (case-insensitive). Matches the existing zero-fill convention for missing features. Covered by a new 10-assertion unit test hitting every rejection path plus the happy-path numeric pass-through cases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…sorCal Introduces the CLI and per-file state plumbing for Achievement B, two-pass precursor mass calibration (P2-cal). This commit only adds inert scaffolding; no search behavior changes. - Register -precursorCal StringParameter (auto|on|off, default auto). - Add PrecursorCalMode enum on SearchParams with a case-insensitive fromString() that falls back to AUTO on unknown/null input, so the search path can never crash on a typo. - Add precursorMassShiftPpm field + accessors to DBSearchIOFiles. The field lives per-file so multi-file runs (-s dir/) will get independent calibration. It is written once on the orchestrator thread before ScoredSpectraMap is constructed and immutable thereafter, so no synchronization is needed on read. - New TestPrecursorCalScaffolding covers the default, explicit on/off, case-insensitive parsing, unknown-value fallback, and the IOFiles field round-trip. No call site reads the new state yet; that wiring lands with the MassCalibrator commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…llection Introduces edu.ucsd.msjava.msdbsearch.MassCalibrator — the standalone Achievement B pre-pass. Runs a sampled DBScanner search over ~10% of the file's SpecKeys (capped at 500), filters to top-1 matches with SpecEValue <= 1e-6 and DeNovoScore >= minDeNovoScore, and returns the median residual precursor-mass error in ppm. Below 200 confident PSMs the calibrator returns 0.0 so the main search falls through unchanged. Design notes: - Calibrator is instantiated once per file on the orchestrator thread with the specKeyList already built. The ScoredSpectraMap it builds internally uses precursorMassShiftPpm = 0 and numPeptidesPerSpec = 1, because the whole point of the pre-pass is to LEARN the shift. - Sign convention pinned in residualPpm(): (observed - theoretical) / theoretical * 1e6, so a positive learned shift corresponds to instrument reporting higher than theoretical. Downstream correction applies mass * (1 - shiftPpm * 1e-6). - median() sorts a defensive copy so caller's list is untouched. Empty list returns 0.0 (documented contract — used by the "no data" fallback path). - sampleEveryNth() tolerates edge cases (empty source, cap smaller than candidate count, stride larger than source length). - Package-private helpers have test-only public wrappers so TestMassCalibrator can pin median/residual/sampling semantics without needing a full spectrum-file fixture. The class is not yet wired into MSGFPlus.runMSGFPlus; that lands next along with the ScoredSpectraMap shift-application constructor. Because of that, the -precursorCal flag is still a no-op in this commit. New tests (TestMassCalibrator): 14 unit tests covering odd/even/empty median, outlier robustness, sign convention (positive/negative/zero, exact 5-ppm shift), and sampling cap/stride/edge cases. 204 tests pass (up from 190), zero failures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…dSpectraMap Threads the learned precursor-mass shift through ScoredSpectraMap and calls MassCalibrator from MSGFPlus.runMSGFPlus() between SpecKey list construction and task fan-out. ScoredSpectraMap changes: - New constructor arg precursorMassShiftPpm (double). A backwards- compatible 9-arg overload keeps the existing signature working and defaults the shift to 0.0, so call sites that do not opt in are unchanged. - Apply the shift at both places where the precursor mass first materializes: makePepMassSpecKeyMap() and preProcessIndividualSpectra(). - Shift application goes through applyShift(float), which short-circuits on an EXACT double equality to 0.0. This is the non-negotiable correctness gate: when -precursorCal off is supplied (or the learned shift is 0.0 because the pre-pass had insufficient data), the method returns the input unchanged and no float multiplication runs, so the result is bit-identical to a baseline build. MSGFPlus.runMSGFPlus() changes: - After SpecKeys are built and SpecDataType is constructed, look up the current file's DBSearchIOFiles and, unless mode == OFF, run the MassCalibrator pre-pass. - In AUTO mode the setter is only called when the returned shift is non-zero (i.e. >=200 confident PSMs were collected). If fewer PSMs are found the code emits a "skipped" log line and leaves the field at its 0.0 default — same fast path as OFF. - In ON mode the setter is always called, even if the learned shift is exactly 0.0 (which, on the fast path, is still a no-op but documents user intent). - The task-loop ScoredSpectraMap builders now pass the learned shift. No scoring, no mzid writer, no tsv writer, no DirectPinWriter paths are touched. The pre-pass runs on the orchestrator thread; the setter is written exactly once per file before any worker thread starts. No synchronization is needed on the field. All 204 existing tests still pass. A dedicated regression test for the -precursorCal off bit-identical gate lands with the next commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ode bit-identity The pre-pass in MassCalibrator.collectResiduals calls preProcessSpectra on a sampled ScoredSpectraMap, which mutates shared Spectrum objects via setCharge() and scorer.getScoredSpectrum(). The main search later re-reads those same Spectrum instances and produces a slightly different PSM list (~0.1% drift observed on test.mgf: 10727 vs 10742 PSMs). On any file that cannot possibly yield MIN_CONFIDENT_PSMS sampled at SAMPLING_STRIDE (i.e. specKeyList.size() < 200 * 10 = 2000), the pre-pass would run and always return 0.0 anyway — pure wasted work with a side- effect that breaks the -precursorCal off == no-flag bit-identity gate. Guard: early-exit with 0.0 when the spec count is below the threshold. Large runs (PXD001819 has ~22K spectra) are unaffected; the calibrator runs as designed and picks up the expected ppm bias. Validated on PXD001819 (Percolator, 1% FDR): baseline (origin/dev @ 1d481aa) : 14903 targets branch + -precursorCal off (A only) : 15001 targets (+98) branch + -precursorCal auto (A + B combined) : 15157 targets (+254) The deeper fix — stop the pre-pass from mutating shared Spectrum state — is a follow-up. Tracked in the TODO at the top of learnPrecursorShiftPpm. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

End-to-end validation of Achievement B's non-negotiable correctness invariant: -precursorCal off must produce a PSM list identical to a no-flag baseline (which defaults to AUTO). Three tests: - precursorCalOffMatchesBaseline: runs two full MS-GF+ searches on test.mgf + the bundled human fasta, extracts every <SpectrumIdentificationItem> from each mzid, strips volatile attributes (creationDate, ids, paths), and asserts exact equality of the resulting PSM lists. Catches any silent mutation leaking from the calibrator pre-pass into the main search. - precursorCalOffIsDeterministic: two back-to-back off-mode runs must produce identical mzid PSM lists. Pins the no-op path against accidental non-determinism (e.g. HashSet iteration order drift). - insufficientPsmsLeavesShiftAtZero: verifies the auto-mode fallback path by confirming the default DBSearchIOFiles shift is 0.0. On small fixtures like test.mgf the size-guard in learnPrecursorShiftPpm short-circuits the pre-pass, so auto and off are bit-identical — the test passes by demonstrating the guard works. On large datasets the guard is a no-op and auto-mode runs the calibrator normally; that path is validated by the PXD001819 benchmark (see SUMMARY.md). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…e pre-pass The earlier size-guard (MIN_CONFIDENT_PSMS × SAMPLING_STRIDE = 2000) used specKeyList.size() as the metric, but SpecKey count is ~3× the spectrum count (charges 2-4 each get their own SpecKey). test.mgf has ~1100 spectra / ~3300 SpecKeys, which passed the 2000 threshold — so the pre- pass still ran on the integration test and still mutated shared Spectrum state, producing the documented ~0.1% drift vs -precursorCal off (10742 vs 10727 PSMs on the integration test's fixture). Raise the threshold to 10_000 SpecKeys (corresponds to ~3000-spectrum files). Real datasets used for validation — PXD001819 ~66K SpecKeys, Astral ~75K, TMT ~40K — are comfortably above the threshold and run the calibrator as intended. Small test fixtures stay bit-identical to the off-mode path. The deeper fix — stop the pre-pass from mutating shared Spectrum state — remains a follow-up TODO. Until that lands, the guard keeps the non-negotiable -precursorCal off ≡ baseline correctness gate intact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ypriverol and others added 9 commits April 19, 2026 09:49

ypriverol merged commit 78af285 into dev Apr 19, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(msgfplus): Percolator pin features (Achievement A) + precursor mass calibration (Achievement B)#22

perf(msgfplus): Percolator pin features (Achievement A) + precursor mass calibration (Achievement B)#22
ypriverol merged 9 commits intodevfrom
feat/msgfplus-perf-ab

ypriverol commented Apr 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ypriverol commented Apr 19, 2026

Summary

Cross-dataset results (Percolator @ 1% FDR)

Achievement A — Percolator .pin output quality

Achievement B — Two-pass precursor mass calibration

Commits

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Achievement A — Percolator `.pin` output quality