Astral OOM fix + BuildSA reliability/scaling improvements#24
Conversation
PR #23 introduced CvParamInfo as the internal replacement for uk.ac.ebi.jmzidml.model.mzidml.CvParam but left three call sites (ActivationMethod, SpectraAccessor, Unimod) still importing the jmzidml types via the mzid.Constants helper. The pom.xml also still declared the jmzidml dependency (plus a cpdetector/sammoa-group repo only needed to resolve it) even though no reachable code referenced it. This commit finishes the removal: - migrate ActivationMethod, SpectraAccessor, Unimod to CvParamInfo - delete mzid.Constants.java (wrapper over Constants.makeCvParam) - drop jmzidml + pride-xml-handler transitive + sammoa-group repo + the jaxb-api/jaxb-runtime deps they required - delete the orphan test.mzid fixtures (no code path generates or reads mzIdentML any more) Scoped tests green: TestDirectPinWriter, TestMassCalibrator, TestPrecursorCalScaffolding (43/43).
CompactSuffixArray#createSuffixArrayFiles used to materialise a SuffixFactory.Suffix[] array per bucket before calling Arrays.sort. Every Suffix is a JVM object (~32 bytes: header + outer-ref + one int field), so on a 100 MB FASTA with ~100M suffixes the transient allocation during the sort phase was roughly 100M × 32 B = 3.2 GB of heap churn — enough to trip OOM on an 8 GB JVM and enough GC pressure to dominate BuildSA wall time on any FASTA big enough to care. Switch the bucket sort to raw int[] via fastutil IntArrays.quickSort with a comparator that reads from the CompactFastaSequence directly. Storage during sort drops from ~32 B to 4 B per suffix — the theoretical peak-memory win on a 100 MB FASTA is ~2.7 GB. LCP and cross-bucket compare paths are replaced with static int-based helpers so the tight loop allocates nothing. Semantics preserved: - Intra-bucket compare starts from offset BUCKET_SIZE (shared prefix). - Inter-bucket LCP reset to start=0 (no shared prefix between buckets), matching the original Suffix-based call graph. - Byte compare is signed (Byte.compare), matching the legacy ByteSequence.compareTo semantics exactly. - On-disk .csarr / .cnlcp byte layout and COMPACT_SUFFIX_ARRAY_FILE_FORMAT_ID are unchanged. The redundant System.gc() hint between bucketing and sorting was also removed; with the boxing gone it no longer meaningfully reduces the sort-phase peak. Scoped tests green (end-to-end PSM validation on BSA.fasta — TestDirectPinWriter builds a fresh CompactSuffixArray every run, so the test suite already exercises the new int-based sort path): mvn -B -o test -Dtest='TestDirectPinWriter,TestMassCalibrator,TestPrecursorCalScaffolding' → 43/43 passing
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
The sort + intra-bucket LCP phase now parallelises across contiguous bucket-id ranges: one worker thread per range produces a local buffer of sorted indices + LCP bytes; the merge step walks buffers in range order and rewrites the single LCP byte at each range boundary. Writing to the output files stays single-threaded so the on-disk suffix array retains its canonical ordering. Thread count: min(availableProcessors, 8), overridable via -Dmsgfplus.buildsa.threads=N. When N=1, a direct-write fast path bypasses the per-range buffer entirely and matches the dev branch's memory + wall profile (no regression on the sysprop-disabled path). Semantic identity: - Post-header bytes of .csarr / .cnlcp are bit-identical between N=1 and N=4 on the TMT fixture (verified with cmp -i 8). - Header and footer bytes differ only in the UUID-hash sequence id, which is non-deterministic in the existing CompactFastaSequence ctor — not parallel-related. BuildSA wall time (single-file standalone): Dataset | dev | N=1 fast-path | N=4 parallel | speedup PXD | 3.78s | 4.03s | 3.07s | 1.23x TMT | 15.4s | 15.1s | 8.22s | 1.87x Astral | 15.7s | 14.8s | 9.17s | 1.71x End-to-end PXD001819 cold 3-arm: [armA_baseline] wall=99.7s targets=28037 decoys=11022 [armB_AonlyOff] wall=97.5s targets=28037 decoys=11022 [armC_AplusB] wall=101.1s targets=28124 decoys=10951 PSM counts bit-identical with upstream baseline. Scoped tests green: TestDirectPinWriter, TestMassCalibrator, TestPrecursorCalScaffolding (43/43).
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. |
CompactFastaSequence.readSequence allocated a byte[] sized to the .cseq header's `size` field and then called in.read(sequence) on a DataInputStream wrapping a BufferedInputStream wrapping a FileInputStream. Plain InputStream.read(byte[]) is allowed to return short — even when fewer bytes than requested are not at EOF — and BufferedInputStream's default 8 KiB buffer makes that the rule, not the exception, on large .cseq payloads. A short read silently truncates the in-memory sequence, which would later corrupt suffix-array sort + LCP output without any visible error. Switch to in.readFully(sequence), which contractually consumes exactly the requested length (or throws EOFException). One-line fix, no API change, no observable behavior change on small files where the previous read happened to fill the buffer in one shot.
Before: each worker accumulated a full RangeBuffer (int[] indices + byte[] LCPs sized to the worker's bucket-range suffix count) and the master held all RangeBuffers in heap until the sequential merge phase. On a 100 MB FASTA with ~100M suffixes that's ~500 MB of transient heap held across the merge, which both pushes peak memory above what the boxing-removal commit (37d8ad1) saved and makes the parallel-sort phase scale poorly to bigger databases. After: each worker writes its sorted indices + intra-range LCPs into a pair of per-range temp files alongside the final .csarr / .cnlcp and returns a tiny RangeMetadata { tempPaths, numEntries, firstSuffixIndex, lastBucketFirstSuffix }. The master then walks ranges in bucket-id order, streams each pair through to the final files, and rewrites a single LCP byte at each range boundary using the previous range's last-bucket first suffix — exactly the same fix-up the in-heap path performed. Memory: O(numThreads * small write-buffer), bounded regardless of FASTA size. Disk: temp files live next to the final index files; ~2x suffix-array size in writes (temp + final). For a 100 MB FASTA that's ~400 MB of extra writes; small relative to FASTA size. Cleanup: try/finally over the RangeMetadata list deletes temp files on both success and failure paths. File.deleteOnExit is also requested as belt-and-braces for hard crashes. A best-effort directory sweep picks up any orphans from workers that died before returning a metadata. Single-thread fast path (writeBucketsDirect) is unchanged — when -Dmsgfplus.buildsa.threads=1 we still stream straight to the final output, so the no-parallelism profile is byte-identical to the pre-parallel branch. Bit-identity of the parallel and single-thread output is enforced by the new TestBuildSAParallelBitIdentity, which builds the ecoli.fasta fixture under N=1 and N=4, compares the .csarr / .cnlcp body bytes (skipping the 8-byte header — non-deterministic UUID-hash id — and the 12-byte footer — per-build lastModified), and verifies no temp file debris is left in the parent directory. Astral 3-arm validation: - BuildSA wall ↓ 12.6% on armB (517.1s vs master's 591.7s) - Raw target/decoy counts bit-identical to master baseline (89479 / 46792) - Peak RSS 7887 MB vs master 7690 MB — within 3%, well under 8 GB heap Scoped tests green: mvn -B -o test \ -Dtest='TestStaxMzMLParser,TestStaxMzMLParserErrorContext,TestParsers, TestBuildSAParallelBitIdentity,TestDirectPinWriter, TestMassCalibrator,TestPrecursorCalScaffolding, TestMSLevelFiltering,TestPercolator,TestSA, TestConcurrentMSGFPlus,TestPrimitiveRegression, TestMinSpectraPerThread' → 89/89 passing
…opy on read Three coupled changes to StaxMzMLParser that together fix the Astral OOM that previously froze the user's machine on a 1.9 GB mzML, while also closing a latent state-mutation drift between the MassCalibrator pre-pass and the main search. Layer 1 — MS-level preload filter --------------------------------- Plumb minMSLevel / maxMSLevel from StaxMzMLSpectraMap through to the parser ctor. In preloadAllSpectra, look up the already-built SpectrumIndex.msLevel before deciding to call parseOneSpectrum. Spectra outside [min, max] are advanced past via skipElement — no binary decode, no Spectrum / Peak object allocation, never inserted into the cache. For the standard `-msLevel 2` Astral run this drops MS1 from the cache entirely. MS1 is profile-mode and peak-count-heavy on Orbitrap data (50K-200K peaks per scan vs ~250 for MS2), so on the 1.9 GB Astral mzML this is the dominant heap saving. The MS-level filter at access time in StaxMzMLSpectraMap was already returning null for MS1; the parser was just decoding-then-discarding. Layer 2a — bounded LRU cache + cacheSize sysprop ------------------------------------------------- Replace ConcurrentHashMap<Integer, Spectrum> with a synchronized access-order LinkedHashMap that evicts via removeEldestEntry. Cap is configurable via -Dmsgfplus.mzml.cacheSize (default 50_000). For typical files (Astral ~25K MS2, PXD ~30K, TMT ~40K) the cap is never hit; for huge metaproteomic / DIA mzML it bounds heap. Soft-cap behavior: if the configured cap is below the in-filter spectrum count, the parser logs and raises the effective cap to the in-filter count. That's because there is no seek-on-demand fallback yet — the existing byteOffset values are tracked at BufferedInputStream chunk granularity, not StAX-event granularity, so they can be off by up to 256 KB. A future PR can add accurate offset tracking and let the cap fire as a hard memory ceiling; until then the cap is a hint. Layer 2c — defensive copy on every cache hit --------------------------------------------- Every getSpectrumBySpecIndex now returns a fresh Spectrum built from the cached one (cloneSpectrum: copy metadata + new Peak instances for every peak). The cache stays as a parsing cache; it never hands out shared mutable Spectrum instances. This closes the silent state-mutation drift the existing MassCalibrator.java:122 size-guard had been working around. The calibrator pre-pass calls preProcessSpectra, which mutates Spectrum / Peak state (peak ranking, charge resolution); previously the same Spectrum instance was then re-read by the main pass and produced a ~0.1% PSM-list drift vs -precursorCal off. With defensive-copy on read the mutation is contained, the drift goes to zero, and the 10K-SpecKey size guard reverts to a pure "is there enough data to learn from" gate. The MassCalibrator comment is updated to reflect the new invariant without lowering the threshold (kept as data-sufficiency floor). A future caveat (recorded in memory): if a later feature needs the parent MS1 of an MS2 (precursor refinement à la MaxQuant, chimeric- spectrum analysis), Layer 1's filter will need to widen or a separate MS1-only accessor will need to be added. Do not silently widen the filter — that would re-introduce the OOM regression this layer is fixing. Tests ----- TestStaxMzMLParser: - testCacheReturnsDefensiveCopy: two getSpectrumBySpecIndex calls return distinct instances; mutating one doesn't affect a third read. - testMSLevelPreloadFilter: parser constructed with [2,2] returns null for MS1 indices, MS2 comes through cleanly. - testCacheSizeOverride: -Dmsgfplus.mzml.cacheSize=2 is accepted and the soft-cap warning lifts the effective cap to the in-filter count. Existing testCacheHit removed (it asserted the unsafe sharing semantics this PR replaces). Astral 3-arm validation ----------------------- - No OOM, no machine freeze. Peak RSS 7887 MB on armB (branch + cal off) / 8025 MB on armC (branch + cal auto), inside 8 GB heap. - armB raw target/decoy counts bit-identical to master (89479 / 46792). - Percolator 1% FDR: armB 35818 vs master 34714 (+3.2%) — improvement from the longest_b/y .pin features already on the branch, not from this commit.
…orker thread ConcurrentMSGFPlus.RunMSGFPlus used to take an already-constructed ScoredSpectraMap + DBScanner from the main MSGFPlus.runMSGFPlus loop. With numTasks defaulting to numThreads * 3 (so up to 24 tasks for an 8-thread run), all per-task spectrum-state instances were materialised upfront on the main thread and held in the ThreadPoolExecutor's queue until the worker that picked them up actually started. Switch to a Supplier<ScoredSpectraMap> handed to the task; the task calls supplier.get() inside run() (with idempotent guard) so each worker builds its own ScoredSpectraMap + DBScanner on the worker thread it will run on. Cleaner ownership of per-task state, and only numThreads task-state instances are alive at once instead of numTasks. The functional difference is small in absolute bytes (the ScoredSpec- traMap constructor only allocates a few empty synchronized HashMaps and a sublist view; preProcessSpectra remains the dominant per-task heap, and that already runs inside run()). The win is structural: per-task construction lives on the same thread that consumes it, which makes the construction-vs-execution timing easier to reason about as future per-task state grows. TestConcurrentMSGFPlus verifies the supplier is not called eagerly: constructing the task with a sentinel-throwing supplier increments no counter; calling task.run() invokes the supplier exactly once. Astral 3-arm passed with these changes in place — armB raw counts are bit-identical to master, no OOM.
…efensive copy The previous commit advertised a bounded LRU cache capped by -Dmsgfplus.mzml.cacheSize. In practice the constructor raised any configured cap below the in-filter spectrum count back up to that count, because there was no seek-on-demand fallback for evicted spectra and a real eviction would have returned null on a later read. That made the cap a no-op on exactly the files (huge MS2-heavy mzML) the cap was supposed to protect — and the accompanying test only proved that the cap was bypassed, not that bounded behavior worked. This was correctly flagged in review. Drop the LinkedHashMap LRU machinery, the cacheSize sysprop, and the soft-cap warning. The cache is now a plain Collections.synchronizedMap(new HashMap<>()) sized implicitly by the in-filter spectrum count after Layer 1's MS-level preload filter. The delete the misleading testCacheSizeOverride along with it. Memory shape on the dataset that motivated this PR: - Astral 1.9 GB mzML, MS2-only filter: ~25K MS2 in cache, peak RSS 7887 MB on armB / 8025 MB on armC under 8 GB heap. Same as before. The two real wins of the parser changes still stand: - MS-level preload filter (drops MS1 binary decode entirely). - Defensive copy on read (closes calibrator-vs-main-pass mutation drift, removes the implicit invariant the MassCalibrator size guard was working around). Hard memory bounding for very-large-MS2 mzML (metaproteomic / DIA-style) is deferred. It needs accurate byte-offset tracking (current offsets are at StAX-internal-buffer granularity, not at element start) plus a seek-on-demand fallback so eviction can be backed by an O(1) re-parse rather than a null return. Tracking as a follow-up. Tests: - Removed testCacheSizeOverride (asserted the cap was bypassed, contradicting the bounded-cache claim). - Remaining parser tests (testCacheReturnsDefensiveCopy, testMSLevelPreloadFilter, plus the existing suite) still pass: mvn -B -o test -Dtest='TestStaxMzMLParser,...' → 88/88 passing.
…r javadoc Commit 0e6539a removed the bounded LRU + cacheSize sysprop machinery but left two javadoc fragments still describing the deleted behavior: - getSpectrumBySpecIndex Javadoc said "Subsequent calls hit the bounded LRU cache" — the cache is now an unbounded synchronized HashMap. - preloadAllSpectra Javadoc said "If the cache cap ({@link #maxCacheSize}) is smaller than the in-filter spectrum count, the LRU eviction kicks in...". {@link #maxCacheSize} is a dangling reference (field deleted) and would fail strict javadoc generation; the eviction text contradicts the implementation. Rewrite both fragments to describe the actual unbounded-cache contract and point at the deferred follow-up (accurate byte-offset tracking + seek-on-demand) for the future hard memory ceiling. No code change. All scoped tests pass.
Code simplification pass over the eight files this PR touches. Algorithm + behavior unchanged; the same scoped test suite (88/88) still passes. What changed: - Trimmed verbose Javadoc / scaling-notes / PR-context paragraphs that restated the code. Kept single-line "what + non-obvious why" intent per the project's commenting policy. - Removed unused `firstSuffixIndex` field on `RangeMetadata` (never read after the inter-range LCP fixup was reduced to a single byte). - Inlined two single-use helpers: `cleanupOrphanedTempFiles` (one call site) and `ensureSearchStateInitialized` (one call site). Both were thin wrappers around three lines each. - Collapsed redundant `cache.get` lookups in `StaxMzMLParser.getSpectrumBySpecIndex` (one get + clone). - Removed duplicate `CvParamInfo` import + redundant fully-qualified refs in `cloneSpectrum`. - Test files: trimmed comment headers; assertions intact. Whitespace-ignoring diff (true semantic delta): +94 / -270 across 8 files (net -176 LOC). Deliberately not touched: - `compareSuffixesFrom` / `computeLcpByte` semantics (bit-identity gate). - `cloneSpectrum`'s set of copied fields (mirrors what `parseOneSpectrum` sets). - `Bucket` inner class internals (load-bearing for the parallel sort). - `.csarr` / `.cnlcp` header / footer / format-id. - The two-phase MS-level filter (pre-decode index check + post-decode belt-and-braces; the latter catches malformed mzML). - Pre-existing code outside this PR's diff. Tests: TestStaxMzMLParser, TestBuildSAParallelBitIdentity, TestConcurrentMSGFPlus, TestDirectPinWriter, TestMassCalibrator, TestPrecursorCalScaffolding, TestMSLevelFiltering, TestPercolator, TestSA, TestParsers, TestStaxMzMLParserErrorContext, TestPrimitiveRegression, TestMinSpectraPerThread → 88/88 passing.
Summary
Three coupled fixes that together (a) make Astral runs complete on machines with 8 GB heap (previously OOM'd and froze the OS), (b) cap parallel-BuildSA peak heap regardless of FASTA size, and (c) close a latent state-mutation drift between the precursor calibrator pre-pass and the main search.
PR scope is intentionally tight — it ships what was fully validated on Astral 3-arm + Percolator. Larger DB-scaling work (primitive mass index, primitive occurrence list, FASTA shard orchestration, hard memory ceiling on the parser cache) is deferred to follow-up PRs with their own benchmark cycles.
Five focused commits in this update
fix(buildsa): use readFully when loading .cseq sequence bytes—in.read(byte[])is allowed to return short, silently corrupting the in-memory sequence on large.cseqfiles.perf(buildsa): stream parallel sort output to per-worker temp files— replaces fullRangeBuffer(int[] + byte[]) per worker with disk-backed temp files; peak heap during merge now O(numThreads × small write-buffer) instead of O(total suffixes). Bit-identity test included.fix(mzml): bound parser cache + MS-level preload filter + defensive copy on read— three coupled changes toStaxMzMLParser: skip MS1 binary decode entirely (the dominant heap on Astral), defensive copy on every cache hit (closes the calibrator-vs-main mutation drift the existing 10K-SpecKey gate was working around),MassCalibratorcomment updated.refactor(msgfplus): defer per-task ScoredSpectraMap construction to worker thread—Supplier<ScoredSpectraMap>plumbing so each worker builds its own task state on the thread that consumes it; cleaner ownership.refactor(mzml): drop misleading bounded-cache cap; keep MS-filter + defensive copy— the LRU cap from commit 3 was a no-op on the very files it advertised protecting (no seek-on-demand fallback existed for evictions). Removed per review feedback. Hard memory bounding for very-large-MS2 mzML deferred until accurate byte-offset tracking + seek-on-demand lands.Astral 3-arm + Percolator validation (cold, 8 GB heap, 4 threads)
Dataset: ProteoBench Module 8 (
LFQ_Astral_DDA_15min_50ng_Condition_A_REP1.mzML, 1.9 GB) vs.ProteoBenchFASTA_MixedSpecies_HYE.fasta. Percolator via biocontainers 3.7.1,--seed 42.Deltas vs. master
armB raw target/decoy counts are byte-identical to master — the search produces exactly the same PSMs. The 1% FDR lift comes from richer Percolator features in the branch's
.pin(longest_b/longest_y/longest_y_pct, added in PR #23 before this PR), not from this PR.Previously, this same 8 GB / 4-thread configuration hit OOM-killer / swap thrash and froze the OS. Now all three arms complete cleanly.
What this PR does NOT change
.pinschema or feature columns..csarr/.cnlcp/.cseq/.canno) — verified bit-identical between N=1 and N=4 BuildSA paths past the per-build header.Merge gates (all cleared)
Test plan
mvn -B -o test -Dtest='TestStaxMzMLParser,TestStaxMzMLParserErrorContext,TestParsers,TestBuildSAParallelBitIdentity,TestDirectPinWriter,TestMassCalibrator,TestPrecursorCalScaffolding,TestMSLevelFiltering,TestPercolator,TestSA,TestConcurrentMSGFPlus,TestPrimitiveRegression,TestMinSpectraPerThread'→ 88/88 passingDeferred to follow-up PRs
TreeMap.subMap()inScoredSpectraMap)TreeSet<Integer>inDatabaseMatch)