Summary
mapPacBio.sh with intronlen=50 (splice-aware mode) crashes with a Java AssertionError when processing long-read (ONT/PacBio) sub-reads whose spliced genomic extent (greflen) exceeds msa.maxColumns - 80 = 7520. The padding-shrinkage while-loops in TranslateColorspaceRead.realign_new have no lower-bound guard, driving extraPadLeft and extraPadRight to negative values (e.g. -30, -30).
Environment
- BBTools version: 39.26 (from
bbmap/current/ install)
- Java: 1.8.0 (OpenJDK,
-ea assertions enabled)
- Command:
mapPacBio.sh with intronlen=50 maxindel=300000 ...
- Input: Oxford Nanopore DRS reads, ~200–32000 bp, split at 6000 bp before alignment
- Genome: S. cerevisiae (R64), but the bug is geometry-dependent and not organism-specific
Crash output
java.lang.AssertionError: -30, -30
<read_id>
at align2.TranslateColorspaceRead.realign_new(TranslateColorspaceRead.java:407)
at align2.SiteScore.realign(SiteScore.java:263)
at align2.BBMapThreadPacBio.realignAndTest(BBMapThreadPacBio.java:791)
at align2.BBMapThreadPacBio.processRead(BBMapThreadPacBio.java:552)
at align2.BBMapThreadPacBio.run(BBMapThreadPacBio.java:155)
at java.lang.Thread.run(Thread.java:750)
Followed (after thread death) by:
java.lang.AssertionError:
The number of reads out does not add up to the number of reads in.
<stats breakdown> = 316339 != 316340
at align2.AbstractMapper.printOutputStats(AbstractMapper.java:~1970)
Exit code: non-zero (killed by assertion). The process loses exactly 1 read per crashing thread.
Root cause
Trigger condition
In TranslateColorspaceRead.realign_new (splice-aware branch, ss.gaps != null):
// TranslateColorspaceRead.java ~line 394–407
int greflen = Tools.max(bases.length, GapTools.calcGrefLen(minLoc, maxLoc, ss.gaps));
int newlen = (greflen + 1 + extraPadLeft + extraPadRight);
if (newlen >= msa.maxColumns - 80) {
while (newlen >= msa.maxColumns-80 && extraPadLeft > extraPadRight) { newlen--; extraPadLeft--; }
while (newlen >= msa.maxColumns-80 && extraPadLeft < extraPadRight) { newlen--; extraPadRight--; }
while (newlen >= msa.maxColumns-80) { newlen -= 2; extraPadLeft--; extraPadRight--; }
// ← NO lower-bound guard here
}
assert(extraPadLeft >= 0 && extraPadRight >= 0) : extraPadLeft + ", " + extraPadRight + ...;
ALIGN_COLUMNS = 7600 (hardcoded in BBIndexPacBio.java:2474), so the threshold is 7520.
How the pads go negative
When a spliced sub-read has greflen ≈ 7579–7581 (e.g. a ~600 bp exon-split read spanning a 7000 bp intron):
newlen = greflen + 1 + extraPadLeft + extraPadRight
≈ 7580 + 1 + 30 + 30 = 7641 (example initial pads)
excess = 7641 - 7520 = 121
The first two while-loops equalise the pads (or are no-ops if already equal). The third loop fires (121 - 1) / 2 = 60 times, decrementing both pads by 60:
extraPadLeft = 30 - 60 = -30
extraPadRight = 30 - 60 = -30
The assertion fires.
If extraPadLeft == extraPadRight == 0 initially (minimum padding case), excess of 60 above threshold drives both to -30 in 30 iterations. This is reproducible with the same read and same reference; the crash is 100% deterministic for affected reads.
Why removing -ea does not fix the bug
Without -ea, the assertion becomes a no-op, but extraPadLeft = -30 and extraPadRight = -30 are then passed to the MSA array sizing code. This produces ArrayIndexOutOfBoundsException instead — same crash, different exception class.
Why ALIGN_COLUMNS cannot be raised at runtime
BBIndexPacBio.java:2474:
public static final int ALIGN_COLUMNS = 7600;
This is static final and referenced at index-build time to size arrays. It cannot be overridden via command-line parameters.
Suggested fix
Clamp extraPadLeft and extraPadRight to 0 before the assertion:
if (newlen >= msa.maxColumns - 80) {
while (newlen >= msa.maxColumns-80 && extraPadLeft > extraPadRight) { newlen--; extraPadLeft--; }
while (newlen >= msa.maxColumns-80 && extraPadLeft < extraPadRight) { newlen--; extraPadRight--; }
while (newlen >= msa.maxColumns-80) { newlen -= 2; extraPadLeft--; extraPadRight--; }
// ADDED: clamp — alignment will be degraded but not crash
if (extraPadLeft < 0) { newlen -= extraPadLeft; extraPadLeft = 0; }
if (extraPadRight < 0) { newlen -= extraPadRight; extraPadRight = 0; }
}
assert(extraPadLeft >= 0 && extraPadRight >= 0) : ...;
A stricter alternative is to break out of the third while-loop when either pad reaches 0:
while (newlen >= msa.maxColumns-80 && extraPadLeft > 0 && extraPadRight > 0) {
newlen -= 2; extraPadLeft--; extraPadRight--;
}
Either approach prevents the assertion/AIOOBE while producing a slightly compressed alignment window for reads that genuinely exceed ALIGN_COLUMNS. The comment already present in the code acknowledges this case: // TODO: In this case the alignment will probably be wrong.
Reproduction context
Reads involved are sub-reads split from long Nanopore DRS reads (>6000 bp) by an upstream read-splitter. A sub-read covering one exon can still span a very long intron in genomic coordinates, giving greflen >> read_length. The affected reads are processed correctly by minimap2 and other long-read aligners; the crash is specific to mapPacBio.sh with intronlen set.
Any workflow that uses mapPacBio.sh with intronlen>0 on reads from species with long introns (>7000 bp) and sufficiently large sub-reads is at risk.
Impact
One read is silently lost per crashing thread. The printOutputStats assertion then fires, causing mapPacBio to exit non-zero. Any downstream tool checking exit codes will reject the entire output BAM — even though only 1 read out of potentially hundreds of thousands was affected.
Summary
mapPacBio.shwithintronlen=50(splice-aware mode) crashes with a JavaAssertionErrorwhen processing long-read (ONT/PacBio) sub-reads whose spliced genomic extent (greflen) exceedsmsa.maxColumns - 80 = 7520. The padding-shrinkage while-loops inTranslateColorspaceRead.realign_newhave no lower-bound guard, drivingextraPadLeftandextraPadRightto negative values (e.g.-30, -30).Environment
bbmap/current/install)-eaassertions enabled)mapPacBio.shwithintronlen=50 maxindel=300000 ...Crash output
Followed (after thread death) by:
Exit code: non-zero (killed by assertion). The process loses exactly 1 read per crashing thread.
Root cause
Trigger condition
In
TranslateColorspaceRead.realign_new(splice-aware branch,ss.gaps != null):ALIGN_COLUMNS = 7600(hardcoded inBBIndexPacBio.java:2474), so the threshold is 7520.How the pads go negative
When a spliced sub-read has
greflen ≈ 7579–7581(e.g. a ~600 bp exon-split read spanning a 7000 bp intron):The first two while-loops equalise the pads (or are no-ops if already equal). The third loop fires
(121 - 1) / 2 = 60times, decrementing both pads by 60:The assertion fires.
If
extraPadLeft == extraPadRight == 0initially (minimum padding case), excess of 60 above threshold drives both to -30 in 30 iterations. This is reproducible with the same read and same reference; the crash is 100% deterministic for affected reads.Why removing
-eadoes not fix the bugWithout
-ea, the assertion becomes a no-op, butextraPadLeft = -30andextraPadRight = -30are then passed to the MSA array sizing code. This producesArrayIndexOutOfBoundsExceptioninstead — same crash, different exception class.Why
ALIGN_COLUMNScannot be raised at runtimeBBIndexPacBio.java:2474:This is
static finaland referenced at index-build time to size arrays. It cannot be overridden via command-line parameters.Suggested fix
Clamp
extraPadLeftandextraPadRightto 0 before the assertion:A stricter alternative is to break out of the third while-loop when either pad reaches 0:
Either approach prevents the assertion/AIOOBE while producing a slightly compressed alignment window for reads that genuinely exceed
ALIGN_COLUMNS. The comment already present in the code acknowledges this case:// TODO: In this case the alignment will probably be wrong.Reproduction context
Reads involved are sub-reads split from long Nanopore DRS reads (>6000 bp) by an upstream read-splitter. A sub-read covering one exon can still span a very long intron in genomic coordinates, giving
greflen >> read_length. The affected reads are processed correctly by minimap2 and other long-read aligners; the crash is specific tomapPacBio.shwithintronlenset.Any workflow that uses
mapPacBio.shwithintronlen>0on reads from species with long introns (>7000 bp) and sufficiently large sub-reads is at risk.Impact
One read is silently lost per crashing thread. The
printOutputStatsassertion then fires, causing mapPacBio to exit non-zero. Any downstream tool checking exit codes will reject the entire output BAM — even though only 1 read out of potentially hundreds of thousands was affected.