What ArrayIndexOutOfBoundsException means -- and what to do about it
GATK errors normally come with a message that tries to explain what went wrong. If possible, the message suggests a solution or at least links to a helpful documentation article. But sometimes you can get an error called "ArrayIndexOutOfBoundsException", and it doesn't come with a helpful message. The error message is just a cryptic number. What's up with that?
The reason we can't easily put in a message that suggests a fix is because this error can happen with different tools and for different reasons. So there's no one-size-fits-all solution! In this post I'm going to try to explain what the error actually means and what you can do about it.
Internally, GATK tools store most of the data they handle in arrays, which are basically a sort of list in computer jargon. The index is a number that specifies the position in the array. Arrays are created with a fixed size based on the number of elements we want (or expect) to store in them. For example, when we create an array to hold the sequence of a read, we create it to match the exact number of bases in the read. The ArrayIndexOutOfBoundsException error happens when a tool tries to access an index position that does not exist because it is larger than the size of the array.
For example, let’s say you have 15 eggs, but you only have an egg carton that can fit 12 eggs. You will not be able to store the 13th, 14th and 15th eggs. As a thinking human, you know you can just get another carton, or store the extra eggs in those little egg holders in the fridge door. But if you're a GATK tool (ahem), when you try to store the 13th egg, there's no room in the carton so the egg falls and you throw an ArrayIndexOutOfBoundsException. Or if you tried to pick a 13th egg out of a 12-egg carton, as a human you would just look silly. But as a GATK tool you would freak out and throw an ArrayIndexOutOfBoundsException again.
Now, let's tie the egg examples into a more realistic sequencing data example, to illustrate why the tool would try to access a nonexistent position in the first place. Let's say we have a read record containing its sequence and corresponding base qualities, which are both stored as arrays of characters. The two arrays are expected to be exactly the same length, so if I'm interested in the 16th base, I can look up its quality score by taking the 16th element in the array of base qualities. But what if the read record is malformed, and only has 12 base qualities despite having a 16-base sequence? You guessed it -- I will get an ArrayIndexOutOfBounds error when I try to look for the 16th quality score.
That was an example of bad data formatting. These errors can also be caused by miscalculations due to bugs, of course. But if the tool is working for everyone else except you, maybe the problem is with your data! So your first step in troubleshooting this sort of error should be to validate all your data files (Picard ValidateSamFile for BAMs, GATK ValidateVariants for VCFs). If the validations are all okay, then try with the latest version of GATK. If the error still occurs, let us know in the forum and we'll help you figure it out!