TLDR
Large paired fastq files with unequal read lengths can lead to improperly paired subsets of fastq reads in memory due to a fixed size character buffer. Workaround by interleaving input files, and working in a single pass. Manually repeat for additional passes.
Description of Problem
I was attempting to run BBNorm on paired fastq reads which had been pre-processed with FastP.
As a result of differential quality between forward and reverse reads the lengths of the paired reads were no longer guaranteed to be the same.
When attempting to RUN BBNorm I received the following error:
Exception in thread "Thread-42" java.lang.AssertionError: List size mismatch: 750 vs 755
at stream.PairStreamer.nextList(PairStreamer.java:86)
at bloom.ReadCounter$CountThread.count(ReadCounter.java:696)
at bloom.ReadCounter$CountThread.run(ReadCounter.java:624)
There were additional error of the same form./
The number of reads in my forward and reverse read files were identical and they were properly paired.
This error is reproducible with any pair of fastq files where both are greater than 262144 bytes, but the the number of complete reads in the first 262144 bytes differs.
Cause of Problem
After carefully tracing the code in jgi/KmerNormalize, I found the issue was "caused" in stream/FastqScanStreamer.
When reading the file, #a fixed number of bytes are read into buffer and then split on new lines into individual reads. The complete reads are put into a list and pushed into the queue. The same process occurs for the reverse reads. The problem is that a fixed number of characters may be a different number of reads if the read lengths are not matched between the forward and reverse reads.
In my case, the first buffer sized chunk of my fastq files contained 750 forward reads and 755 reverse reads (as noted in the error message).
When the PairedStreamer class checks the two lists of reads from the forward and reverse queues, the assertion fails. If one ignores assertions with the -da option, reads are no longer properly paired, the 5 reverse reads are ignored instead of being matched to the first five in the next set of fwd reads. Alternatively a different error is thrown if a chunk of reads is larger than the forward reads, as an attempt to access a reverse read which doesn't exist is made.
Workaround
I attempted to work around the problem by interleaving the input files, while this does allow the program to run the first pass, the issue occurs once more on the second pass as intermediary, non-interleaved fastqfiles are generated. One can manually get a second pass by re-interleaving the output and starting a new BBNorm instance, but without reading the rest of the code, I'm unsure if that has unexpected side-effects.
Suggestions for a fix
The independence of the forward and reverse streamer and the need for independence between threads makes this difficult to fix after an unequal number of reads have been read from both streams and placed in the queue. Perhaps a fixed number of reads to be added to the queue (except for the final chunk of reads)? It would require FastqScanStreamer to hold onto reads between buffer processing steps, with uncertain memory requirements, so not ideal.
Check for Insight
Is there a reason not to use BBNorm on pre-processed reads in this way? Are there assumptions about equal length with paired reads which make the normalization work unexpectedly?
I could go back to the raw reads, use BBNorm and then pre-process them. I was hoping to avoid this as I only need the digital normalization for the assembly portion of my analysis pipeline, and want to keep all reads for the rest.
TLDR
Large paired fastq files with unequal read lengths can lead to improperly paired subsets of fastq reads in memory due to a fixed size character buffer. Workaround by interleaving input files, and working in a single pass. Manually repeat for additional passes.
Description of Problem
I was attempting to run BBNorm on paired fastq reads which had been pre-processed with FastP.
As a result of differential quality between forward and reverse reads the lengths of the paired reads were no longer guaranteed to be the same.
When attempting to RUN BBNorm I received the following error:
The number of reads in my forward and reverse read files were identical and they were properly paired.
This error is reproducible with any pair of fastq files where both are greater than 262144 bytes, but the the number of complete reads in the first 262144 bytes differs.
Cause of Problem
After carefully tracing the code in jgi/KmerNormalize, I found the issue was "caused" in stream/FastqScanStreamer.
When reading the file, #a fixed number of bytes are read into buffer and then split on new lines into individual reads. The complete reads are put into a list and pushed into the queue. The same process occurs for the reverse reads. The problem is that a fixed number of characters may be a different number of reads if the read lengths are not matched between the forward and reverse reads.
In my case, the first buffer sized chunk of my fastq files contained 750 forward reads and 755 reverse reads (as noted in the error message).
When the PairedStreamer class checks the two lists of reads from the forward and reverse queues, the assertion fails. If one ignores assertions with the
-daoption, reads are no longer properly paired, the 5 reverse reads are ignored instead of being matched to the first five in the next set of fwd reads. Alternatively a different error is thrown if a chunk of reads is larger than the forward reads, as an attempt to access a reverse read which doesn't exist is made.Workaround
I attempted to work around the problem by interleaving the input files, while this does allow the program to run the first pass, the issue occurs once more on the second pass as intermediary, non-interleaved fastqfiles are generated. One can manually get a second pass by re-interleaving the output and starting a new BBNorm instance, but without reading the rest of the code, I'm unsure if that has unexpected side-effects.
Suggestions for a fix
The independence of the forward and reverse streamer and the need for independence between threads makes this difficult to fix after an unequal number of reads have been read from both streams and placed in the queue. Perhaps a fixed number of reads to be added to the queue (except for the final chunk of reads)? It would require FastqScanStreamer to hold onto reads between buffer processing steps, with uncertain memory requirements, so not ideal.
Check for Insight
Is there a reason not to use BBNorm on pre-processed reads in this way? Are there assumptions about equal length with paired reads which make the normalization work unexpectedly?
I could go back to the raw reads, use BBNorm and then pre-process them. I was hoping to avoid this as I only need the digital normalization for the assembly portion of my analysis pipeline, and want to keep all reads for the rest.