Problem parsing fasta file #49

JC-therea · 2022-08-16T10:54:46Z

Dear author of RNA-Bloom

I am using your software to assemble some direct RNA reads for different species however I am obtaining different errors in some of them.

Input file and command:

rnabloom -long $READS -stranded -t 8 -outdir $OUTDIR

The output that I get is the following:

RNA-Bloom v2.0.0
args: [-long, Input.fa, -stranded, -t, 8, -outdir, Output]

name:   rnabloom
outdir: Output

Turning on option `-ntcard` to count k-mers

K-mer counting with ntCard...
Running command: `ntcard -t 8 -k 25 -c 65535 -p Output/rnabloom @Output/rnabloom.ntcard.readslist.txt`...
Parsing histogram file `Output/rnabloom_k25.hist`...
Unique k-mers (k=25):     57,234,431
Unique k-mers (k=25,c>1): 11,105,972
K-mer counting completed in 21.059s

Bloom filters          Memory (GB)
====================================
de Bruijn graph:       0.12647936
k-mer counting:        0.19634001
====================================
Total:                 0.32281935

> Stage 1: Construct graph from reads (k=25)
Parsing `Input.fa`...
Parsed 477,740 sequences in 1m 0s
DBG Bloom filter FPR:                 1.06 %
Counting Bloom filter FPR:            1.17 %
> Stage 1 completed in 1m 1s

> Stage 2: Correct long reads for "rnabloom"
Parsing `Input.fa`...
Index -1 out of bounds for length 4
Index -1 out of bounds for length 4
java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for length 4
	at rnabloom.util.SeqUtils.isLowComplexityLong(SeqUtils.java:619)
	at rnabloom.util.SeqUtils.trimLowComplexityRegions(SeqUtils.java:848)
	at rnabloom.RNABloom$LongReadCorrectionWorker.run(RNABloom.java:3791)
	at java.base/java.lang.Thread.run(Thread.java:834)
java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for length 4
	at rnabloom.util.SeqUtils.isLowComplexityLong(SeqUtils.java:619)
	at rnabloom.util.SeqUtils.trimLowComplexityRegions(SeqUtils.java:848)
	at rnabloom.RNABloom$LongReadCorrectionWorker.run(RNABloom.java:3791)
	at java.base/java.lang.Thread.run(Thread.java:834)
null
java.lang.ArrayIndexOutOfBoundsException
null
java.lang.ArrayIndexOutOfBoundsException
null
java.lang.ArrayIndexOutOfBoundsException
null
java.lang.ArrayIndexOutOfBoundsException
Index -1 out of bounds for length 4
java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for length 4
	at rnabloom.util.SeqUtils.isLowComplexityLong(SeqUtils.java:605)
	at rnabloom.util.SeqUtils.trimLowComplexityRegions(SeqUtils.java:848)
	at rnabloom.RNABloom$LongReadCorrectionWorker.run(RNABloom.java:3791)
	at java.base/java.lang.Thread.run(Thread.java:834)
null
java.lang.ArrayIndexOutOfBoundsException
Corrected Read Lengths Sampling Distribution (n=4528)
	min	q1	med	q3	max
	239	776	1112	1635	5315
ERROR: null
java.lang.ArrayIndexOutOfBoundsException

Program version:

RNA-Bloom v2.0.0
openjdk version "11.0.1" 2018-10-16 LTS
OpenJDK Runtime Environment Zulu11.2+3 (build 11.0.1+13-LTS)
OpenJDK 64-Bit Server VM Zulu11.2+3 (build 11.0.1+13-LTS, mixed mode)

Any help that you can provide would be appreciated.

The text was updated successfully, but these errors were encountered:

kmnip · 2022-08-16T18:03:25Z

Hi @JC-therea ,

Thanks for reporting this!

Your read file contains N characters in the sequences and RNA-Bloom currently doesn't work with reads containing non-ACGT characters.

A temporary solution is to simply cut your reads at Ns. You can do so easily with seqtk, e.g.

seqtk cutN -n 1 input.fa > input.noN.fa

RNA-Bloom should work fine when provided with these reads.

I will add support for N-containing reads in a future release.

Hope that helps,
Ka Ming

kmnip · 2022-08-16T18:12:49Z

I am a bit curious in how these N characters arise.
Did you pre-process your raw reads? like masking bases with low quality scores, etc.?

JC-therea · 2022-08-17T14:15:25Z

Thank you very much @kmnip! I did not expect to have Ns in my reads either... direct reads were merged with Illumina reads (with fmlrc) and after that, I applied a second error correction method with transcript clean. The Ns probably were introduced after that step because some of the genomes that I am using are hard masked...

Anyway, thank you very much for noticing that you helped me a lot!

kmnip self-assigned this Aug 16, 2022

kmnip added the enhancement New feature or request label Aug 16, 2022

JC-therea closed this as completed Aug 17, 2022

kmnip added the to be released An update will be available in the next release label Sep 8, 2022

kmnip removed the to be released An update will be available in the next release label Mar 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem parsing fasta file #49

Problem parsing fasta file #49

JC-therea commented Aug 16, 2022

kmnip commented Aug 16, 2022

kmnip commented Aug 16, 2022

JC-therea commented Aug 17, 2022

Problem parsing fasta file #49

Problem parsing fasta file #49

Comments

JC-therea commented Aug 16, 2022

kmnip commented Aug 16, 2022

kmnip commented Aug 16, 2022

JC-therea commented Aug 17, 2022