Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem parsing fasta file #49

Closed
JC-therea opened this issue Aug 16, 2022 · 3 comments
Closed

Problem parsing fasta file #49

JC-therea opened this issue Aug 16, 2022 · 3 comments
Assignees
Labels
enhancement New feature or request

Comments

@JC-therea
Copy link

Dear author of RNA-Bloom

I am using your software to assemble some direct RNA reads for different species however I am obtaining different errors in some of them.

Input file and command:

rnabloom -long $READS -stranded -t 8 -outdir $OUTDIR

The output that I get is the following:

RNA-Bloom v2.0.0
args: [-long, Input.fa, -stranded, -t, 8, -outdir, Output]

name:   rnabloom
outdir: Output

Turning on option `-ntcard` to count k-mers

K-mer counting with ntCard...
Running command: `ntcard -t 8 -k 25 -c 65535 -p Output/rnabloom @Output/rnabloom.ntcard.readslist.txt`...
Parsing histogram file `Output/rnabloom_k25.hist`...
Unique k-mers (k=25):     57,234,431
Unique k-mers (k=25,c>1): 11,105,972
K-mer counting completed in 21.059s

Bloom filters          Memory (GB)
====================================
de Bruijn graph:       0.12647936
k-mer counting:        0.19634001
====================================
Total:                 0.32281935

> Stage 1: Construct graph from reads (k=25)
Parsing `Input.fa`...
Parsed 477,740 sequences in 1m 0s
DBG Bloom filter FPR:                 1.06 %
Counting Bloom filter FPR:            1.17 %
> Stage 1 completed in 1m 1s

> Stage 2: Correct long reads for "rnabloom"
Parsing `Input.fa`...
Index -1 out of bounds for length 4
Index -1 out of bounds for length 4
java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for length 4
	at rnabloom.util.SeqUtils.isLowComplexityLong(SeqUtils.java:619)
	at rnabloom.util.SeqUtils.trimLowComplexityRegions(SeqUtils.java:848)
	at rnabloom.RNABloom$LongReadCorrectionWorker.run(RNABloom.java:3791)
	at java.base/java.lang.Thread.run(Thread.java:834)
java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for length 4
	at rnabloom.util.SeqUtils.isLowComplexityLong(SeqUtils.java:619)
	at rnabloom.util.SeqUtils.trimLowComplexityRegions(SeqUtils.java:848)
	at rnabloom.RNABloom$LongReadCorrectionWorker.run(RNABloom.java:3791)
	at java.base/java.lang.Thread.run(Thread.java:834)
null
java.lang.ArrayIndexOutOfBoundsException
null
java.lang.ArrayIndexOutOfBoundsException
null
java.lang.ArrayIndexOutOfBoundsException
null
java.lang.ArrayIndexOutOfBoundsException
Index -1 out of bounds for length 4
java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for length 4
	at rnabloom.util.SeqUtils.isLowComplexityLong(SeqUtils.java:605)
	at rnabloom.util.SeqUtils.trimLowComplexityRegions(SeqUtils.java:848)
	at rnabloom.RNABloom$LongReadCorrectionWorker.run(RNABloom.java:3791)
	at java.base/java.lang.Thread.run(Thread.java:834)
null
java.lang.ArrayIndexOutOfBoundsException
Corrected Read Lengths Sampling Distribution (n=4528)
	min	q1	med	q3	max
	239	776	1112	1635	5315
ERROR: null
java.lang.ArrayIndexOutOfBoundsException

Program version:

RNA-Bloom v2.0.0
openjdk version "11.0.1" 2018-10-16 LTS
OpenJDK Runtime Environment Zulu11.2+3 (build 11.0.1+13-LTS)
OpenJDK 64-Bit Server VM Zulu11.2+3 (build 11.0.1+13-LTS, mixed mode)

Any help that you can provide would be appreciated.

@kmnip kmnip self-assigned this Aug 16, 2022
@kmnip kmnip added the enhancement New feature or request label Aug 16, 2022
@kmnip
Copy link
Collaborator

kmnip commented Aug 16, 2022

Hi @JC-therea ,

Thanks for reporting this!

Your read file contains N characters in the sequences and RNA-Bloom currently doesn't work with reads containing non-ACGT characters.

A temporary solution is to simply cut your reads at Ns. You can do so easily with seqtk, e.g.

seqtk cutN -n 1 input.fa > input.noN.fa

RNA-Bloom should work fine when provided with these reads.

I will add support for N-containing reads in a future release.

Hope that helps,
Ka Ming

@kmnip
Copy link
Collaborator

kmnip commented Aug 16, 2022

I am a bit curious in how these N characters arise.
Did you pre-process your raw reads? like masking bases with low quality scores, etc.?

@JC-therea
Copy link
Author

Thank you very much @kmnip! I did not expect to have Ns in my reads either... direct reads were merged with Illumina reads (with fmlrc) and after that, I applied a second error correction method with transcript clean. The Ns probably were introduced after that step because some of the genomes that I am using are hard masked...

Anyway, thank you very much for noticing that you helped me a lot!

@kmnip kmnip added the to be released An update will be available in the next release label Sep 8, 2022
@kmnip kmnip removed the to be released An update will be available in the next release label Mar 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants