Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STAR fails on very large inputs #1015

Open
dvdesolve opened this issue Aug 28, 2020 · 4 comments
Open

STAR fails on very large inputs #1015

dvdesolve opened this issue Aug 28, 2020 · 4 comments
Labels
issue: input issue related to input files

Comments

@dvdesolve
Copy link

dvdesolve commented Aug 28, 2020

Hardware information:

  • OS used: Windows 10 64 bit / WSL 1
  • RAM: 16 GB
  • CPU: Core i7 4710MQ

We're trying to perform run on paired FASTQ input files each containing 87264604 reads (file size is approx 11.68 GB). After a long run time we're getting the following error:

EXITING because of FATAL ERROR in reads input: quality string length is not equal to sequence length
@SRR594393.75724218
ATTGTGCTTGCCACAAGTCAGTGTGCTCCCTGGGTGCCCAGGAGCTAGAA
GG
SOLUTION: fix your fastq file

However, our input file is correct. Here is the offending excerpt from it obtained with zcat SRR594393/SRR594393_2.fastq.gz | head -n 302896878 | tail

@SRR594393.75724218 75724218/2
ATTGTGCTTGCCACAAGTCAGTGTGCTCCCTGGGTGCCCAGGAGCTAGAA
+
GGGGFGGFGGGGGGGGGFGGFEFEFGGFFGGGGGFGGGGGGGGGEGGGFG

As you can see the length of quality string matches the length of the sequence string. But error message contains two extra GG symbols which is obviously taken from quality string. Seems like + separator was ignored so this caused that error.

May it be related to the integer overflow somewhere in the code because of the large file size?

@alexdobin
Copy link
Owner

Hi Viktor,

I think the problem could be with the Windows extra invisible characters (such as \r).
Which version of STAR are you using - 2.7.5c had a fix for this kind of issues.
Please try to do
zcat SRR594393/SRR594393_2.fastq.gz | head -n 302896878 | tail -n4 | hexdump -c
to check for these characters, for both read1 and read2.

Cheers
Alex

@alexdobin alexdobin added the issue: input issue related to input files label Sep 1, 2020
@dvdesolve
Copy link
Author

Seems like the problem isn't related to this specific file. We've send it to our friend and he was able to perform run on it successfully without any issues. FWIW he has 8-core CPU and about 64 Gb of RAM

@alexdobin
Copy link
Owner

Hi Viktor,

when it succeeded, did it run on a "true" Linux, not WSL?
Also, can you try to run it with --readMapNumber 1000, to see if it works for small input.

Cheers
Alex

@dvdesolve
Copy link
Author

dvdesolve commented Sep 18, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
issue: input issue related to input files
Projects
None yet
Development

No branches or pull requests

2 participants