Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EXITING because of FATAL ERROR in reads input: short read sequence line: 0 #493

Closed
juulluu21 opened this issue Sep 21, 2018 · 19 comments
Closed
Labels
issue: code Likely to be an issue with STAR code

Comments

@juulluu21
Copy link

juulluu21 commented Sep 21, 2018

**Hi,

I know this is a very old question. Has it been solved yet?
I am running STAR on paired end reads:**

_star --runThreadN 50 --genomeDir index --readFilesIn ASS_1_1.fq.gz ASS__1_2.fq.gz --outFileNamePrefix ASS.mapped.sam --readFilesCommand gzcat

Sep 21 16:20:32 ..... started STAR run
Sep 21 16:20:33 ..... loading genome
Sep 21 16:20:39 ..... started mapping

EXITING because of FATAL ERROR in reads input: short read sequence line: 0
Read Name=@NB501259:103:HNTJLBGX2:1:21310:2440:4630
Read Sequence====
DEF_readNameLengthMax=50000
DEF_readSeqLengthMax=650

Sep 21 16:21:53 ...... FATAL ERROR, exiting
Segmentation fault: 11_

**I have cut my reads by cutadapt. Is there any option in STAR to get this problem around? Or, I have to do something with the cutadapt now?

Thanks.**

@alexdobin
Copy link
Owner

Hi @juulluu21

this indicates some problem with formatting of the fastq files.
Please check that gzcat works as .gz uncompressor on your system - it's usually just zcat.

Cheers
Alex

@alexdobin alexdobin added the issue: code Likely to be an issue with STAR code label Sep 28, 2018
@juulluu21
Copy link
Author

Hi Alex,

Thanks very much. I messed up during the adapter clipping step I guess. Now, its working fine. I have also changed the zcat to gunzip -c option. Here is my command:

star --runThreadN 36 --genomeDir ./Genome --readFilesIn R1.fastq.gz. R2.fastq.gz --outFileNamePrefix R.sam --readFilesCommand gunzip -c

I have two more questions.

  1. I am mapping the reads against my assembled transcriptome. Do you recommend to change any above mentioned parameter? The transcriptome file has ~25000 contigs (or assembled transcripts).

  2. During sequencing, the library was not ribo-depleted. I am getting ~30% mapping with the parameters (commands) mentioned above. Do you think it is normal? Here is my Log.final.out file::

                            _Started job on |	Oct 01 12:21:42
                        Started mapping on |	Oct 01 12:21:49
                               Finished on |	Oct 01 12:32:39
    

    Mapping speed, Million of reads per hour | 234.93

                     Number of input reads |	42417211
                 Average input read length |	298
                               UNIQUE READS:
              Uniquely mapped reads number |	11755143
                   Uniquely mapped reads % |	27.71%
                     Average mapped length |	284.56
                  Number of splices: Total |	84399
       Number of splices: Annotated (sjdb) |	0
                  Number of splices: GT/AG |	29043
                  Number of splices: GC/AG |	3003
                  Number of splices: AT/AC |	468
          Number of splices: Non-canonical |	51885
                 Mismatch rate per base, % |	1.20%
                    Deletion rate per base |	0.05%
                   Deletion average length |	2.19
                   Insertion rate per base |	0.03%
                  Insertion average length |	3.62
                        MULTI-MAPPING READS:
    

    Number of reads mapped to multiple loci | 599926
    % of reads mapped to multiple loci | 1.41%
    Number of reads mapped to too many loci | 327
    % of reads mapped to too many loci | 0.00%
    UNMAPPED READS:
    % of reads unmapped: too many mismatches | 0.00%
    % of reads unmapped: too short | 70.85%
    % of reads unmapped: other | 0.02%
    CHIMERIC READS:
    Number of chimeric reads | 0
    % of chimeric reads | 0.00%_

Thanks a ton again, Alex!

@alexdobin
Copy link
Owner

Hi @juulluu21

compared to the normal mapping rate >90% to a genome, you are getting low mappability.
It might indicate incomplete transcriptome assembly, other possibilities are contamination (BLAST a few unmapped reads to check that) or poor sequencing quality (check the sequencing quality scores distribution).

Cheers
Alex

@juulluu21
Copy link
Author

Actually most of the reads are coming from rRNA. My library prep didn't ribo depleted my sample.

However, I see several mismatches in the reads (when I see them in IGV). Would you suggest some stringent mapping? I am also not very happy with the transcriptome that I am mapping against (I see many mismatches/SNPs when we sanger sequence some genes).

Thanks very much Alex!

@alexdobin
Copy link
Owner

Hi @juulluu21

ribosomal reads should appear as multimappers - if their sequences are present in the genome assembly.
You could try a less stringent mapping - allowing more mismatches and/or shorter mapped length.

Cheers
Alex

@juulluu21
Copy link
Author

HI Alex,

Thanks very much for your support!

I am now adding using these parameters:
--outFilterScoreMinOverLread 0.3 --outFilterMatchNminOverLread 0.3 --outFilterMismatchNmax 20

My mapping has increased by 10%. But, still there are ~65% unmapped reads (% of reads unmapped: too short)
Number of input reads | 29216568
Average input read length | 299
UNIQUE READS:
Uniquely mapped reads number | 9662772
Uniquely mapped reads % | 33.07%
Average mapped length | 250.60
Number of splices: Total | 54725
Number of splices: Annotated (sjdb) | 0
Number of splices: GT/AG | 29601
Number of splices: GC/AG | 1814
Number of splices: AT/AC | 173
Number of splices: Non-canonical | 23137
Mismatch rate per base, % | 1.64%
Deletion rate per base | 0.07%
Deletion average length | 1.65
Insertion rate per base | 0.04%
Insertion average length | 3.82
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 515764
% of reads mapped to multiple loci | 1.77%
Number of reads mapped to too many loci | 792
% of reads mapped to too many loci | 0.00%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.00%
% of reads unmapped: too short | 65.14%
% of reads unmapped: other | 0.02%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
_

If I use these parameters
--outFilterScoreMinOverLread 0.0 --outFilterMatchNminOverLread 0.0 --outFilterMismatchNmax 25
I get this mapping
Number of input reads | 29216568
Average input read length | 299
UNIQUE READS:
Uniquely mapped reads number | 21430073
Uniquely mapped reads % | 73.35%
Average mapped length | 142.05
Number of splices: Total | 58962
Number of splices: Annotated (sjdb) | 0
Number of splices: GT/AG | 33036
Number of splices: GC/AG | 1837
Number of splices: AT/AC | 175
Number of splices: Non-canonical | 23914
Mismatch rate per base, % | 3.42%
Deletion rate per base | 0.07%
Deletion average length | 2.06
Insertion rate per base | 0.03%
Insertion average length | 3.65
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 7538270
% of reads mapped to multiple loci | 25.80%
Number of reads mapped to too many loci | 235685
% of reads mapped to too many loci | 0.81%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.02%
% of reads unmapped: too short | 0.00%
% of reads unmapped: other | 0.02%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%

If I use these parameters
--outFilterScoreMinOverLread 0.2 --outFilterMatchNminOverLread 0.2 --outFilterMismatchNmax 25

            _Number of input reads |	29216568
                  Average input read length |	299
                                UNIQUE READS:
               Uniquely mapped reads number |	11676022
                    Uniquely mapped reads % |	39.96%
                      Average mapped length |	223.20
                   Number of splices: Total |	58467
        Number of splices: Annotated (sjdb) |	0
                   Number of splices: GT/AG |	32605
                   Number of splices: GC/AG |	1837
                   Number of splices: AT/AC |	175
           Number of splices: Non-canonical |	23850
                  Mismatch rate per base, % |	2.23%
                     Deletion rate per base |	0.08%
                    Deletion average length |	1.93
                    Insertion rate per base |	0.04%
                   Insertion average length |	3.71
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |	729810
         % of reads mapped to multiple loci |	2.50%
    Number of reads mapped to too many loci |	105807
         % of reads mapped to too many loci |	0.36%
                              UNMAPPED READS:
   % of reads unmapped: too many mismatches |	0.00%
             % of reads unmapped: too short |	57.16%
                 % of reads unmapped: other |	0.02%
                              CHIMERIC READS:
                   Number of chimeric reads |	0
                        % of chimeric reads |	0.00%_

If I use these parameters
--outFilterScoreMinOverLread 0.1 --outFilterMatchNminOverLread 0.1 --outFilterMismatchNmax 25

Number of input reads | 29216568
Average input read length | 299
UNIQUE READS:
Uniquely mapped reads number | 17749123
Uniquely mapped reads % | 60.75%
Average mapped length | 165.07
Number of splices: Total | 58961
Number of splices: Annotated (sjdb) | 0
Number of splices: GT/AG | 33035
Number of splices: GC/AG | 1837
Number of splices: AT/AC | 175
Number of splices: Non-canonical | 23914
Mismatch rate per base, % | 3.23%
Deletion rate per base | 0.08%
Deletion average length | 2.06
Insertion rate per base | 0.04%
Insertion average length | 3.66
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 2750055
% of reads mapped to multiple loci | 9.41%
Number of reads mapped to too many loci | 164514
% of reads mapped to too many loci | 0.56%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.01%
% of reads unmapped: too short | 29.25%
% of reads unmapped: other | 0.02%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%

**I am mapping the reads against transcriptome (small ~25000 assembled contigs). I do not have a genome. With outFilterScoreMinOverLread 0.1 and outFilterScoreMinOverLread 0.0, I guess I am getting most of the non-specific mapping. So, I guess I have to be satiefied with --outFilterScoreMinOverLread 0.3 --outFilterMatchNminOverLread 0.3 --outFilterMismatchNmax 20.

Do you suggest to change between --outFilterScoreMinOverLread 0.3 --outFilterMatchNminOverLread 0.3 ??

Thanks a ton, Alex!**

@alexdobin
Copy link
Owner

Hi @juulluu21

as you reduce the mapping stringency (mapped length) you can map more reads, however, only small portions of these reads are mapped, which cannot guarantee a correct mapping location. Effectively, you are trading off precision for higher sensitivity. You would have to think if your downstream analysis tolerates low precision (i.e. high false positive rate). In general, I would not recommend going below 0.6 for these parameters.

The best solution to low mappability is to figure out why it happens.
For instance, if your transcriptome assembly is incomplete, a lot of unmapped reads should be going to the parts of the genome not included in the assembly, and forcing them to map to the limited transcriptome will skew your results - might be better to leave such reads unmapped.

Cheers
Alex

@juulluu21
Copy link
Author

Thanks Alex.
Ok. I will not compromise with the stringency. I do not want less confident mapping. Actually, ribosomal RNAs were not depleted during the library prep. If I add rRNA to my transcriptome file, % mapping go up to 70% and more.

I will be satisfied with my ~30% mapping.

Thanks!!

@zillurbmb51
Copy link

zillurbmb51 commented Feb 21, 2019

Hello there,
I am having the same problem. I have uncompressed my fastq files manually before mapping with STAR.
(base) [zillur@genomics analysis]$ head ../concat/A1001_S20_R1.fq @NS500438:247:HHK2FBGX7:1:11101:18152:1051 1:N:0:GGAATGTC TTTTTTCACTGACCCGGTGAGGCGGGGGGGCGAGCCCCGAGGGGCTCTCGCTTCTGGCGCCAAGCGCCCGGCCGCG + AAAA/EE/AEEEEEE6EEEAEEEEEEEEEEEEEAEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEAEEAEEEEEEE< @NS500438:247:HHK2FBGX7:1:11101:20051:1053 1:N:0:GGAATGTC CCTGGCCTTTCTATTAGCTCTTAGTAAGATTACACATGCAAGCATCCCCGTTCCAGTGAGTTCACCCTCTAAATC + AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE @NS500438:247:HHK2FBGX7:1:11101:6358:1054 1:N:0:GGAATGTC TCGGGGGTCGCGTAACTAGTTAGCATGCCAGAGTCTCGTTCGTTATCGGAATTAACCAGACAAATCGCTCCACCAA
STAR is giving me same error whether the fastq files are compressed or uncompressed. Any help?
Best regards
Zillur

@alexdobin
Copy link
Owner

Hi @zillurbmb51

it looks like the fastqs are not properly formatted.
The "+" has to be a separate line, i.e.
@NS500438:247:HHK2FBGX7:1:11101:18152:1051 1:N:0:GGAATGTC TTTTTTCACTGACCCGGTGAGGCGGGGGGGCGAGCCCCGAGGGGCTCTCGCTTCTGGCGCCAAGCGCCCGGCCGCG
+
AAAA/EE/AEEEEEE6EEEAEEEEEEEEEEEEEAEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEAEEAEEEEEEE<

Cheers
Alex

@zillurbmb51
Copy link

Thank you for the quick reply. The "+" is in separate line. It got merged it here after copying (attached)!!
Thanks again.
screen shot 2019-02-21 at 5 05 02 pm

@alexdobin
Copy link
Owner

These lines look OK.
Please try to cut out a few lines, e.g.
$ head -n1000 ../concat/A1001_S20_R1.fq > smallR1.fastq
and map only those reads.

@zehemminger
Copy link

Hey Alex,

Just had a follow-up question that was related to this post. I am having a similar error. Paired-end reads that have the same format as zillurbmb51. I have not tried your suggestion about trying a smaller set but I tried running them as single instead of paired reads and they aligned perfectly fine without errors. I was curious if you had seen this before? I also had issues with having multiple threads. my ulimit is unlimited but I was only able to run mapping jobs with 20 threads. Anything more and I would get a permissions error in creating files. For reference, I am running STAR 2.7.0f. Also just wanted to say that you do an amazing job providing support to us all!

@alexdobin
Copy link
Owner

Hi @zehemminger

the OP error did not get resolved, I cannot reproduce with my data. If you can find a small subset that still causes this error, please send it to me along with the Log.out file.
For ulimit, you need to run
$ ulimit -Sn
$ ulimit -Hn
which will give you the soft and hard limits. You can then increase the soft limit up to the hard limit without root privileges.

Cheers
Alex

@Buuntu
Copy link

Buuntu commented May 2, 2019

I am getting a similar error. When running on fastq.gz files (I've confirmed are formatted correctly, I get):

EXITING because of FATAL ERROR in input reads: unknown file format: the read ID should start with @ or > 

I am running the command

STAR --runThreadN 16 --genomeDir star --readFilesIn forward_reads.fastq.gz reverse_reads.fastq.gz --outFileNamePrefix star --readFilesCommand gzip -c

And first few lines of a sample file:

@NB502126:79:HGWMMBGX7:1:11101:7703:1047 1:N:0:TCCGCGAA+GCCTCTAT
CTAGTTCCTTCACCCGAGTTCTCTCAAGCGCCTTGGTATTCTCTACCTGACCNNCTGTNTCGGTTTGGGGTACGAT
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE##EEEE#EEEEEEEEEEEEEEEEE
@NB502126:79:HGWMMBGX7:1:11101:20738:1047 1:N:0:TCCGCGAA+GCCTCTAT
CAGCGTTCAAAGGCGGCTTCCAGCGCCTGCTCCGCTTCGGCCAGCTGCGCCANNACTTNCTGCGTACGATCGTGCG
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE/EEEEEEEEEE##EEEE#EEEAEEEEEE/EEEEEE
@NB502126:79:HGWMMBGX7:1:11101:8442:1047 1:N:0:TCCGCGAA+GCCTCTAT
GCACTACATGCTTAGTCAGTCTTTACATTCGCTTGCCAGCTGCGGACAGACANNCCACNAACAAACTAGCCTGATT
+

@Buuntu
Copy link

Buuntu commented May 2, 2019

Okay so when I run with

zcat

or

gunzip -c

now I'm getting:

STAR --runThreadN 16 --genomeDir output/star/ --readFilesIn output/trimmed_forward_reads.fastq.gz output/trimmed_reverse_reads.fastq.gz --outFileNamePrefix output/star --readFilesCommand gunzip -c
May 02 19:31:17 ..... started STAR run
May 02 19:31:17 ..... loading genome
May 02 19:31:18 ..... started mapping
[1]    12366 segmentation fault (core dumped)  STAR --runThreadN 16 --genomeDir output/star/ --readFilesIn      gunzip -c

@alexdobin
Copy link
Owner

Hi Buuntu,

please send me the Log.out file.

Cheers
Alex

@Buuntu
Copy link

Buuntu commented May 3, 2019

So I was able to fix the error by using the option --genomeSAindexNbases 10 when generating the index. My genome is 5.6 MB so quite small. It takes a while to run with that option but it doesn't crash anymore. I guess I will have to set that manually depending on my genome size?

Thanks Alex

@alexdobin
Copy link
Owner

Right - at the moment this parameter needs to be scaled down for small genomes. (See Manual 2.2.5).
Cheers
Alex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
issue: code Likely to be an issue with STAR code
Projects
None yet
Development

No branches or pull requests

5 participants