Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Less multimapping reads, more uniquely mapped reads #215

Closed
P4sm opened this issue Nov 23, 2016 · 6 comments
Closed

Less multimapping reads, more uniquely mapped reads #215

P4sm opened this issue Nov 23, 2016 · 6 comments
Labels

Comments

@P4sm
Copy link

P4sm commented Nov 23, 2016

Hi,

I have an example of alignment summary here:

                      Number of input reads |	10206806
                  Average input read length |	51
                                UNIQUE READS:
               Uniquely mapped reads number |	8684221
                    Uniquely mapped reads % |	85.08%
                      Average mapped length |	50.55
                   Number of splices: Total |	1576957
        Number of splices: Annotated (sjdb) |	1559346
                   Number of splices: GT/AG |	1561011
                   Number of splices: GC/AG |	12339
                   Number of splices: AT/AC |	1125
           Number of splices: Non-canonical |	2482
                  Mismatch rate per base, % |	0.12%
                     Deletion rate per base |	0.00%
                    Deletion average length |	1.50
                    Insertion rate per base |	0.00%
                   Insertion average length |	1.30
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |	1431443
         % of reads mapped to multiple loci |	14.02%
    Number of reads mapped to too many loci |	31409
         % of reads mapped to too many loci |	0.31%
                              UNMAPPED READS:
   % of reads unmapped: too many mismatches |	0.00%
             % of reads unmapped: too short |	0.49%
                 % of reads unmapped: other |	0.09%
                              CHIMERIC READS:
                   Number of chimeric reads |	0
                        % of chimeric reads |	0.00%

But I get more uniquely aligned reads with Tophat2:

Reads: Input : 10206806 Mapped : 10058745 (98.5% of input) of these: 347302 ( 3.5%) have multiple alignments (2093 have >20) 98.5% overall read mapping rate.

My STAR parameters are pretty much default. Do you know how I could acquire more uniquely aligned reads with STAR?

@P4sm
Copy link
Author

P4sm commented Nov 23, 2016

Can STARindex have something to do with this:

STAR --runThreadN 16 --runMode genomeGenerate --genomeDir starIndex-1x50bp --genomeFastaFiles WholeGenomeFasta/genome.fa --sjdbOverhang 49 --sjdbGTFfile ../Annotation/Genes/genes.gtf

Is the correct value for sjdbOverhang 51-1=50 in this case?

@alexdobin
Copy link
Owner

Hi @P4sm

overall, STAR finds more alignable reads: 85.08%+14.02%+0.31%=99.41% vs TopHat2's 98.5%.
It looks like TopHat does not find multi-mapping loci for many of these reads.
Could you extract and send me a few reads that are labeled as multimappers by STAR and unique mappers by TopHat?

Cheers
Alex

@P4sm
Copy link
Author

P4sm commented Nov 24, 2016

Hi,

Three examples:

TopHat 2:

HWI-ST1072:208:C5VDPACXX:6:1313:18306:70501 0 chr1 20651946 50 51M * 0 0 CGCCCGAGTAGCTGGGACTACAGGTGCCCACCACCACGCCCAGCTAATTTT B0<FFFFFFFFFFIIIIIIIIIIIFIIIIIIIIIIIIIIIIIIIIIFFIII AS:i:-3 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:1T49 YT:Z:UU XS:A:- NH:i:1

STAR:

HWI-ST1072:208:C5VDPACXX:6:1313:18306:70501 256 chr2 16071802 0 51M * 0 0 CGCCCGAGTAGCTGGGACTACAGGTGCCCACCACCACGCCCAGCTAATTTT B0<FFFFFFFFFFIIIIIIIIIIIFIIIIIIIIIIIIIIIIIIIIIFFIII NH:i:5 HI:i:2 AS:i:48 nM:i:1 HWI-ST1072:208:C5VDPACXX:6:1313:18306:70501 256 chr6 35070610 0 23M231237N28M * 0 0 CGCCCGAGTAGCTGGGACTACAGGTGCCCACCACCACGCCCAGCTAATTTT B0<FFFFFFFFFFIIIIIIIIIIIFIIIIIIIIIIIIIIIIIIIIIFFIII NH:i:5 HI:i:1 AS:i:47 nM:i:0 HWI-ST1072:208:C5VDPACXX:6:1313:18306:70501 256 chr13 19804525 0 23M111099N28M * 0 0 CGCCCGAGTAGCTGGGACTACAGGTGCCCACCACCACGCCCAGCTAATTTT B0<FFFFFFFFFFIIIIIIIIIIIFIIIIIIIIIIIIIIIIIIIIIFFIII NH:i:5 HI:i:3 AS:i:47 nM:i:0 HWI-ST1072:208:C5VDPACXX:6:1313:18306:70501 0 chr19 522182 0 2S49M * 0 0 CGCCCGAGTAGCTGGGACTACAGGTGCCCACCACCACGCCCAGCTAATTTT B0<FFFFFFFFFFIIIIIIIIIIIFIIIIIIIIIIIIIIIIIIIIIFFIII NH:i:5 HI:i:4 AS:i:48 nM:i:0 HWI-ST1072:208:C5VDPACXX:6:1313:18306:70501 256 chr19 563674 0 23M211247N28M * 0 0 CGCCCGAGTAGCTGGGACTACAGGTGCCCACCACCACGCCCAGCTAATTTT B0<FFFFFFFFFFIIIIIIIIIIIFIIIIIIIIIIIIIIIIIIIIIFFIII NH:i:5 HI:i:5 AS:i:47 nM:i:0

TopHat 2:

HWI-ST1072:208:C5VDPACXX:6:1105:13215:66485 0 chr9 125236810 50 51M * 0 0 CCAGCTTTTCTTTATCTCCAATCTGATTCTTTAGAGAATAGGCATAGCTTT BBBFFBFBBB<FFFBBBBFIIFFIIIBFIFFBBFBB7<70<FFIBFFIFB7 AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:51 YT:Z:UU XS:A:- NH:i:1

STAR:

HWI-ST1072:208:C5VDPACXX:6:1105:13215:66485 272 chr1 38710624 3 51M * 0 0 AAAGCTATGCCTATTCTCTAAAGAATCAGATTGGAGATAAAGAAAAGCTGG 7BFIFFBIFF<07<7BBFBBFFIFBIIIFFIIFBBBBFFF<BBBFBFFBBB NH:i:2 HI:i:2 AS:i:50 nM:i:0 HWI-ST1072:208:C5VDPACXX:6:1105:13215:66485 0 chr9 125236810 3 51M * 0 0 CCAGCTTTTCTTTATCTCCAATCTGATTCTTTAGAGAATAGGCATAGCTTT BBBFFBFBBB<FFFBBBBFIIFFIIIBFIFFBBFBB7<70<FFIBFFIFB7 NH:i:2 HI:i:1 AS:i:50 nM:i:0

TopHat 2:

HWI-ST1072:208:C5VDPACXX:6:2204:20020:14928 16 chr2 56175065 50 51M * 0 0 AAGAACATTCCATGCTCATGGGTAGGAAGAATCAATATCGTGAAAATGGAC IIIIIFFFFFFFIIIIIIIIIIFIIFFIIFIFIIFFFFFFBFFFFFFFB<B AS:i:-9 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:39A9C1 YT:Z:UU XS:A:- NH:i:1

STAR

HWI-ST1072:208:C5VDPACXX:6:2204:20020:14928 256 chr1 124963435 0 51M * 0 0 GTCCATTTTCACGATATTGATTCTTCCTACCCATGAGCATGGAATGTTCTT B<BFFFFFFFBFFFFFFIIFIFIIFFIIFIIIIIIIIIIFFFFFFFIIIII NH:i:6 HI:i:6 AS:i:50 nM:i:0 HWI-ST1072:208:C5VDPACXX:6:2204:20020:14928 256 chr3 111754246 0 51M * 0 0 GTCCATTTTCACGATATTGATTCTTCCTACCCATGAGCATGGAATGTTCTT B<BFFFFFFFBFFFFFFIIFIFIIFFIIFIIIIIIIIIIFFFFFFFIIIII NH:i:6 HI:i:2 AS:i:50 nM:i:0 HWI-ST1072:208:C5VDPACXX:6:2204:20020:14928 256 chr5 118622311 0 51M * 0 0 GTCCATTTTCACGATATTGATTCTTCCTACCCATGAGCATGGAATGTTCTT B<BFFFFFFFBFFFFFFIIFIFIIFFIIFIIIIIIIIIIFFFFFFFIIIII NH:i:6 HI:i:3 AS:i:50 nM:i:0 HWI-ST1072:208:C5VDPACXX:6:2204:20020:14928 256 chr8 90394981 0 51M * 0 0 GTCCATTTTCACGATATTGATTCTTCCTACCCATGAGCATGGAATGTTCTT B<BFFFFFFFBFFFFFFIIFIFIIFFIIFIIIIIIIIIIFFFFFFFIIIII NH:i:6 HI:i:5 AS:i:50 nM:i:0 HWI-ST1072:208:C5VDPACXX:6:2204:20020:14928 0 chr10 52361793 0 51M * 0 0 GTCCATTTTCACGATATTGATTCTTCCTACCCATGAGCATGGAATGTTCTT B<BFFFFFFFBFFFFFFIIFIFIIFFIIFIIIIIIIIIIFFFFFFFIIIII NH:i:6 HI:i:1 AS:i:50 nM:i:0 HWI-ST1072:208:C5VDPACXX:6:2204:20020:14928 272 chr12 30592866 0 51M * 0 0 AAGAACATTCCATGCTCATGGGTAGGAAGAATCAATATCGTGAAAATGGAC IIIIIFFFFFFFIIIIIIIIIIFIIFFIIFIFIIFFFFFFBFFFFFFFB<B NH:i:6 HI:i:4 AS:i:50 nM:i:0

I hope this helps.

@P4sm
Copy link
Author

P4sm commented Nov 24, 2016

And here are the reads in FASTQ format, in case you need that.

@HWI-ST1072:208:C5VDPACXX:6:1313:18306:70501 1:N:0:CGATGT CGCCCGAGTAGCTGGGACTACAGGTGCCCACCACCACGCCCAGCTAATTTT + B0<FFFFFFFFFFIIIIIIIIIIIFIIIIIIIIIIIIIIIIIIIIIFFIII

@HWI-ST1072:208:C5VDPACXX:6:1105:13215:66485 1:N:0:CGATGT CCAGCTTTTCTTTATCTCCAATCTGATTCTTTAGAGAATAGGCATAGCTTT + BBBFFBFBBB<FFFBBBBFIIFFIIIBFIFFBBFBB7<70<FFIBFFIFB7

@HWI-ST1072:208:C5VDPACXX:6:2204:20020:14928 1:N:0:CGATGT GTCCATTTTCACGATATTGATTCTTCCTACCCATGAGCATGGAATGTTCTT + B<BFFFFFFFBFFFFFFIIFIFIIFFIIFIIIIIIIIIIFFFFFFFIIIII

@alexdobin
Copy link
Owner

Hi @P4sm

it looks like in all cases STAR behaves as expected, and these reads are truly multimappers.

  1. HWI-ST1072:208:C5VDPACXX:6:1313:18306:70501
    STAR finds 51M alignment with 1 mismatch, 2S49M (two bases soft-clipped at the end) with not mismatches. Both of them have alignment score of 48 (AS tag). TopHat only reports end-to-end alignment, so it does not pickup the soft-clipped alignment. I think that an alignment with two clipped bases is as good as the end-to-end alignments with one mismatch, however, if you want end-to-end alignments only, you can use --alignEndsType EndToEnd

STAR also finds three spliced alignments with no mismatches, and assigns them a score of 47 (large gaps are penalized). I guess TopHat completely misses these spliced alignments. If you do not like long gaps, there are ways to penalize them more.

  1. HWI-ST1072:208:C5VDPACXX:6:1105:13215:66485
    Here STAR finds two perfectly matched alignments, while TopHat finds only one. BLAT alignments agree with STAR, so I think this is a problem with TopHat.

  2. HWI-ST1072:208:C5VDPACXX:6:2204:20020:14928
    Again, STAR found 5 perfectly matched alignments, while TopHat only reported one.
    Interestingly, BLAT only finds 5 alignments. I have checked the extra one that STAR finds, and it is an exact match as well - I guess BLAT misses one of the alignments as well.

To summarize, in 1 case the differences between STAR and TopHat can be tweaked out. In the other two cases, TopHat simply misses the perfectly good multimapping alignments.

Cheers
Alex

@P4sm
Copy link
Author

P4sm commented Nov 29, 2016

Thanks for your thorough analysis, and I agree with you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants