Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reference genome with or without unplaced contigs #83

Open
sjackman opened this issue Nov 30, 2018 · 10 comments
Open

Reference genome with or without unplaced contigs #83

sjackman opened this issue Nov 30, 2018 · 10 comments

Comments

@sjackman
Copy link
Contributor

sjackman commented Nov 30, 2018

The Drosophila melanogaster reference genome has 5.5 Mbp of unplaced sequence in 1,862 unplaced contigs. Do you recommend running QUAST with only the chromosomes, or with the chromosomes and unplaced contigs (possibly with --fragmented).

n n:500 L50 min N75 N50 N25 E-size max sum name
1870 1870 3 544 23.51e6 25.28e6 27.99e6 25.02e6 32.05e6 142.6e6 fly/fly.all.fa
8 8 3 19524 23.51e6 25.28e6 27.99e6 26.03e6 32.05e6 137.1e6 fly/fly.fa

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/215/GCF_000001215.4_Release_6_plus_ISO1_MT/GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna.gz

@alexeigurevich
Copy link
Contributor

alexeigurevich commented Dec 19, 2018

Hi Shaun,
Sorry for the delay!
I think you can try the full reference (fly.all.fa) and use --fragmented as you suggested. I believe the presence of unplaced sequences should not harm the evaluated assembly quality in this case (I mean no false positive misassemblies should be reported). At the same time you will have more information comparing to the run against just chromosomes as the reference. As a sanity check you may try to assess your assemblies with --fragmented against ONLY unplaced contigs as the reference (I mean exclude 8 chromosomes from fly.all.fa).

In any case, please let us know how it is going. The experiment looks interesting in terms of testing --fragmented capabilities.

@sjackman
Copy link
Contributor Author

    --fragmented                      Reference genome may be fragmented into small pieces (e.g. scaffolded reference) 
    --fragmented-max-indent  <int>    Mark translocation as fake if both alignments are located no further than N bases 
                                      from the ends of the reference fragments [default: 85]
                                      Requires --fragmented option

I'm testing now without the unplaced contigs and comparing to with the unplaced contigs with --fragmented. The largest unplaced contigs are 89 kbp. I suspect that we will see some increase in misassemblies (we'll see if I'm right). One option could be to increase --fragmented-max-indent to 45 kbp. So that everything on an unplaced contig is ignored. Thoughts?

@sjackman
Copy link
Contributor Author

sjackman commented Dec 19, 2018

--- quast-lg -es --fast --large -R fly.chr.fa assembly.fa
+++ quast-lg -es --fast --large --fragmented -R fly.all.fa assembly.fa
 # contigs                    1777           3307               
 Largest contig               27246440       831740             
 Total length                 136770791      132088273          
-Reference length             137567484      137567484          
-Reference GC (%)             42.08          42.08              
+Reference length             143726002      143726002          
+Reference GC (%)             42.01          42.01              
 N50                          20770625       152319             
-NG50                         20770625       147975             
+NG50                         20708867       141243             
 N75                          20652883       64699              
-NG75                         20652883       53807              
+NG75                         20652883       43659              
 L50                          3              245                
-LG50                         3              263                
+LG50                         4              284                
 L75                          5              563                
-LG75                         5              631                
-# misassemblies              453            377                
-# misassembled contigs       209            328                
-Misassembled contigs length  120125748      27027106           
-# local misassemblies        1507           1135               
-# scaffold gap ext. mis.     113            -                  
-# scaffold gap loc. mis.     394            -                  
-# possible TEs               72             50                 
-# unaligned mis. contigs     24             23                 
-# unaligned contigs          85 + 191 part  194 + 323 part     
-Unaligned length             7031973        6529255            
-Genome fraction (%)          89.194         89.173             
+LG75                         5              726                
+# misassemblies              483            397                
+# misassembled contigs       228            341                
+Misassembled contigs length  120228985      27087838           
+# local misassemblies        1516           1149               
+# scaffold gap ext. mis.     116            -                  
+# scaffold gap loc. mis.     398            -                  
+# possible TEs               70             54                 
+# unaligned mis. contigs     13             12                 
+# unaligned contigs          54 + 164 part  157 + 291 part     
+Unaligned length             6661039        6167582            
+Genome fraction (%)          86.037         86.018             
 Duplication ratio            1.061          1.027              
 # N's per 100 kbp            3290.97        0.00               
-# mismatches per 100 kbp     650.82         564.16             
-# indels per 100 kbp         42.51          43.55              
+# mismatches per 100 kbp     647.53         561.09             
+# indels per 100 kbp         42.34          43.36              
 Largest alignment            3619434        734089             
-Total aligned length         125169641      125408010          
+Total aligned length         125526732      125755179          
 NA50                         880995         129615             
-NGA50                        880995         122436             
+NGA50                        839777         111752             
 NA75                         262872         47914              
-NGA75                        253454         38009              
+NGA75                        135651         27347              
 LA50                         43             288                
-LGA50                        43             310                
+LGA50                        47             336                
 LA75                         109            698                
-LGA75                        111            794                
+LGA75                        135            936                

So # misassemblies increased from 453 to 483 and NGA50 decreased from 880995 to 839777. Not terrible, but I'll prefer to use only chromosomes and leave out the unplaced contigs.

@sjackman
Copy link
Contributor Author

sjackman commented Dec 19, 2018

Running quast-lg --fragmented --fragmented-max-indent 50000 produced the exception…

  File "/gsc/btl/linuxbrew/Cellar/python@2/2.7.15_1/lib/python2.7/optparse.py", line 809, in take_action
    self.callback(self, opt, value, parser, *args, **kwargs)
  File "/gsc/btl/linuxbrew/Cellar/quast/5.0.2/quast_libs/options_parser.py", line 87, in set_fragmented_max_indent
    % qconfig.extensive_misassembly_threshold, to_stderr=True, exit_with_code=2)
TypeError: %d format: a number is required, not NoneType
ERROR! exception caught!

It looks as though qconfig.extensive_misassembly_threshold is None but should be the default value of 7000.

@sjackman
Copy link
Contributor Author

Increasing --fragmented-max-indent from the default value of 85 to 7000 helps somewhat.

--- quast-lg -es --fast --large -R fly.chr.fa assembly.fa
+++ quast-lg -es --fast --large --fragmented -x 7000 --fragmented-max-indent 7000 -R fly.all.fa assembly.fa
 # contigs                    1777           3307               
 Largest contig               27246440       831740             
 Total length                 136770791      132088273          
-Reference length             137567484      137567484          
-Reference GC (%)             42.08          42.08              
+Reference length             143726002      143726002          
+Reference GC (%)             42.01          42.01              
 N50                          20770625       152319             
-NG50                         20770625       147975             
+NG50                         20708867       141243             
 N75                          20652883       64699              
-NG75                         20652883       53807              
+NG75                         20652883       43659              
 L50                          3              245                
-LG50                         3              263                
+LG50                         4              284                
 L75                          5              563                
-LG75                         5              631                
-# misassemblies              453            377                
-# misassembled contigs       209            328                
-Misassembled contigs length  120125748      27027106           
-# local misassemblies        1507           1135               
-# scaffold gap ext. mis.     113            -                  
-# scaffold gap loc. mis.     394            -                  
-# possible TEs               72             50                 
-# unaligned mis. contigs     24             23                 
-# unaligned contigs          85 + 191 part  194 + 323 part     
-Unaligned length             7031973        6529255            
-Genome fraction (%)          89.194         89.173             
+LG75                         5              726                
+# misassemblies              476            390                
+# misassembled contigs       223            335                
+Misassembled contigs length  120207681      27064051           
+# local misassemblies        1529           1161               
+# scaffold gap ext. mis.     115            -                  
+# scaffold gap loc. mis.     398            -                  
+# possible TEs               68             52                 
+# unaligned mis. contigs     13             12                 
+# unaligned contigs          54 + 164 part  157 + 291 part     
+Unaligned length             6661084        6167617            
+Genome fraction (%)          86.037         86.020             
 Duplication ratio            1.061          1.027              
 # N's per 100 kbp            3290.97        0.00               
-# mismatches per 100 kbp     650.82         564.16             
-# indels per 100 kbp         42.51          43.55              
+# mismatches per 100 kbp     647.54         561.10             
+# indels per 100 kbp         42.34          43.36              
 Largest alignment            3619434        734089             
-Total aligned length         125169641      125408010          
+Total aligned length         125527295      125755684          
 NA50                         880995         129615             
-NGA50                        880995         122436             
+NGA50                        839777         111752             
 NA75                         262872         47914              
-NGA75                        253454         38009              
+NGA75                        135651         27347              
 LA50                         43             288                
-LGA50                        43             310                
+LGA50                        47             336                
 LA75                         109            698                
-LGA75                        111            794                
+LGA75                        135            936                

# misassemblies increased from 453 to 476 (down from 483 without --fragmented-max-indent 7000) and NGA50 decreased from 880995 to 839777 (the same with or without --fragmented-max-indent 7000).

@sjackman
Copy link
Contributor Author

I feel like the behaviour that I would prefer would be to ignore all alignments to unplaced contigs (reference sequences shorter than 100 kbp) when computing misassemblies and NGA50 stats.

@sjackman
Copy link
Contributor Author

Increasing both --extensive-mis-size 50000 and --fragmented-max-indent 50000 did decrease misassemblies significantly, but perhaps too much so. # misassemblies decreased from 453 to 206 and NGA50 increased from 880995 to 17502836.
Although, from inspecting the dot plot of the alignments of the scaffolds to the reference genome, I know that the scaffolds and reference are perfectly colinear (ignoring small-ish relocations and inversions), so depending on your definition of a correct scaffold, perhaps an NGA50 of 17.5 Mbp is closer to the truth than 881 kbp. Note the scaffold NG50 is 20.8 Mbp.

--- quast-lg -es --fast --large -R fly.chr.fa assembly.fa
+++ quast-lg -es --fast --large --fragmented ---extensive-mis-size 50000 --fragmented-max-indent 50000 -R fly.all.fa assembly.fa
 # contigs                    1777           3307               
 Largest contig               27246440       831740             
 Total length                 136770791      132088273          
-Reference length             137567484      137567484          
-Reference GC (%)             42.08          42.08              
+Reference length             143726002      143726002          
+Reference GC (%)             42.01          42.01              
 N50                          20770625       152319             
-NG50                         20770625       147975             
+NG50                         20708867       141243             
 N75                          20652883       64699              
-NG75                         20652883       53807              
+NG75                         20652883       43659              
 L50                          3              245                
-LG50                         3              263                
+LG50                         4              284                
 L75                          5              563                
-LG75                         5              631                
-# misassemblies              453            377                
-# misassembled contigs       209            328                
-Misassembled contigs length  120125748      27027106           
-# local misassemblies        1507           1135               
-# scaffold gap ext. mis.     113            -                  
-# scaffold gap loc. mis.     394            -                  
-# possible TEs               72             50                 
-# unaligned mis. contigs     24             23                 
-# unaligned contigs          85 + 191 part  194 + 323 part     
-Unaligned length             7031973        6529255            
-Genome fraction (%)          89.194         89.173             
+LG75                         5              726                
+# misassemblies              206            206                
+# misassembled contigs       165            182                
+Misassembled contigs length  89183619       4998964            
+# local misassemblies        1780           1348               
+# scaffold gap ext. mis.     8              -                  
+# scaffold gap loc. mis.     505            -                  
+# possible TEs               90             52                 
+# unaligned mis. contigs     13             12                 
+# unaligned contigs          54 + 162 part  157 + 289 part     
+Unaligned length             6659706        6166247            
+Genome fraction (%)          86.041         86.025             
 Duplication ratio            1.061          1.027              
 # N's per 100 kbp            3290.97        0.00               
-# mismatches per 100 kbp     650.82         564.16             
-# indels per 100 kbp         42.51          43.55              
-Largest alignment            3619434        734089             
-Total aligned length         125169641      125408010          
-NA50                         880995         129615             
-NGA50                        880995         122436             
-NA75                         262872         47914              
-NGA75                        253454         38009              
-LA50                         43             288                
-LGA50                        43             310                
-LA75                         109            698                
-LGA75                        111            794                
+# mismatches per 100 kbp     647.55         561.14             
+# indels per 100 kbp         42.33          43.36              
+Largest alignment            26304422       831628             
+Total aligned length         125528847      125757294          
+NA50                         17502836       144852             
+NGA50                        17502836       128811             
+NA75                         7124996        52395              
+NGA75                        963854         30445              
+LA50                         4              260                
+LGA50                        4              302                
+LA75                         6              625                
+LGA75                        9              840                

@sjackman
Copy link
Contributor Author

sjackman commented Dec 19, 2018

I've thought about an alternative definition of scaffold NGA50:
The N50 of the best possible subsequence (not substring) of colinear alignments from a single scaffold.

I think that's roughly equivalent to saying… remove all the incongruous alignment blocks from a scaffold, and what's the N50 of what remains?

So a translocation scaffold A B (where A and B align to different chromosomes) would be counted as A and B, but an inversion A X B where A and B are colinear alignments but X is inverted would be counted as A B and X.

The goal is to capture the fact the the scaffolds in the assembly are 💯 correct at the large scale. There is one scaffold per reference chromosome, and the scaffolds spans from the start of each chromosome to the end, with some local assembly errors along the way. A reported NGA50 of < 1 Mbp doesn't capture that large scale correctness.

@alexeigurevich
Copy link
Contributor

Wow, so many thoughts and suggestions!
We will investigate the mentioned crash and fix it asap (should go to the upcoming 5.0.3 with few other fixes). As for the "scaffold NGA50", I like the idea but the implementation will take some time, so probably this is for 5.1 (somewhen early next year).

As for

I feel like the behaviour that I would prefer would be to ignore all alignments to unplaced contigs (reference sequences shorter than 100 kbp) when computing misassemblies and NGA50 stats.

it looks like an ad hoc behaviour, so probably should not be included in the release version. Maybe you just need to perform two separate Quast runs (against fly.all.fa and fly.fa) and combine the metrics appropriately (e.g., take misassemblies exclusively from fly.fa run but combined Genome fraction % from both, etc).

Anyway, thanks for sharing all these info!

@sjackman
Copy link
Contributor Author

Wow, so many thoughts and suggestions!

Indeed these are just ramblings and wild thoughts. Don't feel compelled to implement them unless you truly think they're awesome ideas. which is great of course if you do! Happy holidays!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants