Evaluating or Assessing an Assembly
Authored by Jin Choi for EDAMAME
- This tutorial will contribute towards an understanding the assembly of metagenome data
- This tutorial will discuss assembly evaluation
- Understanding methods to evaluate an assembly
- Understanding assembly metrics
- Comparing two different assemblies
So, now we want to take a look at our assembly and see how it "looks". We'll navigate into the folder we named (
megahit_assembly) as the output by typing
cd megahit-assembly and then
ls -lah the contents of the folder. You should see something like this:
drwxr-xr-x 3 user group 7 Jun 4 10:43 . drwxr-xr-x 9 user group 101 Jun 4 09:53 .. -rw-r--r-- 1 user group 147M Jun 4 10:43 final.contigs.fa -rw-r--r-- 1 user group 1.2M Jun 4 10:43 log -rw-r--r-- 1 user group 71 Jun 4 09:53 opts.txt drwxr-xr-x 2 user group 13 Jun 4 10:43 tmp
We're really concerned with two files, the
log of the run and, of course, the assembly
Let's take a look at the
log file first; go ahead and type
tail log. This will give you some of megahit's stats on the assembly. The whole log file will give you a more in depth account of what happened than was printed to the screen.
We're interested in this part:
[STAT] 3669 contigs, total 2977628 bp, min 211 bp, max 37382 bp, avg 812 bp, N50 942 bp
Megahit actually runs numerous iterations of assemblies. Here we are looking at the last iteration and the number of contigs and total length of that iteration of the assembly. Later, we will be using QUAST to calculate all of our assembly stats. There may be some odd terminology in some of the log file, see this wiki for more information on terminology arising from the Celera Assembler (the first assembly program designed for the Human Genome Project)
Calculating summary statistics for our Metagenome assembly, using QUAST.
Quast is a program that will calculate some statistics about our metagenome assembly to give us an idea how well it assembled. Again, we do need to quickly install Quast and one of its dependencies so we can get it running on our machine. This shouldn't take very long. Copy each of the following lines of code one line at a time to install Quast.
cd curl -L http://sourceforge.net/projects/quast/files/quast-3.0.tar.gz/download > quast-3.0.tar.gz tar xvf quast-3.0.tar.gz
Getting the data
Now, let’s create a working directory and download assembled file. We are going to run assessment assembled with full data not subsampled.:
cd ~/metagenome mkdir assessment cd assessment wget https://s3.amazonaws.com/edamame/compare_assembly.tar.gz tar -zxvf compare_assembly.tar.gz
Now we can take a look at our assembly using QUAST. From the ~/metagenomics/assessment run the following line of code. ##Looking at the assembly Run QUAST:
~/quast-3.0/quast.py pe.se.final.contigs.fa -o report
Once QUAST has finished running, change into the quast_output directory and use
ls to take a look at all of the files it created. Use
less to examine the
You should see:
All statistics are based on contigs of size >= 500 bp, unless otherwise noted (e.g., "# contigs (>= 0 bp)" and "Total length (>= 0 bp)" include all contigs). Assembly pe.se.final.contigs # contigs (>= 0 bp) 2670 # contigs (>= 1000 bp) 1021 Total length (>= 0 bp) 12958914 Total length (>= 1000 bp) 12177316 # contigs 1537 Largest contig 282362 Total length 12541412 GC (%) 39.44 N50 32048 N75 10401 L50 86 L75 258 # N's per 100 kbp 0.00
N50 could be an important number. L50 (L75, LG50, LG75) is the number of contigs equal to or longer than N50 (N75, NG50, NG75) In other words, L50, for example, is the minimal number of contigs that cover half the assembly.
##Comparing and evaluating assemblies
Let's compare three assembly. The result from last tutorial, which used only paired end
pe.final.contigs.fa, include single end
pe.se.final.contigs.faand assembly from raw read
raw.final.contigs.fa. Those assebmly is done from full data (instead of substituted data). Note, When you compare assembly, all assembly sould be done from same dataset other than that, it dose not make sense.
~/quast-3.0/quast.py pe.final.contigs.fa se.final.contigs.fa pe.se.final.contigs.fa -o report_compare
then, open the result:
You will see.
All statistics are based on contigs of size >= 500 bp, unless otherwise noted (e.g., "# contigs (>= 0 bp)" and "Total length (>= 0 bp)" include all contigs). Assembly pe.final.contigs se.final.contigs pe.se.final.contigs # contigs (>= 0 bp) 2707 3162 2670 # contigs (>= 1000 bp) 1064 1375 1021 Total length (>= 0 bp) 12946382 12735336 12958914 Total length (>= 1000 bp) 12168690 11849449 12177316 # contigs 1581 2038 1537 Largest contig 232526 232458 282362 Total length 12532010 12316301 12541412 GC (%) 39.45 39.56 39.44 N50 30749 22507 32048 N75 10062 6464 10401 L50 95 113 86 L75 275 392 258 # N's per 100 kbp 0.00 0.00 0.00
Let's compare asssembly using RAW read.
All statistics are based on contigs of size >= 500 bp, unless otherwise noted (e.g., "# contigs (>= 0 bp)" and "Total length (>= 0 bp)" include all contigs). Assembly raw.final.contigs # contigs (>= 0 bp) 2976 # contigs (>= 1000 bp) 1149 Total length (>= 0 bp) 13050315 Total length (>= 1000 bp) 12189601 # contigs 1719 Largest contig 232493 Total length 12590769 GC (%) 39.47 N50 22360 N75 8751 L50 132 L75 357 # N's per 100 kbp 0.00
Another way to evaluate an assembly is using read orientation. Paired end sequencing are two reads that we prepare to be a specific distance apart and in opposite directions. We can look at mapped paired end reads to a reference genome or an assembled metagenome to assess "paired end concordance." But first, you need to know how to map reads...