# Module 6: *De novo* genome assembly and quality control

## Overview

After generating sequencing reads and carrying out quality assessment, the next step is to determine how the reads fit together by looking for overlaps between them; this process is called genome assembly. Here you can see the general steps in a genome assembly workflow. *Steps 1 and 2 have already been completed.*


![steps](images/assem.jpg)

*Taken from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5850084/*

The data from the Illumina machine comes as relatively short stretches (35 - 150 base pairs) of DNA – around 6 billion of them. These individual sequences are called sequencing reads. There are a range of assembly programs that have been specifically designed to assemble genomes from sequence reads data. Genome assembly using sequence reads of around 100bp is complicated due to the high frequency of repeats longer than the sequence read length in genomes, for example: IS elements, rRNA operons; and the massive amount of data the programs have to handle. In addition to finding overlaps in the sequence, the assembly programs can also use information from the predicted insert size where paired reads are used, to link and position reads in an assembly.

Where a genome is pieced together without any reference sequence to compare it to, or scaffold it against, it is termed a *de novo* assembly. *De novo* assembly may not produce complete genomes, but will be fragmented into multiple contiguous sequences (contigs), the order of which is arbitrary, and does not necessarily bear not any relation to their real order in the genome.

Where a closely related reference sequence is available, it is possible to improve an assembly by ordering the contigs in comparison to the reference, and also transferring annotation. In this case, nearly all of the genome will be present, i.e. genes and features, but there will be some regions that will contain gaps, or contigs that will not be accurately placed, because they are not present in the reference used. Although technically incomplete, ordered genome assemblies can provide valuable insights into the genetics and biology of an organism.

![denovo](images/denovo.jpg)

*Taken from: https://doi.org/10.1038/nmeth.1935*


### Install condacolab

In [None]:
!pip install -q condacolab
import condacolab
condacolab.install()

### Install software

In [None]:
# Install SPAdes, QUAST and seqtk
!conda install bioconda::spades
!conda install bioconda::quast
!conda install bioconda::seqtk


In [None]:
# Check if SPAdes is installed
!spades.py --help

In [None]:
# Check if QUAST is installed
!quast.py --help

In [None]:
# CHeck if seqtk is installed
!seqtk seq -h

### Download data

In [None]:
!wget https://zenodo.org/records/13750987/files/Module_6.tar.gz

### Extract the .tar.gz file 

In [None]:
!tar xvf Module_6.tar.gz

## Part 1: Generating a *de novo* assembly for a single strain

In this section we will use [SPAdes](https://github.com/ablab/spades) which is one of a number of de novo assemblers that use short read sets as input (e.g. Illumina Reads), and the assembly method is based on de Bruijn graphs.

We will assemble the genome of *Streptococcus pneumoniae*

We will navigate to the folder containing the paired FASTQ files ERR1795461_1.fastq.gz and ERR1795461_2.fastq.gz, which correspond to the run accession ERR1795461 from the project [PRJEB3084](https://www.ebi.ac.uk/ena/browser/view/PRJEB3084).

Some important data about the sample:

- Country of origin: Brasil
- Organism: *Streptococcus pneumoniae*
- Instrument Platform: ILLUMINA
- Instrument Model: Illumina MiSeq
- Read Count: 3627822
- Base Count: 453477750
- Center Name: Wellcome Sanger Institute; SC
- Library Layaout: PAIRED
- Library strategy: WGS

In [None]:
%cd Module_6

Now we are going to generate a subsample of each sequencing file, given that they are large files and the complete assembly could take a long time. However, it is important to remember that ideally, the analysis should be performed with all the reads.

In [None]:
# Subsample the data
!seqtk sample -s100 ERR1795461_1.fastq.gz 100000 > ERR1795461_1_sub.fastq

In [None]:
# Subsample the data
!seqtk sample -s100 ERR1795461_2.fastq.gz 100000 > ERR1795461_2_sub.fastq

In [None]:
# Compress the subsampled data
!gzip ERR1795461_1_sub.fastq ERR1795461_2_sub.fastq

Run the command in the terminal to execute SPAdes:

This step will take a few minutes.

In [None]:
# Run SPAdes
!spades.py -1 ERR1795461_1_sub.fastq.gz -2 ERR1795461_2_sub.fastq.gz --careful --cov-cutoff auto -o spades_assembly

An explanation of this command is as follows:

`spades.py` is the tool

`-1` flag for the input file of forward reads

`-2` flag for the input file of reverse reads

`--careful` minimizes mismatches and short indels

`--cov-cutoff auto` computes the coverage threshold (rather than the default setting, “off”)

`-o` flag for the the output directory

Move into the output directory (spades_assembly) and look at the contigs.fa file

In [None]:
# Change to the output directory
%cd spades_assembly

In [None]:
# Check the report

## Generating a *de novo* assembly for a multiple strains

We can also assemble genomes of multiple strains.  

>**Note**: In this module, we will not run the multiple assembly due to the lack of resources in Colab. However, here is an example of how to do it.

### First, we will first  create a folder for each pair of compressed fastq files and named after the strain id using the command:

In [None]:
# Do not run this cell
# !for x in *1.fastq.gz; do mkdir ${x%%_1.fastq.gz} ; mv $x ${x%%_1.fastq.gz}; mv ${x%%1.fastq.gz}2.fastq.gz ${x%%_1.fastq.gz}; done

An explanation of this command is as follows:

`for x in *1.fastq.gz`; This starts a for loop that iterates over all files in the current directory that end with "1.fastq.gz".

`do` This starts the code block that will be executed for each file.

`mkdir ${x%%_1.fastq.gz}`This creates a new directory with the same name as the file, but with the "_1.fastq.gz" suffix removed.

`mv $x ${x%%_1.fastq.gz}` This moves the original file into the new directory created in the previous step.

`mv "${x%%1.fastq.gz}2.fastq.gz" "${x%%_1.fastq.gz}"` This moves a second file that has the same prefix as the first file, but with a "2.fastq.gz" suffix, into the new directory created in the first step.

`done` This ends the for loop.

Overall, this script is designed to take paired-end sequencing data that is stored in two separate files with names that end in **"_1.fastq.gz"** and **"_2.fastq.gz"**, and organize it into directories based on the prefix of the file name

### We will then execute SPAdes using the command:

>**Note**: In this module, we will not run the multiple assembly due to the lack of resources in Colab. However, here is an example of how to do it.

In [None]:
# Do not run this cell
#!for x in *#* ; do spades.py --pe1-1 $x/${x}_1.fastq.gz --pe1-2 $x/${x}_2.fastq.gz --careful --cov-cutoff auto -o $x"_output"; done

An explanation of this command is as follows:

`for x in *` This starts a loop that iterates over all directories in the current working directory. The loop variable x will be set to each directory name in turn.

`do` This keyword starts the block of commands that will be executed for each iteration of the loop.

`spades.py` This is the  run command that starts the SPAdes software

`--pe1-1 $x/${x}_1.fastq.gz` This specifies the path to the first paired-end read file for SPAdes. The variable $x is used to reference the current directory name, and ${x}_1.fastq.gz is appended to create the full file name.

`--pe1-2 $x/${x}_2.fastq.gz` This specifies the path to the second paired-end read file for SPAdes. The same directory name variable $x is used as above, but with ${x}_2.fastq.gz appended to create the full file name for the second read file.

`--careful` This tells SPAdes to use a more conservative algorithm for error correction.

`--cov-cutoff` auto This tells SPAdes to automatically determine the coverage cutoff for filtering out low coverage contigs during the assembly.

`-o $x"_output"`  This specifies the output directory for the SPAdes assembly. The directory name is created by appending "_output" to the current directory name $x.

`;` This is the command separator that tells Bash to execute the previous command before moving on to the next one.

`done` This keyword marks the end of the loop block.

___

# Part 2: Quality Assessment for Genome Assemblies

Modern DNA sequencing technologies cannot produce the complete sequence of a chromosome. Instead, they generate large numbers of reads, ranging from dozens to thousands of consecutive bases, sampled from different parts of the genome. Genome assembly software combines the reads into larger regions called contigs. However, current sequencing technologies and software face many complications that impede reconstruction of full chromosomes, including errors in reads and large repeats in the genome.

Different assembly programs use different heuristic approaches to tackle these challenges, resulting in many differences in the contigs they output. This leads to the questions of how to assess the quality of an assembly and how to compare different assemblies.

Further reading: https://academic.oup.com/bioinformatics/article/29/8/1072/228832

## Assessing generated assemblies

We will be using QUAST tool in this section.

[QUAST](https://github.com/ablab/quast) stands for QUality ASsessment Tool. It evaluates genome/metagenome assemblies by computing various metrics. The current QUAST toolkit includes the general QUAST tool for genome assemblies, MetaQUAST, the extension for metagenomic datasets, QUAST-LG, the extension for large genomes (e.g., mammalians), and Icarus, the interactive visualizer for these tools.

The QUAST package works both with and without reference genomes. However, it is much more informative if at least a close reference genome is provided along with the assemblies. The tool accepts multiple assemblies, thus is suitable for comparison.

We will assess the quality of ERR1795461_1.fastq.gz y ERR1795461_1.fastq.qz  **COMPLETE** contigs generated from *De novo* assembly described in the previous page. It is important to note that the subsample will not be analyzed, but rather a complete assembly previously performed.

For this, go to the folder ERR1795461_spades_complete.

In [None]:
%cd ..

We will run the QUAST tool on these contigs using the command:

In [None]:
!quast.py ERR1795461_spades_complete/contigs.fasta -r Reference_sequence_GPSC46.fa -g PROKKA_03052023.gff -1 ERR1795461_1.fastq.gz -2 ERR1795461_2.fastq.qz -o quast_ERR1795461_output

An explanation of this command is as follows:

`quast.py` is the tool

`-r Reference_sequence_GPSC46.fa` specifies the reference sequence

`-g PROKKA_03052023.gff` specifies gene in the reference genome (PROKKA output of Reference_sequence_GPSC46.fa)

`-1 ERR1795461_1.fastq.gz` input file of forward reads

`-2 ERR1795461_2.fastq.gz` input file of reverse reads

`-o quast_ERR1795461_output` specifies the output folder

When QUAST is complete, it will print the following message on your terminal screen:

Navigate to the output folder `"quast_ERR1795461_sub_output"` using `%cd` command and explore its contents. A description of the output files are as follows:

`report.txt` tab-separated version of the summary, suitable for spreadsheets (Google Docs, Excel, etc)

`report.tex` LaTeX version of the summary

`icarus.html` Icarus main menu with links to interactive viewers.

`report.pdf` all other plots combined with all tables (file is created if matplotlib python library is installed)

`report.html` HTML version of the report with interactive plots inside

`contigs_reports/` (only if a reference genome is provided)

`misassemblies_report` detailed report on misassemblies

`unaligned_report` detailed report on unaligned and partially unaligned contigs

`k_mer_stats` (only if --k-mer-stats option is specified)

`kmers_report` detailed report on k-mer-based metrics

`reads_stats/` (only if reads are provided)

`reads_report` detailed report on mapped reads statistics.

## Interpretation

### Metric description 

#### 1. Summary report - based on the report.txt file in the output folder

`# contigs` is the total number of contigs in the assembly.

`Largest contig` is the length of the longest contig in the assembly.

`Total length` is the total number of bases in the assembly.

`Reference length` is the total number of bases in the reference genome.

`GC (%)` is the total number of G and C nucleotides in the assembly, divided by the total length of the assembly.

`Reference GC (%)` is the percentage of G and C nucleotides in the reference genome.

`N50` is the length for which the collection of all contigs of that length or longer covers at least half an assembly.

`NG50` is the length for which the collection of all contigs of that length or longer covers at least half the reference genome (This metric is computed only if the reference genome is provided.)

`Nx and NGx` (for x between 0 and 100) are defined similarly to N50 but with x % instead of 50 %. The value of  x  is set with --x-for-Nx (90 by default).

`L50 (Lx, LG50, LGx)` is the number of contigs equal to or longer than N50 (Nx, NG50, NGx)

In other words, L50, for example, is the minimal number of contigs that cover half the assembly.

`# misassemblies` is the number of positions in the contigs (breakpoints) that satisfy one of the following criteria:

- the left flanking sequence aligns over 1 kbp away from the right flanking sequence on the reference;
- flanking sequences overlap on more than 1 kbp;
- flanking sequences align to different strands or different chromosomes;
- flanking sequences align on different reference genomes (MetaQUAST only).

*This metric requires a reference genome.*

`# misassembled contigs` is the number of contigs that contain misassembly events 

`Misassembled contigs length` is the total number of bases in misassembled contigs.

`# local misassemblies` is the number of positions in the contigs (breakpoints) that satisfy the following conditions:

1. The gap or overlap between left and right flanking sequences is less than 1 kbp, and larger than 200 bp (the maximum indel length).
2. The left and right flanking sequences both are on the same strand of the same chromosome of the reference genome.

`# scaffold gap ext. mis.` is the number of positions in the scaffolds (breakpoints) where the flanking sequences are combined in the scaffold on the wrong distance (sufficient for reporting extensive misassembly). 

`# structural variations` is the number of misassemblies matched with structural variations of genome (if reads or BEDPE file with SV are provided, see --reads1/reads2 and --sv-bedpe).

`# unaligned mis. contigs` is the number of contigs that have the number of unaligned bases more than 50% of contig length and at least one misassembly event in their aligned fragment. Such contigs are probably not related to the reference genome, thus their misassemblies may be not real errors but differences between the assembled organism and the reference.

``#` unaligned contigs` is the number of contigs that have no alignment to the reference sequence. The value "X + Y part" means X totally unaligned contigs plus Y partially unaligned contigs. This metric sums up # unaligned mis. contigs described above.

`Unaligned length` is the total length of all unaligned regions in the assembly (sum of lengths of fully unaligned contigs and unaligned parts of partially unaligned ones). This length does not include uncalled bases (N's) in the assembly.

`Genome fraction (%)` is the percentage of aligned bases in the reference genome. A base in the reference genome is aligned if there is at least one contig with at least one alignment to this base. 

`Duplication ratio` is the total number of aligned bases in the assembly divided by the total number of aligned bases in the reference genome. If the assembly contains many contigs that cover the same regions of the reference, its duplication ratio may be much larger than 1. This may occur due to overestimating repeat multiplicities and due to small overlaps between contigs, among other reasons. 

`# N's per 100 kbp` is the average number of uncalled bases (N's) per 100,000 assembly bases.

`# mismatches per 100 kbp` is the average number of mismatches per 100,000 aligned bases in the assembly. True SNPs and sequencing errors are not distinguished and are counted equally.

`# indels per 100 kbp` is the average number of indels per 100,000 aligned bases in the assembly. Several consecutive single nucleotide indels are counted as one indel.

`# genomic features` is the number of genomic features (genes, CDS, etc) in the assembly (complete and partial), based on a user-provided list of genomic features positions in the reference genome. A feature is 'partially covered' if the assembly contains at least 100 bp of this feature but not the whole one.

`Total aligned length` is the total number of aligned bases in the assembly. A value is usually smaller than a value of total length because some of the contigs may be unaligned or partially unaligned.

`Largest alignment` is the length of the largest continuous alignment in the assembly. A value can be smaller than a value of largest contig if the largest contig is misassembled or partially unaligned.

`NA50, NGA50, NAx, NGAx, LA50, LAx, LGA50, LGAx ("A" stands for "aligned")` are similar to the corresponding metrics without "A", but in this case aligned blocks instead of contigs are considered. Aligned blocks are obtained by breaking contigs at misassembly events and removing all unaligned bases.

#### 2. Summary report - plot description (based on icarus.html)

[Icarus](https://quast.sourceforge.net/icarus) generates contig size viewer and one or more contig alignment viewers (if reference genome/genomes are provided). All of them are located in output folder "quast_ 17150_4#79_output/icarus_viewers/". The links to the viewers and other auxiliary information are provided in Icarus main menu which is saved in "quast_ 17150_4#79_output/icarus.html". All Icarus viewers contain a legend with color scheme description. For moving and zooming interactive window you can use mouse, Icarus controls (top panel) or keyboard shortcuts (+, -, ←, →, use Shift to speed up the action).

**Step 1: Open the icarus.html**

**Step 2: Click on contain size viewer**

This type of viewer draws contigs ordered from longest to shortest. This ordering is suitable for comparing only largest contigs or number of contigs longer than a specific threshold. The viewer shows [N50](https://quast.sourceforge.net/docs/manual.html#N50) and [Nx](https://quast.sourceforge.net/docs/manual.html#Nx) (for user-defined x value) with color and textual indication. If the reference genome is available or at least approximate genome length is known, [NG50](https://quast.sourceforge.net/docs/manual.html#NG50) and [NGx](https://quast.sourceforge.net/docs/manual.html#Nx) are also shown. You can also tone down contigs shorter than a specified threshold using Icarus control panel.

**Step 3: Click on QUAST report in the main menu**

**Step 4: Click on contig alignment viewer in the main menu to view contigs aligned to reference sequence** 

This type of viewer is available only if a reference genome is provided. For large genomes (≥ 50 Mbp) each chromosome is displayed in a separate viewer. The viewer places contigs according to their mapping to the reference genome. The viewer can additionally visualize genes, operons, and read coverage distribution along the genome, if any of those were fed to QUAST.

**Further reading**

https://quast.sourceforge.net/docs/manual.html#sec3.4 

https://github.com/ablab/quast

https://academic.oup.com/bioinformatics/article/29/8/1072/228832


## BONUS!

If you are working with BASH in your computer or in a HPC and you have too many files you can optimize commands, loops are very useful for large datasets.

Here's a way to do it. 

Create a new bash script using nano named `assemblies.sh`

In [None]:
# Dont run this cell
# create a script to run the trimming
!nano assemblies.sh

Then copy and paste the following scrip in your new file:

In [None]:
#!/bin/bash
#Autor: Nathalia Portilla

for i in $(ls *_1.trimmed.fastq.gz); do
NAME=$(basename $i _1.trimmed.fastq.gz)
echo "$NAME"
j="${NAME}_1.trimmed.fastq.gz"
echo "$j"
k="${NAME}_2.trimmed.fastq.gz"
echo "$k"
spades.py -1 $j -2 $k -t 16 --careful --cov-cutoff auto -o ${NAME};
done

Save the file.

Finally, you can execute it:

In [None]:
# Dont run this cell
!bash assemblies.sh

*Adapted from:*

- Advanced Bioinformatics Course developed for the GPS and JUNO projects - Wellcome Sanger Insitute

*Modified by Luisa Sacristán (Universidad de los Andes-CABANA)*