# Whole Genome Sequencing of Escherichia coli

## Genome assembly using SPAdes

- De novo assembly is the process of merging overlapping sequence reads into contiguous sequences (contigs) without the use of any reference genome as a guide.

- SPAdes—St. Petersburg genome Assembler—was originally developed for de novo assembly of genome sequencing data produced for cultivated microbial isolates and for single-cell genomic DNA sequencing.

- Initially, SPAdes was designed for assembly of bacterial genomes from short Illumina reads, obtained via single-cell MDA or conventional isolate sequencing.

- With time, the functionality of SPAdes was extended to enable assembly of IonTorrent data, as well as hybrid assembly from short and long reads (PacBio and Oxford Nanopore).



(Cited from: Prjibelski, A., Antipov, D., Meleshko, D., Lapidus, A., & Korobeynikov, A. (2020). Using SPAdes de novo assembler. Current Protocols in Bioinformatics, 70, e102. doi:10.1002/cpbi.102)

In [None]:
![](https://bacpathgenomics.files.wordpress.com/2013/04/figure1_velvet.png)

- SPAdes starts its assembly pipeline by constructing a de Bruijn graph from short reads. When using a de Bruijn graph assembler, a number of variables need to be considered in order to produce optimal contigs. 

- The key issue is selecting an appropriate k-mer length for building the de Bruijn graph. Different sequencing platforms produce fragments of differing length and quality, meaning very different ranges of k-mers will be better suited to different types of read sets. 

- A balance must be found between the sensitivity offered by a smaller k-mer against the specificity of a larger one.


- Once a set of contigs have been assembled from the sequencing reads, the next step is to order those contigs against a suitable reference genome. 

- This may seems counter-intuitive at first as we have applied de novo assembly to obtain these contigs, but ordering the contigs aids the discovery and comparison process.

- The best reference to use is usually the most closely related bacterium with a ‘finished’ genome, but as in the case of E. coli finding the best reference may itself involve trial and error.

(cited from: Edwards and Holt Microbial Informatics and Experimentation 2013, 3:2)

Website: Download and install SPAdes from http://cab.spbu.ru/software/spades/.

Instructional Reference: SPAdes manual (https://cab.spbu.ru/files/release3.15.2/manual.html)

Inputs: forward and reverse read sequences file (fastq format)

In [None]:
spades.py

spades.py -o Fika/wgs/sample1/sample1_assembly -1 Fika/wgs/sample1/sample1_1P.fastq -2 Fika/wgs/sample1/sample1_2P.fastq --careful -t 3

## Scaffold Generation using Mauve

### We will skip this process. Just nice to know

- Once a set of contigs have been assembled from the sequencing reads, the next step is to order those contigs against a suitable reference genome. 

- This may seems counter-intuitive at first as we have applied de novo assembly to obtain these contigs, but ordering the contigs aids the discovery and comparison process. 

- The best reference to use is usually the most closely related bacterium with a ‘finished’ genome, finding the best reference may itself involve trial and error


![](https://static-content.springer.com/esm/art%3A10.1186%2F2042-5783-3-2/MediaObjects/13309_2013_25_MOESM4_ESM.jpeg)

- Mauve is a Java-based tool for multiple alignment of whole genomes, with a built-in viewer and the option to export comparative genomic information in various forms. 

- Its alignment functions can also be used to order and orient contigs against an existing assembly. Mauve takes as input a set of genome assemblies, and generates a multiple whole-genome alignment.

- It identifies blocks of sequence homology, and assigns each block a unique colour. Each genome can then be visualized as a sequence of these coloured sequence blocks, facilitating visualization of the genome comparisons.

- This makes it easy to identify regions that are conserved among the whole set of input genomes, and regions that are unique to subsets of genomes (islands).



Website: http://darlinglab.org/mauve/mauve.html

Inputs: These will be your newly assembled contigs and a reference genome , a closely-related strain with a complete genome.

Manual: http://darlinglab.org/mauve/user-guide/reordering.html

## Genomic analysis of E. coli

- After we get the contigs of the sample, we could continue to identify the serotypes, MLST (Multi Locus Sequence Type), virulence genes, and antimicrobial resistance of E. coli
- We can use web-based software: http://www.genomicepidemiology.org/services/
- Download your sample contigs file, using WinSCP (for windows): https://winscp.net/eng/download.php, or using SCP command from Terminal (for Mac)

## Phylogenetic Analysis

- We can compared our E. coli samples to another common E. coli strains by making Phylogenetic Tree
- we can use PhaMe pipeline (https://phame.readthedocs.io/en/latest/)
- PhaME pipelines which applied a maximum likelihood phylogeny by RAxML v7.2.8, with the GTR model of nucleotide substitution, for model of rate heterogeneity; we can use use the GAMMA, and 100 bootstrap replicates. 
- The phylogeny was midpointrooted and diagramed using the interactive Tree of Life software (iTOL v.6) (https://itol.embl.de)