## Notebook 11.2: *denovo* Genome assembly 

### Learning objectives: 

By the end of this notebook you will:

+ Know where to download fastq data from for shotgun denovo genome assembly. 
+ Have executed code to assemble a genome from Illumina short-reads.
+ Understand ...

### Associated reading: 

+ Single-Phase PacBio De Novo Assembly of the Genome of the Chytrid Fungus *Batrachochytrium dendrobatidis*, a Pathogen of Amphibia.



In [1]:
import toyplot
import numpy as np
import pandas as pd

In [2]:
# conda install bioconda::spades
# conda install bioconda::cutadapt 
# conda install bioconda::samtools
# conda install bioconda::blast
# conda install bioconda::pilon
# conda install bioconda::minimap2

### Batrachochytrium dendrobatidis genome assembly

The first draft genome assembly of *B. dendrobatidis* (*Bd* hereafter) was sequenced by the US DOE Joint Genome Institute using Sanger sequencing with 8.74X coverage in 2011. The assembly is in 127 scaffolds (N50 scaffold size of 1.48Mb) and 510 contigs (N50 contig size 318K) (https://www.ncbi.nlm.nih.gov/assembly/GCF_000203795.1). 


A recent paper published a new draft genome assembly of *Bd* (https://www.ncbi.nlm.nih.gov/pubmed/30533847) using PacBio and Illumina reads. Our exercise in this notebook will be re-assemble this data set using their data while reading the accompanying paper. 

              total        used        free      shared  buff/cache   available
Mem:           125G         24G        3.1G        787M         98G         99G
Swap:          112G        1.4G        111G


### Question from last time: PacBio 'subreads'
http://seqanswers.com/forums/showthread.php?t=34790


Subreads are the individual sequence reads that are determined in real time from a template on the sequencer. These would correspond to the stretch of DNA between the adapter sequence.

CCS reads are the result of doing a consensus base calling from subreads that are all from the same template. If the template was short enough, then the polymerase will loop entirely from the beginning back to the beginning, sequencing the adapter before it starts on the template again. In this case, the same template piece of DNA is sequenced more than once, which means that the sequence data generated from each pass can be used to determine one single consensus sequence with higher base quality than the raw subreads. Thus, the quality distribution that you see is correct between the subreads and the CCS reads



### Jellyfish: kmer counting tool
The [jellyfish](https://github.com/gmarcais/Jellyfish/tree/master/doc) software is used to efficiently count kmers from a genome fasta file or fastq sequenced read files. It works much faster than the Python code we wrote last week to find all kmers in a sequence, and has a lot of additional functionality. 


From Jellyfish paper by Marçais and Kingford (2011): 
    
<blockquote>

Given a string S, we are often interested in counting the number of occurrences in S of every substring of length k. These length-k substrings are called k-mers and the problem of determining the number of their occurrences is called k-mer counting.

Counting the k-mers in a DNA sequence is an important step in many applications. For example, genome assemblers using the overlap-layout-consensus paradigm, such as the Celera (Miller et al., 2008; Myers et al., 2000) and Arachne (Jaffe et al., 2003) assemblers, use k-mers shared by reads as seeds to find overlaps. Statistics on the number of occurrences of each k-mer are first computed and used to filter out which k-mers are used as seeds. Such k-mer count statistics are also used to estimate the genome size: if a large fraction of k-mers occur c times, we can estimate the sequencing coverage to be approximately c and derive an estimate of the genome size from c and the total length of the reads. In addition, in most short-read assembly projects, errors are corrected in the sequencing reads to improve the quality of the final assembly. For example, Kelley et al. (2010) use k-mer frequencies to assess the likelihood that a misalignment between reads is a sequencing error or a genuine difference in sequence. A third application is the detection of repeated sequences, such as transposons, which play an important biological role. De novo repeat annotation techniques find candidate regions based on k-mer frequencies (Campagna et al., 2005; Healy et al., 2003; Kurtz et al., 2008; Lefebvre et al., 2003). The counts of k-mers are also used to seed fast multiple sequence alignment (Edgar, 2004). Finally, k-mer distributions can produce new biological insights directly. Sindi et al. (2008) used k-mers frequencies with large k (20 ≤ k ≤ 100) to study the mechanisms of sequence duplication in genomes.
</blockquote>


In [14]:
%%bash

# make directories for data and results
mkdir -p data/
mkdir -p jellyfish/

# download a fastq file with sequenced reads 
wget https://eaton-lab.org/data/40578.fastq.gz -q -O ./data/reads.fastq.gz

# call jellyfish on fastq file
# -m kmer size
# -s genome size estimate
# -t number of threads to use
# -C switch to count 'canonical kmers' (reverse complement)
# -o output location
# <(...) this decompresses the file as it is passed to jellyfish
jellyfish count -m 11 -s 100M -t 10 -C -o jellyfish/11mer.jf <(zcat ./data/reads.fastq.gz)
jellyfish count -m 21 -s 100M -t 10 -C -o jellyfish/21mer.jf <(zcat ./data/reads.fastq.gz)

In [16]:
%%bash

# get histogram of kmer counts
jellyfish histo jellyfish/11mer.jf > jellyfish/11mer.hist.csv
jellyfish histo jellyfish/11mer.jf > jellyfish/21mer.hist.csv

In [35]:
df = pd.read_csv("jellyfish/11mer.hist.csv", sep=" ", names=["bin", "counts"])
toyplot.bars(
    df.counts,
    width=350, height=300,
    ylabel="frequency", 
    xlabel="11mer ",
);

### Estimate genome size from kmers distribution
The size parameter (given with -s) is an indication of the number k-mers that will be stored in the hash. For sequencing reads, this size should be the size of the genome plus the k-mers generated by sequencing errors. For example, if the error rate is e (e.g.Illumina reads, usually e~1%), with an estimated genome size of G and a coverage of c, the number of expected k-mers is G+Gcek.

In [37]:
%%bash

jellyfish stats jellyfish/11mer.jf

Unique:    6574
Distinct:  7086
Total:     7807
Max_count: 10


### Pilon: Genome assembly improvement software
This desription is based on the [Pilon documentation](https://github.com/broadinstitute/pilon/wiki), where you can find more details. Pilon is used to improve genome assemblies based on  how short reads map it. It is most often used when a genome assembly is based on long reads, which have higher error rates than short reads, and may introduce spurious structural variants. Pilon takes as input a FASTA genome file and one or more BAM files of reads aligned to the input FASTA genome. Pilon aims to improve the assembly by correcting the following types of variation: 

+ Single base differences
+ Small indels
+ Larger indel or block substitution events
+ Gap filling
+ Identification of local misassemblies, including optional opening of new gaps

Pilon then outputs a FASTA file containing an improved representation of the genome from the read data and an optional VCF file detailing variation seen between the read data and the input genome.