## Notebook 11.2: *denovo* PacBio genome assembly 

This week's assignment will take quite a bit longer than most others. This is not necessarily because it is more difficult, but because some of the computational steps take a while to run. So please be patient and start early.  


### Learning objectives: 

By the end of this notebook you will:

+ Experience running PacBio long-read assembly.
+ Compare long read assembly to short-read and hybrid assemblies from notebook 1.
+ Learn about alternative assembly software.


### Assigned readings: 

**Read these papers carefully, we will discuss in class:**

+ Li, Fay-Wei, and Alex Harkess. “A Guide to Sequence Your Favorite Plant Genomes.” Applications in Plant Sciences 6, no. 3 (n.d.): e1030. https://doi.org/10.1002/aps3.1030.



+ Sumpter, Nicholas, Margi Butler, and Russell Poulter. “Single-Phase PacBio De Novo Assembly of the Genome of the Chytrid Fungus Batrachochytrium Dendrobatidis, a Pathogen of Amphibia.” Microbiology Resource Announcements 7, no. 21 (November 2018). https://doi.org/10.1128/MRA.01348-18.



**Please skim the paper below, you do not need to read it in detail:** 
+ Giordano, Francesca, Louise Aigrain, Michael A. Quail, Paul Coupland, James K. Bonfield, Robert M. Davies, German Tischler, et al. “De Novo Yeast Genome Assemblies from MinION, PacBio and MiSeq Platforms.” Scientific Reports 7, no. 1 (June 21, 2017): 3935. https://doi.org/10.1038/s41598-017-03996-z.




In [1]:
# conda install bioconda::spades
# conda install bioconda::cutadapt 
# conda install bioconda::samtools
# conda install bioconda::blast
# conda install bioconda::pilon
# conda install bioconda::minimap2

### Batrachochytrium dendrobatidis genome assembly

In this notebook we will continue to examine the published genome assembly data for *Bd* (https://www.ncbi.nlm.nih.gov/pubmed/30533847) composed of PacBio and Illumina reads. In the last notebook we assembled a genome using short-read and hybrid short-read + PacBio data. Here we will assemble the genome using only PacBio long reads. In terms of the algorithms involved in assembly this will employ a quite different approach from the de Bruijn graph based approach used in the first notebook. We will use the assembler `miniasm` as part of an OLC assembly: overlap-layout-consensus. 

### Overlap: get overlap of PacBio reads
Here we align all of the PacBio reads to each other. Remember this is the type of step that we avoid doing in a *de Bruijn* graph based assembly, which is typically employed on short reads. Instead, in an OLC assembly method we start with the all-by-all comparisons. For very large datasets this can be time consuming. New algorithms (e.g., `minimap2` have *dramatically* improved the speed of this step recently. 

In [None]:
%%bash

# overlap: get overlap of PacBio reads
minimap2 -x ava-pb -t 2 SRR7825134.50K.fastq SRR7825134.50K.fastq > SRR7825134.paf


Let's take a look at the resulting file, which is a [pairwise alignment file (.paf)](https://github.com/lh3/miniasm/blob/master/PAF.md). This file contains information about the positions where long reads in the data overlap each other. 

### Layout: assemble contigs from overlapping reads
The program `miniasm` is an OLC assembler, it takes the raw data (.fastq) and  with mapping information and constructs contigs based on the overlap among reads. An output is produced in [graphical fragment assembly format (.gfa)](https://github.com/GFA-spec/GFA-spec). We then use the unix tool `awk` to extract the lines from the .gfa file that contain the sequence contigs and write them to a fasta file. 

In [None]:
%%bash

# layout: 
miniasm -f SRR7825134.fastq SRR7825134.paf.gz > SRR7825134.gfa

# extract sequence data from PAF to fasta
awk '$1 ~/S/ {print ">"$2"\n"$3}' SRR7825134.gfa > SRR7825134.fasta


### Consensus: get consensus (GFA to fasta)

This step will map the raw reads (.fastq) to the new assembled contigs (.fasta) and save the results in [pairwise alignment file (.paf)](https://github.com/lh3/miniasm/blob/master/PAF.md), just like before. The polishing tool `racon` takes then takes the assembled contigs, the alignment file, and the raw data all together and attempts to fill gaps and correct base calls by comparing raw reads where they map on assembled contigs. This polishing, or consensus step, is common to long-read OLC assembly methods. The mapping step takes only about a minute on this dataset, but polishing takes about 15 minutes. 

In [None]:
%%bash

# map raw reads to the long-read assembled contigs (fasta)
minimap2 -t 2 SRR7825134.fasta SRR7825134.fastq > SRR7825134.gfa1.paf

# polish/correct the assembly using the mapped short reads
racon -t 2 SRR7825134.fastq SRR7825134.gfa1.paf SRR7825134.fasta > SRR7825134.racon1.fasta


### Look at your final genome file
This is it. You assembled a genome completely using long-read PacBio data. 

In [None]:
%%bash

# run quast on the PacBIo assembled genome
quast.py -o results_assembly_3 SRR7825134.racon1.fasta

<div class="alert alert-success">
    <b>[7] Question:</b> 
    What are the major differences in terms of the methods that we used versus those Sumpter et al. used to assemble the Bd genome? How do the assembly statistics for our OLC assembly compare to those in the Sumpter et al. paper? Include the statistics you think are important for assessing how they compare in your answer below in Markdown. 
</div>

In [2]:
# answer here

<div class="alert alert-success">
    <b>[8] Question:</b> 
    Now that you've completed short-read, long-read, and hybrid assemblies, which method do you think worked best given the smaller low-coverage data sets that we had available (subsampled data we're using for efficiency reasons)? Answer in Markdown below and justify your answer in terms of assembly statistics. 
</div>