## Notebook 11.2: *denovo* PacBio genome assembly 


### Learning objectives: 

By the end of this notebook you will:

+ Experience running PacBio long-read assembly.
+ Compare long read assembly to short-read and hybrid assemblies from notebook 1.
+ Learn about alternative assembly software.


### Assigned readings: 

**Read these papers carefully, we will discuss in class:**

+ Li, Fay-Wei, and Alex Harkess. “A Guide to Sequence Your Favorite Plant Genomes.” Applications in Plant Sciences 6, no. 3 (n.d.): e1030. https://doi.org/10.1002/aps3.1030.



+ Sumpter, Nicholas, Margi Butler, and Russell Poulter. “Single-Phase PacBio De Novo Assembly of the Genome of the Chytrid Fungus Batrachochytrium Dendrobatidis, a Pathogen of Amphibia.” Microbiology Resource Announcements 7, no. 21 (November 2018). https://doi.org/10.1128/MRA.01348-18.



**Please skim the paper below, you do not need to read it in detail:** 
+ Giordano, Francesca, Louise Aigrain, Michael A. Quail, Paul Coupland, James K. Bonfield, Robert M. Davies, German Tischler, et al. “De Novo Yeast Genome Assemblies from MinION, PacBio and MiSeq Platforms.” Scientific Reports 7, no. 1 (June 21, 2017): 3935. https://doi.org/10.1038/s41598-017-03996-z.




In [None]:
import toyplot
import numpy as np
import pandas as pd

In [None]:
# conda install bioconda::miniasm
# conda install bioconda::minimap2
# conda install bioconda::canu

### Batrachochytrium dendrobatidis genome assembly

In this notebook we will continue to examine the published genome assembly data for *Bd* (https://www.ncbi.nlm.nih.gov/pubmed/30533847) that was recently assembled with PacBio reads. In the last notebook we assembled a related genome using short-read and hybrid short-read + PacBio data. Here we will assemble the genome using only PacBio long reads. In terms of the algorithms involved in assembly this will employ a quite different approach from the de Bruijn graph based approach used in the first notebook. We will try two different approaches, a very fast assembler called `miniasm`, and a much slower but more accurate method called `canu`. The latter is the method that the authors used in the study. Both are considered OLC assemblers: overlap-layout-consensus. 

## Miniasm assembly

The miniasm assembly is performed in two steps. We first use the tool `minmap2` to calculate the overlaps among all of the pacbio reads in the fastq data file. Then we input this overlap information along with the raw data to `miniasm` which will infer a graph assembly -- the best way to uniquely connect all overlapping sequences. This involves splitting ambiguous loops in the graph and calculating the shortest path through the graph. The result can be converted to a final fasta file representing the consensus assembled contigs. 


### Overlap: get overlap of PacBio reads
Here we align all of the PacBio reads to each other. For very large datasets this can be time consuming. New algorithms (e.g., `minimap2` have *dramatically* improved the speed of this step recently. 

In [None]:
%%bash

#minimap2 -x ava-pb -t 20 SRR7825134.50K.fastq SRR7825134.50K.fastq > SRR7825134.paf


This returns a result in the form of a [pairwise alignment format (.paf)](https://github.com/lh3/miniasm/blob/master/PAF.md) file. To avoid you having to compute this file I uploaded the first 20 lines of the file to a URL so you can easily look at it. Because the format is tabular we can load it as a DataFrame with pandas to look at it below. This file contains information about the positions where long reads in the data overlap each other. 

In [None]:
url = "https://raw.githubusercontent.com/genomics-course/11-assembly/master/notebooks/SRR7825134.20lines.paf"
pd.read_csv(url, sep="\t", header=None)

### Layout: assemble contigs from overlapping reads
The program `miniasm` is an OLC assembler, it takes the raw long-read data (.fastq) and the mapping information (.paf) and constructs contigs by constructing a graph that connects sequences based on the overlaps. An output is produced in [graphical fragment assembly format (.gfa)](https://github.com/GFA-spec/GFA-spec). This format can be used in some software to visualize the assembly. It contains within it the assembled contig sequence information as well as the loops and bubbles (ambiguities) in their assembly (the graph).

In [None]:
%%bash

# miniasm -f SRR7825134.fastq SRR7825134.paf > SRR7825134.gfa

If we just want the best estimate of the assembly we can extract the contig sequences from this file to get a fasta file. This can be done using the unix tool `awk` to extract the lines from the `.gfa` file that start with "S" and convert them to fasta format, like bellow

In [None]:
%%bash

# awk '$1 ~/S/ {print ">"$2"\n"$3}' SRR7825134.gfa > assembly_miniasm.fasta

### Consensus: polish the assembled contigs by mapping reads again

Technically the last step by `miniasm` performs both a *layout* and *consensus* step of the OLC assembly. The assembly fasta file was constructed from the overlaps among the long reads. However, it is common with long-read assemblies to apply one or more additional *consensus* steps that aim to polish the final base calls. Example programs for final polishing include `racon` and `quiver` which use PacBio reads to polish the data, or `pilon` which can use Illumina data to polish it. We will skip this step for now since it is optional, and often very time consuming. 

Before we analyze this genome file, let's try assembling the genome again from PacBio data using a different program. 


## Canu assembly
The program [Canu](https://canu.readthedocs.io/en/latest/quick-start.html) is considered one of the best long-read assemblers, but it is much slower than the last approach we tried. While the assembly above using miniasm finished in just a few minutes, the canu assembly below took several hours. Once we get to the assembly comparisons you can decide if you think the longer run time was worth it.

One of the reasons canu is much slower is that it performs the full suite of steps we've discussed: pre-cleanup of reads, read overlap calculations, layout graph construction, consensus contig construction, and post-analysis contig polishing. A nice feature of canu is that you can tell it to do all of this with a single command rather than requiring multiple different programs or commands. 

In [None]:
# canu parameters explained 
# -p: prefix name for output files
# -d: output directory name
# -genomeSize: estimate of size
# -pacbio-raw: the raw fastq data

In [None]:
%%bash

# canu -p SRR7825124 -d assembly_canu genomeSize=30m -pacbio-raw SRR7825134.fastq

While canu is running it prints a long log file to the screen explaining each step of the analysis and how much resources it is consuming, very similar to spades in the last notebook. We'll skip looking at the log file for now though and continue to comparing the resulting files. 

### Compare all four genome assemblies
Again we will use quast to assess the quality of our genome assemblies. The command below inputs the assembled contigs in fasta format of all four assemblies we've completed using four different approaches. It measures summary statistics of assembly quality and also BUSCO scores. You can read the report [here](https://genomics-course.github.io/11-assembly/notebooks/comparison/report.html).

In [None]:
%%bash

# quast.py \
#     assembly_spades/scaffolds.fasta \
#     assembly_spades_hybrid/scaffolds.fasta \
#     assembly_miniasm.fasta \
#     assembly_canu/SRR7825124.fasta \
#     -o comparison \
#     --conserved-genes-finding \ 
#     --fungus

<div class="alert alert-success">
    <b>[8] Question:</b> 
    Which of the four assembly methods yielded the best assembly in your opinion? Which statistics did you base this off of? Why do you think the contig N50 was higher in some assemblies versus others? Why do you think the complete BUSCO score was higher in some versus others? Answer in Markdown below.   
</div>

In [None]:
# answer here

<div class="alert alert-success">
    <b>[9] Question:</b> 
    What assembly software did Sumpter et al. use in the paper to assemble the *Bd* genome? How do their assembly statistics compare to ours? Why do you think they might differ?
</div>

In [None]:
# answer here

You just completed several genome assemblies that in total took only a few hours. Are you surprised by how easy or difficult this was? The time and expense involved with attaining the data is the hardest part, followed by getting sufficient computational resources to assemble the genome. But overall, it's not a super difficult task. The `canu` and `spades` assemblies could be done with a single command to the programs. 

<div class="alert alert-success">
    <b>[10] Question:</b> 
    Search on google scholar for another published genome from within the last three years. In the Markdown cell below report the citation for the publication, how large the genome is, which technology was used to sequence it, and a summary of the genome assembly (e.g., N50 and/or number of scaffolds/contigs). Try to find an example for an organism that is of interest to you.
</div>

In [None]:
# answer here