## De novo Canu assembly and Nanopolish

###### NOTE: This notebook has high memory requirements

This notebook relies on [Canu](https://github.com/marbl/canu) to get a draft genome assembly, and in Nanopolish to improve the consensus sequence.

Canu is a popular assembler based on the Celera Assembler that can reliably assemble complete microbial genomes and almost complete eukaryotic chromosomes. Canu has three stages: correction, trimming and assembly. Each of the stages can be executed independently or in series. More information of the process is available in the 2.0_DeNovo_Canu-Miniasm.ipynb notebook.

[Nanpolish](https://github.com/jts/nanopolish) polishes the consensus sequence improving the accuracy of all assemblies. Nanopolish works with signal-level ONT data, the basecalled reads, and the draft assembly to generate an improved assembly. 

The first step is to get the draft assembly. Although this can be done with any assembly tool for ONT data, the following commands use Canu:

In [None]:
canu -p sample \
     -d data/sample/canu_output \
     genomeSize=2.1m \
     useGrid=false \
     minReadLength=50 \
     minOverlapLength=50 \
     corMemory=2 \
     corThreads=2 \
     -nanopore-raw data/sample/reads.fastq

Nanopolish actually consists of four different modules that complete different tasks. The code will be using the "variants --consensus" module that calculates an improved consensus sequence for a draft assembly. The other available Nanopolish modules are:

- nanopolish call-methylation: predict genomic bases that may be methylated
- nanopolish variants: detect point variants and indels with respect to a reference genome
- nanopolish eventalign: align signal-level events to k-mers of a reference genome

Before using Nanopolish, the user will need to pre-process the reads and the assembly. [BWA aligner](https://github.com/lh3/bwa) is used to accomplish the task of getting the necessary input files to run the Nanopolish --consensus module.

In first place, the draft assembly have to be indexed to perform the alignment against the basecalled reads file:

In [None]:
bwa index data/sample/canu_output/agalactiae.contigs.fasta

In addition, SAMtools is used to sort the aligned reads file and to index this file:

In [None]:
bwa mem -x ont2d -t 2 data/sample/canu_output/sample.contigs.fasta data/sample/reads.fastq | samtools sort -o | samtools index reads.sorted.bam

After getting the input files, Nanopolish must build an index mapping from basecalled reads to the ONT event data (the directory with the original FAST5 files).

In [None]:
#Data not included in the repository
nanopolish index -d data/sample/fast5/pass \
                    data/sample/reads.fastq

With the following code, Nanopolish will improve the draft assembly using the variants --consensus module. From version 0.10, "variants --consensus" only outputs a VCF file instead of a FASTA. The VCF file describes the changes that need to be made to turn the draft sequence into the polished assembly. The vcf2fasta script is then used to generate the final polished genome.

Change the <font color='blue'>-P</font> and <font color='blue'>--threads</font> options as appropriate for the machines.

In [None]:
python3 nanopolish_makerange.py data/sample/canu_output/sample.contigs.fasta | parallel --results nanopolish.results -P 2 \
    nanopolish/nanopolish variants --consensus polished.{1}.fa -w {1} -r data/sample/reads.fastq -b reads.sorted.bam -g data/sample/canu_output/sample.contigs.fasta -t 2 --min-candidate-frequency 0.1

In [None]:
nanopolish vcf2fasta -g sample.contigs.fasta polished.*.vcf > polished_genome.fa

### References

[Loman NJ, Quick J and Simpson JT. A complete bacterial genome assembled de novo using only nanopore sequencing data. Nature Methods 2015 12:733–735](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=6&cad=rja&uact=8&ved=2ahUKEwiy2pTC3bTeAhUwHjQIHQRECRUQFjAFegQICBAB&url=https%3A%2F%2Fwww.biorxiv.org%2Fcontent%2Fearly%2F2015%2F03%2F11%2F015552&usg=AOvVaw1ddhUDVxjr0YUcfGOEjdrw)
