## *De novo* Canu assembly and Nanopolish

###### NOTE: This notebook has high memory requirements

This notebook relies on [Canu](https://github.com/marbl/canu) to get a draft genome assembly, and in Nanopolish to improve the consensus sequence.

Canu is a popular assembler based on the Celera Assembler that can reliably assemble complete microbial genomes and almost complete eukaryotic chromosomes. Canu has three stages: correction, trimming and assembly. Each of the stages can be executed independently or in series. More information of the process is available in the 2.0_DeNovo_Canu-Miniasm.ipynb notebook.

[Nanpolish](https://github.com/jts/nanopolish) polishes the consensus sequence improving the accuracy of all assemblies. Nanopolish works with signal-level ONT data, the basecalled reads, and the draft assembly to generate an improved assembly. 

The first step is to get the draft assembly. Although this can be done with any assembly tool for ONT data, the following commands use Canu:

In [None]:
canu -p sample \
     -d data/sample/canu_output \
     genomeSize=2.1m \
     useGrid=false \
     minReadLength=50 \
     minOverlapLength=50 \
     corMemory=2 \
     corThreads=2 \
     maxMemory=6 \
     stopOnLowCoverage=1 \
     -nanopore-raw data/sample/reads.fastq

Nanopolish actually consists of four different modules that complete different tasks. The code will be using the "variants --consensus" module that calculates an improved consensus sequence for a draft assembly. The other available Nanopolish modules are:

- nanopolish call-methylation: predict genomic bases that may be methylated
- nanopolish variants: detect point variants and indels with respect to a reference genome
- nanopolish eventalign: align signal-level events to k-mers of a reference genome

Before using Nanopolish, the user will need to pre-process the reads and the assembly. [BWA aligner](https://github.com/lh3/bwa) is used to accomplish the task of getting the necessary input files to run the Nanopolish --consensus module.

In first place, the draft assembly have to be indexed to perform the alignment against the basecalled reads file:

In [1]:
bwa index data/sample/canu_output/sample.contigs.fasta

[bwa_index] Pack FASTA... 0.00 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 0.06 seconds elapse.
[bwa_index] Update BWT... 0.00 sec
[bwa_index] Pack forward-only FASTA... 0.00 sec
[bwa_index] Construct SA from BWT and Occ... 0.03 sec
[main] Version: 0.7.15-r1140
[main] CMD: bwa index data/sample/canu_output/sample.contigs.fasta
[main] Real time: 0.137 sec; CPU: 0.100 sec


In addition, SAMtools is used to sort the aligned reads file and to index this file:

In [2]:
bwa mem -x ont2d -t 2 data/sample/canu_output/sample.contigs.fasta data/sample/reads.fastq | samtools view -S -b -u - | samtools sort - data/sample/reads.sorted | samtools index data/sample/reads.sorted.bam

[bam_header_read] EOF marker is absent. The input is probably truncated.
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 2650 sequences (20048748 bp)...
[M::process] read 1070 sequences (14793475 bp)...
[M::mem_process_seqs] Processed 2650 reads in 29.624 CPU sec, 15.117 real sec
[samopen] SAM header is present: 47 sequences.
[M::mem_process_seqs] Processed 1070 reads in 10.612 CPU sec, 5.476 real sec
[main] Version: 0.7.15-r1140
[main] CMD: bwa mem -x ont2d -t 2 data/sample/canu_output/sample.contigs.fasta data/sample/reads.fastq
[main] Real time: 20.879 sec; CPU: 40.312 sec


After getting the input files, Nanopolish must build an index mapping from basecalled reads to the ONT event data (the directory with the original FAST5 files).

In [3]:
#Data not included in the repository
nanopolish index -d data/sample/fast5 \
                    data/sample/reads.fastq


[readdb] indexing data/sample/fast5
[readdb] num reads: 3720, num reads with path to fast5: 1639


With the following code, Nanopolish will improve the draft assembly using the variants --consensus module. From version 0.10, "variants --consensus" only outputs a VCF file instead of a FASTA. The VCF file describes the changes that need to be made to turn the draft sequence into the polished assembly. The vcf2fasta script is then used to generate the final polished genome.

Change the **-P** and **--threads** options as appropriate for the machines.

In [4]:
mkdir -p data/sample/nanopolish_output
python3 /home/jovyan/software/nanopolish/scripts/nanopolish_makerange.py data/sample/canu_output/sample.contigs.fasta | parallel --results data/sample/nanopolish.results -P 2 \
    nanopolish variants --consensus -o data/sample/nanopolish_output/polished.{1}.vcf -w {1} -r data/sample/reads.fastq -b data/sample/reads.sorted.bam -g data/sample/canu_output/sample.contigs.fasta -t 2 --min-candidate-frequency=0.1

When using programs that use GNU Parallel to process data for publication please cite:

  O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
  ;login: The USENIX Magazine, February 2011:42-47.

This helps funding further development; and it won't cost you a cent.
Or you can get GNU Parallel without this requirement by paying 10000 EUR.

To silence this citation notice run 'parallel --bibtex' once or use '--no-notice'.

Number of variants in span (19) would exceed max-haplotypes. Variants may be missed. Consider running with a higher value of max-haplotypes!
Number of variants in span (61) would exceed max-haplotypes. Variants may be missed. Consider running with a higher value of max-haplotypes!
Number of variants in span (10) would exceed max-haplotypes. Variants may be missed. Consider running with a higher value of max-haplotypes!
Number of variants in span (19) would exceed max-haplotypes. Variants may be missed. Consider running with a higher value of max-haplotypes!
Num

In [6]:
nanopolish vcf2fasta -g data/sample/canu_output/sample.contigs.fasta data/sample/nanopolish_output/polished.*.vcf > data/sample/nanopolish_output/polished_genome.fa

[vcf2fasta] rewrote contig tig00000001 with 13 subs, 87 ins, 7 dels (0 skipped)
[vcf2fasta] rewrote contig tig00000003 with 17 subs, 22 ins, 5 dels (0 skipped)
[vcf2fasta] rewrote contig tig00000008 with 28 subs, 48 ins, 8 dels (0 skipped)
[vcf2fasta] rewrote contig tig00000009 with 6 subs, 14 ins, 1 dels (0 skipped)
[vcf2fasta] rewrote contig tig00000015 with 0 subs, 0 ins, 0 dels (0 skipped)
[vcf2fasta] rewrote contig tig00000018 with 24 subs, 41 ins, 1 dels (0 skipped)
[vcf2fasta] rewrote contig tig00000026 with 24 subs, 57 ins, 8 dels (0 skipped)
[vcf2fasta] rewrote contig tig00000028 with 0 subs, 0 ins, 0 dels (0 skipped)
[vcf2fasta] rewrote contig tig00000033 with 0 subs, 1 ins, 0 dels (0 skipped)
[vcf2fasta] rewrote contig tig00000038 with 0 subs, 0 ins, 0 dels (0 skipped)
[vcf2fasta] rewrote contig tig00000049 with 0 subs, 0 ins, 0 dels (0 skipped)
[vcf2fasta] rewrote contig tig00000052 with 19 subs, 35 ins, 1 dels (0 skipped)
[vcf2fasta] rewrote contig tig00000053 with 4 subs,

### References

[1] Loman N.J., Quick J. and Simpson J.T. A complete bacterial genome assembled de novo using only nanopore sequencing data. Nature Methods 2015 12:733–735. DOI: https://doi.org/10.1101/015552 