## De novo Canu and Miniasm assembly

###### NOTE: This notebook has high memory and computational requirements

Obtaining the assembled sequence of a complete genome is a complex multi-step task. For De novo assembly, the simplest elements of the hierarchy are the reads provided after the sequencing. The next level of hierarchy is the alignment of multiple reads without definite order (i.e. contigs). Finally, the top level of the hierarchy corresponds to the sum of two or more contigs where a (near) complete structure of the genome under study is obtained. Ideally, one expects to obtain a single fragment (contig) for each chromosome or a plasmid that is present in the genome. However, most of the times the assemblies are incomplete, especially when dealing with short reads that can be caused by the process of preparing the material or by the technological limitations. Specifically, when a repeat region is longer than the reads, this will create a single contig in the assembly with multiple connections. ONT provide long reads that can solve this problem albeit the per-base quality is still low (approximately 12-15% error rate).

## De novo Canu assembly pipeline

This notebook relies in the popular Canu pipeline (with Racon and Pilon to polish the assembly result) and Miniasm for de novo assembly of ONT reads. 

[Canu](https://github.com/marbl/canu) is a popular assembler based on the Celera Assembler that can reliably assemble complete microbial genomes and almost complete eukaryotic chromosomes. Canu has three stages: correction, trimming and assembly. Each stage can be executed independently or in series. Each of the three stages begins by identifying the overlaps between all the pairs of input reads, where they count k-mers in the reads, creating an indexed store of the overlays. From the input reads, the correction stage generates corrected reads. The trimming step trims non-compatible bases and detects fork adapters, chimeric sequences and other anomalies. The assembly stage builds an assembly graph and the contigs. Canu handles the repetitions probabilistically, by statistically filtering the repetitively induced overlays and retrospectively inspecting the graph for possible errors, thereby reducing the possibility of selecting a repetitive k-mer for the overlap. In this way, Canu performs multiple rounds of read and overlapping error correction.

Canu substantially reduces coverage requirements with a low coverage hierarchical assembly. In this case, it is recommended to polish the assembly with short high-quality reads. Canu results are optimal with long-read coverages above 20x. 

Canu works with either FASTA or FASTQ files (compressed and uncompressed), but FASTQ format is needed to run some of the steps and complete the full pipeline. The help page is called with the "canu -h" command. These are the parameters needed for running Canu with our data:

<font color='blue'>-p</font> and <font color='blue'>-d</font> : Assembly files prefix and output directory. Both parameters can be the same and output directory doesn't have to exist before execution.

<font color='blue'>genomeSize</font> : The estimated genome size. In our case, 2.1 mbp so we write '2.1m'. We can put letter g for gbp or k for kbp as well.

<font color='blue'>-nanopore-raw</font> : The path to our reads in FASTQ.

Canu auto-detects available resources and will configure job sizes based on the resources and genome size that is being assembled. The following code uses advanced options like 'corMemory' and 'corThreads' to limit the resources in order to make Canu work on a resource-limited environment. These options also prevent Canu to stop the process because of a lack of computation power.

In [None]:
canu -p sample \
     -d data/sample/canu_output \
     genomeSize=2.1m \
     useGrid=false \
     minReadLength=50 \
     minOverlapLength=50 \
     corMemory=2 \
     corThreads=2 \
     -nanopore-raw data/sample/reads.fastq
     

### Racon

[Racon](https://github.com/isovic/racon) is a consensus module to correct raw contigs generated by rapid assembly methods that do not include a consensus step. Canu results in FASTA format and raw reads in FASTQ are used in this step to generate a new FASTA contig file.

Before running Racon, one must align the Canu output to the raw reads file and take the overlaps file in PAF as a parameter for Racon command. This can be done using for example minimap, a superfast aligner for ONT reads.

In [None]:
minimap2 data/sample/canu_output/sample.contigs.fasta \
        data/sample/reads.fastq \
        > sample.paf

The basic Racon parameters are the following:
- <font color='blue'>-t</font> : Number of threads
- Raw reads (FASTQ).
- Overlaps in (PAF).
- Canu output (FASTA)
- Racon output file name (FASTA

In [None]:
racon -t 48 \
    data/sample/reads.fastq \ 
    sample.paf \ 
    data/sample/canu/sample.contigs.fasta \ 
    data/sample_racon.contigs.fasta

### Pilon (Requires Illumina reads)

[Pilon](https:github.com/broadinstitute/pilon) is a tool that can be used to improve a draft assembly and find variation among species or strains. Pilon maps the reads from Illumina read to an assembled sequence and corrects the errors of the base, and the small insertions and deletions (indels). It requires as input a FASTA file and a BAM file of reads aligned to the input FASTA file. At this point, it will need the Racon contigs file and the reads file (ONT and Illumina reads). The Racon contigs file and the BAM file produced by de alignment of Illumina reads against that contigs file generate the final result of Pilon.

In first place, Illumina reads have to be aligned against the Racon contigs file. BWA is used to index the reference file (Racon contigs) and BWA-MEM is used to perform the alignment of the Illumina reads. In order to have the BAM file required by Pilon, SAMtools is used to convert the BWA output in FASTA to the BAM format.


In [None]:
bwa index data/sample_racon.contigs.fasta

In [None]:
#Data not included in repository
bwa mem -t 2 \
        data/sample_racon.contigs.fasta  \
        data/sample/short_reads/reads_1.fastq.gz \
        data/sample/short_reads/reads_2.fastq.gz \
        | samtools view -S -b -u - | samtools sort - data/bwa_aligned_reads


Before running Pilon, the BWA alignment is indexed using SAMtools:

In [None]:
samtools index data/bwa_aligned_reads.bam

Pilon is run as a Java .jar executable. Some options can be added to improve the performance of Pilon before specifying the .jar file. Whichever the case, Pilon works better in terms of execution time when it is run on an environment with more available RAM and threads. The parameters used in this Pilon run are the following:

- <font color='blue'>--threads</font> : Number of threads
- <font color='blue'>--genome</font>: The FASTA input file (Racon contigs)
- <font color='blue'>--bam</font>: BAM file (generated by BWA and SAMtools)
- <font color='blue'>--outdir</font> and <font color='blue'>--output</font> : Output directory and filename 

In [None]:
java -Xmx128g -XX:+UseConcMarkSweepGC \ 
      -XX:-UseGCOverheadLimit \ 
      -jar /home/jovyan/software/pilon/pilon-1.22.jar \ 
      --threads 2 \ 
      --genome  data/sample_racon.contigs.fasta \ 
      --bam data/bwa_aligned_reads.bam \ 
      --outdir data/sample/pilon_output \ 
      --output pilon.contigs

## De novo Miniasm assembly
 
[Miniasm](https://github.com/lh3/miniasm) is a fast Overlap-Layout-Consensus-based option for de novo assembly of noisy long reads. Miniasm builds high-confidence contigs (unitigs) concatenating pieces of read sequences. It takes all read self-mappings as input to output an assembly graph. At least for high-coverage bacterial genomes, Miniasm can generate long contigs from raw ONT reads without error correction. The error rate of the assembly is the same as that of the raw input reads. In this way, Miniasm produces uncontaminated and uncorrected contig sequences from raw read overlays. To reduce the presence of artefacts such as adapters and untrimmed chimeras in the assembly, Miniasm calculates the per base coverage based on good mappings (longer than 2 kb with at least 100 bp non-redundant bases on matching minimizers) versus other reads. 
Furthermore, Miniasm ignores internal matches, eliminates contained reads, and adds overlays to the assembly graph. To avoid multiple edges, Miniasm uses the longest overlay. When the assembly graph is generated, it removes the transitive edges, trims the tilt units composed of few reads, and permits small bubbles to appear. 

Although it cannot produce a high-quality consensus, Miniasm is extremely fast and produces continuous and structurally accurate assemblies, at least for genomes without excessive repetitive sequences. 

Minimap is first used to get an all-vs-all read mappings of the reads:

In [None]:
minimap2  data/sample/reads.fastq data/sample/reads.fastq | gzip -1 > reads.paf.gz 

Miniasm takes the obtained mappings file and the FASTA reads file and generates the final assembly. The assembly algorithm doesn't have a consensus step as it usually needs multiple steps to produce a precise consensus sequence, and that is a computational bottleneck.

In [None]:
miniasm -f data/agalactiae/reads.fastq reads.paf.gz  > data/agalactiae/miniasm_assembly.fasta

### References

[1] Loman N.J., Quick J. and Simpson J.T. A complete bacterial genome assembled de novo using only nanopore sequencing data. Nature Methods 2015 12:733–735. DOI https://doi.org/10.1101/015552

[2] Li H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics, Volume 32, Issue 14, 15 July 2016, Pages 2103–2110. DOI https://doi.org/10.1093/bioinformatics/btw152