# Assembly with the Canu pipeline

[Canu](https://github.com/marbl/canu) is a popular assembler based on Celera Assembler and built specifically to work with ONT reads. It consists on a 4-step pipeline that generates a 'draft assembly' without reference. In order to get better results, Canu is often used with tools that improves it result. In this notebook, we will build a popular Canu pipeline using Canu + Racon + Pilon.

Canu works with either FASTA or FASTQ files (compressed and uncompressed), but FASTQ format is needed to run the next steps and complete the full pipeline. Help page can be shown with "canu -h" command. These are the parameters needed for running Canu with our data:

- -p and -d: Assembly files prefix and output directory. Both parameters can be the same and output directory doesn't have to exist before execution.
- genomeSize: The estimated genome size. In our case, 2.1 mbp so we write '2.1m'. We can put letter g for gbp or k for kbp as well.
- -nanopore-raw: The path to our reads in FASTQ.

In [1]:
canu -p agalactiae \
     -d data/agalactiae/canu_output \
     genomeSize=4.6m \
     useGrid=false \
     minReadLength=50 \
     minOverlapLength=50 \
     -nanopore-raw data/agalactiae/merged-output.fasta


-- Canu snapshot v1.7 +137 changes (r8829 73d5caa1b1087b65f7853ecbebc1bb1dcbd1bc14)
--
-- CITATIONS
--
-- Koren S, Walenz BP, Berlin K, Miller JR, Phillippy AM.
-- Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation.
-- Genome Res. 2017 May;27(5):722-736.
-- http://doi.org/10.1101/gr.215087.116
-- 
-- Read and contig alignments during correction, consensus and GFA building use:
--   Šošic M, Šikic M.
--   Edlib: a C/C ++ library for fast, exact sequence alignment using edit distance.
--   Bioinformatics. 2017 May 1;33(9):1394-1395.
--   http://doi.org/10.1093/bioinformatics/btw753
-- 
-- Overlaps are generated using:
--   Berlin K, et al.
--   Assembling large genomes with single-molecule sequencing and locality-sensitive hashing.
--   Nat Biotechnol. 2015 Jun;33(6):623-30.
--   http://doi.org/10.1038/nbt.3238
-- 
--   Myers EW, et al.
--   A Whole-Genome Assembly of Drosophila.
--   Science. 2000 Mar 24;287(5461):2196-204.
--   http://doi.

: 1

### Racon

[Racon](https://github.com/isovic/racon) is a consensus module to correct raw contigs generated by rapid assembly methods which do not include a consensus step. We will take our Canu result in FASTA and the raw reads in FASTQ and generate a new FASTA contig file.

Before running Racon, we must align the Canu output to the raw reads file and take the overlaps file in PAF as a parameter for Racon command. This can be done using for example minimap, a rapid aligner for ONT reads.


In [None]:
minimap data/agalactiae/canu_output/agalactiae.contigs.fasta \
        data/agalactiae/merged-output.fastq \
        > agalactiae.paf

Racon parameters:
- -t: Number of threads
- Raw reads (FASTQ).
- Overlaps in (PAF).
- Canu output (FASTA)
- Racon output file name (FASTA)


In [None]:
racon -t 48 \
    data/agalactiae/output/merged-output.fastq \ 
    agalactiae.paf \ 
    data/agalactiae/canu/agalactiae.contigs.fasta \ 
    data/agalactiae_racon.contigs.fasta

### Pilon (Requires Illumina reads)

[Pilon](https:github.com/broadinstitute/pilon) is a tool that can be used to improve draft assembly as our case and find variation among strains. It Requires as input a FASTA file and a BAM file of reads aligned to the input FASTA file. At this point, we have the racon contigs file and our reads file (ONT and Illumina reads). We will use the racon contigs file and a BAM file produced by de alignment of ILlumina reads against that contigs file to generate our final result with Pilon.

In first place, we need to align our Illumina reads against the racon contigs file. We are using bwa-mem so first we have to index the reference file (racon contigs) and then perform the alignment of the Illumina reads. In order to have the BAM file required by Pilon, we are using samtools to convert bwa output in FASTA to the BAM format.


In [None]:
bwa index data/agalactaie_racon.contigs.fasta

In [None]:
#Illumina data not included in repository
bwa mem -t 48 \
        data/agalactaie_racon.contigs.fasta  \
        data/Data_Ilumina/Raw/WGS_bacterialIsolates_MiSeq_training-33608589/Sagalactiae_HRC-41106565/Sagal_S5_L001_R1_001.fastq.gz \
        data/Data_Ilumina/Raw/WGS_bacterialIsolates_MiSeq_training-33608589/Sagalactiae_HRC-41106565/Sagal_S5_L001_R2_001.fastq.gz \
        | samtools view -S -b -u - | samtools sort - data/bwa_aligned_reads


Finally we index the BAM file an run the Pilon jar:

In [None]:
samtools index data/bwa_aligned_reads.bam

Pilon arguments:
- --threads: Number of threads
- --genome: The FASTA input file (racon contigs)
- --bam: BAM file (generated by bwa and samtools)
- --outdir and --output: Output directory and filename 

In [None]:
java -Xmx128g -XX:+UseConcMarkSweepGC \ 
      -XX:-UseGCOverheadLimit \ 
      -jar /home/jovyan/software/pilon/pilon-1.22.jar \ 
      --threads 2 \ 
      --genome data/agalactaie_racon.contigs.fasta \ 
      --bam data/bwa_aligned_reads.bam \ 
      --outdir data/agalactiae/pilon_output \ 
      --output pilon.contigs