# Assembly with the Canu pipeline

[Canu](https://github.com/marbl/canu) is a popular assembler based on Celera Assembler and built specifically to work with ONT reads. It consists on a 4-step pipeline that generates a 'draft assembly' without reference. In order to get better results, Canu is often used with tools that improves it result. In this notebook, we will build a popular Canu pipeline using Canu + Racon + Pilon.

Canu works with either FASTA or FASTQ files (compressed and uncompressed), but FASTQ format is needed to run the next steps and complete the full pipeline. Help page can be shown with "canu -h" command. These are the parameters needed for running Canu with our data:

- -p and -d: Assembly files prefix and output directory. Both parameters can be the same and output directory doesn't have to exist before execution.
- genomeSize: The estimated genome size. In our case, 2.1 mbp so we write '2.1m'. We can put letter g for gbp or k for kbp as well.
- -nanopore-raw: The path to our reads in FASTQ.

In [2]:
canu -p agalactiae \
     -d data/agalactiae/canu_output \
     genomeSize=4.6m \
     useGrid=false \
     minReadLength=50 \
     minOverlapLength=50 \
     -nanopore-raw data/agalactiae/merged-output-full.fastq


-- Canu snapshot v1.6 +14 changes (r8426 14520f819a1e5dd221cc16553cf5b5269227b0a3)
--
-- CITATIONS
--
-- Koren S, Walenz BP, Berlin K, Miller JR, Phillippy AM.
-- Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation.
-- Genome Res. 2017 May;27(5):722-736.
-- http://doi.org/10.1101/gr.215087.116
-- 
-- Read and contig alignments during correction, consensus and GFA building use:
--   Šošic M, Šikic M.
--   Edlib: a C/C ++ library for fast, exact sequence alignment using edit distance.
--   Bioinformatics. 2017 May 1;33(9):1394-1395.
--   http://doi.org/10.1093/bioinformatics/btw753
-- 
-- Overlaps are generated using:
--   Berlin K, et al.
--   Assembling large genomes with single-molecule sequencing and locality-sensitive hashing.
--   Nat Biotechnol. 2015 Jun;33(6):623-30.
--   http://doi.org/10.1038/nbt.3238
-- 
--   Myers EW, et al.
--   A Whole-Genome Assembly of Drosophila.
--   Science. 2000 Mar 24;287(5461):2196-204.
--   http://doi.o

### Racon

[Racon](https://github.com/isovic/racon) is a consensus module to correct raw contigs generated by rapid assembly methods which do not include a consensus step. We will take our Canu result in FASTA and the raw reads in FASTQ and generate a new FASTA contig file.

Before running Racon, we must align the Canu output to the raw reads file and take the overlaps file in PAF as a parameter for Racon command. This can be done using for example minimap, a rapid aligner for ONT reads.


In [3]:
minimap data/agalactiae/canu_output/agalactiae.contigs.fasta \
        data/agalactiae/merged-output-full.fastq \
        > agalactaiae.paf

[M::mm_idx_gen::0.213*1.01] collected minimizers
[M::mm_idx_gen::0.277*1.47] sorted minimizers
[M::main::0.277*1.47] loaded/built the index for 1 target sequence(s)
[M::main] max occurrences of a minimizer to consider: 8
[M::main] Version: 0.2-r123
[M::main] CMD: minimap data/EColi/R9/CANU_ALIGNMENT_time/CANU_ALIGNMENT_9.4_2.contigs.fasta data/EColi/R9/Data_1D/ALL_1D/all_1D_arreglado.fastq
[M::main] Real time: 5.376 sec; CPU: 15.536 sec

real	0m5.393s
user	0m14.919s
sys	0m0.632s


Racon parameters:
- -t: Number of threads
- Raw reads (FASTQ).
- Overlaps in (PAF).
- Canu output (FASTA)
- Racon output file name (FASTA)


In [4]:
racon -t 48 \
    data/agalactiae/output/merged-output-full.fastq \ 
    agalactiae.paf \ 
    data/agalactiae/canu/agalactiae.contigs.fasta \ 
    data/agalactiae_racon.contigs.fasta

[09:24:31 main] Using PAF for input alignments. (ecoli_time.paf)
[09:24:31 main] Loading reads.
[09:24:34 main] Hashing qnames.
[09:24:34 main] Parsing the overlaps file.
[09:24:34 main] Unique overlaps will be filtered on the fly.
[09:24:34 main] Overlaps will be fully aligned.
[09:24:34 ConsensusFromOverlaps] Running consensus.
[09:24:34 ConsensusFromOverlaps] Separating overlaps to individual contigs.
[09:24:34 ConsensusFromOverlaps] In total, there are 1 contigs for consensus, each containing:
[09:24:34 ConsensusFromOverlaps] 	[0] tig00000001 len=4661991 reads=8439 covStat=3322.09 gappedBases=no class=contig suggestRepeat=no suggestCircular=yes 29970 alignments, contig len: 4661991

[09:24:34 ConsensusFromOverlaps] Started processing contig 1 / 1 (100.00%): tig00000001 len=4661991 reads=8439 covStat=3322.09 gappedBases=no class=contig suggestRepeat=no suggestCircular=yes
[09:24:34 ConsensusFromOverlaps] (thread_id = 0) Aligning overlaps for contig 1 / 1 (100.00%): tig00000001 len=4

### Pilon (Requires Illumina reads)

[Pilon](https:github.com/broadinstitute/pilon) is a tool that can be used to improve draft assembly as our case and find variation among strains. It Requires as input a FASTA file and a BAM file of reads aligned to the input FASTA file. At this point, we have the racon contigs file and our reads file (ONT and Illumina reads). We will use the racon contigs file and a BAM file produced by de alignment of ILlumina reads against that contigs file to generate our final result with Pilon.

In first place, we need to align our Illumina reads against the racon contigs file. We are using bwa-mem so first we have to index the reference file (racon contigs) and then perform the alignment of the Illumina reads. In order to have the BAM file required by Pilon, we are using samtools to convert bwa output in FASTA to the BAM format.


In [1]:
bwa index data/agalactaie_racon.contigs.fasta

[bwa_index] Pack FASTA... 0.00 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 0.03 seconds elapse.
[bwa_index] Update BWT... 0.00 sec
[bwa_index] Pack forward-only FASTA... 0.00 sec
[bwa_index] Construct SA from BWT and Occ... 0.01 sec
[main] Version: 0.7.12-r1039
[main] CMD: bwa index pruebafastq/racon.contigs.fasta
[main] Real time: 0.173 sec; CPU: 0.048 sec


In [2]:
bwa mem -t 48 \
        data/agalactaie_racon.contigs.fasta  \
        data/Data_Ilumina/Raw/WGS_bacterialIsolates_MiSeq_training-33608589/Sagalactiae_HRC-41106565/Sagal_S5_L001_R1_001.fastq.gz \
        data/Data_Ilumina/Raw/WGS_bacterialIsolates_MiSeq_training-33608589/Sagalactiae_HRC-41106565/Sagal_S5_L001_R2_001.fastq.gz \
        | samtools view -S -b -u - | samtools sort - data/bwa_aligned_reads


[M::bwa_idx_load_from_disk] read 0 ALT contigs
[bam_header_read] EOF marker is absent. The input is probably truncated.
[M::process] read 1962504 sequences (480000296 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (6, 25732, 71, 6)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (323, 444, 616)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 1202)
[M::mem_pestat] mean and std.dev: (480.39, 213.84)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 1495)
[M::mem_pestat] analyzing insert size distribution for orientation RF...
[M::mem_pestat] (25, 50, 75) percentile: (77, 166, 261)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 629)
[M::mem_pestat] mean and std.dev: (134.41, 66.07)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 813)
[M::mem_pestat] skip 

Finally we index the BAM file an run the Pilon jar:

In [4]:
samtools index data/bwa_aligned_reads.bam

Pilon arguments:
- --threads: Number of threads
- --genome: The FASTA input file (racon contigs)
- --bam: BAM file (generated by bwa and samtools)
- --outdir and --output: Output directory and filename 

In [5]:
java -Xmx128g -XX:+UseConcMarkSweepGC \ 
      -XX:-UseGCOverheadLimit \ 
      -jar /home/jovyan/software/pilon/pilon-1.22.jar \ 
      --threads 2 \ 
      --genome data/agalactaie_racon.contigs.fasta \ 
      --bam data/bwa_aligned_reads.bam \ 
      --outdir data/agalactiae/pilon_output \ 
      --output pilon.contigs

Pilon version 1.22 Wed Mar 15 16:38:30 2017 -0400
Genome: pruebafastq/racon.contigs.fasta
Fixing snps, indels, gaps, local
Input genome size: 70617
Scanning BAMs
pruebafastq/bwa_aligned_reads.bam: 2975733 reads, 0 filtered, 82987 mapped, 79775 proper, 234 stray, FR 100% 485+/-223, max 1155 frags
Processing Consensus_tig00000006:1-7753
Processing Consensus_tig00000001:1-62864
Consensus_tig00000001:1-62864 log:
frags pruebafastq/bwa_aligned_reads.bam: coverage 0
Total Reads: 260, Coverage: 0, minDepth: 5
Confirmed 59 of 62864 bases (0,09%)
Corrected 0 snps; 0 ambiguous bases; corrected 0 small insertions totaling 0 bases, 0 small deletions totaling 0 bases
Finished processing Consensus_tig00000001:1-62864
Consensus_tig00000006:1-7753 log:
frags pruebafastq/bwa_aligned_reads.bam: coverage 2206
Total Reads: 85653, Coverage: 2206, minDepth: 221
Confirmed 7479 of 7753 bases (96,47%)
Corrected 18 snps; 0 ambiguous bases; corrected 21 small insertions totaling 24 bases, 31 small deletions tota