## Canu Assembly

Bin's friend is working with MinION and recommended that Canu (a fork of Celera) is the best assembler that he's used. 

https://github.com/marbl/canu

In [None]:
cd /data1/share/scratch/minion-canu

/opt/bioinformatics-software/minion-software/canu/Linux-amd64/bin/canu \
-p flu-11-9-canu \
-d flu-11-9-canu-out \
genomeSize=32k \
-nanopore-raw flu-11-09.fastq

There weren't any contigs produced, but this did perform some error correction with falcon sense (which is a super pain to install), so I can use those reads to see if it helps with anyhting.

First I'm going to separate the FluB reads out. I assembled the entire 11-9 run which is probably a bad idea. Using the FluDB blast results I will grab the reads that map to FluB and start by assembling the full genome and if that doesn't work then I will separate them out by segment.

## Single genome assembly

WE'll start with the 11-9 FluB genome and see how that goes. Note that the PB1 segment will look wonky due to the lab strain spike in.

In [None]:
cd /data1/share/scratch/minion-canu/flub-only

grep -B6 'Influenza B' flu-11-9.2d.fludb.xml | grep 'Iteration_query-def' | perl -pe 's/^.+>(.+) FAS.+/$1/g' | grep --no-group-separator -A3 -F -f - flu-11-9.2d.fastq  > flu-11-9.2d.flub-reads.fastq

/opt/bioinformatics-software/minion-software/canu/Linux-amd64/bin/canu -p flu-11-9-flub-only-canu -d flu-11-9-flub-only-canu-out genomeSize=12k -nanopore-raw flu-11-9.2d.flub-reads.fastq

This resulted in 62 corrected reads and 61 unassembled reads.

It seems like an actual assembly won't be very likely since it will try to stitch together a fragmented genome, though the error correction part may be what we should hone in on. 

Also, they do suggest that we run nanopolish prior to assembling even though they implement falcon-sense so I can give that a shot.

They also provide some documentation on how to assemble low coverage genomes that I will try since this is low coverage, especially after the error correction step (dealing with 61 reads!).
http://canu.readthedocs.org/en/stable/quick-start.html#assembling-low-coverage-datasets

I'll start with the defaults for now.

In [None]:
cd /data1/share/scratch/minion-canu/flub-only

/opt/bioinformatics-software/minion-software/canu/Linux-amd64/bin/canu -p flu-11-9-flub-only-low-cov-canu \
-d flu-11-9-flub-only-low-cov-canu-out genomeSize=12k corMhapSensitivity=high corMinCoverage=2 errorRate=0.035 \
minOverlapLength=499 corMaxEvidenceErate=0.3 -nanopore-raw flu-11-9.2d.flub-reads.fastq

This seemed to not work as well. There are more error corrected reads which isn't what I expected.

## Single Segment Assembly

Before jumping to combining all FluB runs I want to see if the results differ when trying to assemble one specific segment. We'll start with the HA.

In [None]:
cd /data1/share/scratch/minion-canu/flub-only

grep -P -B6 'Influenza B.+Segment:4' flu-11-9.2d.fludb.xml | grep 'Iteration_query-def' | \
perl -pe 's/^.+>(.+) FAS.+/$1/g' | grep --no-group-separator -A3 -F -f - flu-11-9.2d.fastq  > flu-11-9.2d.flub-HA-reads.fastq

/opt/bioinformatics-software/minion-software/canu/Linux-amd64/bin/canu -p flu-11-9-flub-HA-only-canu \
-d flu-11-9-flub-HA-only-canu-out genomeSize=12k -nanopore-raw flu-11-9.2d.flub-HA-reads.fastq

This didn't work either. I find it very odd that when trying to assemble the whole FluB genome and the HA segment only that there are the same number of unitigs. This definitely seems like there's some weird partitioning going on here so I need to find out what the parameters are that could be causing this.

## Single Segment Low Coverage Assembly

Canu has parameters to use specifically with low coverage datasets. It basically lowers the amount of overlap and the error stringency so that it yields more reads after error correction. This seemed to work, though it produced two different contigs that are fairly similar. Also, they weren't assembled, which makes sense since these should just be consensus sequences. Not sure why, but mapping the MiSeq data against it should help decide which (if either) is correct.

### HA 

In [None]:
cd /data1/share/scratch/minion-canu/flub-only

canu -p flu -d flu-11-9-flub-HA-only-low-cov-canu-out genomeSize=2k corMhapSensitivity=high corMinCoverage=2 \
errorRate=0.035 minOverlapLength=499 corMaxEvidenceErate=0.3 -nanopore-raw flu-11-9.2d.flub-HA-reads.fastq

cd flu-11-9-flub-HA-only-low-cov-canu-out/

grep -c '>' *fasta

# align to each other to see how different they are
nucmer --maxmatch flu.unassembled.fasta flu.unassembled.fasta
show-coords -lcT out.delta 

S1 | E1 | S2 | E2 | LEN 1 | LEN 2 | % IDY | LEN R | LEN Q | COV R | COV Q | TAGS
---|----|----|----|-------|-------|-------|-------|-------|-------|-------|-----
1 | 1874 | 1 | 1874 | 1874 | 1874 | 100.00 | 1874 | 1874 | 100.00 | 100.00 | tig00000000 | tig00000000
2 | 1819 | 1874 | 48 | 1818 | 1827 | 97.72 | 1819 | 1874 | 99.95 | 97.49 | tig00000001 | tig00000000
1 | 1819 | 1 | 1819 | 1819 | 1819 | 100.00 | 1819 | 1819 | 100.00 | 100.00 | tig00000001 | tig00000001
48 | 1874 | 1819 | 2 | 1827 | 1818 | 97.72 | 1874 | 1819 | 97.49 | 99.95 | tig00000000 | tig00000001

They're 97.72% identical and they're in opposite orientations, so maybe these are the template an compliment reads? Those should theoretically be merged into one contig..... I guess the best way to assess this would be to assemble other segments to see if this is a trend.

In [None]:
cd /data1/share/scratch/minion-canu/flub-only/flu-11-9-flub-HA-only-low-cov-canu-out

ln -s ../../../miseq-assembly/flub/flub-11-9-blast-hits-assembly/flub.trimmed.r1.fastq
ln -s ../../../miseq-assembly/flub/flub-11-9-blast-hits-assembly/flub.trimmed.r2.fastq

bwa index flu.unassembled.fasta
bwa mem -t 30 flu.unassembled.fasta flub.trimmed.r1.fastq flub.trimmed.r2.fastq > flub.trimmed.sam && \
samtools view -b -o flub.trimmed.bam flub.trimmed.sam && samtools sort -o flub.trimmed.sort.bam flub.trimmed.bam && \
samtools index flub.trimmed.sort.bam && rm flub.trimmed.sam flub.trimmed.bam

Contig0 definitely seems to be correct, though there are some deletions in the reference that don't agree with the MiSeq data. 
![title](docs/minion-flub-ha-denovo-contig0.png)
![title](docs/minion-flub-ha-denovo-contig0-deletions.png)

Contig1 seems very erroneous. though it seems to have a fair amount of coverage we can't really throw it out as garbage just yet.
![title](docs/minion-flub-ha-denovo-contig1.png)

So before we get ahead of ourselves, let's try to do an assembly with a different segment.

## PB2 

In [None]:
cd /data1/share/scratch/minion-canu/flub-only

grep -P -B6 'Influenza B.+Segment:1' flu-11-9.2d.fludb.xml | grep 'Iteration_query-def' | \
perl -pe 's/^.+>(.+) FAS.+/$1/g' | grep --no-group-separator -A3 -F -f - flu-11-9.2d.fastq  > flu-11-9.2d.flub-PB2-reads.fastq

canu -p flu -d flu-11-9-flub-PB2-only-low-cov-canu-out genomeSize=3k corMhapSensitivity=high corMinCoverage=2 \
errorRate=0.035 minOverlapLength=499 corMaxEvidenceErate=0.3 -nanopore-raw flu-11-9.2d.flub-PB2-reads.fastq

cd flu-11-9-flub-PB2-only-low-cov-canu-out/

grep -c '>' *fasta

# align to each other to see how different they are
nucmer --maxmatch flu.unassembled.fasta flu.unassembled.fasta
show-coords -lcT out.delta 

## MP 

In [None]:
cd /data1/share/scratch/minion-canu/flub-only

grep -P -B6 'Influenza B.+Segment:7' flu-11-9.2d.fludb.xml | grep 'Iteration_query-def' | \
perl -pe 's/^.+>(.+) FAS.+/$1/g' | grep --no-group-separator -A3 -F -f - flu-11-9.2d.fastq  > flu-11-9.2d.flub-MP-reads.fastq

canu -p flu -d flu-11-9-flub-MP-only-low-cov-canu-out genomeSize=1k corMhapSensitivity=high corMinCoverage=2 \
errorRate=0.1 minOverlapLength=499 corMaxEvidenceErate=0.3 -nanopore-raw flu-11-9.2d.flub-MP-reads.fastq

cd flu-11-9-flub-MP-only-low-cov-canu-out/

grep -c '>' *fasta

# align to each other to see how different they are
nucmer --maxmatch flu.unassembled.fasta flu.unassembled.fasta
show-coords -lcT out.delta 

## PA

In [None]:
cd /data1/share/scratch/minion-canu/flub-only

seg="PA"
segNum=3
segLen=2.2k

grep -P -B6 'Influenza B.+Segment:'$segNum flu-11-9.2d.fludb.xml | grep 'Iteration_query-def' | \
perl -pe 's/^.+>(.+) FAS.+/$1/g' | grep --no-group-separator -A3 -F -f - flu-11-9.2d.fastq  > flu-11-9.2d.flub-$seg-reads.fastq

canu -p flu -d flu-11-9-flub-$seg-only-low-cov-canu-out genomeSize=$segLen corMhapSensitivity=high corMinCoverage=2 \
errorRate=0.5 minOverlapLength=499 corMaxEvidenceErate=0.4 -nanopore-raw flu-11-9.2d.flub-$seg-reads.fastq

cd flu-11-9-flub-$seg-only-low-cov-canu-out/

grep -c '>' *fasta

# align to each other to see how different they are
nucmer --maxmatch flu.unassembled.fasta flu.unassembled.fasta
show-coords -lcT out.delta 

## NS

In [None]:
cd /data1/share/scratch/minion-canu/flub-only

grep -P -B6 'Influenza B.+Segment:1' flu-11-9.2d.fludb.xml | grep 'Iteration_query-def' | \
perl -pe 's/^.+>(.+) FAS.+/$1/g' | grep --no-group-separator -A3 -F -f - flu-11-9.2d.fastq  > flu-11-9.2d.flub-PB2-reads.fastq

canu -p flu -d flu-11-9-flub-PB2-only-low-cov-canu-out genomeSize=3k corMhapSensitivity=high corMinCoverage=2 \
errorRate=0.035 minOverlapLength=499 corMaxEvidenceErate=0.3 -nanopore-raw flu-11-9.2d.flub-PB2-reads.fastq

cd flu-11-9-flub-PB2-only-low-cov-canu-out/

grep -c '>' *fasta

# align to each other to see how different they are
nucmer --maxmatch flu.unassembled.fasta flu.unassembled.fasta
show-coords -lcT out.delta 

### Nanopolish with Canu

Nanopolish requires an alignment of some sort so not sure if this is a good idea since the reference base bias that I uncovered may have an adverse effect on the data. I will try it with the top blast segments first - this will allow me to see if this can resolve the reference base bias.

In [None]:
cd /data1/share/scratch/minion-canu/flub-only/nanopolish

ln -s ~/projects/MinION-notebook/clinical-analysis/consensus-comparison/flub/flu-11-9.2d.11-9-top-seg-hits.sort.bam
ln -s ~/projects/MinION-notebook/clinical-analysis/consensus-comparison/flub/flu-11-9.2d.11-9-top-seg-hits.sort.bam.bai
ln -s ~/projects/MinION-notebook/clinical-analysis/consensus-comparison/flub/11-9-top-seg-hits.fasta
ln -s ../flu-11-9.2d.fastq

samtools view -F 4 flu-11-9.2d.11-9-top-seg-hits.sort.bam | cut -f1 | grep --no-group-separator \
-A3 -F -f - flu-11-9.2d.fastq > flu-11-9.2d.11-9-top-seg-hits.flub-only.fastq

perl ~/custom-scripts/fq2fa.pl flu-11-9.2d.11-9-top-seg-hits.flub-only.fastq > flu-11-9.2d.11-9-top-seg-hits.flub-only.fasta

#had to change the makefile to be compatible with the new version of samtools

make -f /opt/bioinformatics-software/minion-software/nanopolish/scripts/consensus.make \
READS=flu-11-9.2d.11-9-top-seg-hits.flub-only.fasta ASSEMBLY=11-9-top-seg-hits.fasta



### Canu Falcon Sense Corrected Reads