# Alternative Aligners
```
pi:ababaian
start: 2016 12 1st ish
complete : 2016 12 29
```
## Introduction

Previously for alignment of reads to the 'hgr' genome I used the command (NA19240)
```
# Bowtie2 Alignment to hgr genome
bowtie2 -x ~/Crown/resources/hgr/hgr -1 SRR794330_1.filt.fastq.gz -2 SRR794330_2.filt.fastq.gz --very-sensitive | samtools view -bS - > NA19240_hgr.bam
```
Yeilding flagstat
```
71087208 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplimentary
0 + 0 duplicates
596582 + 0 mapped (0.84%:-nan%)
71087208 + 0 paired in sequencing
35543604 + 0 read1
35543604 + 0 read2
322054 + 0 properly paired (0.45%:-nan%)
347818 + 0 with itself and mate mapped
248764 + 0 singletons (0.35%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
```
This is the stats to beat. Bowtie2 runs in a few hours and yieldis quite a good alignment. I've become a bit more skeptical if this is the best method for alignment especially if there are alternative haplotypes with multiple variations in a cluster.


## Objective

* Test alternative aligners and determine how well they perform relative to bowtie2.

### Hypothesis
    
    * Using more sensitive (try-hard) short read aligners will place more divergent reads on the hgr genome
    * More divergent reads being aligned will allow for more sensitive nucleotide variant calling over rDNA
    * More divergent reads being aligned will allow for better structural variant calling over rDNA
    
    
### Aligners to test

* [Bowtie2](http://www.nature.com/nmeth/journal/v9/n4/full/nmeth.1923.html): My work-horse aligner; it's fast and does a good job. So far I've used this.
* [deBGA](http://bioinformatics.oxfordjournals.org/content/32/21/3224): Uses deBruijn Graphs to read alignment. Theoretically a good dBG aligner with a genome containing known variants should out-perform seed-extend aligners.
* [Stampy](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3106326/): Hybrid aligner which excels at indel detection. 


## Methods
### Bowtie2

In [1]:
# I won't re-run bowtie2 alignment
bowtie2 --version


/usr/bin/bowtie2-align-s version 2.2.6
64-bit
Built on lgw01-12
Mon Dec 28 11:09:46 UTC 2015
Compiler: gcc version 5.3.1 20151219 (Ubuntu 5.3.1-4ubuntu1) 
Options: -O3 -m64 -msse2  -funroll-loops -g3 -Wl,-Bsymbolic-functions -Wl,-z,relro -DPOPCNT_CAPABILITY -DWITH_TBB
Sizeof {int, long, long long, void*, size_t, off_t}: {4, 8, 8, 8, 8, 8}


### deBGA

In [None]:
# These operations were run outside of the notebook in a seperate terminal

git clone https://github.com/hitbc/deBGA.git

cd deBGA
make

deBGA index -k 20 hgr.fa hgr
# Genome index is 2 Gb

# 1.57 seconds to allocate memory
# 1238.4 seconds to map reads


deBGA aln ~/Crown/resources/hgr/hgr SRR794330_1.filt.fastq.gz SRR794330_2.filt.fastq.gz NA19240.deBGA.sam

# Pipe these together in the future
samtools view -b NA19240.deBGA.sam > NA19240.deBGA.bam

samtools sort NA19240.deBGA.bam -o NA19240.deBGA.sort.bam

# rm NA19240.deBGA.bam NA19240.deBGA.sam

samtools flagstat NA19240_hgr.bam > NA19240_bt2.flagstat
# bowtie2: 596582 mapped reads. 0.84%

samtools flagstat NA19240.deBGA.sort.bam > NA19240_bga.flagstat
# deBGA: 3910 mapped reads. 0.01%

mv NA19240.deBGA.sort.bam NA19240.deBGA.bam


### Stampy

In [None]:
# Download stampy
wget http://www.well.ox.ac.uk/bioinformatics/Software/Stampy-latest.tgz

tar -xvf Stampy-latest.tgz

cd stampy-1.0.30

make
# python scripts made here now

# Made HGR genome
./stampy.py -G hgr /home/artem/Crown/resources/hgr/hgr.fa

# Index Genome
./stampy.py -g hgr -H hgr

mv hgr* ~/Crown/resources/hgr/hgr

cd ~/Crown/data/1kgenomes/

# First Run ----------------------------------
### ~/Desktop/stampy-1.0.30/stampy.py -g ~/Crown/resources/hgr/hgr -h ~/Crown/resources/hgr/hgr -M SRR794330_1.filt.fastq.gz SRR794330_2.filt.fastq.gz | samtools view -b - > NA19240.stampy.bam

# Note the command with piping above took 3 days to complete. Try again without
# to see if it increases efficiency

### samtools sort NA19240.stampy.bam -o NA19240.stampy.sort.bam
### mv NA19240.stampy.sort.bam NA19240.stampy.bam

# 7442219 + 0 mapped (13.00% : N/A) !! fuck
# File corrupted after moving it... !! double fuck
# Re-run

# Second Run ---------------------------------

~/Desktop/stampy-1.0.30/stampy.py -g ~/Crown/resources/hgr/hgr -h ~/Crown/resources/hgr/hgr -M SRR794330_1.filt.fastq.gz SRR794330_2.filt.fastq.gz > NA19240.stampy.sam

# With a few suspend interruptions took ~ 2 days to align

samtools view -bh NA19240.sam | samtools sort - -o NA19240.stampy.bam

# Start 161202-080200
# End  161204-192600
#stampy: # Nucleotides (all/1/2):	1376544600	688272300	688272300
#stampy: # Variants:             	117847472	54546261	63301211
#stampy: # Fraction:             	0.0856	0.0793	0.0920
#stampy: # Paired-end insert size: 91.2 +/- 44.5  (5167630 pairs)y
#stampy: Done

samtools flagstat NA19240.stampy.bam > NA19240_stampy.flagstat

In [3]:
cd ~/Crown/data/1kgenomes/

# Flagstats from each aligner
echo Bowtie2 ---------------------------------------------
cat NA19240_bt2.flagstat
echo deBGA -----------------------------------------------
cat NA19240_bga.flagstat
echo Stampy ----------------------------------------------
cat NA19240_stampy.flagstat
echo -----------------------------------------------------


Bowtie2 ---------------------------------------------
71087208 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
596582 + 0 mapped (0.84% : N/A)
71087208 + 0 paired in sequencing
35543604 + 0 read1
35543604 + 0 read2
322054 + 0 properly paired (0.45% : N/A)
347818 + 0 with itself and mate mapped
248764 + 0 singletons (0.35% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
deBGA -----------------------------------------------
71087208 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
3910 + 0 mapped (0.01% : N/A)
71087208 + 0 paired in sequencing
35543604 + 0 read1
35543604 + 0 read2
238 + 0 properly paired (0.00% : N/A)
830 + 0 with itself and mate mapped
3080 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
Stampy -----------------

#### Notes:
I was quite surprised by how much reads get aligned by Stampy, 7 million vs. 50k by bowtie2 is a huge discrepency. But this also is saying that 10% of the genome is in the rDNA which probably isn't a fair assessment. Since the aligner is good at placing 'divergent' reads a lot of genomic sequence is probably being forced onto what is available.

The largest discrepency between Stampy and Bowtie2 is in how the algorithms treat extension from a strong seed sequence match. At chr13:1013900 is a poly-T tract for 30 bp followed by (TTC) repeats. This looks like it makes a great seed match for many sequences across the genome and Stampy puts LOTS of reads there

![Poly-T Alignment](../figure/20161209_polyT_align.png)

But looking at a broader picture of the entire 18S, stampy is better at aligning reads with sequence mis-matches. Allele cut-off set to 0.05 here:

![Alignment comparison: 18S](../figure/20161209_align_compare_18S.png)

deBGA obviously didn't work and needs to be optimized; kind of not stoked to do this. Also I'm not sure how much gains there will be from a dBG aligner before I have a 'population of genomes' to align to. Somethign to keep in the back-pocket for later.


In [4]:
# Let's compare the difference in alignment over exclusively the 18S DNA region
# chr13:1,003,499-1,005,599  (excludes (AG) simple repeats)
echo -e "chr13\t1003499\t1005599" > 18S.tmp
echo Bowtie2: 18S -------------------------------------------
samtools view -bh -L 18S.tmp NA19240_hgr.bam | samtools flagstat -
echo ''
echo Stampy: 18S -------------------------------------------
samtools view -bh -L 18S.tmp NA19240.stampy.bam | samtools flagstat -
echo ''
echo deBGA: 18S -------------------------------------------
samtools view -bh -L 18S.tmp NA19240.deBGA.bam| samtools flagstat -
echo ''

Bowtie2: 18S -------------------------------------------
13281 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
13145 + 0 mapped (98.98% : N/A)
13281 + 0 paired in sequencing
6640 + 0 read1
6641 + 0 read2
12383 + 0 properly paired (93.24% : N/A)
13009 + 0 with itself and mate mapped
136 + 0 singletons (1.02% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

Stampy: 18S -------------------------------------------
13616 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
13494 + 0 mapped (99.10% : N/A)
13616 + 0 paired in sequencing
6808 + 0 read1
6808 + 0 read2
11161 + 0 properly paired (81.97% : N/A)
13369 + 0 with itself and mate mapped
125 + 0 singletons (0.92% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

deBGA: 18S ------------------------------

Over 18S there is a much smaller difference between Bowtie2 and Stampy. Looking at the alignments it seems that it stems from the difference in the two methods in aligning over indels. At chr13:1004360 there is a deletion in some sequences which exemplifies this. Position highlighted with black dotted line is chr13:1004382

![Stampy alignment at 4360](../figure/20161209_del4360_stampy.png)
```
STAMPY ======================
Deletion
chr13:1,004,365
<hr>Total count: 484
A      : 1  (0%,     1+,   0- )
C      : 3  (1%,     0+,   3- )
G      : 480  (99%,     232+,   248- )
T      : 0
N      : 0
---------------
DEL: 18
INS: 1

Highlighted position
chr13:1,004,382
<hr>Total count: 521
A      : 10  (2%,     5+,   5- )
C      : 491  (94%,     245+,   246- )
G      : 15  (3%,     9+,   6- )
T      : 5  (1%,     3+,   2- )
N      : 0
---------------
DEL: 1
INS: 0
```

Compared to the bowtie2 alignment at the same position

![Bowtie2 alignment at 4360](../figure/20161209_del4360_bt2.png)
```
BOWTIE2  =====================
Deletion
chr13:1,004,365
<hr>Total count: 484
A      : 1  (0%,     0+,   1- )
C      : 3  (1%,     0+,   3- )
G      : 480  (99%,     231+,   249- )
T      : 0
N      : 0
---------------
DEL: 2
INS: 0

Highlighted Position
chr13:1,004,382
<hr>Total count: 503
A      : 1  (0%,     0+,   1- )
C      : 496  (99%,     247+,   249- )
G      : 2  (0%,     2+,   0- )
T      : 4  (1%,     2+,   2- )
N      : 0
---------------
```

### del4362-4372

It's variants like this which are the 'trade-off' between bowtie2 and stampy. The deletion is present at 3.52% of the reads (18 / 511). It's highly unlikely that it's simply sequencing error, the rate is too high and the deletion is called over mostly the same 10bp window (+/- 1bp).

Biologically what does this mean? It's tempting to write it off as a 'pseudo-retrogene' degraded elsewhere in the genome. At 3.5% with an estimate of 100 - 450 copies of rDNA that puts this at 3.5 - 15.75 copies in the genome. There also looks to be at least 5 distinct halpotypes of the variant which would mean each of them are between sub-1 - 3 copies. A simple solution here is that they are single-copy variants which are related to one another. But that still doesn't answer why there are multiple copies of this deletion containing copy.

** Deletion 4362 Haplotype from NA19240 using Stampy **
![del4362 Haplotype](../figure/20161209_del4362_hap.png)


#### Possible Sources of Error

* mt rDNA
* Contaminating parasite / organism in sample
* PCR Artifact (unlikely)
* pseduo rRNA
* an rRNA derived gene (not true 18S)



## Conclusions

* Stampy is a more sensitive aligner; I think there is little doubt about that especially over indels like it claims in the paper.
* Stampy takes significantly longer to complete and alignment then bowtie2 (days instead of hours). May not be feasible for aligning dozens or hundreds of genomes.
* A hybrid approach of using Stampy for a smaller number of genomes to detect indels and making 'variant' rDNA copies combined with the speed and accuracy of bowtie2 is probably the best approach.


* Use stampy to align a pilot set of ~10 genomes and describe larger structural variants which are present (along with something like lumpy), using that information align a larger amount of genomes with bowtie2.

qed