# Bowtie2 Optimization
```
Lead     : Charles
Issue    : #35 - Hard Optimize `bowtie2` alignment parameters
Start    : 2020 04 19
Complete : YYYY MM DD
Files    : ~/serratus/notebook/200419
```

## Introduction
We are currently running `bowtie2 --very-sensitive-local ...` for detecting homologous CoV sequences. We need a method (script) to test an array of `bowtie2` parameters for time, sensitivity and specificity of alignment.

Current settings are `-D 20 -R 3 -N 0 -L 20 -i S,1,0.50`, say we'd like to test the space of `-D 5-25 -R 1-4 -N 0-1 -L 30-15`.
### Objectives
Determine optimal `bowtie2` alignment parameters by outputting the following:

1. Wall-clock / CPU time / User time for each setting.
2. TP / FP / TN / FN



## Materials and Methods

- [bowtie2 manual](http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml)

Reads from simulated divergent genomes of SARS-CoV-2. Created by mutating the SARS-CoV-2 reference sequence with random subsitutions at different rates (0.1% - 40%).

Reads can be accessed via Amazon S3: s3://serratus-public/notebook/200411/



In [None]:
# Need to install AWS CLI:
# https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
mkdir aws-install
./aws/install -i aws-install

# download simulated reads from S3
# remember to configure aws with Access Key ID and Secret Access Key first
# https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html#cli-quick-configuration
mkdir fq
mkdir fa
aws s3 cp --recursive s3://serratus-public/notebook/200411/fq/ ./fq
aws s3 cp --recursive s3://serratus-public/notebook/200411/fa/ ./fa

# Index unmutated sequence
gzip -dc fa/index.fa.gz > ./cov.index.fa 
bowtie2-build cov.index.fa cov.index

# Run bowtie2 align for each divergence set and clock runtimes
module load bowtie2
module load samtools

mkdir -p bam
mkdir -p runtimes

MU=(0 30 300 1500 3000 4500 6000 7500 9000 10500 12000)

for mu in ${MU[@]}
do
    FQ1=fq/sim.cov."$mu"_1.fq.gz
    FQ2=fq/sim.cov."$mu"_2.fq.gz
    
    ( time bowtie2 --very-sensitive-local \
      -x cov.index -1 $FQ1 -2 $FQ2 \
      | samtools view -b -G 12 - ) \
      1> sim.cov."$mu".bam \
      2> "$mu".runtime
    
    mv sim.cov."$mu".bam ./bam/
    mv "$mu".runtime ./runtimes/
    
done



In [None]:
# Print out alignment rate for each mutational load
cd runtimes

MU=(0 30 300 1500 3000 4500 6000 7500 9000 10500 12000)

for mu in ${MU[@]}
do
  echo $mu \
  $(sed -n '15p' "$mu".runtime | cut -f1 -d' ' -) \
  $(sed -n '17p' "$mu".runtime | cut -f2 -) \
  $(sed -n '18p' "$mu".runtime | cut -f2 -) \
  $(sed -n '19p' "$mu".runtime | cut -f2 -)
done

Vary `-D` between `5-25`

 `-D <int>           give up extending after <int> failed extends in a row (15)`

Default for `-D` in `---very-sensitive-local` is `20`


In [None]:
# Run bowtie2 align for each divergence set and clock runtimes
module load bowtie2
module load samtools

D=($echo {5..25})
MU=(0 30 300 1500 3000 4500 6000 7500 9000 10500 12000)

for d in ${D[@]}
do
   for mu in ${MU[@]}
   do
       FQ1=fq/sim.cov."$mu"_1.fq.gz
       FQ2=fq/sim.cov."$mu"_2.fq.gz
       
       ( time bowtie2 --D $d -R 3 -N 0 -L 20 -i S,1,0.50 \
         -x cov.index -1 $FQ1 -2 $FQ2 \
         | samtools view -b -G 12 - ) \
         1> sim.cov."$mu".bam \
         2> "$mu".runtime
       
    done
    
    mkdir D"$d"_bam
    mkdir D"$d"_runtimes
    mv sim.cov."$mu".bam ./D"$d"_bam/
    mv "$mu".runtime ./D"$d"_runtimes/
    
done



Vary `-R` between `1-4`

 ` -r                 query input files are raw one-sequence-per-line`

Default for `-R` in `---very-sensitive-local` is `3`


In [None]:
# Run bowtie2 align for each divergence set and clock runtimes
module load bowtie2
module load samtools

R=($echo {1..4})
MU=(0 30 300 1500 3000 4500 6000 7500 9000 10500 12000)

for r in ${R[@]}
do
   for mu in ${MU[@]}
   do
       FQ1=fq/sim.cov."$mu"_1.fq.gz
       FQ2=fq/sim.cov."$mu"_2.fq.gz
       
       ( time bowtie2 --D 20 -R $R -N 0 -L 20 -i S,1,0.50 \
         -x cov.index -1 $FQ1 -2 $FQ2 \
         | samtools view -b -G 12 - ) \
         1> sim.cov."$mu".bam \
         2> "$mu".runtime
       
    done
    
    mkdir r"$R"_bam
    mkdir r"$R"_runtimes
    mv sim.cov."$mu".bam ./D"$R"_bam/
    mv "$mu".runtime ./D"$R"_runtimes/
    
done



In [None]:
# Print out alignment rate for each mutational load
cd runtimes

MU=(0 30 300 1500 3000 4500 6000 7500 9000 10500 12000)

for mu in ${MU[@]}
do
  echo $mu \
  $(sed -n '15p' "$mu".runtime | cut -f1 -d' ' -) \
  $(sed -n '17p' "$mu".runtime | cut -f2 -) \
  $(sed -n '18p' "$mu".runtime | cut -f2 -) \
  $(sed -n '19p' "$mu".runtime | cut -f2 -)
done

## Results & Discussion
