# CoV Divergent Read Simulations
```
Lead     : ababaian
Issue    : na
start    : 2020 04 11
complete : 2020 04 12
files    : ~/serratus/notebook/200411/
```

## Introduction
The key objective of serratus is to discover new species of CoV, not just find more libraires with known CoV. This experiment will use SARS-CoV-2 refernce sequence, mutate the sequence with random substitutions at different rates (0.1% - 40%) to create 'divergent genomes'. Illumina reads for each divergent genome will then be simulated.

These simulated divergent reads are then mapped back to the index sequence 

### Objectives
- Create simulated SARS-CoV-2 Divergent sequences and simulated Illumina reads based on those sequences
- Benchmark sensivity vs. divergence for standard bowtie2 (--very-senstivie-local)

## Materials and Methods



In [None]:
# EC2 Instance Commands:
# Build/Run `serratus-align`container for indexing
sudo yum install -y docker
sudo yum install -y git
sudo yum install -y less
sudo service docker start

git clone https://github.com/ababaian/serratus.git; cd serratus
sudo docker build -t serratus-base:0 -t serratus-base:latest -f docker/Dockerfile .
sudo docker build -t serratus-align:0 -t serratus-align:latest -f docker/serratus-align/Dockerfile .

sudo docker run --rm --entrypoint /bin/bash -it serratus-align:0


In [None]:
# local ART install
wget https://www.niehs.nih.gov/research/resources/assets/docs/artsrcmountrainier2016.06.05linux.tgz
tar -xvf artsrcmountrainier2016.06.05linux.tgz
cd art_src_MountRainier_Linux/

sudo yum install gcc-c++ gsl gsl-devel
./configure && make && make install
cp art_illumina /usr/bin/
cd ..

# EMBOSS Tools
wget ftp://emboss.open-bio.org/pub/EMBOSS/EMBOSS-6.6.0.tar.gz
tar -xvf EMBOSS-6.6.0.tar.gz
cd EMBOSS-6.6.0/
./configure --without-x && make && make install
cp emboss/msbar /usr/bin/

# Wuhan SARS-CoV-2 genome
wget ftp://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/chromosomes/NC_045512v2.fa.gz
gzip -d NC_045512v2.fa.gz
samtools faidx NC_045512v2.fa

In [None]:
mkdir sim; cd sim

cp ../NC_045512v2.fa index.fa

# Mutation rate 
# 0, 0.001, 0.01, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4
# for a 30kb genome
MU=(0 30 300 1500 3000 4500 6000 7500 9000 10500 12000)

for mu in ${MU[@]}
do
  # Mutate input sequence at mu-rate
  msbar -point 4 -block 0 -codon 0 \
    -count $mu \
    -sequence index.fa \
    -outseq sim.cov."$mu".fa
  
  # Change header with mutation rate 
    sed -i "s/>.*/>mu_$mu/g" sim.cov."$mu".fa
    
  # Simulate reads based on each mutation rate
  art_illumina \
    --seqSys HS20 --paired \
    --in sim.cov."$mu".fa \
    --len 100 --mflen 300 --sdev 1 \
    --fcov 50 \
    --rndSeed 666 \
    --out sim.cov."$mu"_ --noALN \
    > log.tmp
    
  rm log.tmp
done


In [None]:
# Organize and upload 
mkdir fa; mv *.fa fa/; cd fa; gzip *; cd ..
mkdir fq; mv *.fq fq/; cd fq; gzip *; cd ..

# ../200411/
aws s3 cp --recursive fa s3://serratus-public/notebook/200411/fa/
aws s3 cp --recursive fa s3://serratus-public/notebook/200411/fa/


In [None]:
# On C4.large EC2 Instance
mkdir ~/seq; cd ~/seq
aws s3 sync s3://serratus-public/notebook/200411/ ./

# Index unmutated sequence
gzip -dc fa/index.fa.gz > ./cov.index.fa 
bowtie2-build cov.index.fa cov.index

In [None]:
# Run bowtie2 align for each divergence set
mkdir -p bam
mkdir -p runtimes

MU=(0 30 300 1500 3000 4500 6000 7500 9000 10500 12000)

for mu in ${MU[@]}
do
    FQ1=fq/sim.cov."$mu"_1.fq.gz
    FQ2=fq/sim.cov."$mu"_2.fq.gz
    
    ( time bowtie2 --very-sensitive-local \
      -x cov.index -1 $FQ1 -2 $FQ2 } \
      | samtools view -b -G 12 - ) \
      1> sim.cov."$mu".bam \
      2> "$mu".runtime
    
    mv sim.cov."$mu".bam ./bam/
    mv "$mu".runtime ./runtimes/
    
done



In [None]:
# Print out alignment rate for each mutational load
cd runtimes

MU=(0 30 300 1500 3000 4500 6000 7500 9000 10500 12000)

for mu in ${MU[@]}
do
  echo $mu \
  $(sed -n '15p' "$mu".runtime | cut -f1 -d' ' -) \
  $(sed -n '17p' "$mu".runtime | cut -f2 -) \
  $(sed -n '18p' "$mu".runtime | cut -f2 -) \
  $(sed -n '19p' "$mu".runtime | cut -f2 -)
done

```
0 100.00% 0m2.644s 0m2.852s 0m0.043s
30 100.00% 0m2.682s 0m2.908s 0m0.025s
300 99.99% 0m2.986s 0m3.197s 0m0.051s
1500 98.03% 0m3.148s 0m3.373s 0m0.044s
3000 79.36% 0m2.142s 0m2.340s 0m0.023s
4500 54.05% 0m1.224s 0m1.360s 0m0.027s
6000 34.64% 0m0.739s 0m0.840s 0m0.008s
7500 20.37% 0m0.479s 0m0.536s 0m0.014s
9000 9.71% 0m0.316s 0m0.321s 0m0.028s
10500 3.63% 0m0.236s 0m0.237s 0m0.019s
12000 2.05% 0m0.216s 0m0.213s 0m0.016s
```

In [None]:
aws s3 sync bam s3://serratus-public/notebook/200411/bam/
aws s3 sync runtimes s3://serratus-public/notebook/200411/runtimes/


## Results & Discussion

![Divergence vs. Alignment Rate](200411/div_v_align_plot1.png)

The closest known species related to SARS-CoV-2 is 96% similar, which means we would pick up ~98% of reads from that species.

Species level differences are 90% divergence which would give us ~80% sensitivity to pick up reads. Even at 70% similarity (0.3 divergence) we'd pick up ~10% of reads.

In the + control library with active infection we had 2M reads map to cov1r; which means if that virus was 70% similar, we can expect 100K reads to still map. The FP rate off the transcriptome was 10-100 reads (0.01%). Looks promising so far.