## cov2r: pan-coronavirus genome
```
Lead     : ababaian
Issue    : #41
start    : 2020 41 08
complete : 2020 04 08
files    : s3://serratus-public/seq/cov2r/
```

## Introduction
Initial run of ~9000 vertebrate libraries against `cov1r` yielded a series of reucrrent false-positives matching plasmid DNA. In addition QC work has identified several accession which will be pruned

Another major refinement in this version of the pan-genome is to prune highly homologous sequences (99% identity) using `usearch`. Alignment is capable of picking aligning something like 99% of reads at 99% sequence homology so we can reduce the alignment search space with this step.

In addition, all poly-NT tracts greater than 10 nucleotides will be masked to "N" sequences as these are low information and prone to spurious false-positives 

### Objectives
- Creation of a refined `cov2` pan-genome
- Create control reverse control sequences `r`
- Generate gt2-index for `cov2r`



## Materials and Methods


In [None]:
# EC2 Instance Commands:
# Build/Run `serratus-align`container for indexing
sudo yum install -y docker
sudo yum install -y git
sudo service docker start

git clone https://github.com/ababaian/serratus.git; cd serratus
sudo docker build -t serratus-base:0 -t serratus-base:latest -f docker/Dockerfile .
sudo docker build -t serratus-align:0 -t serratus-align:latest -f docker/serratus-align/Dockerfile .

sudo docker run --rm --entrypoint /bin/bash -it serratus-align:0


In [None]:
# local bedtools install
wget https://github.com/arq5x/bedtools2/releases/download/v2.29.2/bedtools.static.binary
mv bedtools.static.binary bedtools
chmod 755 bedtools; mv bedtools /usr/bin/

In [None]:
# Local usearch install
#The clustered database was made with usearch:
wget https://drive5.com/downloads/usearch11.0.667_i86linux32.gz
gzip -dc usearch11.0.667_i86linux32.gz > usearch
chmod 755 usearch; mv usearch /usr/bin/usearch


In [None]:
# bash inside `serratus-align`
mkdir cov2; cd cov2

# Start from cov0 sequence, all NCBI entries for CoV
aws s3 cp s3://serratus-public/seq/cov0/cov0.fa ./

# Create a header file to store original cov0 headers
grep "^>" cov0.fa > cov.full.headers
gzip cov.full.headers


In [None]:
# Remove duplicate and short sequences
# 8429 duplicates removed
seqkit rmdup -s -i -D cov0.duplicates cov0.fa > cov0.rmdup.fa

# Remove Accessions shorter than 200 nt
seqkit seq -m 200 cov0.rmdup.fa > cov0.gt200.rmdup.fa

In [None]:
# Prune 99% identity sequences
# Sort
usearch -sortbylength cov0.gt200.rmdup.fa \
   -minseqlength 200 \
   -fastaout cov0.sort.gt200.rmdup.fa

# Prune
usearch -cluster_smallmem cov0.sort.gt200.rmdup.fa \
   -id 0.99 \
   -maxaccepts 4 \
   -maxrejects 64 \
   -maxhits 1 \
   -uc cov0.id99.uc \
   -centroids cov0.id99.fa

```
02:46:42 210Mb   100.0% 7850 clusters, max size 402, avg 3.0
02:46:42 210Mb   100.0% Writing centroids to cov0.id99.fa   
                                                         
      Seqs  23759 (23.8k)
  Clusters  7850
  Max size  402
  Avg size  3.0
  Min size  1
Singletons  5325, 22.4% of seqs, 67.8% of clusters
   Max mem  683Mb
      Time  02:46:42
Throughput  2.4 seqs/sec.
```

In [None]:
# Remove Accessions on blacklist
# Strips headers to accessions only
seqkit grep cov0.id99.fa -i -r -v \
  -p KC786228 -p AX191447 -p AX191449 \
  -p FB764528 -p HV449436 -p CS382036 \
  > cov0.id99.bl.fa

In [None]:
# Create polyNT masks (10-X seed)
seqkit locate --bed -i -m 0 -p 'AAAAAAAAAA' cov0.id99.bl.fa > poly10.bed
  bedtools sort -chrThenSizeA -i poly10.bed > poly10.sort.bed
  bedtools merge -s -i poly10.sort.bed > polyAT.mask.bed

seqkit locate --bed -i -m 0 -p 'GGGGGGGGGG' cov0.id99.bl.fa > poly10.bed
  bedtools sort -chrThenSizeA -i poly10.bed > poly10.sort.bed
  bedtools merge -s -i poly10.sort.bed > polyGC.mask.bed

cat polyAT.mask.bed polyGC.mask.bed > \
  polyNT.bed

rm polyAT.mask.bed polyGC.mask.bed poly10.bed poly10.sort.bed

In [None]:
# Manually set blacklisted regions
echo -e "JB181528.1\t3111\t3307" >> blacklist.bed
echo -e "CS460762.1\t37177\t37211" >> blacklist.bed
echo -e "CS460762.1\t30166\t30243" >> blacklist.bed
echo -e "CS480537.1\t37170\t37220" >> blacklist.bed
echo -e "CS480537.1\t30166\t30241" >> blacklist.bed
echo -e "MK562374.1\t474\t542" >> blacklist.bed
echo -e "DL231478.1\t43\t2296" >> blacklist.bed

cat polyNT.bed blacklist.bed > mask.regions.bed

rm polyNT.bed blacklist.bed

In [None]:
# cov2 pan-genome

# Soft-masked pan-genome
bedtools maskfasta -fi cov0.id99.bl.fa \
  -bed mask.regions.bed -fo cov2.fa -soft
 
# Hard-masked pan-genome
bedtools maskfasta -fi cov0.id99.fa \
  -bed mask.regions.bed -fo cov2.masked.fa -mc N


In [None]:
# Create reverse control sequences
seqkit seq -r cov2.masked.fa |\
  sed 's/>/>REVERSE_/g' - > rev.tmp
  cat cov2.masked.fa rev.tmp > cov2r.fa

rm rev.tmp

In [None]:
# Count the number of accessions in each step of the processing
for FA in $(ls *.fa)
do
  samtools faidx $FA
  echo $(wc -l "$FA".fai)
done

rm *.fai

```
33296 cov0.fa.fai
24865 cov0.rmdup.fa.fai
23759 cov0.gt200.rmdup.fa.fai
23759 cov0.sort.gt200.rmdup.fa.fai
7846 cov0.id99.bl.fa.fai
7850 cov0.id99.fa.fai
7846 cov2.fa.fai
7850 cov2.masked.fa.fai
15700 cov2r.fa.fai
```

In [None]:
# Remove intermediates
rm cov0.fa cov0.rmdup.fa \
  cov0.gt200.rmdup.fa cov0.sort.gt200.rmdup.fa \
  cov0.id99.bl.fa

# gzip usearch output for re-use
gzip cov0.id99.fa cov0.duplicates

# gzip non final fasta files
gzip cov2.fa cov2.masked.fa
gzip polyA.mask.bed

In [None]:
# Build bowtie2 + faidx index for cov1r.fa
bowtie2-build --threads 4 --seed 666 cov2r.fa cov2r
samtools faidx cov2r.fa

```
bash-4.2# ls -alh
total 180M
drwxr-xr-x 2 root     root     4.0K Apr 21 21:43 .
drwx------ 1 serratus serratus   18 Apr 21 17:07 ..
-rw-r--r-- 1 root     root     322K Apr 21 17:07 cov.full.headers.gz
-rw-r--r-- 1 root     root      39K Apr 21 17:09 cov0.duplicates.gz
-rw-r--r-- 1 root     root     8.1M Apr 21 19:57 cov0.id99.fa.gz
-rw-r--r-- 1 root     root     5.4M Apr 21 19:57 cov0.id99.uc
-rw-r--r-- 1 root     root     8.0M Apr 21 21:37 cov2.fa.gz
-rw-r--r-- 1 root     root     7.9M Apr 21 21:37 cov2.masked.fa.gz
-rw-r--r-- 1 root     root      24M Apr 21 21:42 cov2r.1.bt2
-rw-r--r-- 1 root     root      15M Apr 21 21:42 cov2r.2.bt2
-rw-r--r-- 1 root     root     182K Apr 21 21:42 cov2r.3.bt2
-rw-r--r-- 1 root     root      15M Apr 21 21:42 cov2r.4.bt2
-rw-r--r-- 1 root     root      59M Apr 21 21:37 cov2r.fa
-rw-r--r-- 1 root     root     529K Apr 21 21:43 cov2r.fa.fai
-rw-r--r-- 1 root     root      24M Apr 21 21:43 cov2r.rev.1.bt2
-rw-r--r-- 1 root     root      15M Apr 21 21:43 cov2r.rev.2.bt2
-rw-r--r-- 1 root     root      11K Apr 21 21:36 mask.regions.bed
```

In [None]:
# Upload to s3 public access area
aws s3 sync ./ s3://serratus-public/seq/cov2r/

## Results & Discussion

The `cov2r` pan-genome and it's respective `bowtie2` index is prepared.

Note, seqkit destroys the original headers so now each 'chromosome' or input sequence is referred to by it's accession ID only. As a remedy the `cov.full.headers.gz` text file is available.

#### Downloading cov2 sequences

`aws s3 cp --recursive s3://serratus-public/seq/cov2r/ ./`

