# cov1r: pan-coronavirus genome
```
Lead     : ababaian
Issue    : n/a
start    : 2020 04 08
complete : 2020 04 08
files    : s3://serratus-public/seq/cov1r/
```

## Introduction
Initial quality control checks on `cov0` (see: 200407_cov0_test_align) yielded three accessions that are non-CoV sequences and should be removed: `KC786228.1`, `AX191447.1` and `AX191449.1`.

In addition, all poly-A tracts greater than 15 nucleotides will be masked to "N" sequences

### Objectives
- Creation of a refined `cov1` pan-genome
- Create control reverse sequences and bt2-index for `cov1r`

### Addendum
- On `200411` added `rmdup` function to `cov1r`.


## Materials and Methods


In [None]:
# EC2 Instance Commands:
# Build/Run `serratus-align`container for indexing
sudo yum install -y docker
sudo yum install -y git
sudo service docker start

git clone https://github.com/ababaian/serratus.git; cd serratus
sudo docker build -t serratus-base:0 -t serratus-base:latest -f docker/Dockerfile .
sudo docker build -t serratus-align:0 -t serratus-align:latest -f docker/serratus-align/Dockerfile .

sudo docker run --rm --entrypoint /bin/bash -it serratus-align:0


In [None]:
# local bedtools install
wget https://github.com/arq5x/bedtools2/releases/download/v2.29.2/bedtools.static.binary
mv bedtools.static.binary bedtools
chmod 755 bedtools

In [None]:
# bash inside `serratus-align`
mkdir cov1; cd cov1
aws s3 cp s3://serratus-public/seq/cov0/cov0.fa ./

# Create a header file to store original cov0 headers
grep "^>" cov0.fa > cov.full.headers
gzip cov.full.headers

# Remove Accessions on blacklist
# `KC786228.1`, `AX191447.1` and `AX191449.1`.
seqkit grep cov0.fa -r -v \
  -p KC786228 -p AX191447 -p AX191449 \
  > cov0.del.fa
  
# Remove duplicate sequences
seqkit rmdup -s -i -D cov0.dup cov0.del.fa > cov0.rmdup.del.fa

# Create polyA Mask (10-A seed)
seqkit locate --bed -i -m 0 -p 'AAAAAAAAAA' cov0.del.fa > polyA.10.bed
./bedtools sort -chrThenSizeA -i polyA.10.bed > polyA.sort.bed
./bedtools merge -s -i polyA.sort.bed > polyA.mask.bed

# Soft-masked pan-genome
./bedtools maskfasta -fi cov0.del.fa -bed polyA.mask.bed -fo cov1.fa -soft
 
# Hard-masked pan-genome
./bedtools maskfasta -fi cov0.del.fa -bed polyA.mask.bed -fo cov1.pA.masked.fa -mc N

# Clean-up cov0 and intermediates
rm polyA.sort.bed polyA.10.bed
rm cov0.fa cov0.del.fa cov0.rmdup.del.fa

In [None]:
# Create reverse control sequences
seqkit seq -r cov1.pA.masked.fa |\
  sed 's/>/>REVERSE_/g' - > rev.tmp
  cat cov1.pA.masked.fa rev.tmp > cov1r.fa

rm rev.tmp

# gzip all non final fasta files
gzip cov1.fa
gzip cov1.pA.masked.fa
gzip polyA.mask.bed

# Final clean-up
rm bedtools

In [None]:
# Build bowtie2 + faidx index for cov1r.fa
bowtie2-build --threads 2 --seed 666 cov1r.fa cov1r
samtools faidx cov1r.fa

```
bash-4.2# ls -alh
total 678M
drwxr-xr-x 2 root     root     4.0K Apr 13 20:12 .
drwx------ 1 serratus serratus 4.0K Apr 13 19:26 ..
-rw-r--r-- 1 root     root     322K Apr 13 19:25 cov.full.headers.gz
-rw-r--r-- 1 root     root     138K Apr 13 19:25 cov0.dup
-rw-r--r-- 1 root     root      22M Apr 13 19:25 cov1.fa.gz
-rw-r--r-- 1 root     root      22M Apr 13 19:25 cov1.pA.masked.fa.gz
-rw-r--r-- 1 root     root      91M Apr 13 19:44 cov1r.1.bt2
-rw-r--r-- 1 root     root      64M Apr 13 19:44 cov1r.2.bt2
-rw-r--r-- 1 root     root     706K Apr 13 19:26 cov1r.3.bt2
-rw-r--r-- 1 root     root      64M Apr 13 19:26 cov1r.4.bt2
-rw-r--r-- 1 root     root     260M Apr 13 19:26 cov1r.fa
-rw-r--r-- 1 root     root     2.3M Apr 13 20:11 cov1r.fa.fai
-rw-r--r-- 1 root     root      91M Apr 13 20:03 cov1r.rev.1.bt2
-rw-r--r-- 1 root     root      64M Apr 13 20:03 cov1r.rev.2.bt2
-rw-r--r-- 1 root     root     8.6K Apr 13 19:25 polyA.mask.bed.gz

```

In [None]:
# Upload to s3 public access area
aws s3 sync ./ s3://serratus-public/seq/cov1r/

## Results & Discussion

The `cov1r` pan-genome and it's respective `bowtie2` index is prepared. This should give us a higher specificity of reads aligning from non-viral sources.

Note, seqkit destroys the original headers so now each 'chromosome' or input sequence is referred to by it's accession ID only. As a remedy the `cov.full.headers.gz` text file is available.

#### Fetch cov1 sequences

`aws s3 cp --recursive s3://serratus-public/seq/cov1r/ ./`

