## Genome: cov3ma reference -- Antibiotic Resistence Data (AMR)
```
Lead     : ababaian / RCE / JJ
Issue    : 135
start    : 2020 05 29
complete : 2020 05 29
files    : s3://serratus-public/seq/cov3ma/
```

# Introduction

`JJ:`
>
> Rationale: currently characterization of environmental antimicrobial resistance genes is lacking as the potential bacterial pool is too large to sample. Using existing SRA data, if we find a homolog of human AMR genes in the wild (aka an environmental bacterial species), that would allow some insight into where to begin to characterize environmental AMR and might allow for potential delineation of lateral gene transfer of these AMR genes from the environment into the clinics.

> Attached here is a collection of ~3000 AMR genes mostly found in human pathogens. card3.09_nucleotide_homolog.modified.fasta.zip

> How the fasta file was generated.:
>
>    The fasta was built from an existing AMR gene database called "Comprehensive Antibiotic Resistance Database" (CARD; v3.0.9) available: [CARD Database](https://card.mcmaster.ca/download/0/broadstreet-v3.0.9.tar.bz2)
>
>    The original FASTA file is the "nucleotide_fasta_protein_homolog_model.fasta
>
>    The headers are modified using a C# script attached below to follow format ">Accession ID,Gene_Name,Bacteria_Species"

>[script.zip](https://github.com/ababaian/serratus/files/4703917/script.zip)


In [None]:
# EC2 Instance Commands:
# Build/Run `serratus-align`container for indexing
sudo yum install -y docker
sudo yum install -y git
sudo service docker start

#export DOCKERHUB_USER='serratusbio'
#sudo docker login
#git clone https://github.com/ababaian/serratus.git; cd serratus/containers

sudo docker run --rm --entrypoint /bin/bash -it serratusbio/serratus-align:latest

In [None]:
# Dev tools
yum install -y wget tar gzip less vim unzip

In [None]:
# Pre-compiled binary
BOWTIEVERSION='2.4.1'
wget --quiet https://downloads.sourceforge.net/project/bowtie-bio/bowtie2/"$BOWTIEVERSION"/bowtie2-"$BOWTIEVERSION"-linux-x86_64.zip &&\
  unzip bowtie2-"$BOWTIEVERSION"-linux-x86_64.zip &&\
  rm    bowtie2-"$BOWTIEVERSION"-linux-x86_64.zip

mv bowtie2-"$BOWTIEVERSION"*/bowtie2* /usr/local/bin/ &&\
  rm -rf bowtie2-"$BOWTIEVERSION"*

In [None]:
# Python3
yum install -y python3 python3-devel
curl -O https://bootstrap.pypa.io/get-pip.py
python3 get-pip.py
rm get-pip.py
pip3 install biopython

In [None]:
# SeqKit Install
wget https://github.com/shenwei356/seqkit/releases/download/v0.12.0/seqkit_linux_amd64.tar.gz &&\
  tar -xvf seqkit* && mv seqkit /usr/local/bin/ &&\
  rm seqkit_linux*

In [None]:
# local bedtools install
wget https://github.com/arq5x/bedtools2/releases/download/v2.29.2/bedtools.static.binary
mv bedtools.static.binary bedtools
chmod 755 bedtools; mv bedtools /usr/bin/

In [None]:
# Local usearch install
#The clustered database was made with usearch:
wget https://drive5.com/downloads/usearch11.0.667_i86linux32.gz
gzip -dc usearch11.0.667_i86linux32.gz > usearch
chmod 755 usearch; mv usearch /usr/bin/usearch


In [None]:
# Local Dustmasker install
cd /home/serratus/
wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/ncbi-blast-2.10.0+-x64-linux.tar.gz
tar -xvf ncbi-blast-2.10.0+-x64-linux.tar.gz
cp ncbi-blast-2.10.0+/bin/* /usr/bin/


In [None]:
## Install EDirect
## Dependency Hell
#yum install -y cpanminus expat-devel
#sudo cpanm --force IO::Socket::SSL
#sudo cpanm --force LWP::Protocol::https
#sudo cpanm --force JSON::PP
#sudo cpanm --force HTML::Entities
#sudo cpanm --force XML::Simple
#
#perl -MNet::FTP -e \
#  '$$ftp = new Net::FTP("ftp.ncbi.nlm.nih.gov", Passive => 1);
#  $$ftp->login; $$ftp->binary;
#  $$ftp->get("/entrez/entrezdirect/edirect.tar.gz");'
#  
#tar -xvf edirect.tar.gz; rm edirect.tar.gz
#export PATH=${PATH}:/home/serratus/edirect
#yes y | ./edirect/setup.sh

## AMR Build Script

In [None]:
# CoV Fragments and offsets from RCE
mkdir /home/serratus/amr0; cd /home/serratus/amr0

# Raw sequences from McMaster CARD
wget https://card.mcmaster.ca/download/0/broadstreet-v3.0.9.tar.bz2 ./

# Sequences processed by Justin
wget https://github.com/ababaian/serratus/files/4699397/card3.09_nucleotide_homolog.modified.fasta.zip ./
unzip card3.09_nucleotide_homolog.modified.fasta.zip; rm *zip
mv card3.09_nucleotide_homolog.modified.fasta amr0.fa 

# AMR sequences processed from Justin and then RCE
wget https://drive5.com/tmp/amr95.fa.gz ./
gzip -d amr95.fa.gz
mv amr95.fa amr0.95.fa 

md5sum *

```
5636fbcc8a1535e9b00a0010750a4260  amr0.95.fa
28403ee52454f135c7559b48cfa2d63b  amr0.fa
a3ae777254a747189ebb0fcad024af56  broadstreet-v3.0.9.tar.bz2
```

In [None]:
# Create a header file to store original amr0 file
NAME='amr0'
grep "^>" $NAME.fa > $NAME.headers
gzip $NAME.headers

# Create a header file to store original amr.95 file
NAME='amr0.95'
grep "^>" $NAME.fa > $NAME.headers
gzip $NAME.headers

# Strip headers to Accession_GenoStart
cut -f1 -d',' $NAME.fa | tr ' ' '_' > $NAME.tmp
mv $NAME.tmp $NAME.fa


In [None]:
# Manual Blacklisting of bad entries (if they exist)
INPUTFA='amr0.95.fa'
NAME="amr0.95"

# Generate blacklist via Bed Format
#aws s3 cp s3://serratus-public/seq/cov0/cov0.fa.fai ./

# echo "AF209745" >> blacklist.tmp

# grep -f blacklist.tmp cov0.fa.fai \
#   | cut -f1,2 - \
#   | sed 's/\t/\t1\t/g' - > blacklist.bed

# seqkit grep $INPUTFA -i -r -v \
#   -p AF209745 \
#   > $NAME.bl.fa
#
# rm cov0.fa.fai *tmp

# Manually set blacklisted regions
# echo -e "CS460762.1\t30166\t30243" >> blacklist.bed

# First run -- no blacklist
touch blacklist.bed

In [None]:
# SimpleRepeat Mask Annotation
INPUTFA='amr0.95.fa'
NAME="amr0.95"

# Short Window Dust Masking ---------------------
# Soft mask low complexity regions via dustmasker
dustmasker -in $INPUTFA \
  -window 30 -outfmt interval \
  -out $NAME.dust30

# Convert interval dust file to bed file
while read -r line; do
  if [ $(echo $line | head -c 1) = ">" ]; then
    header=$( echo $line | sed 's/>//g' - )
  else
    start=$(echo $line | cut -f1 -d' ' -)
    end=$(echo $line | cut -f3 -d' ' - )
    echo -e "$header\t$start\t$end" >> dust30.bed
  fi
done < $NAME.dust30


# Long Window Dust Masking ---------------------
# Soft mask low complexity regions via dustmasker
dustmasker -in $INPUTFA \
  -window 64 -outfmt interval \
  -out $NAME.dust64

# Convert interval dust file to bed file
while read -r line; do
  if [ $(echo $line | head -c 1) = ">" ]; then
    header=$( echo $line | sed 's/>//g' - )
  else
    start=$(echo $line | cut -f1 -d' ' -)
    end=$(echo $line | cut -f3 -d' ' - )
    echo -e "$header\t$start\t$end" >> dust64.bed
  fi
done < $NAME.dust64

echo ''

In [None]:
# Poly-NT Mask Annotation
INPUTFA='amr0.95.fa'
NAME="amr0.95"

# Create polyNT masks (10-X seed)
# FLOM seems to be missing any NT tract >7nt
seqkit locate --bed -i -m 0 -p 'AAAAAAAAAA' $INPUTFA > poly10.bed
  bedtools sort -chrThenSizeA -i poly10.bed > poly10.sort.bed
  bedtools merge -s -i poly10.sort.bed > polyAT.mask.bed

seqkit locate --bed -i -m 0 -p 'GGGGGGGGGG' $INPUTFA > poly10.bed
  bedtools sort -chrThenSizeA -i poly10.bed > poly10.sort.bed
  bedtools merge -s -i poly10.sort.bed > polyGC.mask.bed

cat polyAT.mask.bed polyGC.mask.bed > \
  polyNT.bed

rm polyAT.mask.bed polyGC.mask.bed poly10.bed poly10.sort.bed

In [None]:
# Combine blacklist, nt mask and dustmask

# Merge the bed files
cat polyNT.bed blacklist.bed dust30.bed dust64.bed > mask.tmp

# Sort the cat bed file
sort -k1,1 -k2,2n mask.tmp > mask.sort.tmp
  
# Clean up some bugs (-1 and a space)
sed 's/ //g' mask.sort.tmp \
  | sed 's/-1/0/g' - \
  > mask.sort.clean.tmp

# Merge BED file
bedtools merge -i mask.sort.clean.tmp > mask.regions.tmp

# Clean up some bugs (-1 and a space)
sed 's/ //g' mask.regions.tmp \
  | sed 's/-1/0/g' - \
  > mask.regions.bed

rm *tmp

wc -l *.bed

rm polyNT.bed blacklist.bed dust30.bed dust64.bed

```
    0 blacklist.bed
  552 dust30.bed
  615 dust64.bed
  615 mask.regions.bed
    0 polyNT.bed
 1782 total
```

In [None]:
# Hard and Soft Mask the Genome
INPUTFA='amr0.95.fa'
NAME="amr0.95"

# Had to manually remove line 8447 which started with "-1"
# There's a bug in there somewhere, likely a 1-base / 0-base

# cov2 pan-genome
# Soft-masked pan-genome
bedtools maskfasta -fi $INPUTFA \
  -bed mask.regions.bed -fo $NAME.softmasked.fa -soft
 
# Hard-masked pan-genome
bedtools maskfasta -fi $INPUTFA \
  -bed mask.regions.bed -fo $NAME.hardmasked.fa

In [None]:
# Confirm masking worked as expected manually 
NAME="amr0.95"

diff $NAME.fa $NAME.softmasked.fa  | head -n20 -
diff $NAME.fa $NAME.hardmasked.fa  | head -n20 -

cp $NAME.fa $NAME.unmasked.fa
mv $NAME.hardmasked.fa $NAME.fa

# Count each fasta file
wc -l *.fa 

```
  15530 amr0.95.fa
  15530 amr0.95.softmasked.fa
  15530 amr0.95.unmasked.fa
   5264 amr0.fa
  51854 total
```

In [None]:
# Remove intermediates and non deployment files
rm *.dust* *.gb *.fai 

# Compress stuff we don't need immediatly
gzip $NAME.softmasked.fa $NAME.unmasked.fa \
     mask.regions.bed

In [None]:
# Build bowtie2 + faidx index for flom2.fa
#bowtie2-build --threads 4 --seed 1337 $NAME.fa $NAME
samtools faidx $NAME.fa

In [None]:
# Make Sumzer file for amr0.95.fa
# 1. accession
# 2. length
# 3. name
# 4. family
# 5. offset of fragment vs. full-length genome, or 0
# 6. Pan-genome length

# Offset : 0
# Family : amr
# Pan    : 1000

# Acc and Length
cut -f1,2 amr0.95.fa.fai > acclen.tmp

# Description
gzip -dc amr0.95.headers.gz | cut -f2-5 -d',' \
  > desc.tmp

# Family Offset and Pangenome len
yes AMR~0~1000 \
  | head -n $(wc -l acclen.tmp | cut -f1 -d' ') \
  | tr '~' '\t' \
  > famoffpan.tmp

wc -l *tmp

paste acclen.tmp desc.tmp famoffpan.tmp > amr0.95.sumzer

rm *tmp

In [None]:
md5sum *
md5sum * > amr0.md5sum

```
6285212a929bd2ef50030bc6ecc37825  amr0.95.fa
150542c5ece1d04fec84fbf2d0c86a40  amr0.95.fa.fai
ceb14d62997ae661754309e6460202f0  amr0.95.headers.gz
1fc5958113d52ae4414550e83e0b7717  amr0.95.softmasked.fa.gz
72be781f06abf917852755de7f2e0610  amr0.95.sumzer
cb59997a72f1879b794ced98887f169c  amr0.95.unmasked.fa.gz
28403ee52454f135c7559b48cfa2d63b  amr0.fa
2ec057a40ee0470bbfb7af1a2978678e  amr0.headers.gz
a3ae777254a747189ebb0fcad024af56  broadstreet-v3.0.9.tar.bz2
cd46f1db5ae4abd0baec5dd907dbd4ca  mask.regions.bed.gz
```

In [None]:
aws s3 sync ./ s3://serratus-public/seq/amr0/

# cov3ma: cov3m + arm0



In [None]:
NAME='cov3ma'
mkdir -p /home/serratus/$NAME; cd /home/serratus/$NAME

# Download RefSeq (FLOM) Sequences
aws s3 cp s3://serratus-public/seq/flom2/flom2.fa ./
aws s3 cp s3://serratus-public/seq/flom2/flom2.sumzer.tsv ./

# Download CovRef from RCE
aws s3 cp s3://serratus-public/seq/covref3/covref3.fa ./
aws s3 cp s3://serratus-public/seq/covref3/covref3.sumzer.tsv ./

# Download AMR
aws s3 cp s3://serratus-public/seq/amr0/amr0.95.fa ./
aws s3 cp s3://serratus-public/seq/amr0/amr0.95.sumzer ./amr0.95.sumzer.tsv

md5sum *

```
6285212a929bd2ef50030bc6ecc37825  amr0.95.fa
72be781f06abf917852755de7f2e0610  amr0.95.sumzer.tsv
289083fefaf1eef20417d01f2096e545  covref3.fa
cb6bebcc97aecc8b9b9ab1c7eafeb054  covref3.sumzer.tsv
631c8ccb1aa0cb396f04646b979251cb  flom2.fa
d47dbf3bc78ae4dc72c7f45f9a0aa7e2  flom2.sumzer.tsv
```

In [None]:
# Merge covref3 and flom2 and arm0 for cov3ma
cat covref3.fa flom2.fa amr0.95.fa > $NAME.fa
cat covref3.sumzer.tsv flom2.sumzer.tsv amr0.95.sumzer.tsv > $NAME.sumzer.tsv

In [None]:
# Build bowtie2 + faidx index
bowtie2-build --threads 4 --seed 1337 $NAME.fa $NAME
samtools faidx $NAME.fa

In [None]:
# Mask Regions
aws s3 cp s3://serratus-public/seq/flom2/mask.regions.bed.gz ./flom2.mask.bed.gz
aws s3 cp s3://serratus-public/seq/covref3/mask.regions.bed.gz ./cov3.mask.bed.gz
aws s3 cp s3://serratus-public/seq/amr0/mask.regions.bed.gz ./amr0.mask.bed.gz

gzip -d flom2.mask.bed.gz
gzip -d cov3.mask.bed.gz
gzip -d amr0.mask.bed.gz

cat *bed > $NAME.mask.tmp
sort -k1,1 -k2,2n $NAME.mask.tmp > $NAME.mask.bed
gzip $NAME.mask.bed


In [None]:
# Clean-up and checksum
rm cov3* flom2* amr0*

md5sum *
md5sum * > $NAME.md5sum
aws s3 sync ./ s3://serratus-public/seq/cov3ma/

```
8023bcdc66e41d6d0504022770749fa6  cov3ma.1.bt2
b3fc27fddd350bfdbbbb1089ae50ebd1  cov3ma.2.bt2
a71f53e22ceb8c8524a87353f2d2d075  cov3ma.3.bt2
dc695278da8730619d818cc334c99a12  cov3ma.4.bt2
3f9d8ebf75d39a0c97193cec07b586db  cov3ma.fa
9957189fd4058799aa79d68b84b27835  cov3ma.fa.fai
d45fbda751026c6e8327a67d091cb0d4  cov3ma.mask.bed.gz
20e9ebc4f2a12f5dc0b93a4a977f2743  cov3ma.mask.tmp
4515729ef57435f40c96f35432858623  cov3ma.md5sum
09dfd7ed776bbe92dd91663e766f84d3  cov3ma.rev.1.bt2
338ae133d20cd8de709414ff1ec4fe7b  cov3ma.rev.2.bt2
991b6a7ce27dcb6fa0b5f1dc674c71fd  cov3ma.sumzer.tsv
```