# DIAMOND + NR EC2 instance
```
Lead     : ababaian
Issue    : <github issue #>
start    : 2021 01 21
complete : 2021 XX XX
files    : NA
s3 files : s3://serratus-public/notebook/200121_nr/
```

## Introduction

As we retrieve new sequences we will have to search every sequence against the "NON REDUNDANT" protein database. Probably the fastest approach is to grab a fasta of the NR database, index it for `diamond` and throw our sequences at it

#### Links

- [BLAST in the cloud](https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=CloudBlast)
- [BLAST database information](https://github.com/ncbi/blast_plus_docs#blast-databases)

## EC2 Set-up script

- OS: `Amazon Linux 2 AMI (HVM) x86`
- ami: `ami-0be2609ba883822ec`
- instance: `c5.xlarge` // `r5d.4xlarge`
- description: `"c5.xlarge (- ECUs, 4 vCPUs, 3.4 GHz, -, 8 GiB memory, EBS only)"`
- description: `"r5d.4xlarge (16 vCPU	128 GB	2 x 300 NVMe SSD)"`
- storage: `450 GiB SSD (gp2)`
- encryption: `false`

In [1]:
# date and version
date

Thu Jan 21 23:50:34 PST 2021


In [None]:
# INSTALL DIAMOND
# From base amazon linux 2
sudo yum install -y docker git

# From `serratus-align` container
mkdir diamond; cd diamond

# Install diamond2 
# Libraries for building diamond2
sudo yum -y install git gcc gcc-c++ glibc-devel \
  cmake patch automake zlib-devel make

# grab latest with fix from Benjamin
git clone https://github.com/bbuchfink/diamond.git
cd diamond

mkdir bin; cd bin
cmake ..
make -j4
sudo cp ./diamond /usr/bin/diamond
sudo chmod 755 /usr/bin/diamond

# stable copy to S3 servers
# OLD: curl https://serratus-public.s3.amazonaws.com/bin/diamond > /usr/bin/diamond2; chmod 755 /usr/bin/diamond2
# curl https://serratus-public.s3.amazonaws.com/bin/diamond > /usr/bin/diamond; chmod 755 /usr/bin/diamond


In [None]:
# DOWNLOAD BLAST DB - NR
mkdir -p ~/nr; cd nr
wget -O ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz \
 | pigz -d - \
 > nr.fa
 
# And taxonomy data
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdmp.zip
unzip taxdmp.zip

In [None]:
# Switch to r5d.4xlarge instance with 450 GB block storage
# Make diamond nr db
# Database hash = f0ef2411c9661667e19bf85d06ff9fab

diamond makedb -p 14 --in nr.fa \
  --taxonmap prot.accession2taxid.gz \
  --taxonnodes nodes.dmp \
  --taxonnames names.dmp \
  -d nr

In [None]:
# Test data (rVert unitigs)
aws s3 cp s3://serratus-public/rce/tmp/rvert_otu_analysis.tar.gz ./
tar -xvf rvert*

In [None]:
#!/bin/bash
# run_diamond.sh
#
# Diamond search standard for Serratus blastp
# against NR database
# ./run_diamond.sh p rdrp_trim.fa rdrp_nr
#

PX=$1
INPUT=$2
OUTPUT=$3


if [ $PX = "p" ]; then
    # Diamond blastp alignment
    time diamond blastp \
      -q  $INPUT \
      -d ~/nr/nr.dmnd \
      --masking 0 \
      --unal 1 \
      --mid-sensitive -l 1 \
      -p14 -k1 \
      -f 6 qseqid  qstart qend qlen qstrand \
           sseqid  sstart send slen \
           pident evalue \
           full_qseq \
      > "$OUTPUT".pro
elif [ $PX = "x" ]; then
    # Diamond blastx alignment
    time diamond blastx \
      -q  $INPUT \
      -d ~/nr/nr.dmnd \
      --masking 0 \
      --unal 1 \
      --mid-sensitive -l 1 \
      -p14 -k1 \
      -f 6 qseqid  qstart qend qlen qstrand \
           sseqid  sstart send slen \
           pident evalue \
           qseq_translated \
      > "$OUTPUT".x.pro
else
    echo "Please specifify diamond 'x' or 'p'"
fi

#real    189m26.120s
#user    2306m8.703s
#sys     5m57.681s


## Analyzing Serratus Trim Sequences

> RCE:
>
> (Thread) updated novel species FASTA
>
> s3://serratus-public/rce/tmp/novel_id90.fa
>
> 128,481 species after adding the missed knowns in NR.

In [None]:
# Download trimmed HICON sequences from serratus assemblage
aws s3 cp s3://serratus-public/rce/tmp/novel_id90.fa ./serratus.rdrp.trim.v1.fa

# Metadata for serratus v1 total assemblage (600 585 sequences)
# 1. pctid of trimmed sOTU 2. SRA with NODE... label from CS, 3. SRA of species
aws s3 cp s3://serratus-public/rce/tmp/sra_species_table.tsv ./sv1_table.tsv

In [None]:
# Align against NR database
screen
./run_diamond.sh p serratus.rdrp.trim.v1.fa s1.rdrp_nr

In [None]:
INPUT='novel_sp.hicon.trim.fa'
OUTPUT='novel_sp.vOTU'

# Diamond blastp alignment
time diamond blastp \
  -q  $INPUT \
  -d vOTU_210122.dmnd \
  --masking 0 \
  --unal 1 \
  --ultra-sensitive -l 1 \
  -p4 -k1 -b2 \
  -f 6 qseqid  qstart qend qlen qstrand \
       sseqid  sstart send slen \
       pident evalue \
       full_qseq \
  > "$OUTPUT".pro
  
# Extract nidovirales
INPUT="$OUTPUT.pro"
OUTPUT='novel_sp.vOTU'

grep -w "[Ff]1357" $INPUT >  $OUTPUT.nido.pro
grep -w "[Ff]1601" $INPUT >> $OUTPUT.nido.pro
grep -w "[Ff]1219" $INPUT >> $OUTPUT.nido.pro
grep -w "[Ff]162"  $INPUT >> $OUTPUT.nido.pro
grep -w "[Ff]2095" $INPUT >> $OUTPUT.nido.pro
grep -w "[Ff]21"   $INPUT >> $OUTPUT.nido.pro
grep -w "[Ff]1854" $INPUT >> $OUTPUT.nido.pro
grep -w "[Ff]232"  $INPUT >> $OUTPUT.nido.pro
grep -w "[Ff]1408" $INPUT >> $OUTPUT.nido.pro
grep -w "[Ff]1409" $INPUT >> $OUTPUT.nido.pro
grep -w "[Ff]2519" $INPUT >> $OUTPUT.nido.pro
grep -w "[Ff]1640" $INPUT >> $OUTPUT.nido.pro
grep -w "[Ff]1944" $INPUT >> $OUTPUT.nido.pro
grep -w "[Ff]2763" $INPUT >> $OUTPUT.nido.pro
grep -w "[Ff]2554" $INPUT >> $OUTPUT.nido.pro

sed "s/^/>/g" $OUTPUT.nido.pro \
  | sed "s/\t.*\t/\n/g" - \
  > $OUTPUT.nido.fa
  
  
# Extract vOTU unique sequences
extract_family () {
  seqkit grep -r -n -i -p "$2" $1 | seqkit grep -r -n -p "S" -
}

extract_family uniques.fa F1601  > nido.votu.fa
extract_family uniques.fa F1219  >> nido.votu.fa
extract_family uniques.fa F162   >> nido.votu.fa
extract_family uniques.fa F2095  >> nido.votu.fa
extract_family uniques.fa F21    >> nido.votu.fa
extract_family uniques.fa F1854  >> nido.votu.fa
extract_family uniques.fa F232   >> nido.votu.fa
extract_family uniques.fa F1408  >> nido.votu.fa
extract_family uniques.fa F1409  >> nido.votu.fa
extract_family uniques.fa F2519  >> nido.votu.fa
extract_family uniques.fa F1640  >> nido.votu.fa
extract_family uniques.fa F1944  >> nido.votu.fa
extract_family uniques.fa F2763  >> nido.votu.fa
extract_family uniques.fa F2554  >> nido.votu.fa


