# RdRp0 pan-proteome
```
Lead     : ababaian
Issue    : 
start    : 2020 12 10
complete : 2020 12 13
revision : 2020 12 15
files    : ~/serratus/notebook/201210_rdrp0/
s3 files : s3://serratus-public/notebook/201210_rdrp0/
```

## Introduction

We've been considering other ways to 'dive' into the SRA to yield meaningful, interpretable results. An idrea which is recurring is to focus on a gene-family/domain that we would like to characterize exhaustively.

The prime candidate is viral RNA-dependent RNA-polymerase or `RdRp`. This is slowly-evolving and central reference gene for the identification and classification of RNA viruses.

It is a daunting task to isolate all known RdRp and categorize them into a meaningful system, this is a first approximation of that goal putting together the components to do so.

The ideal end-goal will be to create a hierarchly/taxonomically nested set of RdRp protein sequences at various cut-off thresholds.

- rdrp100: all unique RdRp sequences
- rdrp97:  all rdrp sequences clustered at 97% identity
- rdrp90:  Species-Approximate. 90% identity clusters
- rdrp75:  Genus-Approximate. 75% identity clusters

### Key Literature

- [Wolf20: Doubling RNA viruses](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7508674/)
- [Wolf18: RdRp evo/origin](https://pubmed.ncbi.nlm.nih.gov/30482837/)
- [Venk18: RdRp evo/origin](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5850383/)
- [Zhange: Expanding RNA virome (review)](https://pubmed.ncbi.nlm.nih.gov/31100994/)

### Objectives
- Compile the materials neccesary for a comprehensive RdRp-ome
- Create the `rdrp0.fa` reference pan-proteome to run a pilot serratus run and see what results would look like

In [6]:
# Serratus commit version
SERRATUS="/home/artem/serratus"
cd $SERRATUS

# Create local run directory
WORK="$SERRATUS/notebook/201210_rdrp0"
mkdir -p $WORK; cd $WORK

# S3 notebook path
S3_WORK='s3://serratus-public/notebook/201210_rdrp0/'

# date and version
date
git rev-parse HEAD # commit version

Sun Dec 13 11:06:21 PST 2020
6ac78a036910813c0f5fb2e7ef0b88599e683959


## GenBank Virome

The master corpus for all viral sequences

### Nucleotide Sequences
- Query: `txid10239[Organism:exp]` # all viruses
- Date: `201205`
- Results: `3 535 357` sequences
- File : `ntViro_gb201005.fa`

### CDS Sequences
- Query: `txid2552587[Organism:exp]` # all RNA virus CDS
- Date: `201205` # error in this version
- Results: `2 825 230` sequences
- File : `cdsViro_gb201005.fa`

### Protein Sequences
- Query: `txid10239[Organism:exp]` # all viral protein sequences
- Date: `201212` # error in this version
- Results: `2 825 230` sequences
- File : `aaViro_gb201012.fa`

In [2]:
cd $WORK
NT='ntViro_gb201205.fa'
grep ">" $NT | wc -l
md5sum $NT
md5sum $NT > $NT.md5

CDS='cdsViro_gb201205.fa'
grep ">" $CDS | wc -l
md5sum $CDS
md5sum $CDS > $CDS.md5

AA='aaViro_gb201212.fa'
grep ">" $AA | wc -l
md5sum $AA
md5sum $AA > $AA.md5

# GB RdRp sequences (see below WOLF18)
GB='gbRdRp_201212.fa'
grep ">" $GB | wc -l
md5sum $GB
md5sum $GB > $GB.md5

# YA RdRp sequences (see below WOLF20)
YA='gbRdRp_201212.fa'
grep ">" $YA | wc -l
md5sum $YA
md5sum $YA > $YA.md5

3535357
9102eceda85185cfb124023dbf129621  ntViro_gb201205.fa
2825230
81946a20fc18b24eb6b49541d41b8dd0  cdsViro_gb201205.fa
2825230
81946a20fc18b24eb6b49541d41b8dd0  aaViro_gb201212.fa
13870
671fb5b3f02fd41457be4d6c7a31a417  gbRdRp_201212.fa
13870
671fb5b3f02fd41457be4d6c7a31a417  gbRdRp_201212.fa


## WOLF18 RdRp

FTP Access: `ftp://ftp.ncbi.nlm.nih.gov/pub/wolf/_suppl/rnavir18/`

Sequence data: `ftp://ftp.ncbi.nlm.nih.gov/pub/wolf/_suppl/rnavir18/RNAvirome.S2.afa`

Saved as: `gb_rdrp.afa`

Date Accessed: `201212`

![Figure 1](/home/artem/serratus/notebook/201210_rdrp0/wolf18/figure1.png)

### Level 1 - Supergroup / Branches

The RdRp can be broadly classified into 5 branches which will form the lowest level of the hierarchy: `rdrp1`, `rdrp2`, `rdrp3`, `rdrp4`, `rdrp5`.

>Branch 1 consists of leviviruses and their eukaryotic relatives, namely, “mitoviruses,” “narnaviruses,” and “ourmiaviruses” (the latter three terms are placed in quotation marks as our analysis contradicts the current ICTV framework, which classifies mitoviruses and narnaviruses as members of one family, Narnaviridae, and ourmiaviruses as members of a free-floating genus, Ourmiavirus).

> Branch 2 (“picornavirus supergroup”) consists of a large assemblage of +RNA viruses of eukaryotes, in particular, those of the orders Picornavirales and Nidovirales; the families Caliciviridae, Potyviridae, Astroviridae, and Solemoviridae, a lineage of dsRNA viruses that includes partitiviruses and picobirnaviruses; and several other, smaller groups of +RNA and dsRNA viruses.

> Branch 3 consists of a distinct subset of +RNA viruses, including the “alphavirus supergroup” along with the “flavivirus supergroup,” nodaviruses, and tombusviruses; the “statovirus,” “wèivirus,” “yànvirus,” and “zhàovirus” groups; and several additional, smaller groups.

> Branch 4 consists of dsRNA viruses, including cystoviruses, reoviruses, and totiviruses and several additional families.

> Branch 5 consists of −RNA viruses.

Boundary defintions of Branches with relation to RdRp are taken from paper

Based on: [Supplementary Data 4](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6282212/bin/mbo006184203sd4.txt)
Saved as: `rdrp_representative_branches.tree`

### Level 2 - Viral Family

Based on: `https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6282212/bin/mbo006184203sd1.xls`
Saved as: `wolf18_vlist.xlsx`

Spreadsheet includes the fields

- RdRp num ID: Ordinal numbering for RdRp
- RdRp GenBank Acc: Protein accession ID
- NCBI Tax ID: taxid from NCBI
- virus name: virus name
- taxonomy: taxonomic tree

Taxonomy field was parsed to retrieve "*dae* suffix for "Family", relatively appropriate family-name or "unclassified" when unavailable. Monkey work.

### Level 3 - Sequence/Species

Based on: `wolf18_vlist.xlsx`

- Virus name field (most shallow taxonomic classification) taken for each record.
- GenBank accession taken from each record.

### Example RdRp

```
rdrp5.Hantaviridae.Bowe_virus:AGW23849.1
rdrp5.Bunyaviridae.Azagny_virus:AEA42011.1
...
rdrp2.Coronaviridae.Night_heron_coronavirus_HKU19:YP_005352862.1
rdrp2.Coronaviridae.Munia_coronavirus_HKU13_3514:YP_002308505.1
rdrp2.Coronaviridae.Wigeon_coronavirus_HKU20:YP_005352870.1
rdrp2.Coronaviridae.Feline_infectious_peritonitis_virus:AGZ84535.1
rdrp2.Coronaviridae.Lucheng_Rn_rat_coronavirus:YP_009336483.1
rdrp2.Coronaviridae.Hipposideros_bat_coronavirus_HKU10:AFU92121.1
rdrp2.Coronaviridae.BtMs_AlphaCoV_GS2013:AIA62270.1
rdrp2.Coronaviridae.Chaerephon_bat_coronavirus_Kenya_KY41_2006:ADX59465.1
rdrp2.Coronaviridae.Porcine_epidemic_diarrhea_virus:AID56804.1
rdrp2.Coronaviridae.Bat_coronavirus_CDPHE15_USA_2006:YP_008439200.1
rdrp2.Coronaviridae.Anlong_Ms_bat_coronavirus:AID16674.1
rdrp2.Coronaviridae.Scotophilus_bat_coronavirus_512:YP_001351683.1
rdrp2.Coronaviridae.BtNv_AlphaCoV_SC2013:YP_009201729.1
rdrp2.Coronaviridae.Bat_coronavirus_1B:ACA52156.1
rdrp2.Coronaviridae.NL63_related_bat_coronavirus:YP_009328933.1
rdrp2.Coronaviridae.NL63_related_bat_coronavirus:APD51489.1
rdrp2.Coronaviridae.229E_related_bat_coronavirus:ALK43115.1
rdrp2.Coronaviridae.Rhinolophus_bat_coronavirus_HKU2:ABQ57223.1
rdrp2.Coronaviridae.Wencheng_Sm_shrew_coronavirus:AID16677.1
rdrp2.Coronaviridae.Human_coronavirus_HKU1:ABD75543.1
rdrp2.Coronaviridae.Betacoronavirus_Erinaceus_VMC_DEU_2012:YP_008719930.1
rdrp2.Coronaviridae.Pipistrellus_bat_coronavirus_HKU5:YP_001039961.1
rdrp2.Coronaviridae.Rousettus_bat_coronavirus:AOG30811.1
rdrp2.Coronaviridae.Rousettus_bat_coronavirus:YP_009273004.1
rdrp2.Coronaviridae.Bat_CoV_279_2005:P0C6V9.1
rdrp2.Coronaviridae.Bat_Hp_betacoronavirus_Zhejiang2013:YP_009072438.1
rdrp2.Coronaviridae.Bottlenose_dolphin_coronavirus_HKU22:AHB63494.1
rdrp2.Coronaviridae.Duck_coronavirus:AKF17722.1
rdrp2.Coronaviridae.Avian_infectious_bronchitis_virus_partridge_GD_S14_2003:AAT70770.1
rdrp2.Coronaviridae.Infectious_bronchitis_virus:ADA83556.1
...
rdrp1.unclassified.Wenzhou_levi_like_virus_3:APG77299.1
rdrp1.Leviviridae.Pseudomonas_phage_PP7:NP_042307.1

```

saved as: `gb_assign_group.txt`

In [7]:
# 201213 REPEAT; accidently duplicated rdrp0.fa output here
cd $WORK/wolf18

In [8]:
grep ">" gb_rdrp.afa | tail -
head gb_assign_group.txt

>AMN92168.1|Bourbon_virus
>YP_009352882.1|Dhori_virus
>YP_145794.1|Thogoto_virus
>AHB34055.1|Upolu_virus
>ABF68025.1|Infectious_salmon_anemia_virus
>AQM37684.1|Steelhead_trout_orthomyxovirus_1
>APG77864.1|Beihai_orthomyxo_like_virus_2
>APG77905.1|Hubei_orthomyxo_like_virus_5
>YP_009246481.1|Tilapia_lake_virus
>YP_009337891.1|Changping_earthworm_virus_2
>rdrp5.unclassified.Wuhan_Insect_virus_3:AJG39263.1
>rdrp5.unclassified.Mucorales_RNA_virus_1:AMK47917.1
>rdrp5.unclassified.Wenling_crustacean_virus_9:YP_009329879.1
>rdrp5.Bunyaviridae.Groundnut_chlorotic_fan_spot_virus:AJT59689.1
>rdrp5.Tospoviridae.Soybean_vein_necrosis_virus:ADX01591.1
>rdrp5.Tospoviridae.Bean_necrotic_mosaic_virus:YP_006468898.1
>rdrp5.Tospoviridae.Polygonum_ringspot_tospovirus:AHZ45965.1
>rdrp5.Tospoviridae.Pepper_chlorotic_spot_virus:AQX77525.1
>rdrp5.Tospoviridae.Melon_yellow_spot_virus:BAG82842.1
>rdrp5.Tospoviridae.Capsicum_chlorosis_virus:AGS78403.1


In [14]:
# Rename the header to remove virus name
# remove gaps from sequence (unaligned)
sed 's/|.*//g' gb_rdrp.afa \
  | sed 's/-//g' - \
  > rdrp_1.tmp

# One sequence "Pseudomonas_phage_phiYY" has no accession
# YP_009618381.1
sed -i 's/^>$/>YP_009618381.1/g' rdrp_1.tmp

tail -n3 rdrp_1.tmp

KAPDSAARESLDRASEIMTGKSYNAVHTGDLSKLPNQGESPLRIVDSDLYSERSCCWVIEKEGRVVCKSTTLTRGMTGLLNTTRCSSPSELICKVLTVESLSEKIGDTSVEELLSHGRYFKCALRDQERGKPKSRAIFLSHPFFRLLSSVVETHARSVLSKVSAVYTATASAEQRAMMAAQVVESRKHVLNGDCTKYNEAIDADTLLKVWDAIGMGSIGVMLAYMVRRKCVLIKDTLVECPGGMLMGMFNATATLALQGTTDRFLSFSDDFITSFNSPAELREIEDLLFASCHNLSLKKSYISVASLEINSCTLTRDGDLATGLGCTAGVPFRGPLVTLKQTAAMLSGAVDSGVMPFHSAERLFQIKQQECAYRYNNPTYTTRNEDFLPTCLGGKTVISFQSLLTWDCHPFWYQVHPDGPDTIDQKVLSVLASK
>YP_009337891.1
WDDQDQSMFLRPKNRTGYGPLIFNTMKRISDMSPTRARELSEVFSVTEKERSISVLASGGTKFVPARGTSVPASTAFWDYQDQMRPIFEHYNIKYTDNSWWHIVICANIFGEYFEILPPTWDRSTLTKLFVEIFSAGLAVKQTEHNRSEGRNIVTMSISLQNFQNFVEEVAKIVNRMTGSHGTDLSSLEKRDLLRKVGLAASIELDTFLASLDKTKWNQLLQISTAMLLLAASYPNDASERRFVLLVGQIWREKCLYFPSKHSYYTGGMKTPKTIDELSRMNDEQLLNDNIRDDLMMVLRHYRKKRVIPQYIKCDLIMLMGMFNHSSTTLHIWPAYANHLDDNQTVSKIIDFCASSDDSMVRAKKILGMSALESYRTISSLWKSMGLNDSEDKSIIHDRLVKVEYNSNVFSMGQLIPNLSRDVAGTKVLYENPEKDLETMKNQLFVYINEGTLSTQDAAIILSDKYLTSLDIHDMLPFQKRHPIFLNNLTSAGLIPQCIPIWCGGTNHIPPELWGTMDDKMYWYHHHKDTGKTNLYLEFLASISTPPDV

In [15]:
# Iterate through each line / fasta name.
# to swap out headers

while read -r line; do
  # Find headers
  if [[ "$line" = ">"* ]]; then
    acc=$(echo $line | sed 's/>//g' -)
    newheader=$(grep "$acc" gb_assign_group.txt)
    
    if [[ "$newheader" = "" ]]; then
      echo ">NA_$acc" >> rdrp_2.tmp
    else
      echo $newheader >> rdrp_2.tmp
    fi
    
  else
    # print aa sequence
    echo $line >> rdrp_2.tmp 
  fi
done < rdrp_1.tmp


# Manually remove first ten sequences (Group II introns outgroup)
tail -n +21 rdrp_2.tmp > rdrp_3.tmp

mv rdrp_3.tmp > ../gbRdRp_201212.fa
rm *.tmp

mv: missing destination file operand after 'rdrp_3.tmp'
Try 'mv --help' for more information.


: 1

## WOLF20 RdRp

FTP Access: `ftp://ftp.ncbi.nlm.nih.gov/pub/wolf/_suppl/yangshan/`

Sequence data: `gb_rdrp.afa`

Saved as: `gb_rdrp.fa`

Date Accessed: `201212`

![Figure 2](/home/artem/serratus/notebook/201210_rdrp0/wolf20/wolf20_figure2.jpg)

>RNA virome analysis performed using complementary DNA derived from approximately 10 l of samples from Yangshan Deep-Water Harbour yielded 4,593 nearly full-length RNA virus RdRPs that formed 2,192 clusters at 75% amino acid identity which represents virus diversity at a level between species and genus. Among the RdRP sequences from GenBank (October 2018), 2,021 comparable clusters were detected. Thus, the 10 l water sample analysed here more than doubles the known diversity of RNA viruses.

There are two sets of data treated independently here; the genbank RdRp and the yangshan RdRp. Only yangshen sequences will be considered as the genbank records are not comprehensive. The data is clustered globally at `75%` aa identity.

### Level 1/2 - Supergroup / Branches and Clusters

From [Supplementary Table 1](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7508674/bin/41564_2020_755_MOESM2_ESM.xlsx) parsed the "clade id" field to retrieve branch numbering and relate to the OV.x clusters. A few sequences are unassigned / deep so they will form `rdrp0`.

The `OV.x` was renamed to `yaOVx` to remove period. So `OV.1` became `rdrp2.yaOV1`

Saved as: `ovx.branches.txt`

### Level 3 - ORF Identifier
Sequence data is from `rdrp.ya.fa`

Header parsing:
`>ya20_JAAOEH010000011_1 JAAOEH010000011.1 5194-3782 OV.1 NODE_11_truseq orf.65`
will become
`>rdrp2.yaOV1.orf65


In [17]:
cd $WORK/wolf20

# Iterate through each line / fasta name.
# to swap out headers

while read -r line; do
  # Find headers
  if [[ "$line" = ">"* ]]; then
    acc=$(echo $line | cut -f 4 -d ' ' - | sed 's/\.//g' -)
    branch=$(grep "$acc\." ovx.branches.txt)
    orf=$(echo $line | sed 's/.* //g' - | sed 's/\.//g' - )
    
    if [[ "$branch" = "" ]]; then
      echo ">NA" >> ya_2.tmp
    else
      echo ">$branch""$orf" >> ya_2.tmp
    fi
    
  else
    # print aa sequence
    echo $line >> ya_2.tmp 
  fi
done < rdrp.ya.fa

mv ya_2.tmp ../yaRdRp_201212.fa

mv: missing destination file operand after 'ya_2.tmp'
Try 'mv --help' for more information.


: 1

## rdrp0 - pilot panproteome


In [16]:
cd $WORK

# Make rdrp0
cat gbRdRp_201212.fa yaRdRp_201212.fa > rdrp0.fa
samtools faidx rdrp0.fa

md5sum rdrp0.fa
md5sum rdrp0.fa > rdrp0.fa.md5

8479a3347bbe73224cb2eac0c2138a92  rdrp0.fa


## Upload to S3

In [17]:
cd $WORK
ls -alh

total 17G
drwxrwxr-x  4 artem artem 4.0K Dec 13 13:34 [0m[01;34m.[0m
drwxr-xr-x 40 artem artem 4.0K Dec 13 13:33 [01;34m..[0m
-rw-r--r--  1 artem artem 4.0G Dec 12 21:28 aaViro_gb201212.fa
-rw-rw-r--  1 artem artem   53 Dec 12 22:35 aaViro_gb201212.fa.md5
-rw-r--r--  1 artem artem 4.0G Dec 12 13:33 cdsViro_gb201205.fa
-rw-rw-r--  1 artem artem   54 Dec 12 22:35 cdsViro_gb201205.fa.md5
-rw-rw-r--  1 artem artem 2.5M Dec 13 13:04 gbRdRp_201212.fa
-rw-rw-r--  1 artem artem 8.7G Dec  5 18:20 ntViro_gb201205.fa
-rw-rw-r--  1 artem artem   53 Dec 12 22:34 ntViro_gb201205.fa.md5
-rw-rw-r--  1 artem artem 4.6M Dec 13 13:34 rdrp0.fa
-rw-rw-r--  1 artem artem 534K Dec 13 13:34 rdrp0.fa.fai
-rw-rw-r--  1 artem artem   43 Dec 13 13:34 rdrp0.fa.md5
drwxrwxr-x  2 artem artem 4.0K Dec 13 13:33 [01;34mwolf18[0m
drwxrwxr-x  5 artem artem 192K Dec 12 22:20 [01;34mwolf20[0m
-rw-rw-r--  1 artem artem 2.1M Dec 12 22:14 yaRdRp_201212.fa


In [18]:
aws s3 sync ./ $S3_WORK # done

upload: ./rdrp0.fa.md5 to s3://serratus-public/notebook/201210_rdrp0/rdrp0.fa.md5
upload: ./rdrp0.fa.fai to s3://serratus-public/notebook/201210_rdrp0/rdrp0.fa.fai
upload: wolf18/rdrp_1.tmp to s3://serratus-public/notebook/201210_rdrp0/wolf18/rdrp_1.tmp
upload: wolf18/rdrp_2.tmp to s3://serratus-public/notebook/201210_rdrp0/wolf18/rdrp_2.tmp
upload: ./gbRdRp_201212.fa to s3://serratus-public/notebook/201210_rdrp0/gbRdRp_201212.fa
upload: ./rdrp0.fa to s3://serratus-public/notebook/201210_rdrp0/rdrp0.fa


## Create rdrp0 database

In [None]:
# Load serratus-align container on EC2
# From base amazon linux 2
sudo yum install -y docker
sudo yum install -y git
sudo service docker start

git clone --branch diamond-dev https://github.com/ababaian/serratus.git; cd serratus

# If you want to upload containers to your repository, include this.
export DOCKERHUB_USER='serratusbio' # optional
sudo docker login # optional

# Build all containers and upload them docker hub repo (if available)
cd containers
./build_containers.sh   # run this in the folder 'serratus/containers'

sudo docker run --rm --entrypoint /bin/bash \
  -it serratus-align:latest

In [None]:
# rdrp0 v201213 for pilot run
mkdir rdrp0; cd rdrp0
GENOME='rdrp0'

# Download rdrp0
aws s3 cp s3://serratus-public/notebook/201210_rdrp0/rdrp0.fa ./

# Make diamond index for protref5
diamond makedb --in $GENOME.fa -d $GENOME

# Make fasta index for protref5
samtools faidx $GENOME.fa
mv $GENOME.fa.fai $GENOME.sumzer.tsv

md5sum * > $GENOME.md5

# use protref5 msa as place-holder
aws s3 cp s3://serratus-public/seq/protref5/protref5.msa ./rdrp0.msa

# 657d302bd62a9e0b588668101f581e4c  rdrp0.dmnd
# 8479a3347bbe73224cb2eac0c2138a92  rdrp0.fa
# 6bf2ffa27bb2b08bf0d6056b675fa348  rdrp0.sumzer.tsv
# e094fc7db19c07ffcedf8bc42963ab80  rdrp0.msa

# Upload to S3
aws s3 sync ./ s3://serratus-public/seq/$GENOME/


### Revision 0 - Clean-up "Virus Name" field

In parsing the "Virus Name" from `rdrp0` the "." character was in quite a few virus names and these result in icky downstream parsing. This is a manual removal of those periods from all rdrp0 names to make a 'clean' version


In [None]:
# revision 0 folder
mkdir rev0; cd rev0

# Wolf18 Genbank Sequences
cp ../gbRdRp_201212.fa  ./
# Wolf20 Yangshen Sequences
cp ../yaRdRp_201212.fa  ./ 

In [None]:
# Update Sequence names (to make uniform)
# This is from ongoing work from

## sp. --> sp
grep "sp\." gbRdRp_201212.fa | less -NS -
sed -i 's/sp\./sp/g' gbRdRp_201212.fa

#      1 >rdrp1.Narnaviridae.Rhizophagus_sp._HR1_mitovirus_like_ssRNA:BAN85985.1
#      2 >rdrp1.Narnaviridae.Rhizophagus_sp._RF1_mitovirus:BAJ23143.2
#      3 >rdrp2.Picobirnaviridae.Picobirnavirus_sp.:AQS16638.1
#      4 >rdrp2.Picobirnaviridae.Picobirnavirus_sp.:AOW41971.1
#      5 >rdrp2.Picobirnaviridae.Picobirnavirus_sp.:AOW41973.1
#      6 >rdrp2.Iflaviridae.Iflavirus_sp.:APB88805.1
#      7 >rdrp2.unclassified.Chaetoceros_sp._RNA_virus_2:BAK40203.1
#      8 >rdrp2.unclassified.Posavirus_sp.:APQ44560.1
#      9 >rdrp2.unclassified.Posavirus_sp.:APQ44553.1
#     10 >rdrp2.unclassified.Posavirus_sp.:APQ44556.1
#     11 >rdrp2.unclassified.Posavirus_sp.:APQ44559.1
#     12 >rdrp2.unclassified.Posavirus_sp.:APQ44517.1
#     13 >rdrp2.unclassified.Basavirus_sp.:APQ44489.1
#     14 >rdrp2.unclassified.Basavirus_sp.:APQ44495.1
#     15 >rdrp2.unclassified.Posavirus_sp.:APQ44558.1
#     16 >rdrp2.unclassified.Basavirus_sp.:APQ44499.1
#     17 >rdrp2.unclassified.Basavirus_sp.:APQ44492.1
#     18 >rdrp2.unclassified.Basavirus_sp.:APQ44496.1
#     19 >rdrp2.unclassified.Basavirus_sp.:APQ44502.1
#     20 >rdrp2.unclassified.Posavirus_sp.:APQ44531.1
#     21 >rdrp2.unclassified.Posavirus_sp.:YP_009333148.1
#     22 >rdrp2.unclassified.Posavirus_sp.:APQ44537.1
#     23 >rdrp2.unclassified.Rasavirus_sp.:APQ44507.1
#     24 >rdrp2.unclassified.Husavirus_sp.:APQ44514.1
#     25 >rdrp2.unclassified.Rasavirus_sp.:APQ44506.1
#     26 >rdrp2.unclassified.Rasavirus_sp.:YP_009333305.1
#     27 >rdrp2.unclassified.Posavirus_sp.:APQ44547.1
#     28 >rdrp2.unclassified.Basavirus_sp.:APQ44500.1
#     29 >rdrp2.Picornaviridae.Sicinivirus_sp.:APR73491.1
#     30 >rdrp2.Picornaviridae.Picornaviridae_sp._rodent_Ee_PicoV_NX2015:APA29022.1
#     31 >rdrp2.Picornaviridae.Picornaviridae_sp._rodent_Mc_PicoV_Tibet2015:APA29023.1
#     32 >rdrp2.Picornaviridae.Enterovirus_sp.:AHY21610.1
#     33 >rdrp2.Picornaviridae.Picornaviridae_sp._rodent_CK_PicoV_Tibet2014:APA29019.1
#     34 >rdrp2.Picornaviridae.Picornaviridae_sp._rodent_Rn_PicoV_SX2015_1:APA29018.1
#     35 >rdrp2.Picornaviridae.Picornaviridae_sp._rodent_Ds_PicoV_IM2014:APA29017.1
#     36 >rdrp3.Picobirnaviridae.Picobirnavirus_sp.:AOW41972.1
#     37 >rdrp4.unclassified.unclassified_Rhizophagus_sp._RF1_medium_virus:BAJ23141.1
#     38 >rdrp4.Chrysoviridae.Fusarium_oxysporum_f._sp._dianthi_mycovirus_1:YP_009158913.1
#     39 >rdrp5.Bunyaviridae.Bunyavirus_sp.:AOY18806.1


In [None]:
## Manual interventions

grep ">" gbRdRp_201212.fa | cut -f3- -d'.' | cut -f1 -d':' | grep "\." - | less -NS

#      1 Chaetoceros_socialis_f._radians_RNA_virus_01
sed -i 's/f\./f/g' gbRdRp_201212.fa
#      2 Norovirus_dog_GVI.1_HKU_Ca026F_2007_HKG
sed -i 's/GVI\.1/GVI_1/g' gbRdRp_201212.fa
#      3 Norovirus_cat_GIV.2_CU081210E_USA_2010
sed -i 's/GIV\.2/GIV_2/g' gbRdRp_201212.fa
#      4 Norovirus_Hu_GII.12_CGMH42_2010_TW
#      5 Norovirus_Hu_GII.12_CGMH40_2010_TW
#      6 Norovirus_GII.17
sed -i 's/GII\.12/GII_12/g' gbRdRp_201212.fa
sed -i 's/GII\.17/GII_17/g' gbRdRp_201212.fa
#      7 Sapovirus_Hu_GI.2_BR_DF01_BRA_2009
sed -i 's/GI\.2/GI_2/g' gbRdRp_201212.fa
#      8 Sapovirus_GII.8
sed -i 's/GII\.8/GII_8/g' gbRdRp_201212.fa
#      9 St._Louis_encephalitis_virus
#     10 St._Louis_encephalitis_virus
sed -i 's/St\._L/St_L/g' gbRdRp_201212.fa
#     11 Fusarium_oxysporum_f._sp_dianthi_mycovirus_1
##

# Accesion field does not contain a version number
# Do not chnage accessions at this point
grep ">" gbRdRp_201212.fa |  grep -v "\.[0-9]$" -
#>rdrp1.Leviviridae.Escherichia_virus_Qbeta:4R71
#>rdrp3.Flaviviridae.Douroucouli_hepatitis_GB_virus_A:T08841
#>rdrp3.Flaviviridae.Marmoset_hepatitis_GB_virus_A:T08839
#>rdrp3.Alphaflexiviridae.Potato_aucuba_mosaic_virus:2012194A
#>rdrp3.Virgaviridae.Barley_stripe_mosaic_virus:2211403A
#>rdrp4.Reoviridae.Simian_rotavirus:2R7Q
#>rdrp4.unclassified.White_button_mushroom_virus_1:T00494
#>rdrp5.Peribunyaviridae.La_Crosse_virus:5AMR_A
#>rdrp5.Orthomyxoviridae.Influenza_B_virus_(B_Memphis_13_2003):4WRT_B
#>rdrp5.Orthomyxoviridae.Influenza_A_virus_(A_little_yellow_shouldered_bat_Guatemala_060_2010(H17N10)):4WSB_B


md5sum *
# 7d2c1858d5d842ada3e8103c43f29e27  gbRdRp_201212.fa
# 7e6781ed5902ea6034393c0dae521c62  yaRdRp_201212.fa

# Overwrite previous gb versions
mv gbRdRp_201212.fa ../
mv yaRdRp_201212.fa ../

## Revision 1 - Cluster known + unclassified sequences

Some lessons learned so far... there are ALOT of new viruses to be discovered! From the ~9.5K datasets run with the `rdrp0` pilot, there are ~4,700 high score (>50) and high divergence (55-85% aa id) hits at the family level. Random sampling from vertebrate/virome SRA queries means overlap is likely to be minimal here but a conservative estimate would be 1,000 distinct RdRp, about 25% of the known biodiversity of sequences.

### Objectives

- There is a large amount of "unclassified" sequences in the data within branches, these should be grouped relative to one another such that if two unclassified sequences exist in one branch, they do not "collide" in the read summaries.



In [1]:
# Serratus commit version
SERRATUS="/home/artem/serratus"
cd $SERRATUS

# Create local run directory
WORK="$SERRATUS/notebook/201210_rdrp0"
mkdir -p $WORK; cd $WORK

# S3 notebook path
S3_WORK='s3://serratus-public/notebook/201210_rdrp0/'

# date and version
date
git rev-parse HEAD # commit version

Tue Dec 15 20:57:00 PST 2020
6ac78a036910813c0f5fb2e7ef0b88599e683959


In [None]:
# Work on EC2 Instance

# Local usearch install
#The clustered database was made with usearch:
wget https://drive5.com/downloads/usearch11.0.667_i86linux32.gz
gzip -dc usearch11.0.667_i86linux32.gz > usearch
chmod 755 usearch; mv usearch /usr/bin/usearch

# Install seqkit
wget https://github.com/shenwei356/seqkit/releases/download/v0.12.0/seqkit_linux_amd64.tar.gz
  tar -xvf seqkit*
  sudo mv seqkit /usr/local/bin/
  rm seqkit_linux*


In [None]:
# Initialize workspace

# Download rdrp working directory
mkdir rdrp0; cd rdrp0
aws s3 sync s3://serratus-public/notebook/201210_rdrp0/ ./

# revision 1 folder
mkdir rev1; cd rev1

# Wolf18 Genbank Sequences
cp ../gbRdRp_201212.fa  ./
# Wolf20 Yangshen Sequences
cp ../yaRdRp_201212.fa  ./ 


In [None]:
# in gbRdRp; seperate out taxonomic and unclassified sequences
grep ">" gbRdRp_201212.fa | wc -l
# 4617

# -v inverse match
seqkit grep -v -r -p 'unclassified' gbRdRp_201212.fa > gb_tax.fa
grep ">" gb_tax.fa | wc -l
# 2863

seqkit grep -r -p 'unclassified' gbRdRp_201212.fa > gb_unc.fa
grep ">" gb_unc.fa | wc -l
# 1754


In [None]:
# Sort and Uclust unclassified sequences

usearch -sortbylength gb_tax.fa \
   -fastaout gb_tax.sort.fa

usearch -sortbylength gb_unc.fa \
   -fastaout gb_unc.sort.fa

cat gb_tax.sort.fa gb_unc.sort.fa > gb_cat.sort.fa


In [None]:
# Cluster sequences at 75% identity

# Prune TAX
usearch -cluster_smallmem gb_tax.sort.fa \
   -id 0.75 \
   -maxaccepts 4 \
   -maxrejects 64 \
   -maxhits 1 \
   -uc gb_tax.id75.uc \
   -centroids gb_tax.id75.fa

#      Seqs  2863
#  Clusters  1799
#  Max size  29
#  Avg size  1.6
#  Min size  1
# Singletons  1353, 47.3% of seqs, 75.2% of clusters


# Prune UNC
usearch -cluster_smallmem gb_unc.sort.fa \
   -id 0.75 \
   -maxaccepts 4 \
   -maxrejects 64 \
   -maxhits 1 \
   -uc gb_unc.id75.uc \
   -centroids gb_unc.id75.fa

#  Seqs      1754
#  Clusters  1660
#  Max size  6
#  Avg size  1.1
#  Min size  1
# Singletons  1581, 90.1% of seqs, 95.2% of clusters


# Prune CAT
usearch -cluster_smallmem gb_cat.sort.fa \
   -sortedby other \
   -id 0.75 \
   -maxaccepts 4 \
   -maxrejects 64 \
   -maxhits 1 \
   -uc gb_cat.id75.uc \
   -centroids gb_cat.id75.fa
   
#      Seqs  4617
#  Clusters  3433
#  Max size  29
#  Avg size  1.3
#  Min size  1
# Singletons  2892, 62.6% of seqs, 84.2% of clusters


## At 75%, no "unclassifed" sequence groups with a taxonomic identifier

mkdir id75
mv *id75.* id75/

In [None]:
# Repeat process at 55%; base of diamond detection

# Prune TAX
usearch -cluster_smallmem gb_tax.sort.fa \
   -id 0.55 \
   -maxaccepts 4 \
   -maxrejects 64 \
   -maxhits 1 \
   -uc gb_tax.id55.uc \
   -centroids gb_tax.id55.fa

#      Seqs  2863
#  Clusters  828
#  Max size  115
#  Avg size  3.5
#  Min size  1
#  Singletons  466, 16.3% of seqs, 56.3% of clusters


# Prune UNC
usearch -cluster_smallmem gb_unc.sort.fa \
   -id 0.55 \
   -maxaccepts 4 \
   -maxrejects 64 \
   -maxhits 1 \
   -uc gb_unc.id55.uc \
   -centroids gb_unc.id55.fa

#      Seqs  1754
#  Clusters  1395
#  Max size  10
#  Avg size  1.3
#  Min size  1
#  Singletons  1171, 66.8% of seqs, 83.9% of clusters


# Prune CAT
usearch -cluster_smallmem gb_cat.sort.fa \
   -sortedby other \
   -id 0.55 \
   -maxaccepts 4 \
   -maxrejects 64 \
   -maxhits 1 \
   -uc gb_cat.id55.uc \
   -centroids gb_cat.id55.fa

#      Seqs  4617
#  Clusters  2150
#  Max size  115
#  Avg size  2.1
#  Min size  1
#  Singletons  1552, 33.6% of seqs, 72.2% of clusters

## At 55%, no "unclassifed" sequence groups with a taxonomic identifier

mkdir id55
mv *.id55.* id55/

In [None]:
# Repeat process at 45%; base of diamond detection

# Prune TAX
usearch -cluster_smallmem gb_tax.sort.fa \
   -id 0.45 \
   -maxaccepts 4 \
   -maxrejects 64 \
   -maxhits 1 \
   -uc gb_tax.id45.uc \
   -centroids gb_tax.id45.fa

#      Seqs  2863
#  Clusters  598
#  Max size  155
#  Avg size  4.8
#  Min size  1
#  Singletons  302, 10.5% of seqs, 50.5% of clusters


# Prune UNC
usearch -cluster_smallmem gb_unc.sort.fa \
   -id 0.45 \
   -maxaccepts 4 \
   -maxrejects 64 \
   -maxhits 1 \
   -uc gb_unc.id45.uc \
   -centroids gb_unc.id45.fa

#      Seqs  1754
#  Clusters  1158
#  Max size  17
#  Avg size  1.5
#  Min size  1
#  Singletons  875, 49.9% of seqs, 75.6% of clusters


# Prune CAT
usearch -cluster_smallmem gb_cat.sort.fa \
   -sortedby other \
   -id 0.45 \
   -maxaccepts 4 \
   -maxrejects 64 \
   -maxhits 1 \
   -uc gb_cat.id45.uc \
   -centroids gb_cat.id45.fa

#      Seqs  4617
#  Clusters  1653
#  Max size  155
#  Avg size  2.8
#  Min size  1
#  Singletons  1068, 23.1% of seqs, 64.6% of clusters


## At 45%, no "unclassifed" sequence groups with a taxonomic identifier.

mkdir id45
mv *.id45.* id45/

In [None]:
# Repeat process at 35%; base of diamond detection

# Prune TAX
usearch -cluster_smallmem gb_tax.sort.fa \
   -id 0.35 \
   -maxaccepts 4 \
   -maxrejects 64 \
   -maxhits 1 \
   -uc gb_tax.id35.uc \
   -centroids gb_tax.id35.fa

#      Seqs  2863
#  Clusters  548
#  Max size  156
#  Avg size  5.2
#  Min size  1
#  Singletons  259, 9.0% of seqs, 47.3% of clusters


# Prune UNC
usearch -cluster_smallmem gb_unc.sort.fa \
   -id 0.35 \
   -maxaccepts 4 \
   -maxrejects 64 \
   -maxhits 1 \
   -uc gb_unc.id35.uc \
   -centroids gb_unc.id35.fa

#      Seqs  1754
#  Clusters  1097
#  Max size  17
#  Avg size  1.6
#  Min size  1
#  Singletons  801, 45.7% of seqs, 73.0% of clusters


# Prune CAT
usearch -cluster_smallmem gb_cat.sort.fa \
   -sortedby other \
   -id 0.35 \
   -maxaccepts 4 \
   -maxrejects 64 \
   -maxhits 1 \
   -uc gb_cat.id35.uc \
   -centroids gb_cat.id35.fa

#      Seqs  4617
#  Clusters  1528
#  Max size  156
#  Avg size  3.0
#  Min size  1
#  Singletons  941, 20.4% of seqs, 61.6% of clusters


## At 35%, no "unclassifed" sequence groups with a taxonomic identifier.

mkdir id35
mv *.id35.* id35/

In [None]:
mkdir -p all
mv *.fa all/

#### Conclusions

The clustering here is actually not as extensive as I originally had thought it would be. It looks like 

#### Assigning Sequence Clusters


In [None]:
# Identity to use for clustering
ID='45'

# Extract centroid sequence names
grep ">" id$ID/gb_cat.id$ID.fa \
  | sed 's/>//g' \
  > centroid$ID.name.tmp


# Convert uc format to TSV
# CENTROID hit1 hit2 hitN ...
rm clust.tmp; touch clust.tmp

# Read through centroid names
while read -r line; do
  # parse to just accession number
  acc=$(echo $line | sed 's/.*://g')
  
  # Search accession in uclust file
  # extract seqname and make it one line
  grep "$acc" id$ID/gb_cat.id$ID.uc \
    | cut -f 9 \
    | tr '\n' '\t' \
    >> clust.tmp
    
  echo -e "\n" >> clust.tmp

done < centroid$ID.name.tmp

# remove empty rows
cat clust.tmp | sed '/^$/d' > cluster$ID.tsv


In [None]:
# Assign "family_id" to sequences (unclassified)

# Field 1: Centroid Name
# Field 2+: Sequence name
cut -f 1 cluster$ID.tsv \
  | sed 's/\./\t/g' - \
  | sed 's/:/\t/g' - \
  > centroid$ID.info

# Centroid Information tsv
# 1: Branch
# 2: Family
# 3: Rep. Viral name
# 4: Rep. Viral accession
# 5: Rep. Viral accession version
# 6: Re-annotated Viral Family Name

# Iterate through centroid file
# assign each "unclassified family" a unique ordinal number
# 1055 --> use 4 leading zeros

# ID
N=1
rm family2.tmp

while read -r line; do
  # Read Family name
  #echo $line
  
  family=$(echo $line | cut -f 2 -d' ' -)  

  if [[ "$family" = "unclassified" ]]; then

    uncN=$(printf "%04d" $N)
    echo "unc$uncN" >> family2.tmp
    
    # increment up
    N=$((N+1))
    
  else
    echo $family >> family2.tmp
  
  fi

done < centroid$ID.info

# Add 6th column of new family names
cp centroid$ID.info centroid.tmp
paste family2.tmp centroid.tmp  > centroid$ID.info
rm *.tmp

In [None]:
## Assign Centroid Family-Name to Each Sequence
cat all/gbRdRp_201212.fa all/yaRdRp_201212.fa > rdrp0_r1.fa

#
cut -f 1  centroid$ID.info > fam.name.tmp
cut -f 2- cluster$ID.tsv   > cluster.members.tmp
paste fam.name.tmp cluster.members.tmp > assign.family.tmp
# col 1 == new name
# col N == old sequence name

while read -r line; do
  newfam=$(echo $line | cut -f1 -d' ' -)
  echo $newfame
  
  echo $line \
    | cut -f 2- -d' ' - \
    > members.tmp
    
  cat members.tmp \
    | tr " " "\n" \
    > members2.tmp
    
    while read -r line2; do
      # rdrp5.Sunviridae.Sunshine_Coast_virus:YP_009094051.1
      branch=$(echo $line2 | cut -f1 -d'.')
      oldfam=$(echo $line2 | cut -f2 -d'.')
      virname=$(echo $line2 | cut -f3 -d'.' | cut -f1 -d':')
      acc=$(echo $line2 | cut -f2 -d':')
      
      echo $branch $oldfam $virname $acc
      echo $branch $newfam $virname $acc
      
      # Inline rename
      matchline=$( echo $(grep -n $acc rdrp0_r1.fa  | cut -f1 -d':' -)s)
      
      sed -i "$matchline/.*/>$branch.$newfam.$virname:$acc/" rdrp0_r1.fa
      
    done < members2.tmp
    
    echo ''

done < assign.family.tmp

grep ">" rdrp0_r1.fa | sed 's/>//g' > rdrp0_r1.fai

rm *.tmp
rm gbRdRp.fa yaRdRp.fa

## Revision 2 - RdRp from GenBank

In [None]:
mkdir rev2; cd rev2
cp ../rdrp0.fa ./

In [None]:
# Install diamond
# DIAMONDVERSION='0.9.35'
# wget --quiet https://github.com/bbuchfink/diamond/releases/download/v"$DIAMONDVERSION"/diamond-linux64.tar.gz
# tar -xvf diamond-linux64.tar.gz
# rm diamond-linux64.tar.gz
# sudo mv    diamond /usr/local/bin/

In [None]:
# Install diamond
DIAMONDVERSION='2.0.6-dev'
cd ~/

# Libraries for diamond
yum -y install gcc gcc-c++ glibc-devel \
  cmake  patch automake zlib-devel

# grab latest with fix from Benjamin
git clone https://github.com/bbuchfink/diamond.git
cd diamond

mkdir bin; cd bin
cmake ..
make -j4
sudo make install

# build
# cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_BUILD_MARCH=nehalem ..
# make && make install

In [None]:
GENOME='rdrp0'
cp ../rev1/rdrp0_r1.fa ./rdrp0.fa

diamond makedb --in $GENOME.fa -d $GENOME

In [None]:
# Link to all genbank CDS
ln -s ../ntViro_gb201205.fa ./

# Input file
IN='ntViro_gb201205.fa'
#IN='tmp.fa'

GENOME='rdrp0'

# Output name
OUTNAME='gbViro_rdrp'

# Diamond blastx alignment
time cat $IN |\
diamond blastx \
  -d "$GENOME".dmnd \
  --unal 0 \
  -k 1 \
  -p 4 \
  -b 1 \
  -f 6 qseqid qstart qend qlen qstrand \
       sseqid sstart send slen \
       pident evalue cigar \
       qseq qseq_translated \
  > "$OUTNAME".pro
  
  
# real    227m38.435s
# user    899m1.421s
# sys     0m17.285s


# qseq
# qseq_translated 
# full_qseq_mate

# tmp.fa timing tests:
# tail -n +20000000 ntViro_gb201205.fa | head -n 500000 > tmp.fa

# default
# real    5m22.409s
# user    21m6.984s
# sys     0m0.410s

# --sensitive
# real    11m19.287s
# user    44m18.393s
# sys     0m0.595s

# output size is equal

# NOTE: sseq returns protein sequence from the database, not the query
# Changed field 15 sseq to qseq
# use qseq instead

In [None]:
# CDS of results
cut -f1,14 $OUTNAME.pro |  sed 's/^/>/g' | sed 's/\t/\n/g' > $OUTNAME.cds_hit.fa

# AA of results
cut -f1,15 $OUTNAME.pro |  sed 's/^/>/g' | sed 's/\t/\n/g' > $OUTNAME.aa_hit.fa

In [None]:
# UCLUST
INPUT="$OUTNAME.aa_hit.fa"
OUTPUT="$OUTNAME.aa_hit.id75.fa"

usearch -sortbylength $INPUT \
   -fastaout tmp.sort.fa


# Prune UNC
usearch -cluster_smallmem tmp.sort.fa \
   -id 0.75 \
   -maxaccepts 4 \
   -maxrejects 64 \
   -maxhits 1 \
   -uc $OUTPUT.uc \
   -centroids $OUTPUT

#      Seqs  318071 (318.1k)
#  Clusters  4081
#  Max size  75551 (75.6k)
#  Avg size  77.9
#  Min size  1

# UCLUST
OUTPUT="$OUTNAME.aa_hit.id85.fa"

# Prune UNC
usearch -cluster_smallmem tmp.sort.fa \
   -id 0.85 \
   -maxaccepts 4 \
   -maxrejects 64 \
   -maxhits 1 \
   -uc $OUTPUT.uc \
   -centroids $OUTPUT

#      Seqs  318071 (318.1k)
#  Clusters  4748
#  Max size  75528 (75.5k)
#  Avg size  67.0
#  Min size  1


In [None]:
# From base amazon linux 2
sudo yum install -y docker
sudo yum install -y git
sudo service docker start
sudo docker run --rm --entrypoint /bin/bash -it serratus-align:latest

In [None]:
# HMMER
wget http://eddylab.org/software/hmmer/hmmer-3.3.1.tar.gz ./