# Run: rdrp0 pilot experiments

```
Lead     : ababaian
Issue    : 
Version  : v0.3.5-dev : diamond-dev branch
start    : 2020 12 13
complete : 2020 12 xx
files    : ~/serratus/notebook/201013_ab/
s3_files : s3://serratus-public/notebook/201213_ab/
output   : s3://serratus-public/out/201213_rdrp0/
```

### Intro/Objectives

- Pilot run for `rdrp0` based refernce built [in 201210_RdRp_panproteome_v1](201210_RdRp_panproteome_v1.ipynb)
- Run 1000 viromes to get a 'global estimate' of how much novelty we would be looking at with an RdRp-based search across large datasets


In [None]:
# Fire up EC2 Instance
sudo yum install -y docker
sudo yum install -y git
sudo service docker start

# Download latest serratus repo
git clone -b diamond-dev https://github.com/ababaian/serratus.git; cd serratus/containers

# If you want to upload containers to your repository, include this.
export DOCKERHUB_USER='serratusbio' # optional
sudo docker login # optional

# Build all containers and upload them docker hub repo (if available)
./build_containers.sh

## Virome + Metatranscriptome

Query: `"VIRAL METAGENOME" OR "VIROME" OR "VIROMIC" OR "VIRAL RNA" OR "METATRANSCRIPTOMIC" NOT "METAGENOMIC" NOT amplicon[All Fields] AND "platform illumina"[Properties] AND cluster_public[prop]`

Date: `2000904`
Return: `60327`

Sub-sample to 1000 datasets (use same sub-sample file as pmito5 pilot search on 201012)

Saved to "$WORK"

### Initialize local workspace

In [1]:
# Serratus commit version
SERRATUS="/home/artem/serratus"
cd $SERRATUS

# Create local run directory
WORK="$SERRATUS/notebook/201213_ab"
mkdir -p $WORK; cd $WORK

# S3 notebook path
S3_WORK='s3://serratus-public/notebook/201213_ab/'

# date and version
date
git rev-parse HEAD # commit version

Sun Dec 13 17:15:03 PST 2020
6ac78a036910813c0f5fb2e7ef0b88599e683959


In [2]:
cd $WORK

# Copy from pmito5 pilot test
cp ../201012_ab/viro1k_SraRunInfo.csv ./

ls -alh

total 500K
drwxrwxr-x  2 artem artem 4.0K Dec 13 17:16 [0m[01;34m.[0m
drwxr-xr-x 41 artem artem  12K Dec 13 17:15 [01;34m..[0m
-rw-rw-r--  1 artem artem 482K Dec 13 17:16 viro1k_SraRunInfo.csv


In [3]:
aws s3 sync ./ $S3_WORK

Completed 256.0 KiB/481.5 KiB with 1 file(s) remainingCompleted 481.5 KiB/481.5 KiB with 1 file(s) remainingupload: ./viro1k_SraRunInfo.csv to s3://serratus-public/notebook/201213_ab/viro1k_SraRunInfo.csv


In [13]:
## Initial run very successful; add another 10k sequences
cd $WORK
aws s3 cp s3://lovelywater2/sra/vert_SraRunInfo.csv.gz ./
aws s3 cp s3://lovelywater2/sra/viro_SraRunInfo.csv.gz ./

gzip -d vert_SraRunInfo.csv.gz
gzip -d viro_SraRunInfo.csv.gz

download: s3://lovelywater2/sra/vert_SraRunInfo.csv.gz to ./vert_SraRunInfo.csv.gz
download: s3://lovelywater2/sra/viro_SraRunInfo.csv.gz to ./viro_SraRunInfo.csv.gz


In [14]:
# Randomly select 5000 viro/vert samples
head -n1 viro_SraRunInfo.csv > sra.header

# Inverse select viro and vert
shuf viro_SraRunInfo.csv | head -n 4000 > viro4k.tmp
shuf viro_SraRunInfo.csv | head -n 5000 > vert5k.tmp

cat sra.header viro4k.tmp \
  > viro4k_SraRunInfo.csv

cat sra.header vert5k.tmp \
  > vert5k_SraRunInfo.csv
  
rm *.tmp
wc -l *.csv
md5sum *.csv

shuf: write error: Broken pipe
shuf: write error
shuf: write error: Broken pipe
shuf: write error
    5001 vert5k_SraRunInfo.csv
   94909 vert_SraRunInfo.csv
    1001 viro1k_SraRunInfo.csv
    4001 viro4k_SraRunInfo.csv
   22252 viro_SraRunInfo.csv
  127164 total
8c176eb45f4362e4356b4e5fe4595e98  vert5k_SraRunInfo.csv
e39b50b78465f7e12676ef18d179de5f  vert_SraRunInfo.csv
c16a1b2da03eaf1088933a1de329ce2f  viro1k_SraRunInfo.csv
8cd67b3ae51008b389c87878c0ad23f7  viro4k_SraRunInfo.csv
e9222b54cee8a65bc3781589f5cbf642  viro_SraRunInfo.csv


### Terraform Initialize

In [6]:
# For rapid batching; copy out serratus folder
# PROTEIN / DNA MUST BE SET IN CONFIG FILE
# LINE 153
#   options            = "-k ${module.work_bucket.name} -a bowtie2"
#   options            = "-k ${module.work_bucket.name} -a diamond"

TF=$SERRATUS/terraform/main
cd $TF
git diff main.tf
terraform init

# Launch Terraform Cluster
# Initialize the serratus cluster with minimal nodes
terraform apply -auto-approve

[1mdiff --git a/terraform/main/main.tf b/terraform/main/main.tf[m
[1mindex de2d00d..0cbcb16 100644[m
[1m--- a/terraform/main/main.tf[m
[1m+++ b/terraform/main/main.tf[m
[36m@@ -12,14 +12,14 @@[m [mvariable "aws_region" {[m
 }[m
 [m
 variable "dl_size" {[m
[31m-  type    = number[m
[31m-  default = 0[m
[32m+[m[32m  type        = number[m
[32m+[m[32m  default     = 0[m
   description = "Default number of downloader nodes (ASG)"[m
 }[m
 [m
 variable "align_size" {[m
[31m-  type    = number[m
[31m-  default = 0[m
[32m+[m[32m  type        = number[m
[32m+[m[32m  default     = 0[m
   description = "Default number of aligner nodes (ASG)"[m
 }[m
 [m
[36m@@ -38,14 +38,22 @@[m [mvariable "dockerhub_account" {[m
 }[m
 [m
 variable "scheduler_port" {[m
[31m-  type  = number[m
[32m+[m[32m  type    = number[m
   default = 8000[m
 }[m
 [m
[32m+[m[32mvariable "output_bucket" {[m
[32m+[m[32m  type = string[m
[32m+[m[32m}[m
[32m+

 }[m
 [m
 resource "local_file" "upload_sra" {[m
[31m-  filename = "${path.module}/uploadSRA.sh"[m
[32m+[m[32m  filename        = "${path.module}/uploadSRA.sh"[m
   file_permission = 0777[m
[31m-  content = <<-EOF[m
[32m+[m[32m  content         = <<-EOF[m
     #!/bin/bash[m
     # Serratus - uploadSRA.sh[m
[36m@@ -289,7 +297,6 @@[m [mresource "local_file" "upload_sra" {[m
   EOF[m
 }[m
 [m
[31m-[m
 // OUTPUT ##############################[m
 output "help" {[m
   value = <<-EOF[m
[0m[1mInitializing modules...[0m

[0m[1mInitializing the backend...[0m

[0m[1mInitializing provider plugins...[0m

The following providers do not have any version constraints in configuration,
so the latest version was installed.

To prevent automatic upgrades to new major versions that may contain breaking
changes, it is recommended to add version = "..." constraints to the
corresponding provider blocks in configuration, with the constraint strings
suggested below.

* pro

[0m[1mmodule.monitoring.aws_iam_role_policy.cloudwatch: Creating...[0m[0m
[0m[1mmodule.align.aws_iam_role_policy.ec2Terminate: Creation complete after 1s [id=SerratusIamRole-serratus-align:TerminateEC2Instances-serratus-align][0m[0m
[0m[1mmodule.merge.module.iam_role.aws_iam_role_policy_attachment.attachment["arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess"]: Creating...[0m[0m
[0m[1maws_security_group.internal: Creation complete after 5s [id=sg-0c88c3a7b5ddfad30][0m[0m
[0m[1mmodule.download.module.iam_role.aws_iam_instance_profile.profile: Creating...[0m[0m
[0m[1mmodule.monitoring.aws_iam_role_policy_attachment.attachment: Creation complete after 1s [id=SerratusIamRole-monitor-20201214044943302600000005][0m[0m
[0m[1mmodule.download.module.iam_role.aws_iam_role_policy_attachment.attachment["arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess"]: Creating...[0m[0m
[0m[1mmodule.monitoring.module.ecs_cluster.aws_iam_instance_profile.p: Creation complete after 2s [id=

[0m[1mmodule.monitoring.aws_ecs_service.monitor: Creation complete after 1s [id=arn:aws:ecs:us-east-1:797308887321:service/serratus-monitor/serratus-monitor][0m[0m
[0m[1mmodule.monitoring.aws_eip.monitor: Creation complete after 3s [id=eipalloc-059dfb592ca78c964][0m[0m
[0m[1mmodule.scheduler.aws_eip.sch: Creation complete after 3s [id=eipalloc-0bb73d9008f3d6d60][0m[0m
[0m[1mlocal_file.create_tunnel: Creating...[0m[0m
[0m[1mlocal_file.hosts: Creating...[0m[0m
[0m[1mlocal_file.create_tunnel: Creation complete after 0s [id=810d3df69a4daaf46df2be4084b6bd1fd04a1a3f][0m[0m
[0m[1mmodule.align.aws_launch_configuration.worker: Creating...[0m[0m
[0m[1mmodule.merge.aws_launch_configuration.worker: Creating...[0m[0m
[0m[1mmodule.download.aws_launch_configuration.worker: Creating...[0m[0m
[0m[1mlocal_file.hosts: Creation complete after 0s [id=c74fd21bb49020537074098a5e423db163c4a4de][0m[0m
[0m[1mmodule.align.aws_launch_configuration.worker: Creation complet

Serratus Config backup 
```
{
  "ALIGN_ARGS": "--very-sensitive-local",
  "ALIGN_MAX_INCREASE": 25,
  "ALIGN_SCALING_CONSTANT": 0.0215,
  "ALIGN_SCALING_ENABLE": true,
  "ALIGN_SCALING_MAX": 20,
  "CLEAR_INTERVAL": 999999,
  "DL_ARGS": "",
  "DL_MAX_INCREASE": 10,
  "DL_SCALING_CONSTANT": 0.1,
  "DL_SCALING_ENABLE": true,
  "DL_SCALING_MAX": 10,
  "GENOME": "protref5b",
  "MERGE_ARGS": "protein",
  "MERGE_MAX_INCREASE": 25,
  "MERGE_SCALING_CONSTANT": 0.1,
  "MERGE_SCALING_ENABLE": true,
  "MERGE_SCALING_MAX": 20,
  "SCALING_INTERVAL": 120,
  "VIRTUAL_SCALING_INTERVAL": 35
}
```

In [7]:
cd $TF

# Open SSH tunnels to the monitor
./create_tunnels.sh

# If you get an error on port
# run:
# ps aux | grep ssh
# sudo kill <PID of SSH>

bind [127.0.0.1]:5432: Address already in use
Tunnels created:
    localhost:3000 = grafana
    localhost:9090 = prometheus
    localhost:5432 = postgres
    localhost:8000 = scheduler


In [9]:
cd $WORK
BATCH='viro1k_SraRunInfo.csv'
wc -l $WORK/$BATCH

1001 /home/artem/serratus/notebook/201213_ab/viro1k_SraRunInfo.csv


In [10]:
# Upload SraRunInfo.csv into Serratus
cd $TF
./uploadSRA.sh $WORK/$BATCH

Loading SRARunInfo into scheduler 
  File: /home/artem/serratus/notebook/201213_ab/viro1k_SraRunInfo.csv
  date: Sun Dec 13 20:57:15 PST 2020
  wc  : 1001 /home/artem/serratus/notebook/201213_ab/viro1k_SraRunInfo.csv
  md5 : c16a1b2da03eaf1088933a1de329ce2f  /home/artem/serratus/notebook/201213_ab/viro1k_SraRunInfo.csv


--------------------------
tmp.chunk00
1001 tmp.chunk00_sraRunInfo.csv
50092fc1d758ec832bfef82c0f033de8  tmp.chunk00_sraRunInfo.csv
{"inserted_rows":1000,"total_rows":1000}


 uploadSRA complete.


In [15]:
cd $WORK
BATCH='viro4k_SraRunInfo.csv'
wc -l $WORK/$BATCH

# Upload SraRunInfo.csv into Serratus
cd $TF
./uploadSRA.sh $WORK/$BATCH

4001 /home/artem/serratus/notebook/201213_ab/viro4k_SraRunInfo.csv
Loading SRARunInfo into scheduler 
  File: /home/artem/serratus/notebook/201213_ab/viro4k_SraRunInfo.csv
  date: Sun Dec 13 22:04:49 PST 2020
  wc  : 4001 /home/artem/serratus/notebook/201213_ab/viro4k_SraRunInfo.csv
  md5 : 8cd67b3ae51008b389c87878c0ad23f7  /home/artem/serratus/notebook/201213_ab/viro4k_SraRunInfo.csv


--------------------------
tmp.chunk00
4001 tmp.chunk00_sraRunInfo.csv
3c6a9640316799584a412a05d123fb27  tmp.chunk00_sraRunInfo.csv
{"inserted_rows":4000,"total_rows":5000}


 uploadSRA complete.


In [16]:
cd $WORK
BATCH='vert5k_SraRunInfo.csv'
wc -l $WORK/$BATCH

# Upload SraRunInfo.csv into Serratus
cd $TF
./uploadSRA.sh $WORK/$BATCH

5001 /home/artem/serratus/notebook/201213_ab/vert5k_SraRunInfo.csv
Loading SRARunInfo into scheduler 
  File: /home/artem/serratus/notebook/201213_ab/vert5k_SraRunInfo.csv
  date: Sun Dec 13 22:05:10 PST 2020
  wc  : 5001 /home/artem/serratus/notebook/201213_ab/vert5k_SraRunInfo.csv
  md5 : 8c176eb45f4362e4356b4e5fe4595e98  /home/artem/serratus/notebook/201213_ab/vert5k_SraRunInfo.csv


--------------------------
tmp.chunk00
5001 tmp.chunk00_sraRunInfo.csv
cfafa5950295c6a6cbc42dd120c8f83a  tmp.chunk00_sraRunInfo.csv
{"inserted_rows":5000,"total_rows":10000}


 uploadSRA complete.


## Run Serratus

In [17]:
# Set Cluster Parameters =============================
## get Config File (if it doesn't exist)
# curl localhost:8000/config | jq > serratus-config.json

cd $TF
# Make local changes to config file
echo "  Cluster Config File: "
cat serratus-config.json
echo ""
echo ""
# Re-upload config file
curl -T serratus-config.json localhost:8000/config

  Cluster Config File: 
{
  "ALIGN_ARGS": "--very-sensitive-local",
  "ALIGN_MAX_INCREASE": 25,
  "ALIGN_SCALING_CONSTANT": 0.1,
  "ALIGN_SCALING_ENABLE": true,
  "ALIGN_SCALING_MAX": 400,
  "CLEAR_INTERVAL": 999999,
  "DL_ARGS": "",
  "DL_MAX_INCREASE": 10,
  "DL_SCALING_CONSTANT": 0.1,
  "DL_SCALING_ENABLE": true,
  "DL_SCALING_MAX": 100,
  "GENOME": "rdrp0",
  "MERGE_ARGS": "protein",
  "MERGE_MAX_INCREASE": 25,
  "MERGE_SCALING_CONSTANT": 0.1,
  "MERGE_SCALING_ENABLE": true,
  "MERGE_SCALING_MAX": 20,
  "SCALING_INTERVAL": 120,
  "VIRTUAL_SCALING_INTERVAL": 35
}

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0{"ALIGN_ARGS":"--very-sensitive-local","ALIGN_MAX_INCREASE":25,"ALIGN_SCALING_CONSTANT":0.1,"ALIGN_SCALING_ENABLE":true,"ALIGN_SCALING_MAX":400,"CLEAR_INTERVAL":999999,"DL_ARGS":"","DL_MAX_I

### Error handling

In [None]:
## Stop postgres if it's running 
# systemctl stop postgresql

## Connect to postgres
# psql -h localhost postgres postgres

#  psql -h localhost postgres postgres -c "DELETE FROM blocks WHERE state = 'done';"

### ACCESSION OPERATIONS
## Reset SPLITTING accessions to NEW
# UPDATE acc SET state = 'new' WHERE state = 'splitting';

## Reset SPLIT_ERR accessions to NEW
## (repeated failures can be missing SRA data)
# UPDATE acc SET state = 'new' WHERE state = 'split_err';

## Reset MERGE_ERR accessions to SPLIT_DONE
# UPDATE acc SET state = 'split_done' WHERE state = 'merge_err';

## Clear DONE Accessions (ONLY ON COMPLETION)
# DELETE FROM acc WHERE state = 'merge_done';

### BLOCK OPERATIONS

##  Reset FAIL blocks to NEW
# UPDATE blocks SET state = 'new' WHERE state = 'fail';

# Reset ALIGNING blocks to NEW
# UPDATE blocks SET state = 'new' WHERE state = 'aligning';

# Clear Done
# DELETE FROM blocks WHERE state = 'done';

# RESET STATE
# DELETE FROM blocks WHERE state = 'done';
# DELETE FROM blocks WHERE state = 'fail';
#
#
# DELETE FROM acc WHERE state = 'split_err';
# DELETE FROM acc WHERE state = 'merging';
# DELETE FROM acc WHERE state = 'merge_err';
# DELETE FROM acc WHERE state = 'split_done';


In [None]:
# Nuke Shutdown
cd $TF

aws ec2 describe-instances \
  --filter Name=tag:Name,Values=serratus-align-instance \
  > align_instances.json

jq '.Reservations[].Instances[].InstanceId' -r align_instances.json \
  | pv -l \
  | xargs -n10 -P10 aws ec2 terminate-instances --instance-ids

In [None]:
cd $WORK

aws s3 cp hutest_SraRunInfo.csv s3://serratus-public/out/201025_pmito5/
aws s3 cp viro1k_SraRunInfo.csv s3://serratus-public/out/201025_pmito5/

psummary: `https://serratus-public.s3.amazonaws.com/out/201213_rdrp0/psummary/<SRA>.psummary`

SRA site: `https://www.ncbi.nlm.nih.gov/sra/?term=<SRA>`

### Possible False Negatives

- [SRR8203524](https://www.ncbi.nlm.nih.gov/sra/?term=SRR8203524) is a rotavirus/reovirus library yet there were no significant hits in the diamond alignments. Could be because it's miRNA.
> Name: rtv_SSCRTV_64272_20170228_SISPA160_92xSSCRTV_miseq_nr_trim_BC134_3N

```
sra=SRR8203524;SUMZER_COMMENT=sra=SRR8203524,genome=rdrp0,date=201214-05:08,type=protein;totalalns=0;readlength=0;truncated=no;
```


### Clear single-virus libraries

- [SRR11780077](https://www.ncbi.nlm.nih.gov/sra/?term=SRR11780077) clean SARSCOV2 sequencing 
> Name: RNA-Seq of Severe acute respiratory syndrome coronavirus 2

```
sra=SRR11780077;SUMZER_COMMENT=sra=SRR11780077,genome=rdrp0,date=201214-05:07,type=protein;totalalns=18349;readlength=72;truncated=no;
sra=SRR11780077;famcvg=WWWAAAAAAAAAWAAAAAAAAOOOA;fam=rdrp2;score=100;pctid=98;alns=18349;avgcols=23;
sra=SRR11780077;gencvg=WWWAAAAAAAAAWAAAAAAAAOOOA;gen=rdrp2.Coronaviridae;score=100;pctid=98;alns=18349;avgcols=23;
sra=SRR11780077;seqcvg=WWWAAWAAAAAAWAAAAAAAAOOOA;seq=rdrp2.Coronaviridae.Bat_CoV_279_2005:P0C6V9.1;score=100;pctid=98;alns=17764;avgcols=23;
sra=SRR11780077;seqcvg=____oU__________u_.______;seq=rdrp2.Coronaviridae.Bat_Hp_betacoronavirus_Zhejiang2013:YP_009072438.1;score=10;pctid=96;alns=273;avgcols=24;
sra=SRR11780077;seqcvg=_____u_____________.____w;seq=rdrp2.Coronaviridae.Betacoronavirus_Erinaceus_VMC_DEU_2012:YP_008719930.1;score=6;pctid=94;alns=21;avgcols=18;
```


### Complex Viromes

- [SRR8475477](https://www.ncbi.nlm.nih.gov/sra/?term=SRR8475477)

> Name: Metatranscriptome of Pinus contorta roots inoculated with Suillus spp. at Duke University, North Carolina, United States - S125(MN)xP.contorta(WY)_rep1

Pine and mushroom roots with two clean hits to novel viruses. Partitiviridae (69% id) is a shoe-in for an assemblable virus.

```
sra=SRR8475477;SUMZER_COMMENT=sra=SRR8475477,genome=rdrp0,date=201214-05:26,type=protein;totalalns=983;readlength=151;truncated=no;
sra=SRR8475477;famcvg=___u._.uwwu_uwuwu_wwwuw:.;fam=rdrp2;score=60;pctid=71;alns=118;avgcols=48;
sra=SRR8475477;famcvg=____.::__..__uAo__uu_:___;fam=rdrp3;score=31;pctid=54;alns=864;avgcols=47;
sra=SRR8475477;famcvg=_______._________________;fam=rdrp1;score=2;pctid=50;alns=1;avgcols=38;
sra=SRR8475477;gencvg=___u__._:u__uwuwu_:w._u:.;gen=rdrp2.Partitiviridae;score=35;pctid=69;alns=71;avgcols=49;
sra=SRR8475477;gencvg=____.::______.Wo__uu_____;gen=rdrp3.unclassified;score=23;pctid=60;alns=554;avgcols=48;
sra=SRR8475477;gencvg=___..__uuuu_______u_wuu__;gen=rdrp2.unclassified;score=18;pctid=74;alns=47;avgcols=48;
sra=SRR8475477;seqcvg=____.::_______Wo__uu_____;seq=rdrp3.unclassified.Rhizoctonia_solani_positive_strand_RNA_virus_1:AMM45289.1;score=22;pctid=60;alns=552;avgcols=48;
sra=SRR8475477;seqcvg=___.__._:u__uwu_______::.;seq=rdrp2.Partitiviridae.Fragaria_chiloensis_cryptic_virus:YP_001274391.1;score=18;pctid=74;alns=38;avgcols=49;
sra=SRR8475477;seqcvg=___..__uuuu_______u_wuu__;seq=rdrp2.unclassified.Arhar_cryptic_virus_I:YP_009026407.1;score=18;pctid=74;alns=47;avgcols=48;
sra=SRR8475477;seqcvg=___:______________:w.____;seq=rdrp2.Partitiviridae.Raphanus_sativus_cryptic_virus_2:YP_001686783.1;score=13;pctid=53;alns=16;avgcols=48;
sra=SRR8475477;seqcvg=_____________:U__________;seq=rdrp3.Gammaflexiviridae.Botrytis_virus_F:NP_068549.1;score=...
```


- [ERR2029738](https://www.ncbi.nlm.nih.gov/sra/?term=ERR2029738)

> Name: Microbiome dynamics and adaptation of expression signatures during methane production failure and process recovery


```
sra=ERR2029738;SUMZER_COMMENT=sra=ERR2029738,genome=rdrp0,date=201214-05:20,type=protein;totalalns=24196;readlength=140;truncated=no;
sra=ERR2029738;SUMZER_COMMENT=sra=ERR2029738,genome=rdrp0,date=201214-05:20,type=protein;totalalns=24196;readlength=140;truncated=no;
sra=ERR2029738;famcvg=maomUoOAUmmmUUAUmoUUWWUUa;fam=rdrp2;score=100;pctid=66;alns=5699;avgcols=41;
sra=ERR2029738;famcvg=WUmmMOOAOmWOOUAAmUWAUAmUW;fam=rdrp3;score=100;pctid=96;alns=18252;avgcols=43;
sra=ERR2029738;famcvg=._._____au:uuu:wu.w_..___;fam=rdrp1;score=46;pctid=52;alns=71;avgcols=40;
sra=ERR2029738;famcvg=wu_.___:awu___:__ww_.u:__;fam=rdrp4;score=43;pctid=65;alns=77;avgcols=43;
sra=ERR2029738;famcvg=___a.___u_auu_u_u__:___w:;fam=NA_YP_009618381;score=30;pctid=67;alns=92;avgcols=43;
sra=ERR2029738;famcvg=_________.______:___.____;fam=rdrp5;score=5;pctid=61;alns=5;avgcols=40;
sra=ERR2029738;gencvg=WUoaMOOAOmWOOUWAmUUWUAmUU;gen=rdrp3.Virgaviridae;score=100;pctid=96;alns=17319;avgcols=43;
sra=ERR2029738;gencvg=auu_waWWm:oaooAmaammWUow.;gen=rdrp2.unclassified;score=100;pctid=63;alns=2666;avgcols=42;
sra=ERR2029738;gencvg=____ow:uo:_ommau_amau.uUw;gen=rdrp2.yaOV238;score=99;pctid=56;alns=701;avgcols=43;
sra=ERR2029738;gencvg=o_omUw:wwma.aoomoaaoaomo_;gen=rdrp2.Potyviridae;score=90;pctid=95;alns=948;avgcols=42;
sra=ERR2029738;gencvg=_u__::AW:_.:::_:w_u::uu__;gen=rdrp2.Picobirnaviridae;score=46;pctid=56;alns=1296;avgcols=37;
sra=ERR2029738;gencvg=_uwu___.._________..w:u.:;gen=rdrp2.Sobemovirus;score=19;pctid=97;alns=45;avgcols=36;
...
sra=ERR2029738;seqcvg=____ow:uo:_ommau_amau.uUw;seq=rdrp2.yaOV238.orf50098;score=99;pctid=56;alns=701;avgcols=43;
sra=ERR2029738;seqcvg=uu___aUUm:o_.:Wu_ammWUow.;seq=rdrp2.unclassified.Hubei_picobirna_like_virus_4:APG78304.1;score=92;pctid=64;alns=1917;avgcols=42;

```

- [SRR8794615](https://www.ncbi.nlm.nih.gov/sra/?term=SRR8794615) 

> Name: DYEatom_Cruise_2013

Metatranscriptome sample, only showing rdrp2 results. Looks like a divergent hit but the coverage is low so I'd be skeptical if it assembles.

```
sra=SRR8794615;SUMZER_COMMENT=sra=SRR8794615,genome=rdrp0,date=201214-05:12,type=protein;totalalns=100;readlength=100;truncated=no;
sra=SRR8794615;famcvg=___.___w:_:u:.uww.w_ww:._;fam=rdrp2;score=55;pctid=61;alns=85;avgcols=31;
...
sra=SRR8794615;gencvg=_______w._:u:__ww_u_uw:__;gen=rdrp2.Partitiviridae;score=38;pctid=60;alns=61;avgcols=31;
sra=SRR8794615;gencvg=________:____.u__.u_u:.._;gen=rdrp2.unclassified;score=15;pctid=61;alns=21;avgcols=31;
...
sra=SRR8794615;seqcvg=_____________.:__.u_u:___;seq=rdrp2.unclassified.Phytophthora_infestans_RNA_virus_1:YP_003193667.1;score=10;pctid=59;alns=16;avgcols=31;
sra=SRR8794615;seqcvg=________________:___:w:__;seq=rdrp2.Partitiviridae.Citrullus_lanatus_cryptic_virus:APT68925.1;score=9;pctid=74;alns=17;avgcols=31;
sra=SRR8794615;seqcvg=____________:__w_________;seq=rdrp2.Partitiviridae.Beet_cryptic_virus_3:AAB27624.1;score=9;pctid=55;alns=10;avgcols=31;
sra=SRR8794615;seqcvg=_______w.________________;seq=rdrp2.Partitiviridae.Raphanus_sativus_cryptic_virus_3:YP_002364401.1;score=8;pctid=61;alns=11;avgcols=32;
sra=SRR8794615;seqcvg=__________:u_____________;seq=rdrp2.Partitiviridae.Pittosporum_cryptic_virus_1:CEJ95596.2;score=4;pctid=50;alns=8;avgcols=31;
sra=SRR8794615;seqcvg=_______________:u________;seq=rdrp2.Partitiviridae.Pepper_cryptic_virus_1:AEJ07890.1;score=3;pctid=61;alns=8;avgcols=32;
sra=SRR8794615;seqcvg=__________________u______;seq=rdrp2.Partitiviridae.Pinus_sylvestris_partitivirus_NL_2005:AAY51483.1;score=2;pctid=50;alns=6;avgcols=32;
```


- [SRR5839347](https://www.ncbi.nlm.nih.gov/sra/?term=SRR5839347)

> Name:  Freshwater microbial communities from Lake Simoncouche, Canada - S_130109_E_mt

```
SUMZER_COMMENT=sra=SRR5839347,genome=rdrp0,date=201214-05:41,type=protein;totalalns=41690;readlength=151;truncated=no;
famcvg=UWWAAAOAAAAAAAAAWAOAAAAAW;fam=rdrp2;score=100;pctid=70;alns=16830;avgcols=47;
famcvg=_uuwauwawaaawaa:waa:_:___;fam=rdrp4;score=100;pctid=54;alns=264;avgcols=45;
famcvg=_.u_.uoawwwwwauuwwaw:uuu_;fam=rdrp5;score=100;pctid=54;alns=239;avgcols=45;
famcvg=UWWUWWWWWWWWWAWWAAWWWWWWo;fam=rdrp3;score=100;pctid=63;alns=9427;avgcols=46;
famcvg=_a.w::::.:uuooooaumomoaoo;fam=rdrp0;score=100;pctid=55;alns=685;avgcols=45;
famcvg=oWAAAAAAWAAWWWAAAAAAWUUOu;fam=rdrp1;score=100;pctid=66;alns=14245;avgcols=45;
gencvg=:.waoowaammawaomoommoaaa.;gen=rdrp2.yaOV13;score=100;pctid=56;alns=826;avgcols=47;
gencvg=:.u:w::wuauwuwwawawuawuu.;gen=rdrp2.yaOV18;score=100;pctid=54;alns=223;avgcols=47;
gencvg=_awuuwoaauowaaaaaamoammo_;gen=rdrp1.yaOV14;score=100;pctid=59;alns=677;avgcols=44;
gencvg=oooaaoomoaaooooooooaoaomw;gen=rdrp3.yaOV11;score=100;pctid=66;alns=1121;avgcols=46;
gencvg=wwoa:aaawawaau__oaawaaua:;gen=rdrp3.Nodaviridae;score=100;pctid=63;alns=445;avgcols=46;
gencvg=.wowwuUWmomoUmUUmUoowao:_;gen=rdrp1.unclassified;score=100;pctid=60;alns=2033;avgcols=43;
gencvg=woaamommommommommmommmaaw;gen=rdrp3.unclassified;score=100;pctid=62;alns=1587;avgcols=46;
gencvg=wwwWWWAUUuoWUUWm:UAUUUWWU;gen=rdrp2.yaOV98;score=100;pctid=83;alns=6001;avgcols=47;
gencvg=ommoommmomoowoomomommmomw;gen=rdrp2.yaOV1;score=100;pctid=71;alns=1489;avgcols=47;
gencvg=:_wuwwaaaawwawww:wwuwaaw:;gen=rdrp2.yaOV9;score=100;pctid=57;alns=314;avgcols=45;
gencvg=_.u_.uawwwuwwwuuwwaw::.u_;gen=rdrp5.unclassified;score=100;pctid=54;alns=203;avgcols=45;
gencvg=_aawwwaaawaawaaaawaawuaw:;gen=rdrp2.yaOV12;score=100;pctid=62;alns=420;avgcols=47;
gencvg=.WWAAAAmoUAmwwwUWWAAUmuO_;gen=rdrp1.yaOV64;score=100;pctid=71;alns=7979;avgcols=46;
gencvg=oomUUmmmUmmUmUmmUUUmUmoUw;gen=rdrp2.unclassified;score=100;pctid=64;alns=2684;avgcols=46;
gencvg=_aaoomoaooaomUmaoooaoomw.;gen=rdrp2.yaOV107;score=100;pctid=54;alns=1174;avgcols=44;
gencvg=uwwwwaaaawawwwaaawaaawwu.;gen=rdrp3.yaOV8;score=100;pctid=58;alns=353;avgcols=44;
gencvg=:aawwuwwaaawaawwwawauwaau;gen=rdrp1.yaOV10;score=100;pctid=58;alns=411;avgcols=45;
gencvg=moawommmomomommmommmmmoww;gen=rdrp3.yaOV2;score=100;pctid=65;alns=1537;avgcols=45;
gencvg=aoaoooooaaaaaaaaaooaowawu;gen=rdrp3.yaOV3;score=100;pctid=63;alns=676;avgcols=46;
gencvg=uaaoaaomooooooooooaaawao:;gen=rdrp3.yaOV4;score=100;pctid=58;alns=887;avgcols=45;
gencvg=aaooooaoooammmmmmmoaooomu;gen=rdrp3.yaOV5;score=100;pctid=66;alns=1321;avgcols=46;
gencvg=aawuwwwwawawaoaoooaoawaw:;gen=rdrp1.yaOV6;score=100;pctid=64;alns=501;avgcols=45;
gencvg=_wwwuaaomwwmooUmomooawoa_;gen=rdrp1.yaOV7;score=100;pctid=61;alns=971;avgcols=46;
gencvg=amwooumoummwoooooaooomo._;gen=rdrp2.yaOV72;score=100;pctid=65;alns=1131;avgcols=47;
```

## Chasing reads

### Hepeviridae 1

Example hit psummary:
```
sra=SRR10873904;gencvg=______amUmm_mU.__________;gen=rdrp3.Hepeviridae;score=58;pctid=50;alns=735;avgcols=57;
sra=SRR10873904;seqcvg=______wo.ma______________;seq=rdrp3.Hepeviridae.Orthohepevirus_D:YP_009350098.1;score=34;pctid=50;alns=171;avgcols=60;
```

Example read:
```
SRR10873904.1856        rdrp3.Hepeviridae.Orthohepevirus_D:YP_009350098.1       11      157     249     136     182     429     41.2    4.9e-05 5VI3SP2MLNCFAMLIF2FWIF3QEQKARILTV1TNILSPIANGQ-R-N-P-TWLF2-D-L2RENA1VI   35M4I4M2D6M     +       GTATAAAGATAAAGTAGGACAAGGGGTTAGTGCATGGAGTAAAACTATGAACTTCATGATTGGGCCTTTTATAAGGGCGTTCCAACAAGCAATAACCAGCACAATTTCTATCAATCAAAGAAACCCAACCCTATATTGCTACAATCGAAATGATGTTAAATTCGCCGAGTTTTTCACGCTAACGGATAAAATGGAAGGGGAACATGTATCAGCAGACGTGAGTGAGATGGATAGCGTCCATAGTCTGTC       KVGQGISAWPKTLCALFGPWFRAFEKRLVSNLPAGWFYCDLYNEADI
```

top blastx hits:
```
Select seq gb|QJI53774.1|	replicase [Hepeviridae sp.]	Hepeviridae sp.	174	174	97%	2e-48	100.00%	2089	QJI53774.1
Select seq ref|YP_009553584.1|	replicative protein [Elicom virus 1]	Elicom virus 1	91.7	91.7	97%	2e-19	55.56%	1787	YP_009553584.1
Select seq gb|QJI53776.1|	replicase [Hepeviridae sp.]	Hepeviridae sp.	49.3	49.3	92%	2e-04	38.46%	2190	QJI53776.1
```

It is in fact an uncharacterized virus; but one which is being published. Next closest hit is 55%, so pretty spot on. Need to update the sequences for GenBank 2020 to minimize these hits!

>replicase [Hepeviridae sp.]
>GenBank: QJI53774.1
>  AUTHORS   Zhou,R., Shan,T., Yang,S. and Zhang,W.
>  TITLE     Viral genomes from wild and zoo birds in China 2020
>  JOURNAL   Unpublished


### Hepeviridae 2

Hit:
```
sra=SRR5415522;gencvg=::_______:_w:.w.u___u____;gen=rdrp3.Hepeviridae;score=20;pctid=78;alns=42;avgcols=30;
sra=SRR5415522;seqcvg=___________.._:_____u____;seq=rdrp3.Hepeviridae.Hepatitis_E_virus:BAT70058.1;score=5;pctid=78;alns=8;avgcols=32;
```

Read:
```
SRR5415522.38593        rdrp3.Hepeviridae.Hepatitis_E_virus:BAV83005.1  100     11      101     199     228     428     83.3    3.4e-10 10DNFY2SG6ER5DE2        30M     -       ACAAGGTAACTAACCAATCAGGGGCGCCCGCCTCCTCCATCAATAGACACTCCAGGCTCAATGAGAAGTCATTCTGGGTCGAATCAAACTCGGAAAAATCA   DFSEFDSTQNNYSLGLECLLMREAGAPEWL
SRR5415522.1278537      rdrp3.Hepeviridae.Hepatitis_E_virus:BAV83005.1  100     20      101     215     241     428     77.8    1.1e-08 5ER5DE2VWTR4IL2SA3      27M     -       GACCATCTAGCGACCTCCACAGCACCCAACTAGACCTAATTAGATGGTACAAGGTAACTAACCAATCAGGGGCGCCCGCCTCCTCCATCAATAGACACTCC   ECLLMREAGAPEWLWRLYHLLRSAWVL
```

Hits:
```
Select seq gb|ATY47660.1|	non-structural polyprotein [Hepevirus sp.]	Hepevirus sp.	75.1	75.1	98%	2e-14	100.00%	1695	ATY47660.1
Select seq gb|AST08175.1|	RNA-dependent RNA-polymerase [Hepatitis E virus]	Hepatitis E virus	58.9	58.9	98%	5e-10	78.79%	94	AST08175.1
Select seq dbj|BAX03497.1|	polyprotein [Hepatitis E virus]	Hepatitis E virus	61.6	61.6	89%	6e-10	86.67%	253	BAX03497.1
```

### Hepeviridae 3

Hit:
```
sra=SRR5995681;gencvg=____________.m_:oUo______;gen=rdrp3.Hepeviridae;score=36;pctid=50;alns=356;avgcols=63;
sra=SRR5995681;seqcvg=_____________a_:aou______;seq=rdrp3.Hepeviridae.Hepatitis_E_virus:BAO31621.1;score=28;pc
tid=50;alns=100;avgcols=66;
```

Read:
```
SRR5995681.278933       rdrp3.Hepeviridae.Hepatitis_E_virus:ANN23868.1  49      249     251     197     262     432     37.3    1.1e-07 3MFES1YF1TSST1SNENIFTSIL1NL1ICKVWILMREREMC1CMNPEQLWFL1ARMLFYKHKLHV1TSQA2ALNQYA1GKY-VE1MLRK1YFAW7FG2     49M1I17M        +       GTTTCCCAGTCACGATAATAGTAGAAAAATGTTAAATATAATTTCATTGAGAACGACATGGAAGAATATGATACGTCACAAAGTGAAATTACAATTGGGAATGAAATTAAATGGTTACGTAGGATGGGTTGTAATGAGTTGTTCATTGCTATGTTTAAGAAACATAGAACACAGTGGACTGCTAATTATCCAGGTTATGTATCCATGAGAGGCTATGCTAAGAAACATTCAGGTGAACCATTTACTTTGGG     ENDFSEFDSTQNNFSLGLECVIMEECGMPQWLIRLYHLVRSAWTLQAPKESLKGFWKKHSGEPGTL

SRR5995681.283115       rdrp3.Hepeviridae.Hepelivirus:AFR11847.1        72      209     251     204     249     326     43.5    7.6e-06 QHDKCASG2IA1YF1VL4FLYF2LV2VAAVCV1LH1NG1RDIYAR1DK1SHGMVN1YAFRDEES2       46M     +       GTTTCCCAGTCACGATAGCGCAGTTGGTGTGAAAGCTTGTCTTGATTTGGGTTTTAAAACCAAGTTAAGAACAGGATTGTAGCGAATTTATTGGATATATTGTTACACCGTATGGTTTTTATCCTGATTTATTGCGTGTTGCTTGTAAGTTATTAAATAAGAGAATTGCTGATGACCAAAGCGGTGTTGAATATTTTGATGAATTAAAGATCGGAAGAGCGTCGGGTAGGGAAAGAGTGTAGGAAGGTGTG     HKAGEFAGFILTPYGLFPDVLRAVVKHLGKDYRDKQHMNEARESLK
```

blastx:
```
Select seq dbj|BBE36496.1|	nonstructural polyprotein [Hepatitis E virus type 4]	Hepatitis E virus type 4	53.9	53.9	83%	5e-06	37.14%	1706	BBE36496.1
Select seq dbj|BAN63764.1|	nonstructural protein [Hepatitis E virus]	Hepatitis E virus	53.5	53.5	80%	7e-06	37.31%	1705	BAN63764.1
Select seq dbj|BAN63773.1|	nonstructural protein [Hepatitis E virus]	Hepatitis E virus	53.5	53.5	80%	7e-06	37.31%	1715	BAN63773.1
```

Wooo first real hit, 37.3% to Hep E. The SRA isn't mined out :)

### Paramyxoviridae 1

Hit:
```
sra=ERR1301463;gencvg=__uu:uwu:__:uuuuuwu:uu:::;gen=rdrp5.Paramyxoviridae;score=34;pctid=80;alns=100;avgcols=59;
sra=ERR1301463;seqcvg=______::___:___u:w_:uu:::;seq=rdrp5.Paramyxoviridae.Miniopterus_schreibersii_paramyxovirus:AGU69458.1;score=19;pctid=81;alns=44;avgcols=62;
```

Read:
```
ERR1301463.579021       rdrp5.Paramyxoviridae.Miniopterus_schreibersii_paramyxovirus:AGU69458.1 249     1       250     535     617     634     77.1
    8.3e-37 1ML1NR2SN3TS1SA1LV9LM6VL10LI2VI4QK4ND1NS5IM2QE2DNCS5TA5VI       83M     -       TAACGAATAGGGATCACTGGTCCAATCAAGATAATCACAGTCTCCTGTTTGCTGATGTATAATTTTTTGAATAATATTTTCATTCAATAGTTTCACTTGGATCATTCTTTTGACATCTGCCAGTGATGCCGTCACTGGGTCTCCAATATTCCTAACGTAGAGTCGGCACATGTTAAGATAATTAAACCCACCTAATTGTGAAGGGAGTATACTTGCTGTGATAAGCCAACTTTGATTGTTTATCATAGGG      PLIRNQNWLISAAIVPSQLGGFNYMNMCRLYLRNIGDPVTASIADIKRMIKVKLLDESIIQKIMHQETGNSDYLDWASDPYSI
```

blastx:
```
Select seq gb|AGU69458.1|	large protein [Miniopterus schreibersii paramyxovirus]	Miniopterus schreibersii paramyxovirus	149	149	99%	1e-39	77.11%	1765	AGU69458.1
Select seq gb|AYM47538.1|	large protein [Bat paramyxovirus]	Bat paramyxovirus	147	147	99%	4e-39	78.31%	2164	AYM47538.1
Select seq gb|AXR70620.1|	large protein [Bat paramyxovirus]	Bat paramyxovirus	147	147	99%	4e-39	78.31%	2195	AXR70620.1
```


### Paramyxoviridae 2

hit:
```
sra=ERR1303045;gencvg=u::wuuu.u_:awwwwwawuawwuu;gen=rdrp5.Paramyxoviridae;score=74;pctid=81;alns=208;avgcols=61;
sra=ERR1303045;seqcvg=:_____.___:w___uuaw:awwuu;seq=rdrp5.Paramyxoviridae.Miniopterus_schreibersii_paramyxovirus:AGU69458.1;score=39;pctid=81;alns=95;avgcols=59;
```

read:
```
ERR1303045.1588 rdrp5.Paramyxoviridae.Miniopterus_schreibersii_paramyxovirus:AGU69458.1 164     3       250     486     540     634     61.8    2.7e-11 2YS1VATINSPK1RI1TQLGLF1*RFW-I2CSIL6EQ1ILIV15TVNE1ML1NR2 17M1D37M        -       CTTTGATTGTTTATCATAGGGTTAGTTATATCATCAGTCATACTTGGGTTGATTGTAAATTTTAACGATATTATAATTTGCTCTAACACTTTAAGAATATTTATACAGTAACCAAATCAATTGAGCAGGGTTTCTCTAGATGGATTGGTTACTGTATAAATATTCTTAAAGTGTTAGAGCAAATTATAATATCGTTAAAATTTAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCTGGCT      NISTAISKSIEQGFNRWIGYSLNILKVLQQLVISLKFTINPSMTDDIVEPLIRNQ
```

blastx:
```
Select seq gb|QID58724.1|	RNA-dependent RNA polymerase [Paramyxovirus PREDICT_PMV-66]	Paramyxovirus PREDICT_PMV-66	77.8	142	80%	7e-16	100.00%	164	QID58724.1
Select seq gb|QID58716.1|	RNA-dependent RNA polymerase [Paramyxovirus PREDICT_PMV-66]	Paramyxovirus PREDICT_PMV-66	77.4	142	80%	7e-16	100.00%	144	QID58716.1
```

Previously identified hit in April 2020. Again missing from our DB 

### Paramyxoviridae 3

hit:
```
sra=SRR038590;gencvg=:_.:u:uuwuwwuuww:::.::::_;gen=rdrp5.Paramyxoviridae;score=48;pctid=78;alns=98;avgcols=76;
sra=SRR038590;seqcvg=:_.:u::uwuwwuuuw:::.::::_;seq=rdrp5.Paramyxoviridae.Measles_morbillivirus:ABB71671.1;score=44;pctid=79;alns=95;avgcols=76;
sra=SRR038590;seqcvg=_______________._________;seq=rdrp5.Paramyxoviridae.Miniopterus_schreibersii_paramyxovirus:AGU69458.1;score=2;pctid=50;alns=1;avgcols=100;
```

read: (Note the high identity of of the virus here)
```
SRR038590.6     rdrp5.Paramyxoviridae.Measles_morbillivirus:ABB71671.1  149     6       289     263     310     636     100.0   1.6e-23 48      48M
     -       TTGGGAAATGAGGGCAATCCGTAAATCTCATTTAGCCTCTGTGCAAACAAGCTGATGGTCTCATATCTCCAATTAAGGCAGTACTTCTTGAGATCAGTCGTGATAAATGCACTGACTGTCTCGTAAGCTTCCATATTCTCCGGATGATCAGCTATTATGTTTGGACACAGGTTAGCGGCGCACCCTGTACGATTGATTTACCAGCAAGCCCAACACCTGTAATTTCCCAATTGGCAACCTCGCTTGAACCTCGGCTGAGACTGCCAAGGCACACAGGGGATAGGNNNNN       DHPENMEAYETVSAFITTDLKKYCLNWRYETISLFAQRLNEIYGLPSF
SRR038590.405   rdrp5.Paramyxoviridae.Measles_morbillivirus:ABB71671.1  151     47      290     263     297     636     100.0   5.5e-16 35      35M
     -       TTGGGAAAATGAGGGCAATCCGTAAATCTCATTTAGCCTCTGTGCAAAACAAGCTGATGGTCTCATATCTCCAATTAAGGCAGTACTTCTTGAGATCAGTCGTGATAAATGCACTGACTGTCTCGTAAGCTTCCATATTCTCCGGATGATCAGCTATTATGTTTGGACACAGGTTAGCGGCGCACCCTGTACGATTGATTTACCAGCAAGCCCAACACCTGTAATTTCCCAATTGGCAACCTCGCTTGAACCTCGGCTGAGACTGCCAAGGCACACAGGGGATAGGNNNN      DHPENMEAYETVSAFITTDLKKYCLNWRYETISLF
```