# Run: Diamond Pilot Study

```
Lead     : ababaian
Issue    : 
Version  : v0.3.4 dev-diamond branch
start    : 2020 06 27
complete : 2020 06 27
files    : ~/serratus/notebook/200627_ab/
s3_files : s3://serratus-public/notebook/200627_ab/
output   : s3://serratus-public/out/200627_dmnd0/
```

### Intro/Objectives

- With ongoing discussion, we have considered how we can go 'deeper' into the virome and identify even more distant viruses.
- The blastx algorithm, that is searching for similarity in protein rather then nucleotides will accomplish this, we have choosen `diamond` as a very fast implementation of the algorithm.
- This is a pilot study from the development branch `dev-diamond` on a few random viromes to see what the output will look like and measure runtime performance

### Initialize local workspace

In [1]:
# Serratus commit version
SERRATUS="/home/artem/serratus"
cd $SERRATUS

# Create local run directory
WORK="$SERRATUS/notebook/200627_ab"
mkdir -p $WORK; cd $WORK

# S3 notebook path
S3_WORK='s3://serratus-public/notebook/200627_ab/'

# date and version
date
git rev-parse HEAD # commit version

Sat Jun 27 15:21:38 PDT 2020
074c3321d1da8e1248b966fe717ca131d1dda7f4


### SRA Accession Initialization

- 50 Random bat samples
- 50 Random virome samples
- CoV known spike-in (from virome)


In [5]:
cd $WORK
BATCH='propilot_SraRunInfo.csv'

ln -s ../200620_ab/bat_SraRunInfo.csv ./
ln -s ../200528_ab/viro_SraRunInfo.csv ./

head -n1  bat_SraRunInfo.csv  > $BATCH
tail -n+2 bat_SraRunInfo.csv  | shuf - | head -n 50 - >> $BATCH
tail -n+2 viro_SraRunInfo.csv | shuf - | head -n 50 - >> $BATCH
grep 'ERR27567' bat_SraRunInfo.csv >> $BATCH

wc -l  $BATCH
md5sum $BATCH

shuf: write error: Broken pipe
shuf: write error
shuf: write error: Broken pipe
shuf: write error
118 propilot_SraRunInfo.csv
388b005633fdd2db058f575939c7e413  propilot_SraRunInfo.csv


In [16]:
cd $WORK
BATCH='propilot_v2_SraRunInfo.csv'

head -n1  bat_SraRunInfo.csv  > $BATCH
tail -n+2 viro_SraRunInfo.csv | shuf - | head -n 200 - >> $BATCH

wc -l  $BATCH
md5sum $BATCH

shuf: write error: Broken pipe
shuf: write error
201 propilot_v2_SraRunInfo.csv
9142c84255be13af0e0616290cc6c73c  propilot_v2_SraRunInfo.csv


### Terraform Initialize

In [7]:
# For rapid batching; copy out serratus folder
TF=$SERRATUS/terraform/main
cd $TF
git diff main.tf
terraform init

# Launch Terraform Cluster
# Initialize the serratus cluster with minimal nodes
terraform apply -auto-approve

diff --git a/terraform/main/main.tf b/terraform/main/main.tf
index de2d00d..2cd53e9 100644
--- a/terraform/main/main.tf
+++ b/terraform/main/main.tf
@@ -92,7 +92,7 @@ module "scheduler" {
   
   security_group_ids = [aws_security_group.internal.id]
   key_name           = var.key_name
-  instance_type      = "c5.large"
+  instance_type      = "r5.xlarge"
   dockerhub_account  = var.dockerhub_account
   scheduler_port     = var.scheduler_port
 }
@@ -105,7 +105,7 @@ module "monitoring" {
   key_name           = var.key_name
   scheduler_ip       = module.scheduler.private_ip
   dockerhub_account  = var.dockerhub_account
-  instance_type      = "r5.large"
+  instance_type      = "r5.xlarge"
 }
 
 // Serratus-dl
@@ -113,13 +113,13 @@ module "download" {
   source             = "../worker"
 
   desired_size       = 0
-  max_size           = 200
+  max_size           = 5000
 
   dev_cidrs          = var.dev_cidrs
   security_group_ids = [aws_security_group.inter

In [13]:
cd $TF

# Open SSH tunnels to the monitor
./create_tunnels.sh

# If you get an error on port
# run:
# ps aux | grep ssh
# sudo kill <PID of SSH>

Tunnels created:
    localhost:3000 = grafana
    localhost:9090 = prometheus
    localhost:5432 = postgres
    localhost:8000 = scheduler


In [9]:
wc -l $WORK/$BATCH

118 /home/artem/serratus/notebook/200627_ab/propilot_SraRunInfo.csv


In [11]:
# Upload SraRunInfo.csv into Serratus
cd $TF
./uploadSRA.sh $WORK/$BATCH

Loading SRARunInfo into scheduler 
  File: /home/artem/serratus/notebook/200627_ab/propilot_SraRunInfo.csv
  date: Sat Jun 27 16:13:07 PDT 2020
  wc  : 118 /home/artem/serratus/notebook/200627_ab/propilot_SraRunInfo.csv
  md5 : 388b005633fdd2db058f575939c7e413  /home/artem/serratus/notebook/200627_ab/propilot_SraRunInfo.csv


--------------------------
tmp.chunk00
118 tmp.chunk00_sraRunInfo.csv
f972feef901df476f832dce17d8322c8  tmp.chunk00_sraRunInfo.csv
{"inserted_rows":117,"total_rows":117}


 uploadSRA complete.


In [17]:
wc -l $WORK/$BATCH
# Upload SraRunInfo.csv into Serratus
cd $TF
./uploadSRA.sh $WORK/$BATCH

201 /home/artem/serratus/notebook/200627_ab/propilot_v2_SraRunInfo.csv
Loading SRARunInfo into scheduler 
  File: /home/artem/serratus/notebook/200627_ab/propilot_v2_SraRunInfo.csv
  date: Sat Jun 27 18:17:49 PDT 2020
  wc  : 201 /home/artem/serratus/notebook/200627_ab/propilot_v2_SraRunInfo.csv
  md5 : 9142c84255be13af0e0616290cc6c73c  /home/artem/serratus/notebook/200627_ab/propilot_v2_SraRunInfo.csv


--------------------------
tmp.chunk00
201 tmp.chunk00_sraRunInfo.csv
9d1536846e5b5239fe18848c812e576f  tmp.chunk00_sraRunInfo.csv
{"inserted_rows":200,"total_rows":317}


 uploadSRA complete.


### Terraform Initialize 2

In [23]:
# For rapid batching; copy out serratus folder
TF=$SERRATUS/terraform/main
cd $TF
git diff main.tf
terraform init

# Launch Terraform Cluster
# Initialize the serratus cluster with minimal nodes
terraform apply -auto-approve

Connection to ec2-35-170-10-104.compute-1.amazonaws.com closed by remote host.
Connection to ec2-107-23-183-13.compute-1.amazonaws.com closed by remote host.
diff --git a/terraform/main/main.tf b/terraform/main/main.tf
index de2d00d..2cd53e9 100644
--- a/terraform/main/main.tf
+++ b/terraform/main/main.tf
@@ -92,7 +92,7 @@ module "scheduler" {
   
   security_group_ids = [aws_security_group.internal.id]
   key_name           = var.key_name
-  instance_type      = "c5.large"
+  instance_type      = "r5.xlarge"
   dockerhub_account  = var.dockerhub_account
   scheduler_port     = var.scheduler_port
 }
@@ -105,7 +105,7 @@ module "monitoring" {
   key_name           = var.key_name
   scheduler_ip       = module.scheduler.private_ip
   dockerhub_account  = var.dockerhub_account
-  instance_type      = "r5.large"
+  instance_type      = "r5.xlarge"
 }
 
 // Serratus-dl
@@ -113,13 +113,13 @@ module "download" {
   source             = "../worker"
 
   desired_size

In [24]:
cd $TF

# Open SSH tunnels to the monitor
./create_tunnels.sh

# If you get an error on port
# run:
# ps aux | grep ssh
# sudo kill <PID of SSH>

Tunnels created:
    localhost:3000 = grafana
    localhost:9090 = prometheus
    localhost:5432 = postgres
    localhost:8000 = scheduler


In [26]:
cd $WORK
cut -f1 -d',' propilot_SraRunInfo.csv > v2.sra
cut -f1 -d',' propilot_v2_SraRunInfo.csv >> v2.sra

sed -i 's/$/,/g' v2.sra

head -n1 bat_SraRunInfo.csv          > v3_bat_SraRunInfo.csv

grep -vif v2.sra bat_SraRunInfo.csv >> v3_bat_SraRunInfo.csv



In [28]:
BATCH=$WORK/v3_bat_SraRunInfo.csv
# Upload SraRunInfo.csv into Serratus
cd $TF
./uploadSRA.sh $WORK/v3_bat_SraRunInfo.csv

Loading SRARunInfo into scheduler 
  File: /home/artem/serratus/notebook/200627_ab/v3_bat_SraRunInfo.csv
  date: Sat Jun 27 23:27:40 PDT 2020
  wc  : 2756 /home/artem/serratus/notebook/200627_ab/v3_bat_SraRunInfo.csv
  md5 : cc2c4a67fe8b87ce904d83dde8bda5df  /home/artem/serratus/notebook/200627_ab/v3_bat_SraRunInfo.csv


--------------------------
tmp.chunk00
2756 tmp.chunk00_sraRunInfo.csv
f9b0f84b7b64ab699d292db76d08e3c9  tmp.chunk00_sraRunInfo.csv
{"inserted_rows":2754,"total_rows":2754}


 uploadSRA complete.


## Run Serratus

In [29]:
# Set Cluster Parameters =============================
## get Config File (if it doesn't exist)
# curl localhost:8000/config | jq > serratus-config.json
#
cd $TF
# Make local changes to config file
echo "  Cluster Config File: "
cat serratus-config.json
echo ""
echo ""
# Re-upload config file
curl -T serratus-config.json localhost:8000/config

  Cluster Config File: 
{
  "ALIGN_ARGS": "--very-sensitive-local",
  "ALIGN_MAX_INCREASE": 50,
  "ALIGN_SCALING_CONSTANT": 0.25,
  "ALIGN_SCALING_ENABLE": true,
  "ALIGN_SCALING_MAX": 300,
  "CLEAR_INTERVAL": 600000,
  "DL_ARGS": "",
  "DL_MAX_INCREASE": 15,
  "DL_SCALING_CONSTANT": 0.1,
  "DL_SCALING_ENABLE": true,
  "DL_SCALING_MAX": 50,
  "GENOME": "protref1",
  "MERGE_ARGS": "protein",
  "MERGE_MAX_INCREASE": 5,
  "MERGE_SCALING_CONSTANT": 0.1,
  "MERGE_SCALING_ENABLE": true,
  "MERGE_SCALING_MAX": 10,
  "SCALING_INTERVAL": 120,
  "VIRTUAL_SCALING_INTERVAL": 35
}

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0   550    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0{"ALIGN_ARGS":"--very-sensitive-local","ALIGN_MAX_INCREASE":50,"ALIGN_SCALING_CONST

### Error handling

In [None]:
## Stop postgres if it's running 
# systemctl stop postgresql

## Connect to postgres
# psql -h localhost postgres postgres

#  psql -h localhost postgres postgres -c "DELETE FROM blocks WHERE state = 'done';"

### ACCESSION OPERATIONS
## Reset SPLITTING accessions to NEW
# UPDATE acc SET state = 'new' WHERE state = 'splitting';

## Reset SPLIT_ERR accessions to NEW
## (repeated failures can be missing SRA data)
# UPDATE acc SET state = 'new' WHERE state = 'split_err';

## Reset MERGE_ERR accessions to SPLIT_DONE
# UPDATE acc SET state = 'split_done' WHERE state = 'merge_err';

## Clear DONE Accessions (ONLY ON COMPLETION)
# DELETE FROM acc WHERE state = 'merge_done';

### BLOCK OPERATIONS

##  Reset FAIL blocks to NEW
# UPDATE blocks SET state = 'new' WHERE state = 'fail';

# Reset ALIGNING blocks to NEW
# UPDATE blocks SET state = 'new' WHERE state = 'aligning';

# Clear Done
# DELETE FROM blocks WHERE state = 'done';

# RESET STATE
# DELETE FROM blocks WHERE state = 'done';
# DELETE FROM blocks WHERE state = 'fail';
#
#
# DELETE FROM acc WHERE state = 'split_err';
# DELETE FROM acc WHERE state = 'merging';
# DELETE FROM acc WHERE state = 'merge_err';
# DELETE FROM acc WHERE state = 'split_done';


In [None]:
# Nuke Shutdown
aws ec2 describe-instances \
  --filter Name=tag:Name,Values=serratus-align-instance \
  > align_instances.json

jq '.Reservations[].Instances[].InstanceId' -r align_instances.json \
  | pv -l \
  | xargs -n10 -P10 aws ec2 terminate-instances --instance-ids