# Run: Phase 1 Clean-up - QC virome

```
Lead     : ababaian
Issue    : 
Version  : v0.3.3
start    : 2020 06 12
complete : 2020 06 12
files    : ~/serratus/notebook/200612_ab/
s3_files : s3://serratus-public/notebook/200612_ab/
output   : s3://serratus-public/out/200612_qc/
```

### Objectives
- Use improved virome analysis
- Clean-up missing entries processed from the first batch


### Initialize local workspace

In [1]:
# Serratus commit version
SERRATUS="/home/artem/serratus"
cd $SERRATUS

# Create local run directory
WORK="$SERRATUS/notebook/200612_ab"
mkdir -p $WORK; cd $WORK

# S3 notebook path
S3_WORK='s3://serratus-public/notebook/200612_ab/'

# date and version
date
git rev-parse HEAD # commit version

Fri Jun 12 20:10:29 PDT 2020
f6fff24f9e18a6104d024e1394788f28154fb63f


## SRA Accession Initialization



### Virome 2 Addendum

SRA Accessed: 2020/06/12

Search Term: 
```
"virome" OR "viral metagenome" OR "viral metagenomics" AND cluster_public[prop]
```

Results: `21858` Accessions saved in `viro2_SraRunInfo.csv`


### Remove previously completed accessions

In [2]:
# Create a list of all completed runs to date
cd $WORK
RUNINFO=$WORK/viro2_SraRunInfo.csv
BATCH='v2'
BATCH_SRA=$WORK/"$BATCH"_SraRunInfo.csv
S3_PATH="s3://lovelywater2/index.tsv"

# Download phase-1 index
aws s3 cp $S3_PATH $BATCH.complete
cat $BATCH.complete \
  | sed 's/^...............................//g' - \
  | cut -f1 -d'.' - > $BATCH.sra.complete

# Count
wc -l $RUNINFO
wc -l $BATCH.sra.complete

# Inverse look-up completed SRA
# into a new SraRunInfo file
grep -vif $BATCH.sra.complete $RUNINFO > $BATCH_SRA

# QC on new batch
wc -l  $BATCH_SRA
md5sum $BATCH_SRA

Completed 256.0 KiB/54.7 MiB with 1 file(s) remainingCompleted 512.0 KiB/54.7 MiB with 1 file(s) remainingCompleted 768.0 KiB/54.7 MiB with 1 file(s) remainingCompleted 1.0 MiB/54.7 MiB with 1 file(s) remaining  Completed 1.2 MiB/54.7 MiB with 1 file(s) remaining  Completed 1.5 MiB/54.7 MiB with 1 file(s) remaining  Completed 1.8 MiB/54.7 MiB with 1 file(s) remaining  Completed 2.0 MiB/54.7 MiB with 1 file(s) remaining  Completed 2.2 MiB/54.7 MiB with 1 file(s) remaining  Completed 2.5 MiB/54.7 MiB with 1 file(s) remaining  Completed 2.8 MiB/54.7 MiB with 1 file(s) remaining  Completed 3.0 MiB/54.7 MiB with 1 file(s) remaining  Completed 3.2 MiB/54.7 MiB with 1 file(s) remaining  Completed 3.5 MiB/54.7 MiB with 1 file(s) remaining  Completed 3.8 MiB/54.7 MiB with 1 file(s) remaining  Completed 4.0 MiB/54.7 MiB with 1 file(s) remaining  Completed 4.2 MiB/54.7 MiB with 1 file(s) remaining  Completed 4.5 MiB/54.7 MiB with 1 file(s) remaining  Completed 4.8 MiB/54.7 MiB w

In [None]:
aws s3 cp $BATCH_SRA $S3_WORK

### Terraform Initialize

In [3]:
# Terraform customization
git diff $SERRATUS/terraform/main/main.tf

diff --git a/terraform/main/main.tf b/terraform/main/main.tf
index c030eb5..a0a9134 100644
--- a/terraform/main/main.tf
+++ b/terraform/main/main.tf
@@ -89,10 +89,10 @@ module "scheduler" {
   
   security_group_ids = [aws_security_group.internal.id]
   key_name           = var.key_name
-  instance_type      = "c5.2xlarge"
+  instance_type      = "c5.4xlarge"
   dockerhub_account  = var.dockerhub_account
   scheduler_port     = var.scheduler_port
-  flask_workers      = 17 # (2*CPU)+1, according to https://medium.com/building-the-system/gunicorn-3-means-of-concurrency-efbb547674b7
+  flask_workers      = 31 # (2*CPU)+1, according to https://medium.com/building-the-system/gunicorn-3-means-of-concurrency-efbb547674b7
 }
 
 // Cluster monitor
@@ -117,7 +117,7 @@ module "download" {
   security_group_ids = [aws_security_group.internal.id]
 
   instance_type      = "r5.xlarge" // Mitigate the memory leak in fastq-dump
-  volume_size        = 200 // Mitigate the storage 

In [4]:
# Initialize terraform
TF=$SERRATUS/terraform/main
cd $TF
terraform init

# Launch Terraform Cluster
# Initialize the serratus cluster with minimal nodes
terraform apply -auto-approve

[0m[1mInitializing modules...[0m

[0m[1mInitializing the backend...[0m

[0m[1mInitializing provider plugins...[0m

The following providers do not have any version constraints in configuration,
so the latest version was installed.

To prevent automatic upgrades to new major versions that may contain breaking
changes, it is recommended to add version = "..." constraints to the
corresponding provider blocks in configuration, with the constraint strings
suggested below.

* provider.random: version = "~> 2.2"

[0m[1m[32mTerraform has been successfully initialized![0m[32m[0m
[0m[32m
You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.

If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so

### Running Serratus

In [5]:
cd $TF

# Open SSH tunnels to the monitor
./create_tunnels.sh

# If you get an error on port
# run:
# ps aux | grep ssh
# sudo kill <PID of SSH>

Tunnels created:
    localhost:3000 = grafana
    localhost:9090 = prometheus
    localhost:5432 = postgres
    localhost:8000 = scheduler


In [7]:
# Confirm the upload file
BATCH_SRA=$WORK/"$BATCH"_SraRunInfo.csv
echo  $BATCH_SRA
wc -l $BATCH_SRA

/home/artem/serratus/notebook/200612_ab/v2_SraRunInfo.csv
13392 /home/artem/serratus/notebook/200612_ab/v2_SraRunInfo.csv


In [8]:
# Upload SraRunInfo.csv into Serratus
cd $TF
./uploadSRA.sh $BATCH_SRA

Loading SRARunInfo into scheduler 
  File: /home/artem/serratus/notebook/200612_ab/v2_SraRunInfo.csv
  date: Fri Jun 12 21:47:19 PDT 2020
  wc  : 13392 /home/artem/serratus/notebook/200612_ab/v2_SraRunInfo.csv
  md5 : 72ac566f40f9796c38fed4a90904af49  /home/artem/serratus/notebook/200612_ab/v2_SraRunInfo.csv


--------------------------
tmp.chunk00
10001 tmp.chunk00_sraRunInfo.csv
c4157f921d0687be5450f6d047b14fa2  tmp.chunk00_sraRunInfo.csv
{"inserted_rows":10000,"total_rows":10000}
--------------------------
tmp.chunk01
3392 tmp.chunk01_sraRunInfo.csv
ead2a6ed3eb455138038f4714c618e9e  tmp.chunk01_sraRunInfo.csv
{"inserted_rows":3390,"total_rows":13390}


 uploadSRA complete.


## Run Serratus

In [11]:
# Set Cluster Parameters =============================
## get Config File (if it doesn't exist)
# curl localhost:8000/config | jq > serratus-config.json
#
cd $TF
# Make local changes to config file
echo "  Cluster Config File: "
cat serratus-config.json
echo ""
echo ""
# Re-upload config file
curl -T serratus-config.json localhost:8000/config

  Cluster Config File: 
{
  "ALIGN_ARGS": "--very-sensitive-local",
  "ALIGN_MAX_INCREASE": 25,
  "ALIGN_SCALING_CONSTANT": 0.0215,
  "ALIGN_SCALING_ENABLE": true,
  "ALIGN_SCALING_MAX": 750,
  "CLEAR_INTERVAL": 600,
  "DL_ARGS": "",
  "DL_MAX_INCREASE": 10,
  "DL_SCALING_CONSTANT": 0.1,
  "DL_SCALING_ENABLE": true,
  "DL_SCALING_MAX": 200,
  "GENOME": "cov3ma",
  "MERGE_ARGS": "",
  "MERGE_MAX_INCREASE": 10,
  "MERGE_SCALING_CONSTANT": 0.2,
  "MERGE_SCALING_ENABLE": true,
  "MERGE_SCALING_MAX": 75,
  "SCALING_INTERVAL": 120,
  "VIRTUAL_SCALING_INTERVAL": 45
}

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0{"ALIGN_ARGS":"--very-sensitive-local","ALIGN_MAX_INCREASE":25,"ALIGN_SCALING_CONSTANT":0.0215,"ALIGN_SCALING_ENABLE":true,"ALIGN_SCALING_MAX":750,"CLEAR_INTERVAL":600,"D

### Error handling

In [None]:
## Stop postgres if it's running 
# systemctl stop postgresql

## Connect to postgres
# psql -h localhost postgres postgres

### ACCESSION OPERATIONS
## Reset SPLITTING accessions to NEW
# UPDATE acc SET state = 'new' WHERE state = 'splitting';

## Reset SPLIT_ERR accessions to NEW
## (repeated failures can be missing SRA data)
# UPDATE acc SET state = 'new' WHERE state = 'split_err';

## Reset MERGE_ERR accessions to SPLIT_DONE
# UPDATE acc SET state = 'split_done' WHERE state = 'merge_err';

## Clear DONE Accessions (ONLY ON COMPLETION)
# DELETE FROM acc WHERE state = 'merge_done';

### BLOCK OPERATIONS

##  Reset FAIL blocks to NEW
# UPDATE blocks SET state = 'new' WHERE state = 'fail';

# Reset ALIGNING blocks to NEW
# UPDATE blocks SET state = 'new' WHERE state = 'aligning';

# Clear DONE blocks
# DELETE FROM blocks WHERE state = 'done';
