# Run: Phase 1 Clean-up - QC hu / mamm / vert

```
Lead     : ababaian
Issue    : 
Version  : v0.3.3
start    : 2020 06 13
complete : 2020 06 13
files    : ~/serratus/notebook/200612_ab/
s3_files : s3://serratus-public/notebook/200612_ab/
output   : s3://serratus-public/out/200612_qc/
```

### Objectives
- Clean-up missing entries processed from the first batch


### Initialize local workspace

In [1]:
# Serratus commit version
SERRATUS="/home/artem/serratus"
cd $SERRATUS

# Create local run directory
WORK="$SERRATUS/notebook/200612_ab"
mkdir -p $WORK; cd $WORK

# S3 notebook path
S3_WORK='s3://serratus-public/notebook/200612_ab/'

# date and version
date
git rev-parse HEAD # commit version

Sat Jun 13 08:53:31 PDT 2020
f6fff24f9e18a6104d024e1394788f28154fb63f


## SRA Accession Initialization



### Remove previously completed accessions

In [4]:
cd $WORK

# get previous SraRunInfo
aws s3 sync s3://lovelywater2/sra/ ./

gzip -d *.gz

Completed 2.2 KiB/2.2 KiB with 1 file(s) remainingdownload: s3://lovelywater2/sra/README.md to ./README.md


In [6]:
# Create a list of all completed runs to date
cd $WORK
INDEX="v2.sra.complete" # from last notebook

wc -l *.csv

echo " -------------------- "

for SRA in $(ls *.csv)
do
  # Inverse look-up completed SRA
  # into a new SraRunInfo file
  grep -vif $INDEX $SRA > "qc_"$SRA
  
  wc -l $SRA
  wc -l "qc_"$SRA
  echo " -------------------- "
  
done

wc -l qc*csv

   672657 hu_SraRunInfo.csv
    36104 hu_meta_SraRunInfo.csv
   100799 mamm_SraRunInfo.csv
   890747 mu_SraRunInfo.csv
    94909 vert_SraRunInfo.csv
    22252 viro2_SraRunInfo.csv
  1817468 total
 -------------------- 
672657 hu_SraRunInfo.csv
18740 qc_hu_SraRunInfo.csv
 -------------------- 
36104 hu_meta_SraRunInfo.csv
395 qc_hu_meta_SraRunInfo.csv
 -------------------- 
100799 mamm_SraRunInfo.csv
6046 qc_mamm_SraRunInfo.csv
 -------------------- 
890747 mu_SraRunInfo.csv
617567 qc_mu_SraRunInfo.csv
 -------------------- 
94909 vert_SraRunInfo.csv
447 qc_vert_SraRunInfo.csv
 -------------------- 
22252 viro2_SraRunInfo.csv
13392 qc_viro2_SraRunInfo.csv
 -------------------- 
    18740 qc_hu_SraRunInfo.csv
      395 qc_hu_meta_SraRunInfo.csv
     6046 qc_mamm_SraRunInfo.csv
   617567 qc_mu_SraRunInfo.csv
      447 qc_vert_SraRunInfo.csv
    13392 qc_viro2_SraRunInfo.csv
   656587 total


In [None]:
# the viro2 qc is complete, there were ~320 libraries
# that remained unfinished

In [9]:
# Merge all un-finished libraries into one qc file
# exclude mouse (those will be for 1M push)

head -n1  qc_hu_SraRunInfo.csv      >  sra.header.tmp

tail -n+2 qc_hu_SraRunInfo.csv      >  qc_SraRunInfo.tmp
tail -n+2 qc_hu_meta_SraRunInfo.csv >> qc_SraRunInfo.tmp
tail -n+2 qc_mamm_SraRunInfo.csv    >> qc_SraRunInfo.tmp
tail -n+2 qc_vert_SraRunInfo.csv    >> qc_SraRunInfo.tmp

# shuf and merge
shuf qc_SraRunInfo.tmp | cat sra.header.tmp - > qc_SraRunInfo.csv

# summary
wc -l  qc_SraRunInfo.csv
md5sum qc_SraRunInfo.csv


mv qc_SraRunInfo.csv tmp_SraRunInfo.csv
rm qc_*
mv tmp_SraRunInfo.csv qc_SraRunInfo.csv

25625 qc_SraRunInfo.csv
6ee3bf51717de86475aa9c6e6be61fd7  qc_SraRunInfo.csv


### Terraform Initialize

In [10]:
# Terraform customization
git diff $SERRATUS/terraform/main/main.tf

diff --git a/terraform/main/main.tf b/terraform/main/main.tf
index c030eb5..a0a9134 100644
--- a/terraform/main/main.tf
+++ b/terraform/main/main.tf
@@ -89,10 +89,10 @@ module "scheduler" {
   
   security_group_ids = [aws_security_group.internal.id]
   key_name           = var.key_name
-  instance_type      = "c5.2xlarge"
+  instance_type      = "c5.4xlarge"
   dockerhub_account  = var.dockerhub_account
   scheduler_port     = var.scheduler_port
-  flask_workers      = 17 # (2*CPU)+1, according to https://medium.com/building-the-system/gunicorn-3-means-of-concurrency-efbb547674b7
+  flask_workers      = 31 # (2*CPU)+1, according to https://medium.com/building-the-system/gunicorn-3-means-of-concurrency-efbb547674b7
 }
 
 // Cluster monitor
@@ -117,7 +117,7 @@ module "download" {
   security_group_ids = [aws_security_group.internal.id]
 
   instance_type      = "r5.xlarge" // Mitigate the memory leak in fastq-dump
-  volume_size        = 200 // Mitigate the storage 

In [11]:
# Initialize terraform
TF=$SERRATUS/terraform/main
cd $TF
terraform init

# Launch Terraform Cluster
# Initialize the serratus cluster with minimal nodes
terraform apply -auto-approve

[0m[1mInitializing modules...[0m

[0m[1mInitializing the backend...[0m

[0m[1mInitializing provider plugins...[0m

The following providers do not have any version constraints in configuration,
so the latest version was installed.

To prevent automatic upgrades to new major versions that may contain breaking
changes, it is recommended to add version = "..." constraints to the
corresponding provider blocks in configuration, with the constraint strings
suggested below.

* provider.random: version = "~> 2.2"

[0m[1m[32mTerraform has been successfully initialized![0m[32m[0m
[0m[32m
You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.

If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so

### Running Serratus

In [12]:
cd $TF

# Open SSH tunnels to the monitor
./create_tunnels.sh

# If you get an error on port
# run:
# ps aux | grep ssh
# sudo kill <PID of SSH>

Tunnels created:
    localhost:3000 = grafana
    localhost:9090 = prometheus
    localhost:5432 = postgres
    localhost:8000 = scheduler


In [13]:
# Confirm the upload file
BATCH_SRA=$WORK/qc_SraRunInfo.csv
echo  $BATCH_SRA
wc -l $BATCH_SRA

/home/artem/serratus/notebook/200612_ab/qc_SraRunInfo.csv
25625 /home/artem/serratus/notebook/200612_ab/qc_SraRunInfo.csv


In [14]:
# Upload SraRunInfo.csv into Serratus
cd $TF
./uploadSRA.sh $BATCH_SRA

Loading SRARunInfo into scheduler 
  File: /home/artem/serratus/notebook/200612_ab/qc_SraRunInfo.csv
  date: Sat Jun 13 09:58:34 PDT 2020
  wc  : 25625 /home/artem/serratus/notebook/200612_ab/qc_SraRunInfo.csv
  md5 : 6ee3bf51717de86475aa9c6e6be61fd7  /home/artem/serratus/notebook/200612_ab/qc_SraRunInfo.csv


--------------------------
tmp.chunk00
10001 tmp.chunk00_sraRunInfo.csv
ac43c0e732e3fd11dfab862a2555538e  tmp.chunk00_sraRunInfo.csv
{"inserted_rows":9998,"total_rows":9998}
--------------------------
tmp.chunk01
10001 tmp.chunk01_sraRunInfo.csv
556cb74cc6b5c024879e4a63298c0767  tmp.chunk01_sraRunInfo.csv
{"inserted_rows":9999,"total_rows":19997}
--------------------------
tmp.chunk02
5625 tmp.chunk02_sraRunInfo.csv
4c9d3fb6bdc335898c7f214ad2812b7e  tmp.chunk02_sraRunInfo.csv
{"inserted_rows":5623,"total_rows":25620}


 uploadSRA complete.


## Run Serratus

In [17]:
# Set Cluster Parameters =============================
## get Config File (if it doesn't exist)
# curl localhost:8000/config | jq > serratus-config.json
#
cd $TF
# Make local changes to config file
echo "  Cluster Config File: "
cat serratus-config.json
echo ""
echo ""
# Re-upload config file
curl -T serratus-config.json localhost:8000/config

  Cluster Config File: 
{
  "ALIGN_ARGS": "--very-sensitive-local",
  "ALIGN_MAX_INCREASE": 25,
  "ALIGN_SCALING_CONSTANT": 0.0215,
  "ALIGN_SCALING_ENABLE": true,
  "ALIGN_SCALING_MAX": 1400,
  "CLEAR_INTERVAL": 600,
  "DL_ARGS": "",
  "DL_MAX_INCREASE": 10,
  "DL_SCALING_CONSTANT": 0.1,
  "DL_SCALING_ENABLE": true,
  "DL_SCALING_MAX": 400,
  "GENOME": "cov3ma",
  "MERGE_ARGS": "",
  "MERGE_MAX_INCREASE": 10,
  "MERGE_SCALING_CONSTANT": 0.1,
  "MERGE_SCALING_ENABLE": true,
  "MERGE_SCALING_MAX": 75,
  "SCALING_INTERVAL": 120,
  "VIRTUAL_SCALING_INTERVAL": 45
}

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0{"ALIGN_ARGS":"--very-sensitive-local","ALIGN_MAX_INCREASE":25,"ALIGN_SCALING_CONSTANT":0.0215,"ALIGN_SCALING_ENABLE":true,"ALIGN_SCALING_MAX":1400,"CLEAR_INTERVAL":600,

### Error handling

In [None]:
## Stop postgres if it's running 
# systemctl stop postgresql

## Connect to postgres
# psql -h localhost postgres postgres

### ACCESSION OPERATIONS
## Reset SPLITTING accessions to NEW
# UPDATE acc SET state = 'new' WHERE state = 'splitting';

## Reset SPLIT_ERR accessions to NEW
## (repeated failures can be missing SRA data)
# UPDATE acc SET state = 'new' WHERE state = 'split_err';

## Reset MERGE_ERR accessions to SPLIT_DONE
# UPDATE acc SET state = 'split_done' WHERE state = 'merge_err';

## Clear DONE Accessions (ONLY ON COMPLETION)
# DELETE FROM acc WHERE state = 'merge_done';

### BLOCK OPERATIONS

##  Reset FAIL blocks to NEW
# UPDATE blocks SET state = 'new' WHERE state = 'fail';

# Reset ALIGNING blocks to NEW
# UPDATE blocks SET state = 'new' WHERE state = 'aligning';

# Clear DONE blocks
# DELETE FROM blocks WHERE state = 'done';


## Output

`19508` are so "persistent errors" or unavailable files. Completed 