# Run: Mammals + Push for 1M samples/day


## Introduction

```
Lead     : ababaian / jtalyor
Issue    : 
Version  : v0.3.3
start    : 2020 06 06
complete : YYYY MM DD
files    : ~/serratus/notebook/200530_ab/
s3_files : s3://serratus-public/notebook/200606_hu2/
output   : s3://serratus-public/out/200606_hu2/
```

### Objectives
- Search all mammals RNA/metagenome/metatranscriptome samples (~80K)
- Search remaining human RNA-seq (900K+)
- Search mouse metagenome/metatranscriptome + RNAseq for total of 1.25M libraries

All output will be initially stored in `200606_hu` which is the bulk of data, and partitioned after the run is complete


### Initialize local workspace

In [1]:
date

Sat Jun  6 11:19:54 PDT 2020


In [2]:
# Serratus commit version
SERRATUS="/home/artem/serratus"
cd $SERRATUS
git rev-parse HEAD # commit version

5616277fa95a5dec7071382477c620d2292d4e91


In [3]:
# Create local run directory
WORK="$SERRATUS/notebook/200606_ab"
mkdir -p $WORK; cd $WORK

# SRA RunInfo Table base for run
# RUNINFO="$WORK/hu1_meta_rand_SraRunInfo.csv"
#
#md5sum $RUNINFO
#aws s3 cp $RUNINFO s3://serratus-public/out/200530_hu1/



## SRA Accession Initialization



### Mammalian Sequences

SRA Accessed: 2020/06/06
Search Term: 
```
("Mammalia"[Organism] NOT "Homo sapiens"[Organism] NOT "Mus musculus"[orgn]) AND ("type_rnaseq"[Filter] OR "metagenomic"[Filter] OR "metatranscriptomic"[Filter]) AND "platform illumina"[Properties]
```

Results: `82367` Accessions saved in `mamm_SraRunInfo.csv`

### Public human RNA-seq 

SRA Accessed: 2020/05/30
Search Term: 
```
"txid9606"[Organism:exp] AND ("type_rnaseq"[Filter]) AND cluster_public[prop] AND "platform illumina"[Properties]
```

Results: `572215` Accessions saved in `hu0_SraRunInfo.csv`

100K from this was selected randomly, all complete libraries will be filtered and the remainder will be defined as the `hu2` set.

### Mouse Sequences

SRA Accessed: 2020/06/06
Search Term: 
```
("Mus musculus"[orgn]) AND ("type_rnaseq"[Filter] OR "metagenomic"[Filter] OR "metatranscriptomic"[Filter]) AND "platform illumina"[Properties]
```

Results: `594949` Accessions saved in `mu1_SraRunInfo.csv`



### Remove completed human accessions

In [5]:
# Create a list of all completed runs to date
cd $WORK
RUNINFO=$WORK/hu0_SraRunInfo.csv
BATCH='hu2'
S3_PATH="s3://serratus-public/out/200530_hu1/summary/"


aws s3 ls $S3_PATH > $BATCH.complete
cat $BATCH.complete | sed 's/^...............................//g' - | cut -f1 -d'.' - > $BATCH.sra.complete

cd $WORK

wc -l $RUNINFO
wc -l $BATCH.sra.complete

grep -vif $BATCH.sra.complete $RUNINFO > "$BATCH"_sraRunInfo.csv

wc -l  "$BATCH"_sraRunInfo.csv
md5sum "$BATCH"_sraRunInfo.csv

672657 /home/artem/serratus/notebook/200606_ab/hu0_SraRunInfo.csv
132672 hu2.sra.complete
573852 hu2_sraRunInfo.csv
d33559d55a8f3374f985ba4ef506d7a0  hu2_sraRunInfo.csv


### Summary and MD5sum for all SRA files

In [6]:
cd $WORK

wc -l *
echo ''
md5sum *
echo ''

md5sum * > 1m_SraRunInfo.md5sum

aws s3 sync ./ s3://serratus-public/out/200606_hu2/

    672657 hu0_SraRunInfo.csv
    573852 hu2_sraRunInfo.csv
    100799 mamm_SraRunInfo.csv
    890747 mu0_SraRunInfo.csv
   2238055 total

2d2998b585f6b5035b051b0960692c96  hu0_SraRunInfo.csv
d33559d55a8f3374f985ba4ef506d7a0  hu2_sraRunInfo.csv
499fa3d5a1fa8cf86efce1925c7e27fd  mamm_SraRunInfo.csv
a9e14f6043f70e485ebebeb81ace8da7  mu0_SraRunInfo.csv

Completed 213 Bytes/961.6 MiB with 5 file(s) remainingupload: ./1m_SraRunInfo.md5sum to s3://serratus-public/out/200606_hu2/1m_SraRunInfo.md5sum
Completed 213 Bytes/961.6 MiB with 4 file(s) remainingCompleted 256.2 KiB/961.6 MiB with 4 file(s) remainingCompleted 512.2 KiB/961.6 MiB with 4 file(s) remainingCompleted 768.2 KiB/961.6 MiB with 4 file(s) remainingCompleted 1.0 MiB/961.6 MiB with 4 file(s) remaining  Completed 1.3 MiB/961.6 MiB with 4 file(s) remaining  Completed 1.5 MiB/961.6 MiB with 4 file(s) remaining  Completed 1.8 MiB/961.6 MiB with 4 file(s) remaining  Completed 2.0 MiB/961.6 MiB with 4 file(s) remain

### Running Serratus

In [None]:
# Set Cluster Parameters =============================
## get Config File (if it doesn't exist)
# curl localhost:8000/config | jq > serratus-config.json
#
cd $TF
# Make local changes to config file
echo "  Cluster Config File: "
cat serratus-config.json
echo ""
echo ""
# Re-upload config file
curl -T serratus-config.json localhost:8000/config

In [11]:
#!/bin/bash
# upSerratus.sh <sraRunInfo.csv>
# SRA uploader script
# 
set -eu

# input SRA file
CURRENT_BATCH=$1

# Load SRA Run Info into scheduler ===================
# Scheduler DNS: 
echo "Loading SRARunInfo into scheduler "
echo "  File: $CURRENT_BATCH"
echo "  md5 : $(md5sum $CURRENT_BATCH)"
echo "  date: $(date)"
echo ""

head -n1 $CURRENT_BATCH > sra.header.tmp

tail -n+2 $CURRENT_BATCH | split -d -l 10000 - upBatch

for FILE in $(ls upBatch*); do

  cat sra.header.tmp $FILE > "FILE"_sraRunInfo.csv

  wc -l "$FILE"_sraRunInfo.csv
  md5sum "$FILE"_sraRunInfo.csv
  
  curl -s -X POST -T "$FILE"_sraRunInfo.csv localhost:8000/jobs/add_sra_run_info/
  
done

rm upBatch* *tmp



```
{
  "ALIGN_ARGS": "--very-sensitive-local",
  "ALIGN_MAX_INCREASE": 100,
  "ALIGN_SCALING_CONSTANT": 0.020,
  "ALIGN_SCALING_ENABLE": true,
  "ALIGN_SCALING_MAX": 3000,
  "CLEAR_INTERVAL": 600,
  "DL_ARGS": "",
  "DL_MAX_INCREASE": 10,
  "DL_SCALING_CONSTANT": 0.1,
  "DL_SCALING_ENABLE": true,
  "DL_SCALING_MAX": 400,
  "GENOME": "cov3ma",
  "MERGE_ARGS": "",
  "MERGE_MAX_INCREASE": 10,
  "MERGE_SCALING_CONSTANT": 0.02,
  "MERGE_SCALING_ENABLE": true,
  "MERGE_SCALING_MAX": 100,
  "SCALING_INTERVAL": 300,
  "VIRTUAL_SCALING_INTERVAL": 60
}
```

## Error Rate Logging

Hour Merge_done split_err
```
16:00 0 0
17:00 5680 1405
18:00 16150 1800
19:00 38000 2050
20:00 62700 2250
21:00 102000 3440
22:00 137000 6000
23:00 137000 11200
24:00 146000 14000
```

At ~19:00 we hit an EBS limit of 300 TiB of gp2 storage allowed per region. This was at 1360:3495:36 instances or 19492 vCPU. :)

Shifted to aligner heavy, 1302:4257:62 (22 360 vCPU) --> 100 GB/s of alignment throughput
vs

compression ratio 14:84


# Run Attempt 2 -- 200607 hu3


In [1]:
# Serratus commit version
SERRATUS="/home/artem/serratus"
cd $SERRATUS
git rev-parse HEAD # commit version

0d57946dbb980bb4c8fa714d155a6c54fb9b8525


In [2]:
# Create local run directory
WORK="$SERRATUS/notebook/200606_ab"
mkdir -p $WORK; cd $WORK



In [6]:
# Create a list of all completed runs to date
cd $WORK
BATCH='hu3'
S3_PATH="s3://serratus-public/out/200606_hu2/summary/"

aws s3 ls $S3_PATH > $BATCH.complete
cat $BATCH.complete | sed 's/^...............................//g' - | cut -f1 -d'.' - > $BATCH.sra.complete


cd $WORK
RUNINFO=$WORK/hu2_sraRunInfo.csv

wc -l $RUNINFO
wc -l $BATCH.sra.complete

grep -vif $BATCH.sra.complete $RUNINFO > "$BATCH"_sraRunInfo.csv

wc -l  "$BATCH"_sraRunInfo.csv
md5sum "$BATCH"_sraRunInfo.csv

CURRENT_SRA="hu3_sraRunInfo.csv"

573852 /home/artem/serratus/notebook/200606_ab/hu2_sraRunInfo.csv
231897 hu3.sra.complete
435440 hu3_sraRunInfo.csv
9537493f6723ce90a4fa35d093dcc786  hu3_sraRunInfo.csv


In [7]:
aws s3 cp hu3_sraRunInfo.csv s3://serratus-public/out/200606_hu2/

Completed 256.0 KiB/188.2 MiB with 1 file(s) remainingCompleted 512.0 KiB/188.2 MiB with 1 file(s) remainingCompleted 768.0 KiB/188.2 MiB with 1 file(s) remainingCompleted 1.0 MiB/188.2 MiB with 1 file(s) remaining  Completed 1.2 MiB/188.2 MiB with 1 file(s) remaining  Completed 1.5 MiB/188.2 MiB with 1 file(s) remaining  Completed 1.8 MiB/188.2 MiB with 1 file(s) remaining  Completed 2.0 MiB/188.2 MiB with 1 file(s) remaining  Completed 2.2 MiB/188.2 MiB with 1 file(s) remaining  Completed 2.5 MiB/188.2 MiB with 1 file(s) remaining  Completed 2.8 MiB/188.2 MiB with 1 file(s) remaining  Completed 3.0 MiB/188.2 MiB with 1 file(s) remaining  Completed 3.2 MiB/188.2 MiB with 1 file(s) remaining  Completed 3.5 MiB/188.2 MiB with 1 file(s) remaining  Completed 3.8 MiB/188.2 MiB with 1 file(s) remaining  Completed 4.0 MiB/188.2 MiB with 1 file(s) remaining  Completed 4.2 MiB/188.2 MiB with 1 file(s) remaining  Completed 4.5 MiB/188.2 MiB with 1 file(s) remaining  Completed 

### Terraform Initialize

In [15]:
# Terraform customization
git diff $SERRATUS/terraform/main/main.tf

diff --git a/terraform/main/main.tf b/terraform/main/main.tf
index c030eb5..2ccb3bf 100644
--- a/terraform/main/main.tf
+++ b/terraform/main/main.tf
@@ -117,7 +117,7 @@ module "download" {
   security_group_ids = [aws_security_group.internal.id]
 
   instance_type      = "r5.xlarge" // Mitigate the memory leak in fastq-dump
-  volume_size        = 200 // Mitigate the storage leak in fastq-dump
+  volume_size        = 150 // Mitigate the storage leak in fastq-dump
   spot_price         = 0.10
 
   s3_bucket          = module.work_bucket.name
@@ -159,7 +159,7 @@ module "merge" {
   dev_cidrs          = var.dev_cidrs
   security_group_ids = [aws_security_group.internal.id]
   instance_type      = "c5.large"
-  volume_size        = 200 // prevent disk overflow via samtools sort
+  volume_size        = 100 // prevent disk overflow via samtools sort
   spot_price         = 0.05
   s3_bucket          = module.work_bucket.name
   s3_delete_prefix   = "bam-blocks"
@@ -171,

In [16]:
# Initialize terraform
TF=$SERRATUS/terraform/main
cd $TF
terraform init

# Launch Terraform Cluster
# Initialize the serratus cluster with minimal nodes
terraform apply -auto-approve

[0m[1mInitializing modules...[0m

[0m[1mInitializing the backend...[0m

[0m[1mInitializing provider plugins...[0m

The following providers do not have any version constraints in configuration,
so the latest version was installed.

To prevent automatic upgrades to new major versions that may contain breaking
changes, it is recommended to add version = "..." constraints to the
corresponding provider blocks in configuration, with the constraint strings
suggested below.

* provider.random: version = "~> 2.2"

[0m[1m[32mTerraform has been successfully initialized![0m[32m[0m
[0m[32m
You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.

If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so

In [17]:
cd $TF

# Open SSH tunnels to the monitor
./create_tunnels.sh

# If you get an error on port
# run:
# ps aux | grep ssh
# sudo kill <PID of SSH>
#

Tunnels created:
    localhost:3000 = grafana
    localhost:9090 = prometheus
    localhost:5432 = postgres
    localhost:8000 = scheduler


### Serratus Initialize

In [18]:
# #!/bin/bash
# upSerratus.sh <sraRunInfo.csv>
# SRA uploader script
# 
# set -eu
cd $WORK

# input SRA file
CURRENT_BATCH="hu3_sraRunInfo.csv"

# Load SRA Run Info into scheduler ===================
# Scheduler DNS: 
echo "Loading SRARunInfo into scheduler "
echo "  File: $CURRENT_BATCH"
echo "  md5 : $(md5sum $CURRENT_BATCH)"
echo "  date: $(date)"
echo ""

head -n1 $CURRENT_BATCH > sra.header.tmp

tail -n+2 $CURRENT_BATCH | split -d -l 10000 - upBatch

for FILE in $(ls upBatch*); do

  cat  sra.header.tmp > "$FILE"_sraRunInfo.csv
  shuf $FILE >> "$FILE"_sraRunInfo.csv

  wc -l "$FILE"_sraRunInfo.csv
  md5sum "$FILE"_sraRunInfo.csv
  
  curl -s -X POST -T "$FILE"_sraRunInfo.csv localhost:8000/jobs/add_sra_run_info/
  
done

rm upBatch* *tmp

Loading SRARunInfo into scheduler 
  File: hu3_sraRunInfo.csv
  md5 : 9537493f6723ce90a4fa35d093dcc786  hu3_sraRunInfo.csv
  date: Sun Jun  7 19:40:49 PDT 2020

10001 upBatch00_sraRunInfo.csv
d7a10a30671db8e803f7e5bbdd7564d4  upBatch00_sraRunInfo.csv
{"inserted_rows":10000,"total_rows":10000}
10001 upBatch01_sraRunInfo.csv
63a64ea5985c8f05e7bb6ed3d591d007  upBatch01_sraRunInfo.csv
{"inserted_rows":10000,"total_rows":20000}
10001 upBatch02_sraRunInfo.csv
ef40111a079bf54a802ca17beeaa2c1c  upBatch02_sraRunInfo.csv
{"inserted_rows":10000,"total_rows":30000}
10001 upBatch03_sraRunInfo.csv
6ffda2de659ff24ff77f0d0c5dc1ff07  upBatch03_sraRunInfo.csv
{"inserted_rows":10000,"total_rows":40000}
10001 upBatch04_sraRunInfo.csv
3c9c6f6d723398fd3713395b18ab60d4  upBatch04_sraRunInfo.csv
{"inserted_rows":10000,"total_rows":50000}
10001 upBatch05_sraRunInfo.csv
8e294d4e0e5507fae9c7199f6b3d899d  upBatch05_sraRunInfo.csv
{"inserted_rows":10000,"total_rows":60000}
10001 upBatch06_sr

In [22]:
# #!/bin/bash
# upSerratus.sh <sraRunInfo.csv>
# SRA uploader script
# 
# set -eu
cd $WORK

# input SRA file
CURRENT_BATCH="mu0_SraRunInfo.csv"

# Load SRA Run Info into scheduler ===================
# Scheduler DNS: 
echo "Loading SRARunInfo into scheduler "
echo "  File: $CURRENT_BATCH"
echo "  md5 : $(md5sum $CURRENT_BATCH)"
echo "  date: $(date)"
echo ""

head -n1 $CURRENT_BATCH > sra.header.tmp

tail -n+2 $CURRENT_BATCH | split -d -l 10000 - upBatch

for FILE in $(ls upBatch*); do

  cat  sra.header.tmp > "$FILE"_sraRunInfo.csv
  shuf $FILE >> "$FILE"_sraRunInfo.csv

  wc -l "$FILE"_sraRunInfo.csv
  md5sum "$FILE"_sraRunInfo.csv
  
  curl -s -X POST -T "$FILE"_sraRunInfo.csv localhost:8000/jobs/add_sra_run_info/
  
done

rm upBatch* *tmp

Loading SRARunInfo into scheduler 
  File: mu0_SraRunInfo.csv
  md5 : a9e14f6043f70e485ebebeb81ace8da7  mu0_SraRunInfo.csv
  date: Sun Jun  7 20:12:31 PDT 2020

10001 upBatch00_sraRunInfo.csv
becc457fefc49edca52589270ccf48dc  upBatch00_sraRunInfo.csv
{"inserted_rows":10000,"total_rows":445438}
10001 upBatch01_sraRunInfo.csv
3d7411235bfbbd273bf85502056258d5  upBatch01_sraRunInfo.csv
{"inserted_rows":10000,"total_rows":455438}
10001 upBatch02_sraRunInfo.csv
91f557c53915452fb667d23de79e3055  upBatch02_sraRunInfo.csv
{"inserted_rows":10000,"total_rows":465438}
10001 upBatch03_sraRunInfo.csv
295d55eb7f999f1fca1e3336724b5a31  upBatch03_sraRunInfo.csv
{"inserted_rows":10000,"total_rows":475438}
10001 upBatch04_sraRunInfo.csv
c32b0c9181c0efd6d87a5d545ddc059e  upBatch04_sraRunInfo.csv
{"inserted_rows":10000,"total_rows":485438}
10001 upBatch05_sraRunInfo.csv
bfe6babe1653d5348dd9e18200afedaa  upBatch05_sraRunInfo.csv
{"inserted_rows":10000,"total_rows":495438}
10001 upBatc

## Run Serratus

In [4]:
# Set Cluster Parameters =============================
## get Config File (if it doesn't exist)
# curl localhost:8000/config | jq > serratus-config.json
#
cd $TF
# Make local changes to config file
echo "  Cluster Config File: "
cat serratus-config.json
echo ""
echo ""
# Re-upload config file
curl -T serratus-config.json localhost:8000/config

  Cluster Config File: 
{
  "ALIGN_ARGS": "--very-sensitive-local",
  "ALIGN_MAX_INCREASE": 25,
  "ALIGN_SCALING_CONSTANT": 0.0215,
  "ALIGN_SCALING_ENABLE": true,
  "ALIGN_SCALING_MAX": 1200,
  "CLEAR_INTERVAL": 600,
  "DL_ARGS": "",
  "DL_MAX_INCREASE": 10,
  "DL_SCALING_CONSTANT": 0.1,
  "DL_SCALING_ENABLE": true,
  "DL_SCALING_MAX": 450,
  "GENOME": "cov3ma",
  "MERGE_ARGS": "",
  "MERGE_MAX_INCREASE": 10,
  "MERGE_SCALING_CONSTANT": 0.01,
  "MERGE_SCALING_ENABLE": true,
  "MERGE_SCALING_MAX": 75,
  "SCALING_INTERVAL": 120,
  "VIRTUAL_SCALING_INTERVAL": 45
}

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0{"ALIGN_ARGS":"--very-sensitive-local","ALIGN_MAX_INCREASE":25,"ALIGN_SCALING_CONSTANT":0.0215,"ALIGN_SCALING_ENABLE":true,"ALIGN_SCALING_MAX":1200,"CLEAR_INTERVAL":600

### Notes



Run Start: `19:43`

```
19:43 0 0
20:43 7000 74
21:43 40000 144
22:43 122000 163
23:43 208430 328
```


Ratios:
1350:4140:150 = 22,260 vCPU  --> Aligner hot
1460:4124:90  = 22,516 vCPU  --> Aligner hot

`Stable Config -- Weekend`

```
{
  "ALIGN_ARGS": "--very-sensitive-local",
  "ALIGN_MAX_INCREASE": 25,
  "ALIGN_SCALING_CONSTANT": 0.0215,
  "ALIGN_SCALING_ENABLE": true,
  "ALIGN_SCALING_MAX": 3600,
  "CLEAR_INTERVAL": 600,
  "DL_ARGS": "",
  "DL_MAX_INCREASE": 10,
  "DL_SCALING_CONSTANT": 0.1,
  "DL_SCALING_ENABLE": true,
  "DL_SCALING_MAX": 1350,
  "GENOME": "cov3ma",
  "MERGE_ARGS": "",
  "MERGE_MAX_INCREASE": 10,
  "MERGE_SCALING_CONSTANT": 0.01,
  "MERGE_SCALING_ENABLE": true,
  "MERGE_SCALING_MAX": 100,
  "SCALING_INTERVAL": 120,
  "VIRTUAL_SCALING_INTERVAL": 45
}
```

In [None]:
## Stop postgres if it's running 
# systemctl stop postgresql

## Connect to postgres
# psql -h localhost postgres postgres

### ACCESSION OPERATIONS
## Reset SPLITTING accessions to NEW
# UPDATE acc SET state = 'new' WHERE state = 'splitting';

## Reset SPLIT_ERR accessions to NEW
## (repeated failures can be missing SRA data)
# UPDATE acc SET state = 'new' WHERE state = 'split_err';

## Reset MERGE_ERR accessions to SPLIT_DONE
# UPDATE acc SET state = 'split_done' WHERE state = 'merge_err';

## Clear DONE Accessions (ONLY ON COMPLETION)
# DELETE FROM acc WHERE state = 'merge_done';

### BLOCK OPERATIONS

##  Reset FAIL blocks to NEW
# UPDATE blocks SET state = 'new' WHERE state = 'fail';

# Reset ALIGNING blocks to NEW
# UPDATE blocks SET state = 'new' WHERE state = 'aligning';

# Clear DONE blocks
# DELETE FROM blocks WHERE state = 'done';


### Placeholder script (unfinished)

```
#!/bin/bash
# =====================================
# Serratus - uploadSRA.sh
# =====================================
#
# Usage: 
# uploadSRA.sh <sraRunInfo.csv>
#
# script for uploading sraRunInfo.csv
# files into Serratus in chunks and with
# randomization of input to normalize load.
# 
set -eu

# Config parameters -----------------------------
# Input SRA file
INPUT_SRA=$1

# Chunk size for uploading
SIZE=10000
# -----------------------------------------------

# Check that sraRunInfo was provided

if [ -z "$INPUT_SRA" ]; then
    echo "Usage:"
    echo "  uploadSRA.sh <sraRunInfo.csv>"
    exit 1
fi

# Sript ==============================================

# Descriptive parsing --------------------------------
# Scheduler DNS: 
echo "Loading SRARunInfo into scheduler "
echo "  File: $INPUT_SRA"
echo "  md5 : $(md5sum $INPUT_SRA)"
echo "  date: $(date)"
echo ""
echo ""

# Extract header from csv input
head -n1 $INPUT_SRA > sra.header.tmp

# Split the input csv file into $SIZE chunks
tail -n+2 $INPUT_SRA | split -d -l $SIZE - tmp.chunk

# Re-header an sraRunInfo file for each chunk
# with randomization of the data order
# and upload to Serratus
for CHUNK in $(ls tmp.chunk*); do

  cat  sra.header.tmp > "$CHUNK"_sraRunInfo.csv
  shuf $CHUNK >> "$CHUNK"_sraRunInfo.csv

  echo '--------------------------'
  echo $CHUNK
  wc -l "$CHUNK"_sraRunInfo.csv
  md5sum "$CHUNK"_sraRunInfo.csv
  
  # Upload to Serratus
  # via curl (localhost:8000)
  curl -s -X POST -T "$CHUNK"_sraRunInfo.csv \
    localhost:8000/jobs/add_sra_run_info/
  
  # Clean-up
  rm $CHUNK "$CHUNK"_sraRunInfo.csv
done

rm sra.header.tmp

echo ""
echo ""
echo " uploadSRA complete."
```

```
# #!/bin/bash
# upSerratus.sh <sraRunInfo.csv>
# SRA uploader script
# breaks up big sraRunInfo.csv into chunks
# 
#set -eu
cd $WORK

# input SRA file
#CURRENT_SRA=$1
CURRENT_SRA="hu3_sraRunInfo.csv"

# Total size
SRA_SIZE=$(wc -l $CURRENT_SRA | cut -f1 -d' ' - )

# Size of each upload file
SIZE=10000

# Load SRA Run Info into scheduler ===================
# Scheduler DNS: 
echo "Loading sraRunInfo into scheduler "
echo "  File: $CURRENT_SRA"
echo "  wc  : $SRA_SIZE"
echo "  md5 : $(md5sum $CURRENT_SRA)"
echo "  date: $(date)"
echo ""

head -n1 $CURRENT_SRA > sra.header.tmp

# File coordinates to upload
START=2
((END=$START + $SIZE))

while [ $END -lt $SRA_SIZE ]; do

  BATCH=batch_"$START"_"$END"_sraRunInfo.csv
  
  cat sra.header.tmp      $CURRENT_SRA > $BATCH
  sed -n "$START","$END"p $CURRENT_SRA | shuf - >> $BATCH
  
  echo "  uploading -- $BATCH"
  wc -l  $BATCH
  md5sum $BATCH
  
  curl -s -X POST -T $BATCH localhost:8000/jobs/add_sra_run_info/
  rm $BATCH
  
  ((START=$END + 1))
  ((END=$END + $SIZE))
done

# Last iteration
BATCH=batch_"$START"_"$END"_sraRunInfo.csv
  
cat sra.header.tmp      $CURRENT_SRA > $BATCH
sed -n "$START","$END"p $CURRENT_SRA >> $BATCH

echo "  uploading -- $BATCH"
wc -l  $BATCH
md5sum $BATCH

curl -s -X POST -T $BATCH localhost:8000/jobs/add_sra_run_info/
rm $BATCH

```

# Run Attempt 3 -- 200609 hu4


In [40]:
# Serratus commit version
SERRATUS="/home/artem/serratus"
cd $SERRATUS

# Terraform Directory
TF=$SERRATUS/terraform/main

# Create local run directory
WORK="$SERRATUS/notebook/200606_ab"
mkdir -p $WORK; cd $WORK

date
git rev-parse HEAD # commit version

Tue Jun  9 11:29:38 PDT 2020
e06988ab1c25124b50b01d59404c8182495f3687


In [2]:
# Notebook restart
# Serratus commit version
SERRATUS="/home/artem/serratus"
cd $SERRATUS

# Terraform Directory
TF=$SERRATUS/terraform/main

# Create local run directory
WORK="$SERRATUS/notebook/200606_ab"
mkdir -p $WORK; cd $WORK

date
git rev-parse HEAD # commit version

Wed Jun 10 15:47:48 PDT 2020
30a199d9dcd5e2550581666e19ecfd7a3dc1f447


### Initialize SRA RunInfo

In [44]:
# Create a list of all completed runs to date
cd $WORK
BATCH='hu4'
S3_PATH_1="s3://serratus-public/out/200530_hu1/summary/"
S3_PATH_2="s3://serratus-public/out/200606_hu2/summary/"
S3_PATH_3="s3://serratus-public/out/200607_hu3/summary/"

aws s3 ls $S3_PATH_1 > hu1.complete
aws s3 ls $S3_PATH_2 > hu2.complete
aws s3 ls $S3_PATH_3 > hu3.complete

cat *.complete | sed 's/^...............................//g' - \
  | cut -f1 -d'.' - \
  > $BATCH.sra.complete

cd $WORK
# Original SRA File
RUNINFO=$WORK/hu0_SraRunInfo.csv

wc -l $RUNINFO
wc -l *.complete

grep -vif $BATCH.sra.complete $RUNINFO > "$BATCH"_sraRunInfo.csv

wc -l  "$BATCH"_sraRunInfo.csv
md5sum "$BATCH"_sraRunInfo.csv

CURRENT_SRA="$BATCH"_sraRunInfo.csv

CURRENT_BATCH="$WORK/$CURRENT_SRA"

672657 /home/artem/serratus/notebook/200606_ab/hu0_SraRunInfo.csv
  132672 hu1.complete
  231897 hu2.complete
  277038 hu3.complete
  641607 hu4.sra.complete
 1283214 total
422652 hu4_sraRunInfo.csv
86c6ef38406eb87106b90f63186cab44  hu4_sraRunInfo.csv


In [49]:
# Looks like the hu3 data is almost entirely mouse
# the uploader or something must have effected run-order
# Luckily data is not duplicated!

# Once all hu/mu data is complete I'll have to re-organize.
wc -l mu0_SraRunInfo.csv

cat hu3.complete | sed 's/^...............................//g' - \
  | cut -f1 -d'.' - \
  > hu3.tmp

grep -f hu3.tmp mu0_SraRunInfo.csv | wc -l

rm hu3.tmp

890747 mu0_SraRunInfo.csv
264368


In [16]:
# hu5 addendum
# Create a list of all completed runs to date
cd $WORK
BATCH='hu5'
S3_PATH_3="s3://serratus-public/out/200609_hu4/summary/"

aws s3 ls $S3_PATH_3 > hu4.complete

# Append hu4 to other hu* complete
cat *.complete | sed 's/^...............................//g' - \
  | cut -f1 -d'.' - \
  > $BATCH.sra.complete

cd $WORK
# Original SRA File
RUNINFO=$WORK/hu0_SraRunInfo.csv

wc -l $RUNINFO
wc -l *.complete

grep -vif $BATCH.sra.complete $RUNINFO > "$BATCH"_sraRunInfo.csv

wc -l  "$BATCH"_sraRunInfo.csv
md5sum "$BATCH"_sraRunInfo.csv

CURRENT_SRA="$BATCH"_sraRunInfo.csv

CURRENT_BATCH="$WORK/$CURRENT_SRA"

672657 /home/artem/serratus/notebook/200606_ab/hu0_SraRunInfo.csv
  132672 hu1.complete
  231897 hu2.complete
  277038 hu3.complete
  388928 hu4.complete
  641607 hu4.sra.complete
 1672142 hu5.sra.complete
 3344284 total
33648 hu5_sraRunInfo.csv
ca2b41a7ce11125b08a1a40c3fe2ce17  hu5_sraRunInfo.csv


### Serratus Initialize

In [17]:
# Initialize terraform workspace
cd $TF

# Terraform run parameters
# (changes from git commit)
git diff $SERRATUS/terraform/main/main.tf

diff --git a/terraform/main/main.tf b/terraform/main/main.tf
index c030eb5..86fb4c6 100644
--- a/terraform/main/main.tf
+++ b/terraform/main/main.tf
@@ -89,10 +89,10 @@ module "scheduler" {
   
   security_group_ids = [aws_security_group.internal.id]
   key_name           = var.key_name
-  instance_type      = "c5.2xlarge"
+  instance_type      = "c5.4xlarge"
   dockerhub_account  = var.dockerhub_account
   scheduler_port     = var.scheduler_port
-  flask_workers      = 17 # (2*CPU)+1, according to https://medium.com/building-the-system/gunicorn-3-means-of-concurrency-efbb547674b7
+  flask_workers      = 31 # (2*CPU)+1, according to https://medium.com/building-the-system/gunicorn-3-means-of-concurrency-efbb547674b7
 }
 
 // Cluster monitor
@@ -159,9 +159,11 @@ module "merge" {
   dev_cidrs          = var.dev_cidrs
   security_group_ids = [aws_security_group.internal.id]
   instance_type      = "c5.large"
-  volume_size        = 200 // prevent disk overflow via samt

In [18]:
cd $TF

# Check terraform configuration files
terraform init

# Launch Terraform Cluster
terraform apply -auto-approve

[0m[1mInitializing modules...[0m

[0m[1mInitializing the backend...[0m

[0m[1mInitializing provider plugins...[0m

The following providers do not have any version constraints in configuration,
so the latest version was installed.

To prevent automatic upgrades to new major versions that may contain breaking
changes, it is recommended to add version = "..." constraints to the
corresponding provider blocks in configuration, with the constraint strings
suggested below.

* provider.random: version = "~> 2.2"

[0m[1m[32mTerraform has been successfully initialized![0m[32m[0m
[0m[32m
You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.

If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so

In [19]:
cd $TF

# Open SSH tunnels to the monitor
./create_tunnels.sh

# If you get an error on port
# run:
# ps aux | grep ssh
# sudo kill <PID of SSH>
#

Tunnels created:
    localhost:3000 = grafana
    localhost:9090 = prometheus
    localhost:5432 = postgres
    localhost:8000 = scheduler


In [21]:
# Upload data
CURRENT_BATCH="$WORK/$CURRENT_SRA"

echo $CURRENT_BATCH
wc -l $CURRENT_BATCH
md5sum $CURRENT_BATCH
md5sum $CURRENT_BATCH > "$CURRENT_BATCH".md5

aws s3 cp $CURRENT_BATCH       s3://serratus-public/out/200609_hu4/
aws s3 cp "$CURRENT_BATCH".md5 s3://serratus-public/out/200609_hu4/

/home/artem/serratus/notebook/200606_ab/hu5_sraRunInfo.csv
33648 /home/artem/serratus/notebook/200606_ab/hu5_sraRunInfo.csv
ca2b41a7ce11125b08a1a40c3fe2ce17  /home/artem/serratus/notebook/200606_ab/hu5_sraRunInfo.csv
Completed 256.0 KiB/15.0 MiB with 1 file(s) remainingCompleted 512.0 KiB/15.0 MiB with 1 file(s) remainingCompleted 768.0 KiB/15.0 MiB with 1 file(s) remainingCompleted 1.0 MiB/15.0 MiB with 1 file(s) remaining  Completed 1.2 MiB/15.0 MiB with 1 file(s) remaining  Completed 1.5 MiB/15.0 MiB with 1 file(s) remaining  Completed 1.8 MiB/15.0 MiB with 1 file(s) remaining  Completed 2.0 MiB/15.0 MiB with 1 file(s) remaining  Completed 2.2 MiB/15.0 MiB with 1 file(s) remaining  Completed 2.5 MiB/15.0 MiB with 1 file(s) remaining  Completed 2.8 MiB/15.0 MiB with 1 file(s) remaining  Completed 3.0 MiB/15.0 MiB with 1 file(s) remaining  Completed 3.2 MiB/15.0 MiB with 1 file(s) remaining  Completed 3.5 MiB/15.0 MiB with 1 file(s) remaining  Completed 3.8 MiB/15.0 M

In [59]:
# hu4
cd $TF
./uploadSRA.sh $CURRENT_BATCH

Loading SRARunInfo into scheduler 
  File: /home/artem/serratus/notebook/200606_ab/hu4_sraRunInfo.csv
  date: Tue Jun  9 13:46:09 PDT 2020
  wc  : 422652 /home/artem/serratus/notebook/200606_ab/hu4_sraRunInfo.csv
  md5 : 86c6ef38406eb87106b90f63186cab44  /home/artem/serratus/notebook/200606_ab/hu4_sraRunInfo.csv


--------------------------
tmp.chunk00
10001 tmp.chunk00_sraRunInfo.csv
1821f141202ff3dbb77101945ba2d641  tmp.chunk00_sraRunInfo.csv
{"inserted_rows":10000,"total_rows":10000}
--------------------------
tmp.chunk01
10001 tmp.chunk01_sraRunInfo.csv
059dfb8acfb7489182382028723339f2  tmp.chunk01_sraRunInfo.csv
{"inserted_rows":10000,"total_rows":20000}
--------------------------
tmp.chunk02
10001 tmp.chunk02_sraRunInfo.csv
c592ebc726ba39e789e7980f0e6b0c8c  tmp.chunk02_sraRunInfo.csv
{"inserted_rows":10000,"total_rows":30000}
--------------------------
tmp.chunk03
10001 tmp.chunk03_sraRunInfo.csv
eefe67acb63cdc8473b25db15f57a23b  tmp.chunk03_sraRunInfo.cs

In [22]:
# hu5
cd $TF
./uploadSRA.sh $CURRENT_BATCH

Loading SRARunInfo into scheduler 
  File: /home/artem/serratus/notebook/200606_ab/hu5_sraRunInfo.csv
  date: Thu Jun 11 11:06:11 PDT 2020
  wc  : 33648 /home/artem/serratus/notebook/200606_ab/hu5_sraRunInfo.csv
  md5 : ca2b41a7ce11125b08a1a40c3fe2ce17  /home/artem/serratus/notebook/200606_ab/hu5_sraRunInfo.csv


--------------------------
tmp.chunk00
10001 tmp.chunk00_sraRunInfo.csv
25ad85b8639022bcec0bcb16a2de774a  tmp.chunk00_sraRunInfo.csv
{"inserted_rows":10000,"total_rows":10000}
--------------------------
tmp.chunk01
10001 tmp.chunk01_sraRunInfo.csv
79f0b278df29790435ca64a6527bffd6  tmp.chunk01_sraRunInfo.csv
{"inserted_rows":10000,"total_rows":20000}
--------------------------
tmp.chunk02
10001 tmp.chunk02_sraRunInfo.csv
8dfc99552161acc720974ab5d0a7db6a  tmp.chunk02_sraRunInfo.csv
{"inserted_rows":10000,"total_rows":30000}
--------------------------
tmp.chunk03
3648 tmp.chunk03_sraRunInfo.csv
208c5d8fe8e90346ae4a4224da129776  tmp.chunk03_sraRunInfo.csv

### Run Serratus

In [24]:
# Set Cluster Parameters =============================
cd $TF

## get Config File (if it doesn't exist)
# curl localhost:8000/config | jq > serratus-config.json
#
# Make local changes to config file
echo "  Cluster Config File: "
cat serratus-config.json
echo ""
echo ""
# Re-upload config file
curl -T serratus-config.json localhost:8000/config

  Cluster Config File: 
{
  "ALIGN_ARGS": "--very-sensitive-local",
  "ALIGN_MAX_INCREASE": 25,
  "ALIGN_SCALING_CONSTANT": 0.0215,
  "ALIGN_SCALING_ENABLE": true,
  "ALIGN_SCALING_MAX": 1600,
  "CLEAR_INTERVAL": 600,
  "DL_ARGS": "",
  "DL_MAX_INCREASE": 10,
  "DL_SCALING_CONSTANT": 0.4,
  "DL_SCALING_ENABLE": true,
  "DL_SCALING_MAX": 200,
  "GENOME": "cov3ma",
  "MERGE_ARGS": "",
  "MERGE_MAX_INCREASE": 10,
  "MERGE_SCALING_CONSTANT": 0.01,
  "MERGE_SCALING_ENABLE": true,
  "MERGE_SCALING_MAX": 50,
  "SCALING_INTERVAL": 120,
  "VIRTUAL_SCALING_INTERVAL": 45
}

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0{"ALIGN_ARGS":"--very-sensitive-local","ALIGN_MAX_INCREASE":25,"ALIGN_SCALING_CONSTANT":0.0215,"ALIGN_SCALING_ENABLE":true,"ALIGN_SCALING_MAX":1600,"CLEAR_INTERVAL":600

### Error handling

In [None]:
## Stop postgres if it's running 
# systemctl stop postgresql

## Connect to postgres
# psql -h localhost postgres postgres

### ACCESSION OPERATIONS
## Reset SPLITTING accessions to NEW
# UPDATE acc SET state = 'new' WHERE state = 'splitting';

## Reset SPLIT_ERR accessions to NEW
## (repeated failures can be missing SRA data)
# UPDATE acc SET state = 'new' WHERE state = 'split_err';

## Reset MERGE_ERR accessions to SPLIT_DONE
# UPDATE acc SET state = 'split_done' WHERE state = 'merge_err';

## Clear DONE Accessions (ONLY ON COMPLETION)
# DELETE FROM acc WHERE state = 'merge_done';

### BLOCK OPERATIONS

##  Reset FAIL blocks to NEW
# UPDATE blocks SET state = 'new' WHERE state = 'fail';

# Reset ALIGNING blocks to NEW
# UPDATE blocks SET state = 'new' WHERE state = 'aligning';

# Clear DONE blocks
# DELETE FROM blocks WHERE state = 'done';