# Run: Push for 16S pilot

```
Lead     : ababaian / rce
Issue    : 
Version  : v0.3.4
start    : 2020 06 23
complete : 2020 06 23
files    : ~/serratus/notebook/200623_ab/
s3_files : s3://serratus-public/notebook/200623_ab/
output   : s3://serratus-public/out/200623_ssu/
```

### Objectives
- Robert provided a ssu test file `v4.75.fa` --> `ssu0.fa`
- Test it against all known bat samples `bat_SraRunInfo.csv`


### Initialize SSU reference


In [None]:
# on EC2
sudo yum install -y docker
sudo yum install -y git
sudo service docker start

sudo docker run --rm --entrypoint /bin/bash \
  -it serratusbio/serratus-align:latest


In [None]:
yum install -y unzip wget python3
BOWTIEVERSION='2.4.1'

wget --quiet https://downloads.sourceforge.net/project/bowtie-bio/bowtie2/"$BOWTIEVERSION"/bowtie2-"$BOWTIEVERSION"-linux-x86_64.zip &&\
  unzip bowtie2-"$BOWTIEVERSION"-linux-x86_64.zip &&\
  rm    bowtie2-"$BOWTIEVERSION"-linux-x86_64.zip
  
  mv bowtie2-"$BOWTIEVERSION"*/bowtie2* /usr/local/bin/ &&\
  rm -rf bowtie2-"$BOWTIEVERSION"*

In [None]:
# SeqKit Install
yum install -y wget tar gzip less vim unzip
wget https://github.com/shenwei356/seqkit/releases/download/v0.12.0/seqkit_linux_amd64.tar.gz &&\
  tar -xvf seqkit* && mv seqkit /usr/local/bin/ &&\
  rm seqkit_linux*

In [None]:
# Set-up reference genome
aws s3 cp s3://serratus-public/seq/ssu0/ssu0.fa ./
seqkit rmdup ssu0.fa > ssu0.rmdup.fa
mv ssu0.rmdup.fa ssu0.fa

# Build index
samtools faidx ssu0.fa
bowtie2-build ssu0.fa ssu0


# Make sumzer file
# accession length name family offset length
# 110360
cut -f 1,2 ssu0.fa.fai > acc.tmp
yes "NA ssu 0 75" \
  | sed 's/ /\t/g' - \
  | head -n 110360 \
  > stats.tmp
  
paste acc.tmp stats.tmp > ssu0.sumzer.tsv
rm *.tmp
md5sum * > ssu0.md5sum

aws s3 sync ./ s3://serratus-public/seq/ssu0/

```
054fa9e8c2aa3fe0166bef547f5656dd  ssu0.1.bt2
d740bc842e3b46e19c68bbbc82c63010  ssu0.2.bt2
70c9da6a2dac965deb571f878d7437e2  ssu0.3.bt2
b1626f1c62d13336a03bb854baa85091  ssu0.4.bt2
7c7a11c2d38004bdd726713a07ffbefe  ssu0.fa
19402ce91e927f9413e528d57f9bb7ea  ssu0.fa.fai
dece953f2af26ae8a0d0a0b2483d44f6  ssu0.rev.1.bt2
368007ea8220ab39bb2a9405eb075494  ssu0.rev.2.bt2
b02436b973b2830444d6ab7f513c35ff  ssu0.sumzer.tsv
```

### Initialize local workspace

In [1]:
# Serratus commit version
SERRATUS="/home/artem/serratus"
cd $SERRATUS

# Create local run directory
WORK="$SERRATUS/notebook/200623_ab"
mkdir -p $WORK; cd $WORK

# S3 notebook path
S3_WORK='s3://serratus-public/notebook/200623_ab/'

# date and version
date
git rev-parse HEAD # commit version

Tue Jun 23 22:44:37 PDT 2020
fbe06ae227741b28885d419f570b66634b347ec6


In [3]:
cd $WORK
# Download all bat SRA
aws s3 cp s3://lovelywater2/sra/bat_SraRunInfo.csv.gz ./
gzip -d *gz

Completed 256.0 KiB/267.5 KiB with 1 file(s) remainingCompleted 267.5 KiB/267.5 KiB with 1 file(s) remainingdownload: s3://lovelywater2/sra/bat_SraRunInfo.csv.gz to ./bat_SraRunInfo.csv.gz


In [17]:
cd $WORK
aws s3 cp bat_SraRunInfo.csv s3://serratus-public/out/200623_ssu/

Completed 256.0 KiB/1.3 MiB with 1 file(s) remainingCompleted 512.0 KiB/1.3 MiB with 1 file(s) remainingCompleted 768.0 KiB/1.3 MiB with 1 file(s) remainingCompleted 1.0 MiB/1.3 MiB with 1 file(s) remaining  Completed 1.2 MiB/1.3 MiB with 1 file(s) remaining  Completed 1.3 MiB/1.3 MiB with 1 file(s) remaining  upload: ./bat_SraRunInfo.csv to s3://serratus-public/out/200623_ssu/bat_SraRunInfo.csv


In [4]:
wc -l *
md5sum *

2823 bat_SraRunInfo.csv
1108e9cda3e07b55b19ece9ee8ac4dca  bat_SraRunInfo.csv


### Clean-up sweep

In [21]:
aws s3 ls s3://serratus-public/out/200623_ssu/summary/  > v1.sra
cat v1.sra \
  | sed 's/^...............................//g' - \
  | cut -f1 -d'.' - > v1.sra.complete
  
wc -l *sra.complete

head -n1 bat_SraRunInfo.csv > sra.header
grep -vif v1.sra.complete bat_SraRunInfo.csv | tail -n+2 - > v2.tmp

cat sra.header v2.tmp > v2_bat_SraRunInfo.csv

rm sra.header v1.sra v2.tmp

2459 v1.sra.complete


In [23]:
wc -l *

   2823 bat_SraRunInfo.csv
   2459 v1.sra.complete
    364 v2_bat_SraRunInfo.csv
   5646 total


### Terraform Initialize

In [24]:
# Terraform customization
git diff $SERRATUS/terraform/main/main.tf

diff --git a/terraform/main/main.tf b/terraform/main/main.tf
index de2d00d..8c7f922 100644
--- a/terraform/main/main.tf
+++ b/terraform/main/main.tf
@@ -92,7 +92,7 @@ module "scheduler" {
   
   security_group_ids = [aws_security_group.internal.id]
   key_name           = var.key_name
-  instance_type      = "c5.large"
+  instance_type      = "r5.xlarge"
   dockerhub_account  = var.dockerhub_account
   scheduler_port     = var.scheduler_port
 }
@@ -105,7 +105,7 @@ module "monitoring" {
   key_name           = var.key_name
   scheduler_ip       = module.scheduler.private_ip
   dockerhub_account  = var.dockerhub_account
-  instance_type      = "r5.large"
+  instance_type      = "r5.2xlarge"
 }
 
 // Serratus-dl
@@ -113,13 +113,13 @@ module "download" {
   source             = "../worker"
 
   desired_size       = 0
-  max_size           = 200
+  max_size           = 5000
 
   dev_cidrs          = var.dev_cidrs
   security_group_ids = [aws_security_group.inte

In [26]:
# Initialize terraform
TF=$SERRATUS/terraform/main
cd $TF
terraform init

# Launch Terraform Cluster
# Initialize the serratus cluster with minimal nodes
terraform apply -auto-approve

[0m[1mInitializing modules...[0m

[0m[1mInitializing the backend...[0m

[0m[1mInitializing provider plugins...[0m

The following providers do not have any version constraints in configuration,
so the latest version was installed.

To prevent automatic upgrades to new major versions that may contain breaking
changes, it is recommended to add version = "..." constraints to the
corresponding provider blocks in configuration, with the constraint strings
suggested below.

* provider.random: version = "~> 2.2"

[0m[1m[32mTerraform has been successfully initialized![0m[32m[0m
[0m[32m
You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.

If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so

### Initialize Serratus

In [27]:
cd $TF

# Open SSH tunnels to the monitor
./create_tunnels.sh

# If you get an error on port
# run:
# ps aux | grep ssh
# sudo kill <PID of SSH>

Tunnels created:
    localhost:3000 = grafana
    localhost:9090 = prometheus
    localhost:5432 = postgres
    localhost:8000 = scheduler


In [8]:
# Confirm the upload file -- Mouse
BATCH_SRA=$WORK/bat_SraRunInfo.csv
echo  $BATCH_SRA
wc -l $BATCH_SRA

/home/artem/serratus/notebook/200623_ab/bat_SraRunInfo.csv
2823 /home/artem/serratus/notebook/200623_ab/bat_SraRunInfo.csv


In [10]:
# Upload SraRunInfo.csv into Serratus
cd $TF
./uploadSRA.sh $BATCH_SRA

Loading SRARunInfo into scheduler 
  File: /home/artem/serratus/notebook/200623_ab/bat_SraRunInfo.csv
  date: Tue Jun 23 22:53:41 PDT 2020
  wc  : 2823 /home/artem/serratus/notebook/200623_ab/bat_SraRunInfo.csv
  md5 : 1108e9cda3e07b55b19ece9ee8ac4dca  /home/artem/serratus/notebook/200623_ab/bat_SraRunInfo.csv


--------------------------
tmp.chunk00
2823 tmp.chunk00_sraRunInfo.csv
1fb31014a8b828b4a4ff08ea27b78fe7  tmp.chunk00_sraRunInfo.csv
{"inserted_rows":2821,"total_rows":2821}


 uploadSRA complete.


In [28]:
# Confirm the upload file -- V2
BATCH_SRA=$WORK/v2_bat_SraRunInfo.csv
echo  $BATCH_SRA
wc -l $BATCH_SRA

/home/artem/serratus/notebook/200623_ab/v2_bat_SraRunInfo.csv
364 /home/artem/serratus/notebook/200623_ab/v2_bat_SraRunInfo.csv


In [30]:
# Upload SraRunInfo.csv into Serratus
cd $TF
./uploadSRA.sh $BATCH_SRA

Loading SRARunInfo into scheduler 
  File: /home/artem/serratus/notebook/200623_ab/v2_bat_SraRunInfo.csv
  date: Wed Jun 24 12:03:22 PDT 2020
  wc  : 364 /home/artem/serratus/notebook/200623_ab/v2_bat_SraRunInfo.csv
  md5 : 1d19e0476b3fa69f3948ca11501d66d9  /home/artem/serratus/notebook/200623_ab/v2_bat_SraRunInfo.csv


--------------------------
tmp.chunk00
364 tmp.chunk00_sraRunInfo.csv
64ded563ca86269e6723cd7dc713028e  tmp.chunk00_sraRunInfo.csv
{"inserted_rows":362,"total_rows":362}


 uploadSRA complete.


## Run Serratus

In [31]:
# Set Cluster Parameters =============================
## get Config File (if it doesn't exist)
# curl localhost:8000/config | jq > serratus-config.json
#
cd $TF
# Make local changes to config file
echo "  Cluster Config File: "
cat serratus-config.json
echo ""
echo ""
# Re-upload config file
curl -T serratus-config.json localhost:8000/config

  Cluster Config File: 
{
  "ALIGN_ARGS": "--very-sensitive-local",
  "ALIGN_MAX_INCREASE": 50,
  "ALIGN_SCALING_CONSTANT": 0.1,
  "ALIGN_SCALING_ENABLE": true,
  "ALIGN_SCALING_MAX": 0,
  "CLEAR_INTERVAL": 600000,
  "DL_ARGS": "",
  "DL_MAX_INCREASE": 15,
  "DL_SCALING_CONSTANT": 0.1,
  "DL_SCALING_ENABLE": true,
  "DL_SCALING_MAX": 0,
  "GENOME": "ssu0",
  "MERGE_ARGS": "",
  "MERGE_MAX_INCREASE": 15,
  "MERGE_SCALING_CONSTANT": 0.1,
  "MERGE_SCALING_ENABLE": true,
  "MERGE_SCALING_MAX": 0,
  "SCALING_INTERVAL": 120,
  "VIRTUAL_SCALING_INTERVAL": 35
}

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0{"ALIGN_ARGS":"--very-sensitive-local","ALIGN_MAX_INCREASE":50,"ALIGN_SCALING_CONSTANT":0.1,"ALIGN_SCALING_ENABLE":true,"ALIGN_SCALING_MAX":0,"CLEAR_INTERVAL":600000,"DL_ARGS":"

### Error handling

In [None]:
## Stop postgres if it's running 
# systemctl stop postgresql

## Connect to postgres
# psql -h localhost postgres postgres


#  psql -h localhost postgres postgres -c "DELETE FROM blocks WHERE state = 'done';"

### ACCESSION OPERATIONS
## Reset SPLITTING accessions to NEW
# UPDATE acc SET state = 'new' WHERE state = 'splitting';

## Reset SPLIT_ERR accessions to NEW
## (repeated failures can be missing SRA data)
# UPDATE acc SET state = 'new' WHERE state = 'split_err';

## Reset MERGE_ERR accessions to SPLIT_DONE
# UPDATE acc SET state = 'split_done' WHERE state = 'merge_err';

## Clear DONE Accessions (ONLY ON COMPLETION)
# DELETE FROM acc WHERE state = 'merge_done';

### BLOCK OPERATIONS

##  Reset FAIL blocks to NEW
# UPDATE blocks SET state = 'new' WHERE state = 'fail';

# Reset ALIGNING blocks to NEW
# UPDATE blocks SET state = 'new' WHERE state = 'aligning';

# Clear Done
# DELETE FROM blocks WHERE state = 'done';

# RESET STATE
# DELETE FROM blocks WHERE state = 'done';
# DELETE FROM blocks WHERE state = 'fail';
#
#
# DELETE FROM acc WHERE state = 'split_err';
# DELETE FROM acc WHERE state = 'merging';
# DELETE FROM acc WHERE state = 'merge_err';
# DELETE FROM acc WHERE state = 'split_done';


## Output
