# RUN: COV2R Pilot Run

```
Lead     : ababaian
Issue    : n/a
Version  : af64c2da1fdb3815dcabb0db3aafd2443c0c470d <git rev-parse HEAD> 
start    : 2020 04 23
complete : 2020 04 23
files    : ~/serratus/notebook/200423_ab/
s3)files : n/a
output   : s3://serratus-public/out/200423_ab_cov2r/
```

### Objectives
- Create a re-usable template for running `serratus`
- Run the 49 SRA test datasets with the current standard `serratus` against the `cov2r` pan-genome.
- Compare `cov2r` alignment statistics to `cov0r` alignments


## Serratus Initialization
Prerequisites for running Serratus


### Initialize local workspace

In [2]:
# Serratus commit version
SERRATUS="/home/artem/serratus"
cd $SERRATUS
git rev-parse HEAD

af64c2da1fdb3815dcabb0db3aafd2443c0c470d


In [11]:
# Create local run directory
WORK="$SERRATUS/notebook/200423_ab"
mkdir -p $WORK; cd $WORK



In [10]:
# SRA RunInfo Table for run
aws s3 cp s3://serratus-public/sra/testing_SraRunInfo.csv ./
RUNINFO="$WORK/testing_SraRunInfo.csv"
cat $RUNINFO

Completed 23.1 KiB/23.1 KiB with 1 file(s) remainingdownload: s3://serratus-public/sra/testing_SraRunInfo.csv to ./testing_SraRunInfo.csv
Run,ReleaseDate,LoadDate,spots,bases,spots_with_mates,avgLength,size_MB,AssemblyName,download_path,Experiment,LibraryName,LibraryStrategy,LibrarySelection,LibrarySource,LibraryLayout,InsertSize,InsertDev,Platform,Model,SRAStudy,BioProject,Study_Pubmed_id,ProjectID,Sample,BioSample,SampleType,TaxID,ScientificName,SampleName,g1k_pop_code,source,g1k_analysis_group,Subject_ID,Sex,Disease,Tumor,Affection_Status,Analyte_Type,Histological_Type,Body_Site,CenterName,Submission,dbgap_study_accession,Consent,RunHash,ReadHash
SRR11454614,2020-04-02 00:08:41,2020-04-01 00:45:40,5758629,1736681196,5758629,301,634,,https://sra-download.ncbi.nlm.nih.gov/traces/sra60/SRR/011186/SRR11454614,SRX8032203,HBCDC-HB-01/2019,RNA-Seq,RANDOM PCR,TRANSCRIPTOMIC,PAIRED,0,0,ILLUMINA,Illumina MiSeq,SRP254688,PRJNA616446,,616446,SRS6404538,SAMN14479128,simple,2697049,Severe acut

### Packer / AMI Initialization
Does not need to be ran each time if you have access to the AMI already.

Current Build: `us-east-1: ami-046baafb2ee438b69`

In [13]:
cd $SERRATUS/packer
packer build docker-ami.json

[1;32mamazon-ebs: output will be in this color.[0m

[1;32m==> amazon-ebs: Prevalidating any provided VPC information[0m
[1;32m==> amazon-ebs: Prevalidating AMI Name: packer-amazon-linux-2-docker-005[0m
[0;32m    amazon-ebs: Found Image ID: ami-0323c3dd2da7fb37d[0m
[1;32m==> amazon-ebs: Creating temporary keypair: packer_5ea203b1-f549-bf3e-1007-c202f84bf7a6[0m
[1;32m==> amazon-ebs: Creating temporary security group for this instance: packer_5ea203b4-b2cc-7d4b-e1cb-191ce719da7b[0m
[1;32m==> amazon-ebs: Authorizing access to port 22 from [0.0.0.0/0] in the temporary security groups...[0m
[1;32m==> amazon-ebs: Launching a source AWS instance...[0m
[1;32m==> amazon-ebs: Adding tags to source instance[0m
[0;32m    amazon-ebs: Adding tag: "Name": "Packer Builder"[0m
[0;32m    amazon-ebs: Instance ID: i-00f674bf01a53eacd[0m
[1;32m==> amazon-ebs: Waiting for instance (i-00f674bf01a53eacd) to become ready...[0m
[1;32m==> amazon-ebs: Using ssh communicator to

### Build Serratus containers (optional)
Serratus containers are available on the `serratusbio` dockerhub. If you wish to deploy your own containers, you will have to build them from the `serratus` repository and upload them to your own dockerhub account.

This can be done with the `build.sh` script

In [None]:
cd $SERRATUS

# If you want to upload containers to your repository
# include this.
export DOCKERHUB_USER='serratusbio' # optional
sudo docker login # optional

# Build all containers and upload them docker hub repo
# (if available)
./build.sh

NOTE: The genome version is currently hard-set as part of `scheduler/flask_app/jobs.py` on line 172
```
    response['genome'] = "cov1r"
```
changed to 
```
    response['genome'] = "cov2r"
```

And containers re-built. This variable needs to be moved to terraform to allow control of genome versions.


### Terraform Initialization
The Global Variables for Terraform file must be modified to initialize for your system.

File: `$SERRATUS/terarform/main/terraform.tfvars`

This step must be done manually in a text editor currently.

In [None]:
# Your public IP followed by "/32"
LOCALIP="75.155.242.67/32" #dev_cidrs
# Your AWS key name
KEYNAME="serratus"         #key_name
# Dockerhub account containing serratus containers
DOCKERHUB_USER='serratusbio'    #dockerhub_account (optional)

In [15]:
# Initialize terraform
cd $SERRATUS/terraform/main
terraform init

[0m[1mInitializing modules...[0m

[0m[1mInitializing the backend...[0m

[0m[1mInitializing provider plugins...[0m

The following providers do not have any version constraints in configuration,
so the latest version was installed.

To prevent automatic upgrades to new major versions that may contain breaking
changes, it is recommended to add version = "..." constraints to the
corresponding provider blocks in configuration, with the constraint strings
suggested below.

* provider.local: version = "~> 1.4"

[0m[1m[32mTerraform has been successfully initialized![0m[32m[0m
[0m[32m
You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.

If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so 

In [16]:
# Launch Terraform Cluster
# Initialize the serratus cluster with minimal nodes
terraform apply -auto-approve

[0m[1mmodule.align.data.aws_ami.amazon_linux_2: Refreshing state...[0m
[0m[1mmodule.monitoring.data.aws_ami.ecs: Refreshing state...[0m
[0m[1mmodule.align.data.aws_availability_zones.all: Refreshing state...[0m
[0m[1mmodule.scheduler.data.aws_ami.amazon_linux_2: Refreshing state...[0m
[0m[1mmodule.download.data.aws_availability_zones.all: Refreshing state...[0m
[0m[1mmodule.download.data.aws_region.current: Refreshing state...[0m
[0m[1mmodule.scheduler.data.aws_region.current: Refreshing state...[0m
[0m[1mmodule.merge.data.aws_availability_zones.all: Refreshing state...[0m
[0m[1mmodule.merge.data.aws_region.current: Refreshing state...[0m
[0m[1mmodule.download.data.aws_ami.amazon_linux_2: Refreshing state...[0m
[0m[1mmodule.align.data.aws_region.current: Refreshing state...[0m
[0m[1mmodule.merge.data.aws_ami.amazon_linux_2: Refreshing state...[0m
[0m[1mmodule.scheduler.aws_cloudwatch_log_group.scheduler: Creating...[0m[0m
[0m[1mmod

## Running Serratus 
Upload the run data, scale-out the cluster, monitor performance.


In [23]:
# Terraform will have created four scripts to control
# serratus
ls -alh *.sh

-rwxrwxr-x 1 artem artem 194 Apr 23 15:11 align_set_capacity.sh
-rwxrwxr-x 1 artem artem 405 Apr 23 15:11 create_tunnels.sh
-rwxrwxr-x 1 artem artem 191 Apr 23 15:11 dl_set_capacity.sh
-rwxrwxr-x 1 artem artem 194 Apr 23 15:11 merge_set_capacity.sh


### Run Monitors & Upload table

Open SSH tunnels to monitor node then open monitors in browser

- [Scheduler Table](localhost:8000/jobs/)
- [Cluster Monitor: Grafana](http://localhost:3000/?orgId=1)
- [Cluster Monitor: Prometheus](http://localhost:9090)


#### Empty Scheduler Table (localhost:8000/jobs/)
![Empty Table Load Screen](200423_ab/empty_scheduler.png)

#### Ready Scheduler Table (localhost:8000/jobs/)
![Empty Table Load Screen](200423_ab/ready_scheduler.png)

In [17]:
cd $SERRATUS/terraform/main

# Open SSH tunnels to the monitor
./create_tunnels.sh

Tunnels created:
    localhost:3000 -- grafana
    localhost:9090 -- prometheus
    localhost:8000 -- scheduler


In [22]:
# Load SRA Run Info into scheduler (READY)
curl -s -X POST -T $RUNINFO localhost:8000/jobs/add_sra_run_info/

{
  "inserted_rows": 49, 
  "total_rows": 49
}


### Scale up the cluster
This will set-up 10 download, 10 align and 2 merge nodes to process data


In [24]:
./dl_set_capacity.sh 10
./align_set_capacity.sh 10
./merge_set_capacity.sh 2

+ export AWS_REGION=us-east-1
+ AWS_REGION=us-east-1
+ aws autoscaling set-desired-capacity --auto-scaling-group-name tf-asg-tf-serratus-dl-20200423221112630800000009 --desired-capacity 10
+ export AWS_REGION=us-east-1
+ AWS_REGION=us-east-1
+ aws autoscaling set-desired-capacity --auto-scaling-group-name tf-asg-tf-serratus-align-20200423221112402100000007 --desired-capacity 10
+ export AWS_REGION=us-east-1
+ AWS_REGION=us-east-1
+ aws autoscaling set-desired-capacity --auto-scaling-group-name tf-asg-tf-serratus-merge-20200423221112408900000008 --desired-capacity 2


You can track the progress of accessions in the scheduler:

![Running Scheduler](200423_ab/running_scheduler.png)

And monitor the performance of the cluster in the monitor:

![Running Monitor](200423_ab/running_monitor.png)

In [25]:
# When all downloading/splitting is done,
# scale-in the downloaders
./dl_set_capacity.sh 0

+ export AWS_REGION=us-east-1
+ AWS_REGION=us-east-1
+ aws autoscaling set-desired-capacity --auto-scaling-group-name tf-asg-tf-serratus-dl-20200423221112630800000009 --desired-capacity 0


In [26]:
# When all alignment is done,
# scale-in the aligners
./align_set_capacity.sh 0

# When all merging is done,
# scale in the mergers
./merge_set_capacity.sh 0

+ export AWS_REGION=us-east-1
+ AWS_REGION=us-east-1
+ aws autoscaling set-desired-capacity --auto-scaling-group-name tf-asg-tf-serratus-align-20200423221112402100000007 --desired-capacity 0
+ export AWS_REGION=us-east-1
+ AWS_REGION=us-east-1
+ aws autoscaling set-desired-capacity --auto-scaling-group-name tf-asg-tf-serratus-merge-20200423221112408900000008 --desired-capacity 0


In [27]:
# Dump the Scheduler SQLITE table to a local file
curl localhost:8000/db > \
  $SERRATUS/notebook/200423_ab/schedDump.sqlite

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0 93  188k   93  176k    0     0   276k      0 --:--:-- --:--:-- --:--:--  276k100  188k  100  188k    0     0   291k      0 --:--:-- --:--:-- --:--:--  291k


## Shutting down procedures

Closing up shop.


### Save output of runs

output directory: `s3://serratus-public/out/200423_ab_cov2r/`


In [None]:
# Output files are in two folders:
# Bam and Bai files
aws s3 ls s3://tf-serratus-work-20200423221042780000000001/out/bam/
# Flagstat and RefCount files
aws s3 ls s3://tf-serratus-work-20200423221042780000000001/out/flagstat/


In [30]:
# Copy output to a permenant bucket
# TODO: automatically transfer final outputs
# to the permenant bucket
aws s3 sync \
  s3://tf-serratus-work-20200423221042780000000001/out \
  s3://serratus-public/out/200423_ab_cov2r/


Completed 262.8 KiB/~197.2 MiB with ~77 file(s) remaining (calculating...)copy: s3://tf-serratus-work-20200423221042780000000001/out/bam/ERR2906839.bam to s3://serratus-public/out/200423_ab_cov2r/bam/ERR2906839.bam
Completed 262.8 KiB/~197.2 MiB with ~77 file(s) remaining (calculating...)Completed 592.5 KiB/~197.2 MiB with ~120 file(s) remaining (calculating...)copy: s3://tf-serratus-work-20200423221042780000000001/out/bam/ERR2906838.bam to s3://serratus-public/out/200423_ab_cov2r/bam/ERR2906838.bam
Completed 592.5 KiB/~197.2 MiB with ~122 file(s) remaining (calculating...)Completed 907.2 KiB/~197.2 MiB with ~122 file(s) remaining (calculating...)copy: s3://tf-serratus-work-20200423221042780000000001/out/bam/ERR2906843.bam to s3://serratus-public/out/200423_ab_cov2r/bam/ERR2906843.bam
Completed 907.2 KiB/~197.2 MiB with ~123 file(s) remaining (calculating...)Completed 1.5 MiB/~197.2 MiB with ~127 file(s) remaining (calculating...)  copy: s3://tf-serratus-work-2020042322104278

## Destroy Cluster

Close out all resources with terraform (will take a few minutes).


In [31]:
terraform destroy -auto-approve
# WARNING this will also delete the standard output bucket/data
# Save data prior to destroy

[0m[1mmodule.download.aws_cloudwatch_log_group.g: Refreshing state... [id=serratus-dl][0m
[0m[1mmodule.merge.aws_cloudwatch_log_group.g: Refreshing state... [id=serratus-merge][0m
[0m[1mmodule.align.module.iam_role.aws_iam_role.role: Refreshing state... [id=SerratusIamRole-serratus-align][0m
[0m[1mmodule.scheduler.data.aws_ami.amazon_linux_2: Refreshing state...[0m
[0m[1mmodule.align.data.aws_ami.amazon_linux_2: Refreshing state...[0m
[0m[1mmodule.merge.data.aws_ami.amazon_linux_2: Refreshing state...[0m
[0m[1mmodule.scheduler.aws_cloudwatch_log_group.scheduler: Refreshing state... [id=scheduler][0m
[0m[1mmodule.work_bucket.aws_s3_bucket.work: Refreshing state... [id=tf-serratus-work-20200423221042780000000001][0m
[0m[1mmodule.scheduler.data.aws_region.current: Refreshing state...[0m
[0m[1mmodule.merge.data.aws_region.current: Refreshing state...[0m
[0m[1mmodule.align.data.aws_region.current: Refreshing state...[0m
[0m[1mmodule.download.dat

# Run Notes

## Errors

Accessions: `SRR6639047` - `SRR6639058` all suffered from `split_err` (download fault).

With example error:

```
+ fastq-dump --split-e SRR9658359
Rejected 3658747 READS because of filtering out non-biological READS
Read 3658747 spots for SRR9658358
Written 3658747 spots for SRR9658358
```