# RUN: Mega-genome pilot

## Introduction

```
Lead     : ababaian /rce
Issue    : 
Version  : 
start    : 2020 05 06
complete : YYYY MM DD
files    : ~/serratus/notebook/200506_ab/
s3_files : s3://serratus-public/notebook/200506_ab/
output   : s3://serratus-public/out/200506_zoonotic/
```

### Objectives
- RCE has been developing a minimal 'mega-genome' for all viruses that infect mammamls
- Should add minimal overhead to the search and broaden our viral discovery


## Generating Mega-genome (mega0)

Initial sequences Searched on nucleotide database:

Query:
```
"Viruses"[Organism] AND srcdb_refseq[PROP] NOT wgs[PROP] NOT "cellular
organisms"[Organism] NOT AC_000001[PACC] : AC_999999[PACC]
```

### Creating mega-genome

RCE to fill out code on how it's made.

```
Let's call the expanded reference all virus families of interest the
"mega"-genome, to distinguish from single-family pan-genome.

Directory is here:

s3://serratus-public/rce/mega

The mega-genome contains:

1. Cov full-length genomes clustered at 97% identity (421 sequences).
SARS-Cov-2 is included (MT121215.1).

2. Representative complete genome and CDS sequences from all virus
families with vertebrate hosts, including human (354 sequences). DNA
viruses are included -- why not? they get transcribed too!

"Representative sequences" are defined by NCBI. They divide viral
genomes and CDSs into "representatives" and "neighbors". I didn't find
documentation for how they do that, presumably it's similar to our
clustering at 99% or 97%. To give a sense of the coverage, these
included 9 Cov's including SARS-Cov-2, so these sequences alone would
have been fine for the zoonotic reservoir search.

This reference is roughly 50% Cov, so it's heavily Cov-focused, probably
unnecessarily so, and should be able to find a wide range of other
viruses as well.

Total size is 15Mb FASTA. Smaller than the Cov2 pan-genome!

I screened against a couple of Cov-negative datasets. Got some hits, but
these were a tiny fraction of the reads so nothing blew up & looked
tolerable to me. Example SAM here:

s3://serratus-public/rce/mega/sam/bowtie2.SRR11454614.mega_hv_covu.sam

The hits I checked were short virus fragments which matched human
genome. If we were doing human only then we could mask them, but
pointless for this search because we're not going to check all mammals
in advance.

In FASTA format:

mega/fa/mega_hv_covu.fa
mega/fa/mega_hv_covu_hardmasked.fa
mega/fa/mega_hv_covu_softmasked.fa

Bowtie2 index of hardmasked FASTA:

mega/bowtie2_index/

R
```




In [None]:
# had to get copy via email due to s3 permissions error
# Copy mega-genome reference  seq folder for serratus
aws s3 cp s3://serratus-public/rce/mega/fa/mega_hv_covu_hardmasked.fa \
  s3://serratus-public/seq/mega0/mega0.fa


# Copy over bt2 index for mega0
aws s3 cp s3://serratus-public/rce/mega/bowtie2_index/mega_hv_covu_hardmasked.1.bt2 \
  s3://serratus-public/seq/mega0/mega0.1.bt2
aws s3 cp s3://serratus-public/rce/mega/bowtie2_index/mega_hv_covu_hardmasked.2.bt2 \
  s3://serratus-public/seq/mega0/mega0.2.bt2
aws s3 cp s3://serratus-public/rce/mega/bowtie2_index/mega_hv_covu_hardmasked.3.bt2 \
  s3://serratus-public/seq/mega0/mega0.3.bt2
aws s3 cp s3://serratus-public/rce/mega/bowtie2_index/mega_hv_covu_hardmasked.4.bt2 \
  s3://serratus-public/seq/mega0/mega0.4.bt2
  
aws s3 cp s3://serratus-public/rce/mega/bowtie2_index/mega_hv_covu_hardmasked.rev.1.bt2 \
  s3://serratus-public/seq/mega0/mega0.rev.1.bt2
aws s3 cp s3://serratus-public/rce/mega/bowtie2_index/mega_hv_covu_hardmasked.rev.2.bt2 \
  s3://serratus-public/seq/mega0/mega0.rev.2.bt2



In [1]:
date

Thu May  7 07:51:35 PDT 2020


### Initialize local workspace

In [3]:
# Serratus commit version
SERRATUS="/home/artem/serratus"
cd $SERRATUS
git rev-parse HEAD # commit version

# Create local run directory
WORK="$SERRATUS/notebook/200506_ab"
mkdir -p $WORK; cd $WORK

# SRA RunInfo Table for run -- use first 500 from Zoonotic pilot
RUNINFO="$SERRATUS/notebook/200505_ab/zoonotic_SraRunInfo.csv"

head -n 500 $RUNINFO > pilot_mega.csv
RUNINFO="$WORK/pilot_mega.csv"

#head $RUNINFO

fc6c49f882332b0644c7b2660294e7e0c27ac928


### Terraform Initialization
The Global Variables for Terraform file must be modified to initialize for your system.

File: `$SERRATUS/terarform/main/terraform.tfvars`

This step must be done manually in a text editor currently.

In [4]:
# Terraform customization
git diff $SERRATUS/terraform/main/main.tf

diff --git a/terraform/main/main.tf b/terraform/main/main.tf
index a52496e..281017a 100644
--- a/terraform/main/main.tf
+++ b/terraform/main/main.tf
@@ -109,7 +109,7 @@ module "download" {
   source             = "../worker"
 
   desired_size       = 0
-  max_size           = 256
+  max_size           = 200
 
   dev_cidrs          = var.dev_cidrs
   security_group_ids = [aws_security_group.internal.id]
@@ -134,7 +134,7 @@ module "align" {
   source             = "../worker"
 
   desired_size       = 0
-  max_size           = 256
+  max_size           = 500
   dev_cidrs          = var.dev_cidrs
   security_group_ids = [aws_security_group.internal.id]
   instance_type      = "c5.large" # c5.large
@@ -170,7 +170,7 @@ module "merge" {
   // TODO: the credentials are not properly set-up to
   //       upload to serratus-public, requires a *Object policy
   //       on the bucket.
-  options            = "-k ${module.work_bucket.name} -b s3://serratus-public/out/200

In [5]:
# Initialize terraform
TF=$SERRATUS/terraform/main
cd $TF
terraform init

[0m[1mInitializing modules...[0m

[0m[1mInitializing the backend...[0m

[0m[1mInitializing provider plugins...[0m

[0m[1m[32mTerraform has been successfully initialized![0m[32m[0m
[0m[32m
You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.

If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so if necessary.[0m


In [6]:
cd $TF
# Launch Terraform Cluster
# Initialize the serratus cluster with minimal nodes
terraform apply -auto-approve

[0m[1mmodule.align.data.aws_region.current: Refreshing state...[0m
[0m[1mmodule.align.data.aws_availability_zones.all: Refreshing state...[0m
[0m[1mmodule.merge.data.aws_availability_zones.all: Refreshing state...[0m
[0m[1mmodule.merge.data.aws_region.current: Refreshing state...[0m
[0m[1mmodule.download.data.aws_region.current: Refreshing state...[0m
[0m[1mmodule.scheduler.data.aws_ami.amazon_linux_2: Refreshing state...[0m
[0m[1mmodule.monitoring.data.aws_ami.ecs: Refreshing state...[0m
[0m[1mmodule.merge.data.aws_ami.amazon_linux_2: Refreshing state...[0m
[0m[1mmodule.download.data.aws_availability_zones.all: Refreshing state...[0m
[0m[1mmodule.align.data.aws_ami.amazon_linux_2: Refreshing state...[0m
[0m[1mmodule.download.data.aws_ami.amazon_linux_2: Refreshing state...[0m
[0m[1mmodule.scheduler.data.aws_region.current: Refreshing state...[0m
[0m[1mmodule.merge.aws_cloudwatch_log_group.g: Creating...[0m[0m
[0m[1mmodule.merge.mo

## Running Serratus 
Upload the run data, scale-out the cluster, monitor performance.


### Run Monitors & Upload table
Open SSH tunnels to monitor node then open monitors in browser


In [7]:
cd $TF

# Open SSH tunnels to the monitor
./create_tunnels.sh

# Download Scheduler config file
#curl localhost:8000/config > serratus-config.json

Tunnels created:
    localhost:3000 -- grafana
    localhost:9090 -- prometheus
    localhost:8000 -- scheduler


In [12]:
# Make local changes to config file
cat serratus-config.json
echo '--------'
# Re-upload config file
curl -T serratus-config.json localhost:8000/config

{
"ALIGN_ARGS":"--very-sensitive-local",
"ALIGN_SCALING_CONSTANT":0.1,
"ALIGN_SCALING_ENABLE":true,
"ALIGN_SCALING_MAX":0,
"CLEAR_INTERVAL":600,
"DL_ARGS":"",
"DL_SCALING_CONSTANT":0.1,
"DL_SCALING_ENABLE":true,
"DL_SCALING_MAX":0
"GENOME":"mega0",
"MERGE_ARGS":"",
"MERGE_SCALING_CONSTANT":0.1,
"MERGE_SCALING_ENABLE":true,
"MERGE_SCALING_MAX":1,
"SCALING_INTERVAL":30
}
--------
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0   372    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<title>500 Internal Server Error</title>
<h1>Internal Server Error</h1>
<p>The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the applica

In [9]:
# Load SRA Run Info into scheduler (READY)
curl -s -X POST -T $RUNINFO localhost:8000/jobs/add_sra_run_info/

{"inserted_rows":499,"total_rows":499}


{% for row in accs %}
          {% if row.state !== 'merge_done'%}
          <tr>
            <td>{{ row.acc_id }}</td>
            <td>{{ row.sra_run_info["Run"] }} </td>
            <td>{{ row.state }}</td>
            <td>{{ row.split_start_time }}</td>
            <td>{{ row.split_end_time }}</td>
            <td>{{ row.split_worker }}</td>
            <td>{{ row.merge_start_time }}</td>
            <td>{{ row.merge_end_time }}</td>
            <td>{{ row.merge_worker }}</td>
          </tr>
          {% endif %}
        {% endfor %}
...

### Scale up the cluster

Cluster scale-in and scale-out is automated. Should be "set it and forget it".


In [None]:
# Error fixes (manually help along)
curl -X POST "localhost:8000/jobs/split/601?state=new&N_paired=0&N_unpaired=0"


## Shutting down procedures

Closing up shop.

In [18]:
# Dump the Scheduler SQLITE table to a local file
curl localhost:8000/db > \
  $WORK/zoonotic_pilot.sqlite

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100  308k  100  308k    0     0   565k      0 --:--:-- --:--:-- --:--:--  565k


## Destroy Cluster

Close out all resources with terraform (will take a few minutes).


In [19]:
terraform destroy -auto-approve
# WARNING this will also delete the standard output bucket/data
# Save data prior to destroy

[0m[1mmodule.download.data.aws_ami.amazon_linux_2: Refreshing state...[0m
[0m[1mmodule.align.module.iam_role.aws_iam_role.role: Refreshing state... [id=SerratusIamRole-serratus-align][0m
[0m[1mmodule.merge.data.aws_availability_zones.all: Refreshing state...[0m
[0m[1mmodule.align.aws_cloudwatch_log_group.g: Refreshing state... [id=serratus-align][0m
[0m[1mmodule.monitoring.aws_iam_role.instance_role: Refreshing state... [id=SerratusEcsInstanceRole][0m
[0m[1mmodule.scheduler.data.aws_ami.amazon_linux_2: Refreshing state...[0m
[0m[1mmodule.align.data.aws_availability_zones.all: Refreshing state...[0m
[0m[1mmodule.download.module.iam_role.aws_iam_role.role: Refreshing state... [id=SerratusIamRole-serratus-dl][0m
[0m[1mmodule.merge.aws_cloudwatch_log_group.g: Refreshing state... [id=serratus-merge][0m
[0m[1mmodule.scheduler.aws_cloudwatch_log_group.scheduler: Refreshing state... [id=scheduler][0m
[0m[1mmodule.merge.module.iam_role.aws_iam_role.role

### Run Notes

There were some minor bug-fixes with `run_merge.sh`, but the entire pilot data made it through. Time to go to scale boys!


# Batch 1

Process upto sample 1000.

## Serratus Initialization


In [1]:
# Serratus commit version
SERRATUS="/home/artem/serratus"
cd $SERRATUS
git rev-parse HEAD # commit version

# Create local run directory
WORK="$SERRATUS/notebook/200505_ab"
mkdir -p $WORK; cd $WORK

c1f438ca1c4eb4f1fcf4f24079ed09558f20e7d5


In [3]:
# SRA RunInfo Table for run -- PILOT
RUNINFO="$WORK/zoonotic_SraRunInfo.csv"

head -n 1000 $RUNINFO > batch1_zoonotic.csv
sed -i '2,50d' batch1_zoonotic.csv
RUNINFO="$WORK/batch1_zoonotic.csv"

head -n 5 $RUNINFO

Run,ReleaseDate,LoadDate,spots,bases,spots_with_mates,avgLength,size_MB,AssemblyName,download_path,Experiment,LibraryName,LibraryStrategy,LibrarySelection,LibrarySource,LibraryLayout,InsertSize,InsertDev,Platform,Model,SRAStudy,BioProject,Study_Pubmed_id,ProjectID,Sample,BioSample,SampleType,TaxID,ScientificName,SampleName,g1k_pop_code,source,g1k_analysis_group,Subject_ID,Sex,Disease,Tumor,Affection_Status,Analyte_Type,Histological_Type,Body_Site,CenterName,Submission,dbgap_study_accession,Consent,RunHash,ReadHash
ERR3568637,2020-04-18 18:18:52,2020-04-21 03:38:04,12326423,616321150,0,50,204,,https://sra-download.ncbi.nlm.nih.gov/traces/era19/ERR/ERR3568/ERR3568637,ERX3567005,Sample 31_s,RNA-Seq,RANDOM,GENOMIC,SINGLE,0,0,ILLUMINA,Illumina HiSeq 4000,ERP117619,PRJEB34680,,626196,ERS3789009,SAMEA5986188,simple,9940,Ovis aries,E-MTAB-8396:Sample 31,,,,,female,,no,,,,,Marcella Ma,ERA2154894,,public,6614621801329D7D2CB845B25B7C4555,CF2B263E1122C3876B423D519299C99A
ERR3568638,2020-04-18 18

In [4]:
# Terraform customization
git diff $SERRATUS/terraform/main/main.tf

diff --git a/terraform/main/main.tf b/terraform/main/main.tf
index a52496e..84dd768 100644
--- a/terraform/main/main.tf
+++ b/terraform/main/main.tf
@@ -109,7 +109,7 @@ module "download" {
   source             = "../worker"
 
   desired_size       = 0
-  max_size           = 256
+  max_size           = 200
 
   dev_cidrs          = var.dev_cidrs
   security_group_ids = [aws_security_group.internal.id]
@@ -134,7 +134,7 @@ module "align" {
   source             = "../worker"
 
   desired_size       = 0
-  max_size           = 256
+  max_size           = 500
   dev_cidrs          = var.dev_cidrs
   security_group_ids = [aws_security_group.internal.id]
   instance_type      = "c5.large" # c5.large


In [6]:
# Initialize terraform
cd $SERRATUS/terraform/main
terraform init

# Launch Terraform Cluster
# Initialize the serratus cluster with minimal nodes
terraform apply -auto-approve

[0m[1mInitializing modules...[0m

[0m[1mInitializing the backend...[0m

[0m[1mInitializing provider plugins...[0m

[0m[1m[32mTerraform has been successfully initialized![0m[32m[0m
[0m[32m
You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.

If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so if necessary.[0m
[0m[1mmodule.merge.data.aws_region.current: Refreshing state...[0m
[0m[1mmodule.scheduler.data.aws_region.current: Refreshing state...[0m
[0m[1mmodule.scheduler.module.iam_role.aws_iam_role.role: Refreshing state... [id=SerratusIamRole-scheduler][0m
[0m[1mmodule.merge.module.iam_role.aws_iam_role.role: Refreshing state... [id=SerratusIamRole-serratus-merge][0m
[0m[1

## Running Serratus


In [7]:
cd $SERRATUS/terraform/main


# Open SSH tunnels to the monitor
./create_tunnels.sh

# Download Scheduler config file
# curl localhost:8000/config > serratus-config.json

Tunnels created:
    localhost:3000 -- grafana
    localhost:9090 -- prometheus
    localhost:8000 -- scheduler


Settings: `serratus-config.json`

```
{
"ALIGN_ARGS":"--very-sensitive-local",
"ALIGN_SCALING_CONSTANT":0.1,
"ALIGN_SCALING_ENABLE":true,
"ALIGN_SCALING_MAX":20,
"CLEAR_INTERVAL":600,
"DL_ARGS":"",
"DL_SCALING_CONSTANT":0.1,
"DL_SCALING_ENABLE":true,
"DL_SCALING_MAX":10,
"GENOME":"cov2r",
"MERGE_ARGS":"",
"MERGE_SCALING_CONSTANT":0.9,
"MERGE_SCALING_ENABLE":true,
"MERGE_SCALING_MAX":1,
"SCALING_INTERVAL":600
}
```


In [13]:
# Dump the Scheduler SQLITE table to a local file
curl localhost:8000/db > \
  $WORK/mega0_pilot.sqlite

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100 1044k  100 1044k    0     0   969k      0  0:00:01  0:00:01 --:--:--  969k


In [27]:
# Cleaning up Run Table via SSH into scheduler
# apt-get install sqlite3



# Clear DONE Accessions
# sqlite3 instance/scheduler.sqlite 'DELETE FROM acc WHERE state = "merge_done";'

# Clear DONE blocks
# sqlite3 instance/scheduler.sqlite 'DELETE FROM blocks WHERE state = "aligning";'



In [14]:
cd $SERRATUS/terraform/main

terraform destroy -auto-approve

[0m[1mmodule.merge.aws_cloudwatch_log_group.g: Refreshing state... [id=serratus-merge][0m
[0m[1mmodule.merge.data.aws_availability_zones.all: Refreshing state...[0m
[0m[1mmodule.align.module.iam_role.aws_iam_role.role: Refreshing state... [id=SerratusIamRole-serratus-align][0m
[0m[1mmodule.merge.module.iam_role.aws_iam_role.role: Refreshing state... [id=SerratusIamRole-serratus-merge][0m
[0m[1mmodule.align.aws_cloudwatch_log_group.g: Refreshing state... [id=serratus-align][0m
[0m[1mmodule.monitoring.data.aws_ami.ecs: Refreshing state...[0m
[0m[1mmodule.merge.data.aws_ami.amazon_linux_2: Refreshing state...[0m
[0m[1mmodule.download.data.aws_region.current: Refreshing state...[0m
[0m[1mmodule.download.data.aws_ami.amazon_linux_2: Refreshing state...[0m
[0m[1mmodule.align.data.aws_region.current: Refreshing state...[0m
[0m[1mmodule.monitoring.aws_iam_role.task_role: Refreshing state... [id=SerratusIamRole-monitor][0m
[0m[1mmodule.work_bucket.a

### Run Notes

#### Out of Memory

Going back to the C5 instances for downloader keeps resulting in out of memory errors. Eventually the worker script cannot start up. Go back to R5 and retry.

These instances exit the docker/worker script and are not responsive to ASG shut-down due to scale-in protection being turned on when they were shut down.


```
2020-05-07T16:17:17.706Z
	
Running -- run_dl-sra.sh --
	
2020-05-07T16:17:17.706Z
	
/home/serratus/run_dl-sra.sh ERR3294540
	
2020-05-07T16:17:27.717Z
	
parallel: Warning: A record was longer than 104857600. Increasing to --blocksize 136314881.
	
2020-05-07T16:17:27.775Z
	
parallel: Warning: A record was longer than 104857600. Increasing to --blocksize 136314881.
	
2020-05-07T16:17:30.694Z
	
parallel: Warning: A record was longer than 136314881. Increasing to --blocksize 177209347.
	
2020-05-07T16:17:30.856Z
	
parallel: Warning: A record was longer than 136314881. Increasing to --blocksize 177209347.
	
2020-05-07T16:17:34.939Z
	
parallel: Warning: A record was longer than 177209347. Increasing to --blocksize 230372153.
	
2020-05-07T16:17:35.045Z
	
parallel: Warning: A record was longer than 177209347. Increasing to --blocksize 230372153.
	
2020-05-07T16:17:40.020Z
	
parallel: Warning: A record was longer than 230372153. Increasing to --blocksize 299483800.
	
2020-05-07T16:17:40.200Z
	
parallel: Warning: A record was longer than 230372153. Increasing to --blocksize 299483800.
	
2020-05-07T16:17:48.997Z
	
parallel: Warning: A record was longer than 299483800. Increasing to --blocksize 389328941.
	
2020-05-07T16:17:49.422Z
	
parallel: Warning: A record was longer than 299483800. Increasing to --blocksize 389328941.
	
2020-05-07T16:18:41.413Z
	
./worker.sh: fork: Cannot allocate memory
```


### Critical Error
In one instance an out of memory error caused fastq-dump to quit prematurely with Error 13 (Seg-Fault). The fq-blocks that were generated made it to the next stage and the library went to 'merge_done' without complete SRA download.

