# RUN: SARS-CoV-2 Zoonotic Reservoir FLOM1 genome: Take 2

```
Lead     : ababaian / rce
Issue    : 
Version  : 
start    : 2020 05 18
complete : 2020 05 18
files    : ~/serratus/notebook/200517_ab/
s3_files : s3://serratus-public/notebook/200505_ab/
output   : s3://serratus-public/out/200517_zoo2/
```

Re-analysis from the series of `200505_Run_Zoonotic_Reservoir.ipynb`

Uses the `flom1` genome which is defined in [FLOM1 notebook](200517_flom1_full_length_only_mega-genome.ipynb)

This has been updated

### Notes from last run:

>rce:
> First-pass comments on zoo2: Good news, it works. Bad news, it works very well -- there are viruses everywhere! Who knew? Virologists, probably. This makes me a bit worried about blowing up the BAM storage and downstream analysis. I'm thinking about what to do. One option is try to talk @Artem out of storing BAMs for everything, but I'm not optimistic. A more Artem-friendly option is to have the summarizer filter out all non-Cov alignments so that we capture BAM for Cov and summaries only for non-Cov families. This should be workable if the summarizer generates a clean Cov-specific score, which it does not as yet but should be straightforward to implement. Dust-masking got dropped again, this explains many of the black regions. The last step in prepping a new reference should be 1. align non-dust-masked sequences to known black regions, 2. dustmask non-black-masked sequences, and finally 3. mask with the union of regions found in 1 and 2. The reason for doing it this was is low-complexity regions and other black regions may overlap, in which case partial masking may reduce sensitivity to the search, so both searches should be done on unmasked sequence.

---

>ab:
>1. I did dustmask via

```
# Soft mask low complexity regions via dustmasker
 dustmasker -in INPUTFA \
  -window 30 -outfmt interval \
  -out flom1.dust
```
> ab: 
> 2. old you there's going to be a lot of hits :stuck_out_tongue:
> 3. All we need is a coronavirus=1 /coronavirus=0 data per acc= line and coronavirus pan-genome first-line summary from the summarizer and we leave the rest as an exercise for the reader.
Artem  9 minutes ago
> 4. I did forget to do do the 'search back' method, but we didn't have blacklisted regions for this genome to do this with.

> rce:
> I want to improve and test summarizer without having to run all the alignments from scratch. I can do this locally if we add a few known Cov's to zoo2, e.g. Frank & Giinger to be sure I don't screw up Cov detection.

> ab:
> I can expedite that

`flom1` was originally processed using `dustmasker -w 30`, this was repeated and an additional default `dustmasker -w 64` is done.

In [1]:
date

Mon May 18 19:53:54 PDT 2020


### Initialize local workspace

In [2]:
# Serratus commit version
SERRATUS="/home/artem/serratus"
cd $SERRATUS

## Serratus was updated, genome remains the same
git rev-parse HEAD # commit version

# Create local run directory
WORK="$SERRATUS/notebook/200518_ab"
mkdir -p $WORK; cd $WORK

# SRA RunInfo Table base for run
RUNINFO="$SERRATUS/notebook/200505_ab/zoonotic_SraRunInfo.csv"

5d34d6e20399136e439345f99eefa1240f4613d4


# Zoo3 Run

In [3]:
# Create a list of all completed runs to date
cd $WORK

#head -n1 $RUNINFO > zoo2_pilot2.csv
#shuf -n1000 $RUNINFO >> zoo2_pilot2.csv
#CURRENT_BATCH="zoo2_pilot2.csv"

cp "$SERRATUS/notebook/200505_ab/zoo2_pilot2.csv" \
    $WORK/zoo3_sraRunInfo.csv

CURRENT_BATCH="zoo3_sraRunInfo.csv"

# Add known CoV Spike-in
# high PEDV in pig
grep "SRR1082995" $RUNINFO >> $CURRENT_BATCH

# low IBV in pig
grep "SRR109516" $RUNINFO >> $CURRENT_BATCH

# Frank + co
grep "ERR275678" $RUNINFO >> $CURRENT_BATCH
# Ginger + co
grep "SRR728711" $RUNINFO >> $CURRENT_BATCH





## Running Serratus 
Upload the run data, scale-out the cluster, monitor performance.

### Terraform Initialization



In [4]:
# Terraform customization
# Make scheduler/monitor beefier for more nodes
git diff $SERRATUS/terraform/main/main.tf

diff --git a/terraform/main/main.tf b/terraform/main/main.tf
index 80b3c5f..cbe2ce8 100644
--- a/terraform/main/main.tf
+++ b/terraform/main/main.tf
@@ -170,7 +170,7 @@ module "merge" {
   // TODO: the credentials are not properly set-up to
   //       upload to serratus-public, requires a *Object policy
   //       on the bucket.
-  options            = "-k ${module.work_bucket.name} -b s3://serratus-public/out/200505_zoonotic"
+  options            = "-k ${module.work_bucket.name} -b s3://serratus-public/out/200518_zoo3"
 }
 
 // RESOURCES ##############################


In [5]:
# Initialize terraform
TF=$SERRATUS/terraform/main
cd $TF
terraform init

# Launch Terraform Cluster
# Initialize the serratus cluster with minimal nodes
terraform apply -auto-approve

[0m[1mInitializing modules...[0m

[0m[1mInitializing the backend...[0m

[0m[1mInitializing provider plugins...[0m

[0m[1m[32mTerraform has been successfully initialized![0m[32m[0m
[0m[32m
You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.

If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so if necessary.[0m
[0m[1mmodule.align.data.aws_availability_zones.all: Refreshing state...[0m
[0m[1mmodule.download.data.aws_ami.amazon_linux_2: Refreshing state...[0m
[0m[1mmodule.merge.data.aws_region.current: Refreshing state...[0m
[0m[1mmodule.download.data.aws_region.current: Refreshing state...[0m
[0m[1mmodule.scheduler.data.aws_region.current: Refreshing state...[0m
[0m[1mm

### Run Monitors & Upload table
Open SSH tunnels to monitor node then open monitors in browser


In [6]:
cd $TF

# Open SSH tunnels to the monitor
./create_tunnels.sh

# If you get an error on port
# run:
# ps aux | grep ssh
# sudo kill <PID of SSH>
#

Tunnels created:
    localhost:3000 -- grafana
    localhost:9090 -- prometheus
    localhost:8000 -- scheduler


### Zoo3 with FLOM1


In [7]:
# Load SRA Run Info into scheduler ===================
# Scheduler DNS: 
echo "Loading SRARunInfo into scheduler "
echo "  File: $CURRENT_BATCH"
echo "  md5 : $(md5sum $WORK/$CURRENT_BATCH)"
echo "  date: $(date)"

curl -s -X POST -T $WORK/$CURRENT_BATCH localhost:8000/jobs/add_sra_run_info/

Loading SRARunInfo into scheduler 
  File: zoo3_sraRunInfo.csv
  md5 : bfc49bcc15f5413035bdf548ace78fdc  /home/artem/serratus/notebook/200518_ab/zoo3_sraRunInfo.csv
  date: Mon May 18 19:59:54 PDT 2020
{"inserted_rows":1074,"total_rows":1074}


In [9]:
# Set Cluster Parameters =============================
cd $TF
# Make local changes to config file
echo "  Cluster Config File: "
cat serratus-config.json
echo ""
echo ""
# Re-upload config file
curl -T serratus-config.json localhost:8000/config

  Cluster Config File: 
{
"ALIGN_ARGS":"--very-sensitive-local",
"ALIGN_SCALING_CONSTANT":0.1,
"ALIGN_SCALING_ENABLE":true,
"ALIGN_SCALING_MAX":450,
"CLEAR_INTERVAL":777,
"DL_ARGS":"",
"DL_SCALING_CONSTANT":0.1,
"DL_SCALING_ENABLE":true,
"DL_SCALING_MAX":150,
"GENOME":"flom1",
"MERGE_ARGS":"",
"MERGE_SCALING_CONSTANT":0.1,
"MERGE_SCALING_ENABLE":true,
"MERGE_SCALING_MAX":3,
"SCALING_INTERVAL":305,
"VIRTUAL_ASG_MAX_INCREASE":10,
"VIRTUAL_SCALING_INTERVAL":60
}


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0{"ALIGN_ARGS":"--very-sensitive-local","ALIGN_SCALING_CONSTANT":0.1,"ALIGN_SCALING_ENABLE":true,"ALIGN_SCALING_MAX":450,"CLEAR_INTERVAL":777,"DL_ARGS":"","DL_SCALING_CONSTANT":0.1,"DL_SCALING_ENABLE":true,"DL_SCALING_MAX":150,"GENOME":"flom1","MERGE_ARGS":"","MERGE_SCALING

### Error Handling


In [None]:
# Error fixes (manually help along)

# ssh <scheduler IPv4>
# sudo docker ps
# sudo docker exec -it <container> bash
# apt install sqlite3 awscli

### ACCESSION OPERATIONS

# Reset SPLITTING accessions to NEW
# sqlite3 instance/scheduler.sqlite 'UPDATE acc SET state = "new" WHERE state = "splitting";'

# Reset SPLIT_ERR accessions to NEW
# (repeated failures can be missing SRA data)
# sqlite3 instance/scheduler.sqlite 'UPDATE acc SET state = "new" WHERE state = "split_err";'

# Reset MERGE_ERR accessions to MERGE_WAIT
# sqlite3 instance/scheduler.sqlite 'UPDATE acc SET state = "merge_wait" WHERE state = "merge_err";'

# Clear DONE Accessions (ONLY ON COMPLETION)
# sqlite3 instance/scheduler.sqlite 'DELETE FROM acc WHERE state = "merge_done";'

### BLOCK OPERATIONS

# Reset FAIL blocks to NEW
# sqlite3 instance/scheduler.sqlite 'UPDATE blocks SET state = "new" WHERE state = "fail";'

# Reset ALIGNING blocks to NEW
# sqlite3 instance/scheduler.sqlite 'UPDATE blocks SET state = "new" WHERE state = "aligning";'


## Shutting down procedures

Closing up shop.

In [10]:
# Dump the Scheduler SQLITE table to a local file
date
curl localhost:8000/db > \
  $WORK/zoo3_complete.sqlite

Mon May 18 23:38:30 PDT 2020
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0 4820k    0 15996    0     0  29243      0  0:02:48 --:--:--  0:02:48 29189  5 4820k    5  287k    0     0   198k      0  0:00:24  0:00:01  0:00:23  198k 15 4820k   15  735k    0     0   292k      0  0:00:16  0:00:02  0:00:14  292k 27 4820k   27 1311k    0     0   375k      0  0:00:12  0:00:03  0:00:09  375k 43 4820k   43 2079k    0     0   466k      0  0:00:10  0:00:04  0:00:06  466k 65 4820k   65 3135k    0     0   573k      0  0:00:08  0:00:05  0:00:03  633k 92 4820k   92 4447k    0     0   690k      0  0:00:06  0:00:06 --:--:--  832k100 4820k  100 4820k    0     0   723k      0  0:00:06  0:00:06 --:--:--  985k


In [11]:
terraform destroy -auto-approve
# WARNING this will also delete the standard output bucket/data
# Save data prior to destroy

[0m[1mmodule.download.aws_cloudwatch_log_group.g: Refreshing state... [id=serratus-dl][0m
[0m[1mmodule.download.data.aws_availability_zones.all: Refreshing state...[0m
[0m[1mmodule.merge.data.aws_region.current: Refreshing state...[0m
[0m[1mmodule.merge.data.aws_ami.amazon_linux_2: Refreshing state...[0m
[0m[1mmodule.download.data.aws_ami.amazon_linux_2: Refreshing state...[0m
[0m[1mmodule.merge.aws_cloudwatch_log_group.g: Refreshing state... [id=serratus-merge][0m
[0m[1mmodule.align.data.aws_availability_zones.all: Refreshing state...[0m
[0m[1mmodule.monitoring.aws_ecs_cluster.monitor: Refreshing state... [id=arn:aws:ecs:us-east-1:797308887321:cluster/serratus-monitor][0m
[0m[1mmodule.monitoring.data.aws_ami.ecs: Refreshing state...[0m
[0m[1mmodule.download.data.aws_region.current: Refreshing state...[0m
[0m[1mmodule.scheduler.data.aws_ami.amazon_linux_2: Refreshing state...[0m
[0m[1mmodule.align.aws_cloudwatch_log_group.g: Refreshing stat

## Destroy Cluster

Close out all resources with terraform (will take a few minutes).
