# EC2 Array Launch
```
pi:ababaian
files: ~/Crown/scripts/1kg_hgr0/
start: 2017 02 22
complete : 2017 03 03
```
## Introduction

The largest time-sink for me at the moment on running 1000 genomes data is launching individual instances and pasting the command to run the pipeline with a given set of parameters.

I'd like to automate this process to launch an 'array' of EC2 machines, each of which knows what it needs to do.

### Set-up / Initialization

Initialization to be done on a local machine.

In [None]:
# Create Key Pair for this
# Do this privately, don't share
aws ec2 create-key-pair --key-name XXXXXX --query 'KeyMaterial' --output text > XXXXXX.pem
chmod 400 XXXXXX.pem

# Create Security Group
# In the EC2 control panel; edit the 'inbound' rules
# add: SSH, TCP, 22, Anywhere

aws ec2 create-security-group --group-name YYYYYY --description 'Crown Project sec-group for ec2 arrays'

aws ec2 run-instances --image-id ami-66129306 --count 1 \
--instance-type t2.micro --key-name XXXXXX --security-groups YYYYYY

example
```
{
    "OwnerId": "XXXXXXXX", 
    "ReservationId": "r-XXXXXXXX", 
    "Groups": [], 
    "Instances": [
        {
            "Monitoring": {
                "State": "disabled"
            }, 
            "PublicDnsName": "", 
            "RootDeviceType": "ebs", 
            "State": {
                "Code": 0, 
                "Name": "pending"
            }, 
            "EbsOptimized": false, 
            "LaunchTime": "2017-02-23T04:29:09.000Z", 
            "PrivateIpAddress": "172.31.27.171", 
            "ProductCodes": [], 
            "VpcId": "vpc-f418f593", 
            "StateTransitionReason": "", 
            "InstanceId": "i-XXXXXXXXXXXXXXXX", 
            "ImageId": "ami-66129306", 
            "PrivateDnsName": "ip-172-31-27-171.us-west-2.compute.internal", 
            "KeyName": "XXXXXXXX", 
            "SecurityGroups": [
                {
                    "GroupName": "XXXXXXXX", 
                    "GroupId": "XXXXXXXX"
                }
            ], 
            "ClientToken": "", 
            "SubnetId": "subnet-27030843", 
            "InstanceType": "t2.micro", 
            "NetworkInterfaces": [
                {
                    "Status": "in-use", 
                    "MacAddress": "02:fc:2d:42:61:c1", 
                    "SourceDestCheck": true, 
                    "VpcId": "vpc-XXXXXXXX", 
                    "Description": "", 
                    "NetworkInterfaceId": "eni-732bc300", 
                    "PrivateIpAddresses": [
                        {
                            "PrivateDnsName": "ip-172-31-27-171.us-west-2.compute.internal", 
                            "Primary": true, 
                            "PrivateIpAddress": "172.31.27.171"
                        }
                    ], 
                    "PrivateDnsName": "ip-172-31-27-171.us-west-2.compute.internal", 
                    "Attachment": {
                        "Status": "attaching", 
                        "DeviceIndex": 0, 
                        "DeleteOnTermination": true, 
                        "AttachmentId": "eni-attach-8e1101ec", 
                        "AttachTime": "2017-02-23T04:29:09.000Z"
                    }, 
                    "Groups": [
                        {
                            "GroupName": "XXXXXXXX", 
                            "GroupId": "XXXXXXXX"
                        }
                    ], 
                    "SubnetId": "subnet-XXXXXXXX", 
                    "OwnerId": "XXXXXXXX", 
                    "PrivateIpAddress": "XXXXXXXX"
                }
            ], 
            "SourceDestCheck": true, 
            "Placement": {
                "Tenancy": "default", 
                "GroupName": "", 
                "AvailabilityZone": "us-west-2b"
            }, 
            "Hypervisor": "xen", 
            "BlockDeviceMappings": [], 
            "Architecture": "x86_64", 
            "StateReason": {
```

### queenB Launch Script

This is a 3-script set-up.

queenB.sh - Runs locally and controls EC2 launch and parameterization

droneB.sh - Script which executes commands on the newly launched EC2 instance

gather.sh - Optional, any script which is executed in a screen on the EC2 machine by droneB. This is the 'main pipeline' script.

In [None]:
#!/bin/bash
# queenB.sh
#
# EC2 Launch / Control Script
#

# Control Panel =========================
# EC2 Run Script - script for droneB to execute
TASK="s3://crownproject/scripts/gather.sh"

# Parameter file, each line is given to a droneB to execute
# gather.sh by
PARAMETERS="pollen.coord"

# EC2 Set-up
instanceTYPE='t2.micro'
imageID='ami-66129306' #AMI

devNAME='/dev/sda1' # /dev/sda1 for Crown-AMI
volSIZE='25' # in Gb

# Number of instances to launch
#COUNT=2 # predetermined number
COUNT=$(wc -l $PARAMETERS | cut -f 1 -d' ' ) # for each input argument

# Security
keyNAME='XXXXXX'
keyPATH="XXXXXX"
secGROUP='XXXXXXX'

# =======================================

for ITER in $(seq 1 $COUNT)
do

  # Extract Parameters/Arguments ----------

  ARGS=$(sed -n "$ITER"p $PARAMETERS | sed 's/\t/ /g' - )

  echo "Launch instance # $ITER"
  echo "Instance Type: $instanceTYPE"
  echo "AMI Image: $imageID"
  echo "Run Script: $TASK"
  echo "Parameters: $ARGS"

  # Launch an instance --------------------
  # NOTE: each iteration of the for loop launches one instance
  # therefore each loop launches only one instance
  aws ec2 run-instances --image-id $imageID --count 1 \
   --instance-type $instanceTYPE --key-name $keyNAME \
   --block-device-mappings DeviceName=$devNAME,Ebs={VolumeSize=$volSIZE} \
   --security-groups $secGROUP > launch.tmp

  # Another alternative is to use --user-data droneB.sh 
  # which will run at instance boot-up
  # passing arguments to it may be challenging

  # Retrieve instance ID
  instanceID=$(cat launch.tmp | \
    egrep -o -e 'InstanceId[":/A-Za-z0-9_ \\-]*' - |\
    cut -f2 -d' ' - | xargs)

  echo "Instance ID: $instanceID"


  # Add a few minute wait here to allow for Public DNS to be assigned
  # otherwise ssh doesn't work
  sleep 120s

  # Retrieve public DNS
  aws ec2 describe-instances --instance-ids $instanceID > launch2.tmp

  pubDNS=$(cat launch2.tmp | \
    egrep -o -m 1 -e 'PublicDnsName[.":/A-Za-z0-9_ \\-]*' - |\
    cut -f2 -d' ' - | xargs)

  echo "Public DNS: $pubDNS"

  # Access the instance -------------------

  LOGIN="ubuntu@$pubDNS" 

  ssh -i $keyPATH \
    -o StrictHostKeyChecking=no \
    $LOGIN 'bash -s' < droneB.sh $TASK $(echo $ARGS)

  # Cleanup
  rm *.tmp

  echo ''
  echo ''

done

# end of script

Parameterization of ssh
```
ssh user@host ARG1=$ARG1 ARG2=$ARG2 'bash -s' <<'ENDSSH'
  # commands to run on remote host
  echo $ARG1 $ARG2
ENDSSH
```

In [None]:
#!/bin/bash
# droneB.sh
#

# This script-layer is neccesary to launch a screen session
# on each ec2-machine. The pipeline is run within that session
# and the output is logged. This allows 'looking in' on sessions
# as they are running.

# Commands to run on server-side
# ===============================================================

SCRIPTPATH=$1

SCRIPT=$(basename $1)

shift # drop first (TASK or SCRIPT variable)

# Download pipeline / droneB's function
  aws s3 cp $SCRIPTPATH ./

  chmod 777 *.sh

# open screen; run gather.sh function. -L logged
  screen -Ldmt sh ~/$SCRIPT $@


# ===============================================================


In [None]:
#!/bin/bash
# gather.sh
#
# Server-side pipeline script
# to be executed on ec2-machine
# within a screen

echo HELLO WORLD!!!

echo $1 $2 $3 $4 $5 $6

touch hello.world

`pollen.coord`
Input file which contains arguments for gather.sh. Preferably it's space-delimited but tab-delimited works as well.

Example below

In [None]:
10MWEST	50M	PINK	FLOWR	POLLEN!	YUM
54MEAST	20M	HOUSE	HIVE	ENEMY!	STING
NA19017	NA19017	SRR1295544	LWK	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR129/004/SRR1295544/SRR1295544_1.fastq.gz	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR129/004/SRR1295544/SRR1295544_2.fastq.gz

### Run queenB for two libraries

1kg_runs_1.txt
```
NA20845	NA20845	SRR1295542	GIH	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR129/002/SRR1295542/SRR1295542_1.fastq.gz	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR129/002/SRR1295542/SRR1295542_2.fastq.gz
NA18525	NA18525	SRR1295539	CHB	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR129/009/SRR1295539/SRR1295539_1.fastq.gz	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR129/009/SRR1295539/SRR1295539_2.fastq.gz
```

queenB.sh Controls
```
# Control Panel =========================
# EC2 Run Script - script for droneB to execute
TASK="s3://crownproject/scripts/1kg_align_v0.sh"

# Parameter file, each line is given to a droneB to execute
# gather.sh by
PARAMETERS="1kg_runs_1.txt"

# EC2 Set-up
instanceTYPE='c4.2xlarge'
imageID='ami-66129306' #AMI

devNAME='/dev/sda1' # /dev/sda1 for Crown-AMI
volSIZE='200' # in Gb
```

In [4]:
cd ~/Crown/data/tmp
ls; echo ''

sh queenB.sh

1kg_runs_1.txt	droneB.sh  queenB.sh

Launch instance # 1
Instance Type: c4.2xlarge
AMI Image: ami-66129306
Run Script: s3://crownproject/scripts/1kg_align_v0.sh
Parameters: NA20845 NA20845 SRR1295542 GIH ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR129/002/SRR1295542/SRR1295542_1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR129/002/SRR1295542/SRR1295542_2.fastq.gz
Instance ID: i-062f505bd1cbdf5b5
Public DNS: ec2-52-11-70-27.us-west-2.compute.amazonaws.com
download: s3://crownproject/scripts/1kg_align_v0.sh to ./1kg_align_v0.sh


Launch instance # 2
Instance Type: c4.2xlarge
AMI Image: ami-66129306
Run Script: s3://crownproject/scripts/1kg_align_v0.sh
Parameters: NA18525 NA18525 SRR1295539 CHB ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR129/009/SRR1295539/SRR1295539_1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR129/009/SRR1295539/SRR1295539_2.fastq.gz
Instance ID: i-07d5fb0675cc1b8f8
Public DNS: ec2-52-40-200-91.us-west-2.compute.amazonaws.com
download: s3://crownproject/scri

 - This 'worked' in the sense that it completed exactly what it was supposed to. The 'Pipeline' script `1kg_align_v0.sh` was the older 'A' version not the 'B' version so the entire .bam files were copied to S3 (then manually moved to subfolder Crown/1kg_align/fullBam/* )
 
 - I copied the correct 'B' version of the pipeline script and am re-running the command.

In [1]:
cd ~/Crown/data/tmp
ls; echo ''

sh queenB.sh

1kg_align_v0.sh  1kg_runs_1.txt  droneB.sh  queenB.sh

Launch instance # 1
Instance Type: c4.2xlarge
AMI Image: ami-66129306
Run Script: s3://crownproject/scripts/1kg_align_v0.sh
Parameters: NA20845 NA20845 SRR1295542 GIH ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR129/002/SRR1295542/SRR1295542_1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR129/002/SRR1295542/SRR1295542_2.fastq.gz
Instance ID: i-05792a3db89b404d8
Public DNS: ec2-52-39-60-213.us-west-2.compute.amazonaws.com
download: s3://crownproject/scripts/1kg_align_v0.sh to ./1kg_align_v0.sh


Launch instance # 2
Instance Type: c4.2xlarge
AMI Image: ami-66129306
Run Script: s3://crownproject/scripts/1kg_align_v0.sh
Parameters: NA18525 NA18525 SRR1295539 CHB ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR129/009/SRR1295539/SRR1295539_1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR129/009/SRR1295539/SRR1295539_2.fastq.gz
Instance ID: i-03ffa18bebb970057
Public DNS: ec2-52-89-89-251.us-west-2.compute.amazonaws.com
download: s3:

## Results + Discussion

EC2 Launch array is operational. I can feed a single tab seperated file given the pipeline parameters and that can be run through v0 of the pipeline. The next step is to give it lots of genomes and generate some data!

This also means future iterations can be done much faster once refinements are made.


### Finish Pilot Run

Run the last four genomes from the pilot side of the study. Update to 1kg_runs.txt.

Note: Using c4.xlarge instance, dropping to 3 threads of processing from 7. I think it's more cost effective from some scribble calculations.

```
HG00268	HG00268	SRR1291262	FIN	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR129/002/SRR1291262/SRR1291262_1.fastq.gz	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR129/002/SRR1291262/SRR1291262_2.fastq.gz
HG00731_pcr	HG00731	ERR903028	PUR	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR903/ERR903028/ERR903028_1.fastq.gz	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR903/ERR903028/ERR903028_2.fastq.gz
HG00096	HG00096	SRR1291026	GBR	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR129/006/SRR1291026/SRR1291026_1.fastq.gz	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR129/006/SRR1291026/SRR1291026_2.fastq.gz
NA19017	NA19017	SRR1295544	LWK	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR129/004/SRR1295544/SRR1295544_1.fastq.gz	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR129/004/SRR1295544/SRR1295544_2.fastq.gz
```


In [1]:
cd ~/Crown/data/tmp
ls; echo ''

sh queenB.sh

1kg_align_v0.sh  1kg_runs_1.txt  droneB.sh  queenB.sh

Launch instance # 1
Instance Type: c4.xlarge
AMI Image: ami-66129306
Run Script: s3://crownproject/scripts/1kg_align_v0.sh
Parameters: HG00268 HG00268 SRR1291262 FIN ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR129/002/SRR1291262/SRR1291262_1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR129/002/SRR1291262/SRR1291262_2.fastq.gz
Instance ID: i-0181dbb681d960a20
Public DNS: ec2-52-42-108-212.us-west-2.compute.amazonaws.com
download: s3://crownproject/scripts/1kg_align_v0.sh to ./1kg_align_v0.sh


Launch instance # 2
Instance Type: c4.xlarge
AMI Image: ami-66129306
Run Script: s3://crownproject/scripts/1kg_align_v0.sh
Parameters: HG00731_pcr HG00731 ERR903028 PUR ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR903/ERR903028/ERR903028_1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR903/ERR903028/ERR903028_2.fastq.gz
Instance ID: i-071261922488bd3ec
Public DNS: ec2-52-39-230-115.us-west-2.compute.amazonaws.com
download: s3://crownpr