## [Cromwell on AWS](https://docs.opendata.aws/genomics-workflows/)

[Cromwell](https://github.com/broadinstitute/cromwell) is a Workflow Management System geared towards scientific workflows. Cromwell is open sourced under the [BSD 3-Clause license](https://github.com/broadinstitute/cromwell/blob/develop/LICENSE.txt).

![Image of Cromwell](https://docs.opendata.aws/genomics-workflows/cromwell/images/cromwell-on-aws_infrastructure.png)

### Initialize Notebook Environment

In [None]:
import boto3
import sys
import os
import json
import base64
import project_path # path to helper methods
import pprint

from lib import workshop
from botocore.exceptions import ClientError

cfn = boto3.client('cloudformation')

session = boto3.session.Session()
region = session.region_name

key_name = 'cromwell'

### [Create S3 Bucket](https://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html)

We will create an S3 bucket that will be used throughout the workshop for storing our data.

[s3.create_bucket](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.create_bucket) boto3 documentation

In [None]:
bucket = workshop.create_bucket_name('genomics-')
session.resource('s3').create_bucket(Bucket=bucket, CreateBucketConfiguration={'LocationConstraint': region})
print(bucket)

### [Create VPC](https://aws.amazon.com/vpc/)

Amazon Virtual Private Cloud (Amazon VPC) lets you provision a logically isolated section of the AWS Cloud where you can launch AWS resources in a virtual network that you define. You have complete control over your virtual networking environment, including selection of your own IP address range, creation of subnets, and configuration of route tables and network gateways. You can use both IPv4 and IPv6 in your VPC for secure and easy access to resources and applications.

In [None]:
vpc, subnet, subnet2 = workshop.create_and_configure_vpc()
vpc_id = vpc.id
subnet_id = subnet.id
subnet2_id = subnet2.id
print(vpc_id)
print(subnet_id)
print(subnet2_id)

### [Create EC2 Keypair](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html)

Amazon EC2 uses public–key cryptography to encrypt and decrypt login information. Public–key cryptography uses a public key to encrypt a piece of data, such as a password, then the recipient uses the private key to decrypt the data. The public and private keys are known as a key pair.

[ec2_client.create_key_pair](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/ec2.html#EC2.ServiceResource.create_key_pair)

In [None]:
try:
    response = ec2_client.describe_key_pairs(
    KeyNames=[
        key_name,
    ],
)
except ClientError as e:
    if e.response['Error']['Code'] == 'InvalidKeyPair.NotFound':
        print ('Creating keypair: %s' % key_name)
        # Create an SSH key to use when logging into instances.
        outfile = open(key_name + '.pem','w')
        key_pair = ec2.create_key_pair(KeyName=key_name)
        KeyPairOut = str(key_pair.key_material)
        outfile.write(KeyPairOut)
        outfile.close()
        os.chmod(key_name + '.pem', 400)
    else:
        print ('Keypair: %s already exists' % key_name)

### [Create a custom AMI for Cromwell](https://docs.opendata.aws/genomics-workflows/aws-batch/create-custom-ami/)

In all cases, you will need a AMI ID for the AWS Batch Compute Resource AMI that you created using the ["Create a Custom AMI"](https://docs.opendata.aws/genomics-workflows/aws-batch/create-custom-ami/) guide! We do not provide a default value since for most genomics workloads, you will need to account for more storage than the default AWS Batch AMI provides. We will launch a [CloudFormation](https://aws.amazon.com/cloudformation/) template to generate the custom AMI for use with Cromwell.

In [None]:
print("https://console.aws.amazon.com/cloudformation/home?#/stacks/new?stackName=GenomicsWorkflow-AMI&templateURL=https://s3.amazonaws.com/aws-genomics-workflows/templates/create-genomics-ami/create-custom-ami-existing-vpc.yaml")

In [None]:
!wget https://s3.amazonaws.com/aws-genomics-workflows/templates/create-genomics-ami/create-custom-ami-existing-vpc.yaml

In [None]:
!cat create-custom-ami-existing-vpc.yaml

### [Launching the CloudFormation stacks](https://docs.opendata.aws/genomics-workflows/aws-batch/configure-aws-batch-cfn/)

The link below provides a CloudFormation template to deploy a base AWS Batch environment for genomics workflows. The `Full Stack` template is self-contained and will create all of the AWS resources, including VPC network, security groups, etc. The template defaults to using two Availability Zones for deploying instances. If you need more than this, leverage the next template.

In [None]:
print("https://console.aws.amazon.com/cloudformation/home?#/stacks/new?stackName=GenomicsEnv-Batch&templateURL=https://s3.amazonaws.com/aws-genomics-workflows/templates/aws-genomics-root-novpc.template.yaml")

In [None]:
!wget https://s3.amazonaws.com/aws-genomics-workflows/templates/aws-genomics-root-novpc.template.yaml

In [None]:
!cat aws-genomics-root-novpc.template.yaml

### Get Status of the CloudFormation template

We want to get the status and outputs of the CloudFormation template as it completes.

[cfn.describe_stacks](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/cloudformation.html#CloudFormation.Client.describe_stacks)

In [None]:
import pandas as pd

response = cfn.describe_stacks(StackName='GenomicsEnv-Batch')

print(response["Stacks"][0]["StackStatus"] +'\n')

### Get Outputs from CloudFormation

In [None]:
outputs = response["Stacks"][0]["Outputs"]
pd.set_option('display.max_colwidth', -1)
pd.DataFrame(outputs, columns=["OutputKey", "OutputValue"])

### [Launch the Cromwell CloudFormation stack](https://docs.opendata.aws/genomics-workflows/cromwell/cromwell-aws-batch/)

#### Cromwell Server
To ensure the highest level of security, and robustness for long running workflows, it is recommended that you use an EC2 instance as your Cromwell server for submitting workflows to AWS Batch.

A couple things to note:

* This server does not need to be permanent. In fact, when you are not running workflows, you should stop or terminate the instance so that you are not paying for resources you are not using.

* You can launch a Cromwell server just for yourself and exactly when you need it.

* This server does not need to be in the same VPC as the one that Batch will launch instances in.

#### Parameters
When launching the CloudFormation template you will copy the `GenomicsEnvS3Bucket` value into the `S3BucketName` parameter and `GenomicsEnvDefaultJobQueueArn` value into the `BatchQueue` parameter under the `Cromwell Configuration` section.

In [None]:
print("https://console.aws.amazon.com/cloudformation/home?#/stacks/new?stackName=CromwellServer&templateURL=https://s3.amazonaws.com/aws-genomics-workflows/templates/cromwell/cromwell-server.template.yaml")

In [None]:
!wget https://s3.amazonaws.com/aws-genomics-workflows/templates/cromwell/cromwell-server.template.yaml

In [None]:
!cat cromwell-server.template.yaml

### Get Status of Cromwell CloudFormation stack 

In [None]:
response = cfn.describe_stacks(StackName='CromwellServer')

print(response["Stacks"][0]["StackStatus"] +'\n')

### Get Outputs from CloudFormation

In [None]:
outputs = response["Stacks"][0]["Outputs"]
pd.set_option('display.max_colwidth', -1)
pd.DataFrame(outputs, columns=["OutputKey", "OutputValue"])

In [None]:
%%writefile simple-hello.wdl

task echoHello{
    command {
        echo "Hello AWS!"
    }
    runtime {
        docker: "ubuntu:latest"
    }

}

workflow printHelloAndGoodbye {
    call echoHello
}

In [None]:
!curl -X POST "http://{{cromwell server}}/api/workflows/v1" \
    -H "accept: application/json" \
    -F "workflowSource=@simple-hello.wdl"

### Real world example

In [None]:
%%writefile HaplotypeCaller.aws.wdl

## Copyright Broad Institute, 2017
##
## This WDL workflow runs HaplotypeCaller from GATK4 in GVCF mode on a single sample
## according to the GATK Best Practices (June 2016), scattered across intervals.
##
## Requirements/expectations :
## - One analysis-ready BAM file for a single sample (as identified in RG:SM)
## - Set of variant calling intervals lists for the scatter, provided in a file
##
## Outputs :
## - One GVCF file and its index
##
## Cromwell version support
## - Successfully tested on v29
## - Does not work on versions < v23 due to output syntax
##
## IMPORTANT NOTE: HaplotypeCaller in GATK4 is still in evaluation phase and should not
## be used in production until it has been fully vetted. In the meantime, use the GATK3
## version for any production needs.
##
## Runtime parameters are optimized for Broad's Google Cloud Platform implementation.
##
## LICENSING :
## This script is released under the WDL source code license (BSD-3) (see LICENSE in
## https://github.com/broadinstitute/wdl). Note however that the programs it calls may
## be subject to different licenses. Users are responsible for checking that they are
## authorized to run all programs before running this script. Please see the dockers
## for detailed licensing information pertaining to the included programs.

# WORKFLOW DEFINITION
workflow HaplotypeCallerGvcf_GATK4 {
  File input_bam
  File input_bam_index
  File ref_dict
  File ref_fasta
  File ref_fasta_index
  File scattered_calling_intervals_list

  String gatk_docker

  String gatk_path

  Array[File] scattered_calling_intervals = read_lines(scattered_calling_intervals_list)

  String sample_basename = basename(input_bam, ".bam")

  String gvcf_name = sample_basename + ".g.vcf.gz"
  String gvcf_index = sample_basename + ".g.vcf.gz.tbi"

  # Call variants in parallel over grouped calling intervals
  scatter (interval_file in scattered_calling_intervals) {

    # Generate GVCF by interval
    call HaplotypeCaller {
      input:
        input_bam = input_bam,
        input_bam_index = input_bam_index,
        interval_list = interval_file,
        gvcf_name = gvcf_name,
        ref_dict = ref_dict,
        ref_fasta = ref_fasta,
        ref_fasta_index = ref_fasta_index,
        docker_image = gatk_docker,
        gatk_path = gatk_path
    }
  }

  # Merge per-interval GVCFs
  call MergeGVCFs {
    input:
      input_vcfs = HaplotypeCaller.output_gvcf,
      vcf_name = gvcf_name,
      vcf_index = gvcf_index,
      docker_image = gatk_docker,
      gatk_path = gatk_path
  }

  # Outputs that will be retained when execution is complete
  output {
    File output_merged_gvcf = MergeGVCFs.output_vcf
    File output_merged_gvcf_index = MergeGVCFs.output_vcf_index
  }
}

# TASK DEFINITIONS

# HaplotypeCaller per-sample in GVCF mode
task HaplotypeCaller {
  File input_bam
  File input_bam_index
  String gvcf_name
  File ref_dict
  File ref_fasta
  File ref_fasta_index
  File interval_list
  Int? interval_padding
  Float? contamination
  Int? max_alt_alleles

  Int preemptible_tries
  Int disk_size
  String mem_size

  String docker_image
  String gatk_path
  String java_opt

  command {
    ${gatk_path} --java-options ${java_opt} \
      HaplotypeCaller \
      -R ${ref_fasta} \
      -I ${input_bam} \
      -O ${gvcf_name} \
      -L ${interval_list} \
      -ip ${default=100 interval_padding} \
      -contamination ${default=0 contamination} \
      --max-alternate-alleles ${default=3 max_alt_alleles} \
      -ERC GVCF
  }

  runtime {
    docker: docker_image
    memory: mem_size
    cpu: 1
    disks: "local-disk"
  }

  output {
    File output_gvcf = "${gvcf_name}"
  }
}

# Merge GVCFs generated per-interval for the same sample
task MergeGVCFs {
  Array [File] input_vcfs
  String vcf_name
  String vcf_index

  Int preemptible_tries
  Int disk_size
  String mem_size

  String docker_image
  String gatk_path
  String java_opt

  command {
    ${gatk_path} --java-options ${java_opt} \
      MergeVcfs \
      --INPUT=${sep=' --INPUT=' input_vcfs} \
      --OUTPUT=${vcf_name}
  }

  runtime {
    docker: docker_image
    memory: mem_size
    cpu: 1
    disks: "local-disk"
}

  output {
    File output_vcf = "${vcf_name}"
    File output_vcf_index = "${vcf_index}"
  }
}

### Input parameters

In [None]:
%%writefile HaplotypeCaller.aws.json

{
  "##_COMMENT1": "INPUT BAM",
  "HaplotypeCallerGvcf_GATK4.input_bam": "s3://gatk-test-data/wgs_bam/NA12878_24RG_hg38/NA12878_24RG_small.hg38.bam",
  "HaplotypeCallerGvcf_GATK4.input_bam_index": "s3://gatk-test-data/wgs_bam/NA12878_24RG_hg38/NA12878_24RG_small.hg38.bai",

  "##_COMMENT2": "REFERENCE FILES",
  "HaplotypeCallerGvcf_GATK4.ref_dict": "s3://broad-references/hg38/v0/Homo_sapiens_assembly38.dict",
  "HaplotypeCallerGvcf_GATK4.ref_fasta": "s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta",
  "HaplotypeCallerGvcf_GATK4.ref_fasta_index": "s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.fai",

  "##_COMMENT3": "INTERVALS",
  "HaplotypeCallerGvcf_GATK4.scattered_calling_intervals_list": "s3://gatk-test-data/intervals/hg38_wgs_scattered_calling_intervals.txt",
  "HaplotypeCallerGvcf_GATK4.HaplotypeCaller.interval_padding": 100,

  "##_COMMENT4": "DOCKERS",
  "HaplotypeCallerGvcf_GATK4.gatk_docker": "broadinstitute/gatk:4.0.0.0",

  "##_COMMENT5": "PATHS",
  "HaplotypeCallerGvcf_GATK4.gatk_path": "/gatk/gatk",

  "##_COMMENT6": "JAVA OPTIONS",
  "HaplotypeCallerGvcf_GATK4.HaplotypeCaller.java_opt": "-Xms8000m",
  "HaplotypeCallerGvcf_GATK4.MergeGVCFs.java_opt": "-Xms8000m",

  "##_COMMENT7": "MEMORY ALLOCATION",
  "HaplotypeCallerGvcf_GATK4.HaplotypeCaller.mem_size": "10 GB",
  "HaplotypeCallerGvcf_GATK4.MergeGVCFs.mem_size": "30 GB",

  "##_COMMENT8": "DISK SIZE ALLOCATION",
  "HaplotypeCallerGvcf_GATK4.HaplotypeCaller.disk_size": 100,
  "HaplotypeCallerGvcf_GATK4.MergeGVCFs.disk_size": 100,

  "##_COMMENT9": "PREEMPTION",
  "HaplotypeCallerGvcf_GATK4.HaplotypeCaller.preemptible_tries": 3,
  "HaplotypeCallerGvcf_GATK4.MergeGVCFs.preemptible_tries": 3
}

### Submit job to Cromwell server

In [None]:
!curl -X POST "http://{{cromwell server}}/api/workflows/v1" \
    -H  "accept: application/json" \
    -F "workflowSource=@HaplotypeCaller.aws.wdl" \
    -F "workflowInputs=@HaplotypeCaller.aws.json"

## Cleanup

In [None]:
response = cfn.delete_stack(StackName='CromwellServer')

In [None]:
response = cfn.delete_stack(StackName='GenomicsEnv-Batch')

In [None]:
response = cfn.delete_stack(StackName='GenomicsWorkflow-AMI')

In [None]:
response = ec2_client.delete_key_pair(KeyName=key_name)

In [None]:
workshop.vpc_cleanup(vpc_id)