In [2]:
%load_ext autoreload
% autoreload 2

In [3]:
import os
import json
import cromwell_tools as cwt
from google.cloud import storage
from IPython import display

## Basic Cromwell Tools Functionality

This notebook walks through the basic functionality of cromwell_tools. A typical workflow involves defining and running some WDL workflows using a cromwell server. 

This server can either be a locally running server, in which case you access it from a localhost url, or a web service, which will have its own URL. For the purpose of this demo notebook, we'll assume you're setting up a local server. It also assumes you've set-up and authenticated a version of the google-cloud-sdk and can make a successful call to `google.cloud.storage.Client()`

## Set up clients & confirm they're running

In [4]:
# no username and password for localhost. 
google_project = 'broad-dsde-mint-dev'
cromwell_url = 'http://localhost:6361'

local_config = {'cromwell_url': cromwell_url}
cromwell = cwt.Cromwell(**local_config)

The constructor for `Cromwell` confirms that you've specified a properly identified server. However you can re-run this check with `Cromwell.server_is_running()`. This can also help debug rare cases where your server shuts down mid-workflow. 

In [5]:
cromwell.server_is_running()

True

In [6]:
# confirm client is properly authenticated by listing the buckets
client = storage.Client(project=google_project)
buckets = list(client.list_buckets())
print(buckets[:2]) # just list the first two

[<Bucket: artifacts.broad-dsde-mint-dev.appspot.com>, <Bucket: broad-dsde-mint-dev>]


## Define an example workflow

Cromwell is set up to accept local files, google storage endpoints, and http or https endpoints for its inputs. To demonstrate the https capabilities, this demo will pull the files directly from the `cromwell_tools` git repository. We'll download a WDL file that runs a testing workflow that spins up an inexpensive google instance and sleeps for 15 seconds. We will attach a monitoring script to it so we can see how the memory and disk usage fluctuates across the run. 

In [125]:
ls ${HOME}/projects/cromwell_tools/src/cromwell_tools/test/data/

10x_count.wdl                example_monitoring.log
10x_count_inputs.json        options.json
10x_count_inputs_1e4.json    secrets.json
10x_count_inputs_1e5.json    testing.wdl
10x_count_inputs_1e6.json    testing_example_inputs.json
10x_count_inputs_1e7.json    [34mutilizations[m[m/
example_metadata.json


In [14]:
data_dir = os.path.expanduser('~/projects/cromwell_tools/src/cromwell_tools/test/data/')
wdl = data_dir + 'testing.wdl'
inputs = data_dir + 'testing_example_inputs.json'
options = data_dir + 'options.json'

We can visualize each of these modules because they're very simple. The inputs file provides a single input, the amount of time to sleep. The options file provides the monitoring script and turns off call caching to make sure the submission provokes a fresh run. Finally, the WDL defines a task that takes the time input and sleeps for that amount of time. 

In [19]:
!cat $inputs

{
  "Sleep.time": 15
}


In [20]:
!cat $options

{
  "monitoring_script": "gs://broad-dsde-mint-dev-teststorage/10x/benchmark/scripts/monitor.sh",
  "read_from_cache":false,
  "write_to_cache":false
}

In [21]:
!cat $wdl


task SleepAWhile {
  Int time

  command {
    lsblk
    df -k
    sleep ${time}
    echo "something"
  }

  runtime {
    cpu: "1"
    docker: "ubuntu:zesty"
    memory: "1 GB"
    disks: "local-disk 10 HDD"
  }
}

workflow Sleep {
  Int time

  call SleepAWhile {
    input:
      time = time
  }
}


## Submit and explore the workflow

The Cromwell Tools package defines two main classes: Cromwell and Workflow. An instance of the `Cromwell` object checks that it points to a valid, authenticated, active Cromwell instance when it starts up. It defines all of the REST api methods supported by cromwell. In contrast, a `Workflow` instance represents a workflow that the `Cromwell` server is aware of. Thus, it has two constructors: one that submits a new workflow, and one that builds the object based on an existing run. We will explore both below. 

First, we'll use the secondary constructor to submit a new workflow. Later we'll query `Cromwell` and use the discovered run_id to demonstrate the primary constructor. 

One useful capability of cromwell that this package exposes is the ability to add custom tags to runs. This will help us find the workflow we're initializing with a query. 

In [42]:
custom_labels = {'type': 'basicfunctionalitytest'}

In [83]:
# also accessible with ?cwt.Workflow.from_submission
print(cwt.Workflow.from_submission.__doc__)

Submit a new workflow, returning a Workflow object.

        :param str wdl: wdl that defines this workflow
        :param str inputs_json: inputs to this wdl
        :param Cromwell cromwell_server: an authenticated cromwell server

        :param str workflow_dependencies:
        :param dict custom_labels:
        :param str options_json: options file for the workflow
        :param bool wait: if True, wait until workflow recognizes as submitted (default: True)
        :param int timeout: maximum time to wait
        :param int delay: time between status queries
        :param bool verbose: if True, print request results
        :param args: additional positional args to pass to requests.post
        :param kwargs: additional keyword args to pass to request.post

        :return dict: Cromwell submission result
        


In [44]:
test_workflow = cwt.Workflow.from_submission(wdl, inputs, cromwell, custom_labels=custom_labels, options_json=options)

the Cromwell rest API exposes a number of useful endpoints that we can use to interact with and evaluate the outcome of a running workflow. For any command to a `Cromwell` instance, specifying `verbose=True` will print the response in addition to storing the output, and specifying `open_browser=True` for any GET request will display the json response in your browser. 

Below, we describe two ways to get the status of a workflow. In the latter case, we both open the browser window and print the request with `verbose`.  

In [39]:
# version 1
test_workflow.status

{'id': '827c9e8e-cabc-4e5c-9d46-5254cad52bf4', 'status': 'Submitted'}

In [45]:
# version 2
cromwell.status(test_workflow.run_id, open_browser=True, verbose=True)

GET Request: http://localhost:6361/api/workflows/v1/cbd7daf8-0a75-4366-9a64-4d3c004ed458/status
Response: 200
Response Content:
{
  "status": "Submitted",
  "id": "cbd7daf8-0a75-4366-9a64-4d3c004ed458"
}


<Response [200]>

We can also get a run's metadata, which we read in as a traversable dictionary:

In [62]:
test_workflow.metadata.keys()

dict_keys(['workflowName', 'submittedFiles', 'calls', 'outputs', 'workflowRoot', 'id', 'inputs', 'labels', 'submission', 'status', 'end', 'start'])

We can also print the whole dictionary:

In [63]:
test_workflow.metadata

{'calls': {'Sleep.SleepAWhile': [{'attempt': 1,
    'backend': 'JES',
    'backendLabels': {'cromwell-workflow-id': 'cromwell-cbd7daf8-0a75-4366-9a64-4d3c004ed458',
     'type': 'basicfunctionalitytest',
     'wdl-task-name': 'sleepawhile'},
    'backendLogs': {'log': 'gs://broad-dsde-mint-dev-cromwell-execution/Sleep/cbd7daf8-0a75-4366-9a64-4d3c004ed458/call-SleepAWhile/SleepAWhile.log'},
    'backendStatus': 'Success',
    'callCaching': {'allowResultReuse': False,
     'effectiveCallCachingMode': 'CallCachingOff'},
    'callRoot': 'gs://broad-dsde-mint-dev-cromwell-execution/Sleep/cbd7daf8-0a75-4366-9a64-4d3c004ed458/call-SleepAWhile',
    'dockerImageUsed': 'ubuntu@sha256:da2fd4e2e10e0ab991f251353a2d3e32d38c75a83a917dbca0a307efd8730f49',
    'end': '2017-10-06T13:22:44.658-07:00',
    'executionEvents': [{'description': 'start',
      'endTime': '2017-10-06T20:21:52.914155355Z',
      'startTime': '2017-10-06T20:21:52.914094914Z'},
     {'description': 'pulling-image',
      'endTi

## Explore Workflow results and resource utilization

After the workflow completes, we can automatically parse information on the tasks that were run. In this case, we ran a monitoring script and can figure out how much memory and disk was used in the task. While it's not necessary to do this, we can first look at the actual output of the monitoring script.

In [51]:
# can take up to two minutes, considering overhead required to spin up the instance
test_workflow.wait_until_complete(timeout=120, delay=5)

We can display the logs for the successful run. Unfortunately cromwell doesn't consider our monitoring script a log, so we need to get the workflow root. 

In [55]:
cromwell.logs(test_workflow.run_id, verbose=True)

GET Request: http://localhost:6361/api/workflows/v1/cbd7daf8-0a75-4366-9a64-4d3c004ed458/logs
Response: 200
Response Content:
{
  "calls": {
    "Sleep.SleepAWhile": [
      {
        "stdout": "gs://broad-dsde-mint-dev-cromwell-execution/Sleep/cbd7daf8-0a75-4366-9a64-4d3c004ed458/call-SleepAWhile/SleepAWhile-stdout.log",
        "shardIndex": -1,
        "stderr": "gs://broad-dsde-mint-dev-cromwell-execution/Sleep/cbd7daf8-0a75-4366-9a64-4d3c004ed458/call-SleepAWhile/SleepAWhile-stderr.log",
        "attempt": 1,
        "backendLogs": {
          "log": "gs://broad-dsde-mint-dev-cromwell-execution/Sleep/cbd7daf8-0a75-4366-9a64-4d3c004ed458/call-SleepAWhile/SleepAWhile.log"
        }
      }
    ]
  },
  "id": "cbd7daf8-0a75-4366-9a64-4d3c004ed458"
}


<Response [200]>

In [72]:
# our call was called SleepAWhile; we can get the file from google storage. 
log_filename = test_workflow.metadata['workflowRoot'] + 'call-SleepAWhile/monitoring.log'
bucket_name, blob_name = cwt.task.split_google_storage_path(log_filename)

In [75]:
bucket = client.bucket(bucket_name)
blob = bucket.blob(blob_name)

In [87]:
# here's what the monitoring log looks like
print(blob.download_as_string().decode())

--- General Information ---
#CPU: 1
Total Memory (MB): 1700
Total Disk space (KB): 10190136

--- Runtime Information ---
* Memory usage (%): 7.41%
* Memory usage (MB): 126
* Disk usage (%): 0.23%
* Disk usage (KB): 23044
* Memory usage (%): 7.41%
* Memory usage (MB): 126
* Disk usage (%): 0.23%
* Disk usage (KB): 23044
* Memory usage (%): 7.35%
* Memory usage (MB): 125
* Disk usage (%): 0.23%
* Disk usage (KB): 23044
* Memory usage (%): 7.41%
* Memory usage (MB): 126
* Disk usage (%): 0.23%
* Disk usage (KB): 23048



This information is automatically parsed by the `Task` object and stored in a `ResourceUtilization` object, which is created when you call `tasks()` on a workflow. 

In [95]:
for name, task in test_workflow.tasks().items():
    print(name)
    print(task.resource_utilization)

Sleep.SleepAWhile
SleepAWhile Monitoring Summary:
Max Memory Usage (MB): 126
Available Memory (MB): 1700
Max disk usage   (KB): 23048
Available disk   (KB): 10190136
Disk Utilized     (%): 0.002
Memory Utilized   (%): 0.074
Robust Estimate?     : True



This information can be saved to a file for later analysis

In [None]:
resource_utilization_filename = 'test_resource_utilization.txt'
test_workflow.save_resource_utilization(resource_utilization_filename)

## Interact with previously completed workflows

Earlier it was stated that there are two `Workflow` constructors. Lets use some of the other cromwell functionality to show how that other constructor works. First, lets find our workflow using cromwell's query syntax. 

In [108]:
cromwell.query(status=['Succeeded'], names=['Sleep'], verbose=True)

GET Request: http://localhost:6361/api/workflows/v1/query?name=Sleep&status=Succeeded
Response: 200
Response Content:
{
  "results": [
    {
      "name": "Sleep",
      "id": "da9dbc0a-1361-41d3-9ba3-7d98650e554b",
      "status": "Succeeded",
      "end": "2017-10-05T08:24:04.304-07:00",
      "start": "2017-10-05T08:22:31.431-07:00"
    },
    {
      "name": "Sleep",
      "id": "be986809-4fa0-44ff-ac63-bf8a13e33c1c",
      "status": "Succeeded",
      "end": "2017-10-05T08:25:24.072-07:00",
      "start": "2017-10-05T08:23:11.461-07:00"
    },
    {
      "name": "Sleep",
      "id": "22933ae9-08a1-47b2-babf-188aee37d4b6",
      "status": "Succeeded",
      "end": "2017-10-05T08:25:35.174-07:00",
      "start": "2017-10-05T08:24:11.522-07:00"
    },
    {
      "name": "Sleep",
      "id": "adc9a556-74dd-41f8-bc24-69e54bd161e3",
      "status": "Succeeded",
      "end": "2017-10-05T08:26:52.304-07:00",
      "start": "2017-10-05T08:24:31.542-07:00"
    },
    {
      "name": "Slee

<Response [200]>

Here I've run a couple, but the last one is the one we're looking for, which was run today. 

Below we use the other constructor to create a `Workflow` from a run_id object. 

In [114]:
run_id = cromwell.query(status=['Succeeded'], names=['Sleep']).json()['results'][-1]['id']
duplicate_workflow = cwt.Workflow(run_id=run_id, cromwell_server=cromwell, storage_client=client)
duplicate_workflow.status  # same as above. 

{'id': 'cbd7daf8-0a75-4366-9a64-4d3c004ed458', 'status': 'Succeeded'}

We can also look at the timing diagram, which for this workflow is boring (will open in another window)

In [116]:
duplicate_workflow.timing()

## Other miscellaneous functionality

Display cromwell backends:

In [120]:
cromwell.backends(verbose=True)

GET Request: http://localhost:6361/api/workflows/v1/backends
Response: 200
Response Content:
{
  "supportedBackends": [
    "JES",
    "Local",
    "SGE"
  ],
  "defaultBackend": "JES"
}


<Response [200]>

Display run outputs (note: our task doesn't have any!)

In [122]:
cromwell.outputs(test_workflow.run_id, verbose=True)

GET Request: http://localhost:6361/api/workflows/v1/cbd7daf8-0a75-4366-9a64-4d3c004ed458/outputs
Response: 200
Response Content:
{
  "outputs": {},
  "id": "cbd7daf8-0a75-4366-9a64-4d3c004ed458"
}


<Response [200]>

Abort a workflow (this will fail, since our workflow is already complete!)

In [123]:
test_workflow.abort()

{'message': "Couldn't abort cbd7daf8-0a75-4366-9a64-4d3c004ed458 because no workflow with that ID is in progress",
 'status': 'error'}

Finally, open the swagger API for your instance:

In [124]:
cromwell.swagger()