## Running a Workflow on a Seven Bridges WES server
I'm setting out to use the SevenBridges WES client to run samtools stats on a cram file. The instructions described here https://docs.cancergenomicscloud.org/docs/run-a-workflow are the starting point for how to do this.


In [1]:
from fasp.workflow import sbWESClient
cl = sbWESClient('cgc','forei/gecco','~/.keys/sbcgc_key.json')

The above instantiates a client for the SevenBridges Cancer Genomics Cloud (CGC ). 

### Checking a previous run
For information we'll first use the client to get the details of a task that was run from the CGC user interface
The getTaskStatus function below is simply a wrapper around https://cgc-ga4gh-api.sbgenomics.com/ga4gh/wes/v1/runs/{run_id} which deals with authentication, passing and retrieving the request. That  gives some clues about how to fill out a request to submit the same task via WES instead of the UI.

It's worth noting that though DRS was not used at all to create the task within the UI the file paths in the WES response do use a DRS notation for them.

In [2]:
cl.getTaskStatus('0a528553-1292-493c-8db6-db1c3ce7831b', verbose=True)

Get request sent to: https://cgc-ga4gh-api.sbgenomics.com/ga4gh/wes/v1/runs/0a528553-1292-493c-8db6-db1c3ce7831b
{
  "request": {
    "tags": {},
    "workflow_params": {
      "name": "SAMtools Stats 1.8 run - 01-09-21 17:44:31",
      "project": "forei/gecco",
      "inputs": {
        "total_memory_GB": null,
        "coverage_limit": null,
        "include_only_read_group": null,
        "remove_duplicates": null,
        "max_insert_size": null,
        "reference_file": {
          "path": "drs://cgc-ga4gh-api.sbgenomics.com/5bad6c83e4b0abc138917143",
          "name": "references-hs37d5-hs37d5.fasta",
          "class": "File"
        },
        "alignment_input_file": {
          "path": "drs://cgc-ga4gh-api.sbgenomics.com/5ba9223ee4b0abc138883360",
          "name": "117438.recal.cram",
          "class": "File"
        }
      }
    },
    "workflow_type": "CWL",
    "workflow_engine_params": {}
  },
  "state": "COMPLETE",
  "outputs": {
    "statistics": {
      "path": "drs

'COMPLETE'

Looking at that response gives some clues about how to edit the example provided in the documentation.

How the task looks in the UI is also helpful.
![alt text](SAMToolsTask.png "samtools task as shown in SevenBridges CGC UI")

## Running the same compute via the WES API
Filling out the body for a WES request to run the same thing, the project information is easy to work out. Inputs too seem pretty straightforward. Even though it's not present in the status above it's also pretty obvious that workflow_url should be the URI for the samtools stats app in my gecco project. The only tricky one was workflow_type_version. The log for the task run via the UI gives us a clue for that; job.json contains "cwlVersion" : "sbg:draft-2".

With all that we come up with the following body for the request.

In [3]:
body = {
  "workflow_params": {
    "project": "forei/gecco",
    "inputs": {
      "alignment_input_file":
        {
          "path": "drs://cgc-ga4gh-api.sbgenomics.com/5ba9223ee4b0abc138883360",
          "name": "117438.recal.cram",
          "class": "File"
        },
      "reference_file": {
          "path": "drs://cgc-ga4gh-api.sbgenomics.com/5bad6c83e4b0abc138917143",
          "name": "references-hs37d5-hs37d5.fasta",
        "class": "File"
      }
    }
  },
  "workflow_type": "CWL",
  "workflow_type_version": "sbg:draft-2",
  "workflow_url": "sbg://forei/gecco/samtools-stats-1-8/10"
}

## Calling WES from Python
The WES request has to be passed to the WES server as a multipart/form. The following approach proved necessary to structure the request body to make use of the way Python the requests module constructs multipart forms. The way the  requests module does so is quite compact but somewhat obscure. It might be better to deal with than in the WES client. For now we'll do it outside the client to maintain transparency and for the purpose of learning how to do this.

In [4]:
import json

params = {
    "project": "forei/gecco",
    "inputs": {
      "alignment_input_file":
        {
          "path": "drs://cgc-ga4gh-api.sbgenomics.com/5ba9223ee4b0abc138883360",
          "name": "117438.recal.cram",
          "class": "File"
        },
      "reference_file": {
          "path": "drs://cgc-ga4gh-api.sbgenomics.com/5bad6c83e4b0abc138917143",
          "name": "references-hs37d5-hs37d5.fasta",
        "class": "File"
      }
    }
  }
body = {
  "workflow_params": (None, json.dumps(params), 'application/json'),
  "workflow_type": "CWL",
  "workflow_type_version": "sbg:draft-2",
  "workflow_url": "sbg://forei/gecco/samtools-stats-1-8/10"
}

Now we have formulated the body in the way that it can be passed as a multipart/form we will run it.

In [5]:
run_id = cl.runGenericWorkflow(body)

In [9]:
cl.getTaskStatus(run_id)

'COMPLETE'

## Getting the results - via DRS
Once the run is complete, further steps can use DRS to obtain the file output from the workflow.

In [10]:
runLog = cl.GetRunLog(run_id)
runLog['outputs']

{'statistics': {'path': 'drs://cgc-ga4gh-api.sbgenomics.com/60034a99e4b09cae722a78d0',
  'name': '_11_117438.recal.cram.stats.txt',
  'class': 'File'}}

In [11]:
resultsDRSID = runLog['outputs']['statistics']['path']
resultsDRSID

'drs://cgc-ga4gh-api.sbgenomics.com/60034a99e4b09cae722a78d0'

We'll pass over the question of how one would determine which DRS server that URI needs to be sent to because
* In this case it's fairly obvious - it's the CGC DRS Server
* We want to get something up and working
* There are other things we should consider when dealing with metaresolvers

Add to to-do list: a notebook on Metaresolvers

In [12]:
from fasp.loc import sbcgcDRSClient
drsClient = sbcgcDRSClient('~/.keys/sevenbridges_keys.json', 's3')

### DRS GetObject
Here's how we then get details of the file. Note that here only the id portion of the DRS URI is being passed. It is the job of a metaresolver to look at that URI and to determine where to send the id. As noted, we are passing up on the opportunity to use a metaresolver and putting in the id manually.

In [13]:
fileDetails = drsClient.getObject('5ffe65dee4b0eeecd99a2b39')
fileDetails

{'id': '5ffe65dee4b0eeecd99a2b39',
 'name': '_3_117438.recal.cram.stats.txt',
 'size': 111394,
 'checksums': [{'type': 'etag',
   'checksum': '347d17ba60392492bff1689cae4355b5-1'}],
 'self_uri': 'drs://cgc-ga4gh-api.sbgenomics.com/5ffe65dee4b0eeecd99a2b39',
 'created_time': '2021-01-13T03:15:42Z',
 'updated_time': '2021-01-13T03:15:42Z',
 'mime_type': 'application/json',
 'access_methods': [{'type': 's3',
   'region': 'us-east-1',
   'access_id': 'aws-us-east-1'}]}

In [14]:
url = drsClient.getAccessURL('5ffe65dee4b0eeecd99a2b39','s3')

### Downloading the file
Now we can use the url obtained to download the file. We'll create a small function to encapsulate the download.

In [15]:
import requests
import os
def download(url, file_path):
    with open(os.path.expanduser(file_path), "wb") as file:
        response = requests.get(url)
        file.write(response.content)

In [16]:
fullPath = '~/Downloads/' + fileDetails['name']
download(url, fullPath)

## Why we need a Metaresolver
Just to prove that to use DRS URIs we need a metaresolver. Here's what happens when we try asking the DRS Server to resolve the full DRS URI.

In [None]:
drsClient.getObject('drs://cgc-ga4gh-api.sbgenomics.com/5ffe65dee4b0eeecd99a2b39')

That we get an error (404) might seem weird or obtuse behavior for at least couple of reasons:
* The DRS server clearly knows that is the URI for that file. It tells us so in the self_uri attribute
* The WES server from the same organization was quite happy with the full URI

However this behavior is correct according to the spec (double check that). A DRS Server resolves only the identifiers that are local to it.

A metaresolver would be needed for resolving compact URIs too.