## Running the VUS cooccurence workflow via WES on a Seven Bridges server
This notebook attempts nothing more complicated than running the readme example from the BRCA Challenge federated_analysis project at https://github.com/BRCAChallenge/federated-analysis.

The files used in the example were uploaded to the Seven Bridges Cancer Genomics Cloud server (part of the CRDC Driver project).

A container was created from the Docker image provided and the federated_analysis project was cloned into it. This container was added to the docker repository on the Seven Bridges CGC server.  



In [43]:
from fasp.workflow import sbcgcWESClient

cl = sbcgcWESClient('forei/fasp-vus', debug=True)

## Setting up the WES run

In [52]:
params = {
    "project": "forei/fasp-vus",
    "inputs": {
        "gnomad_file": {
            "path": "drs://cgc-ga4gh-api.sbgenomics.com/616c15aa5dc80361090d275c",
            "name": "gnomad_chr13_brca2.vcf",
            "class": "File"
        },
        "pathogenicity_file": {
            "path": "drs://cgc-ga4gh-api.sbgenomics.com/616c15ac5dc80361090d2764",
            "name": "clinvar_brca2.tsv",
            "class": "File"
        },
        "pathology_file": {
            "path": "drs://cgc-ga4gh-api.sbgenomics.com/616c15ac58aa9505ff6f91a4",
            "name": "brca2-pathology.tsv",
            "class": "File"
        },

        "vcf_file": {
            "path": "drs://cgc-ga4gh-api.sbgenomics.com/616c15ad5dc80361090d276c",
            "name": "brca2.vcf",
            "class": "File"
        }
    }
}

Now we have formulated the body in the way that it can be passed to a client function as follows.

In [53]:
import json
run_id= cl.runGenericWorkflow(
    workflow_url="sbg://forei/fasp-vus/cooccurence2/4",
    workflow_params = json.dumps(params),
    workflow_type = "CWL",
    workflow_type_version = "sbg:draft-2",
    verbose=False
)
run_id

'a4690a05-83cf-437b-9746-5cc3e90a7e7b'

In [118]:
cl.getTaskStatus(run_id)

'COMPLETE'

## Getting the results - via DRS
Once the run is complete, further steps can use DRS to obtain the file output from the workflow.

In [2]:
run_id = 'd16a014f-6723-4013-8b6d-e5731c1205e4'
runLog = cl.GetRunLog(run_id)
runLog['outputs']

{'vpi_file': {'path': 'drs://cgc-ga4gh-api.sbgenomics.com/61703ab5ae5e936aba7228e6',
  'name': '_6_BRCA2-vpi.json',
  'class': 'File'},
 'pathology_output': None,
 'all_file': {'path': 'drs://cgc-ga4gh-api.sbgenomics.com/61703ab5ae5e936aba7228ec',
  'name': '_6_BRCA2-all.json',
  'class': 'File'},
 'out_file': {'path': 'drs://cgc-ga4gh-api.sbgenomics.com/61703ab5ae5e936aba7228e4',
  'name': '_6_BRCA2-cooccurrences.json',
  'class': 'File'},
 'ipv_file': {'path': 'drs://cgc-ga4gh-api.sbgenomics.com/61703ab5ae5e936aba7228e5',
  'name': '_6_BRCA2-ipv.json',
  'class': 'File'},
 'tout_file': {'path': 'drs://cgc-ga4gh-api.sbgenomics.com/61703ab5ae5e936aba7228ea',
  'name': '_6_BRCA2-tout.json',
  'class': 'File'}}

In [7]:
resultsDRSID = runLog['outputs']['out_file']['path']
resultsDRSID

Use the CGC DRS Server to retrieve the results files

In [28]:
from fasp.loc import sbcgcDRSClient
drsClient = sbcgcDRSClient('~/.keys/sevenbridges_keys.json', 's3')

### DRS GetObject
Here's how we then get details of the file. Note that here only the id portion of the DRS URI is being passed. It is the job of a metaresolver to look at that URI and to determine where to send the id. As noted, we are passing up on the opportunity to use a metaresolver and putting in the id manually.

In [9]:
sbDRSID = resultsDRSID.split('/')[-1]
fileDetails = drsClient.getObject(sbDRSID)
fileDetails

{'id': '61703ab5ae5e936aba7228e4',
 'name': '_6_BRCA2-cooccurrences.json',
 'size': 950,
 'checksums': [{'type': 'etag',
   'checksum': '70b9acfd3fe40187a8b5689f97003313-1'}],
 'self_uri': 'drs://cgc-ga4gh-api.sbgenomics.com/61703ab5ae5e936aba7228e4',
 'created_time': '2021-10-20T15:50:13Z',
 'updated_time': '2021-10-20T15:50:13Z',
 'mime_type': 'application/json',
 'access_methods': [{'type': 's3',
   'region': 'us-east-1',
   'access_id': 'aws-us-east-1'}]}

In [10]:
url = drsClient.getAccessURL(sbDRSID,'s3')

### Downloading the file
Now we can use the url obtained to download the file. We'll create a small function to encapsulate the download.

In [31]:
import requests
import os
def download(url, file_path):
    with open(os.path.expanduser(file_path), "wb") as file:
        response = requests.get(url)
        file.write(response.content)

In [38]:
fullPath = fileDetails['name']
download(url, fullPath)


In [40]:
with open(fullPath) as json_file:
    data = json.load(json_file)
# delete the local copy of the file
os.remove(fullPath)
# and look at the contents
data

{'cooccurring vus': {"(13, 32355250, 'T', 'C')": {'likelihood data': {'p1': 0.375,
    'p2': 0.001,
    'n': 2,
    'k': 1,
    'likelihood': 0.0042624},
   'allele frequencies': {'maxPop': 'eas',
    'maxPopFreq': '0.977087',
    'cohortFreq': 0.5},
   'pathogenic variants': [[13, 32316508, 'GAC', 'G']]},
  "(13, 32353519, 'A', 'G')": {'likelihood data': {'p1': 0.375,
    'p2': 0.001,
    'n': 1,
    'k': 1,
    'likelihood': 0.0026666666666666666},
   'allele frequencies': {'maxPop': 'afr',
    'maxPopFreq': '0.00385267',
    'cohortFreq': 0.25},
   'pathogenic variants': [[13, 32338749, 'AATTAC', 'A']]},
  "(13, 32353470, 'A', 'C')": {'likelihood data': {'p1': 0.375,
    'p2': 0.001,
    'n': 1,
    'k': 1,
    'likelihood': 0.0026666666666666666},
   'allele frequencies': {'maxPop': 'eas',
    'maxPopFreq': '0.383654',
    'cohortFreq': 0.25},
   'pathogenic variants': [[13, 32340836, 'GACAA', 'G']]}},
 'homozygous vus': {"(13, 32355250, 'T', 'C')": {'count': 1,
   'maxPop': 'eas',

In [37]:
# flatten cooccurrence output
flat_vus = []
for k, v in data['cooccurring vus'].items():
    pathogenic_count = len(v['pathogenic variants'])
    ## this is a pythonic way of merging dicts - it is cryptic
    z = {**{"vus":k}, **v['likelihood data'], **v['allele frequencies'], **{"no_pathogenic_coocurrs":pathogenic_count}}
    flat_vus.append(z)

# turn the array of dicts into a data frame    
import pandas as pd
flat_df = pd.DataFrame(flat_vus)
flat_df

Unnamed: 0,vus,p1,p2,n,k,likelihood,maxPop,maxPopFreq,cohortFreq,no_pathogenic_coocurrs
0,"(13, 32355250, 'T', 'C')",0.375,0.001,2,1,0.004262,eas,0.977087,0.5,1
1,"(13, 32353519, 'A', 'G')",0.375,0.001,1,1,0.002667,afr,0.00385267,0.25,1
2,"(13, 32353470, 'A', 'C')",0.375,0.001,1,1,0.002667,eas,0.383654,0.25,1


## To do
- Submit the pathogenicity file from the local system
- Either  access the gnomad file from Gnomad, or supply it from the local system

## Done
- Make the container available to other WES servers by adding the Docker container to Docker Hub instead of the Seven Bridges docker repository
