## Running the VUS cooccurence workflow via WES on Cavatica
This notebook attempts nothing more complicated than running the readme example from the BRCA Challenge federated_analysis project at https://github.com/BRCAChallenge/federated-analysis.

The files used in the example were uploaded to Cavatica.

In [1]:
input_files = {
        "gnomad_file": {
            "drs_id": "616c15aa5dc80361090d275c",
            "name": "gnomad_chr13_brca2.vcf"
        },
        "pathogenicity_file": {
            "drs_id": "616c15ac5dc80361090d2764",
            "name": "clinvar_brca2.tsv"
        },
        "vcf_file": {
            "drs_id": "616c15ad5dc80361090d276c",
            "name": "brca2.vcf"
        }
    }


In [2]:
from fasp.workflow import cavaticaWESClient

cl = cavaticaWESClient('forei/fasp-vus', debug=True)



## Setting up the WES run

In [6]:
params = {
    "project": "forei/fasp-vus",
    "inputs": {
    'gnomad_file': {'path': 'drs://cavatica-ga4gh-api.sbgenomics.com/6172de7c37cb3c07e42e4e7f',
     'name': 'gnomad_chr13_brca2.vcf',
     'class': 'File'},
    'vcf_file': {'path': 'drs://cavatica-ga4gh-api.sbgenomics.com/6172de7f37cb3c07e42e4e91',
     'name': 'brca2.vcf',
     'class': 'File'},
    'pathogenicity_file': {'path': 'drs://cavatica-ga4gh-api.sbgenomics.com/6172de7e37cb3c07e42e4e87',
     'name': 'clinvar_brca2.tsv',
     'class': 'File'}
    }
}



Now we have formulated the body in the way that it can be passed to a client function as follows.

In [7]:
import json
run_id= cl.runGenericWorkflow(
    workflow_url="sbg://forei/fasp-vus/cooccurrence/1",
    workflow_params = json.dumps(params),
    workflow_type = "CWL",
    workflow_type_version = "sbg:draft-2",
    verbose=False
)
run_id

'bd9fb271-19fb-472d-aa6f-95a0d946c9a0'

In [10]:
cl.getTaskStatus(run_id)

'COMPLETE'

In [11]:
cl.GetRunLog(run_id)

{'request': {'tags': {},
  'workflow_params': {'name': 'cooccurrence run - 11-01-21 01:33:16',
   'project': 'forei/fasp-vus',
   'inputs': {'p2': None,
    'save_files': None,
    'gene': None,
    'chromosome': None,
    'data_directory': None,
    'ensembl_release': None,
    'pathology_file': None,
    'gnomad_file': {'path': 'drs://cavatica-ga4gh-api.sbgenomics.com/6172de7c37cb3c07e42e4e7f',
     'name': 'gnomad_chr13_brca2.vcf',
     'class': 'File'},
    'hg_version': None,
    'vcf_file': {'path': 'drs://cavatica-ga4gh-api.sbgenomics.com/6172de7f37cb3c07e42e4e91',
     'name': 'brca2.vcf',
     'class': 'File'},
    'phased': None,
    'pathogenicity_file': {'path': 'drs://cavatica-ga4gh-api.sbgenomics.com/6172de7e37cb3c07e42e4e87',
     'name': 'clinvar_brca2.tsv',
     'class': 'File'}}},
  'workflow_type': 'CWL',
  'workflow_engine_params': {},
  'workflow_url': 'sbg://forei/fasp-vus/cooccurrence/1'},
 'state': 'COMPLETE',
 'outputs': {'vpi_file': {'path': 'drs://cavatica-ga

## Getting the results - via DRS
Once the run is complete, further steps can use DRS to obtain the file output from the workflow.

In [12]:
runLog = cl.GetRunLog(run_id)
runLog['outputs']

{'vpi_file': {'path': 'drs://cavatica-ga4gh-api.sbgenomics.com/617f44ead2f88b031e8fa6ad',
  'name': '_2_BRCA2-vpi.json',
  'class': 'File'},
 'pathology_output': None,
 'all_file': {'path': 'drs://cavatica-ga4gh-api.sbgenomics.com/617f44ead2f88b031e8fa6ab',
  'name': '_2_BRCA2-all.json',
  'class': 'File'},
 'out_file': {'path': 'drs://cavatica-ga4gh-api.sbgenomics.com/617f44ead2f88b031e8fa6a9',
  'name': '_2_BRCA2-cooccurrences.json',
  'class': 'File'},
 'ipv_file': {'path': 'drs://cavatica-ga4gh-api.sbgenomics.com/617f44ead2f88b031e8fa6af',
  'name': '_2_BRCA2-ipv.json',
  'class': 'File'},
 'tout_file': {'path': 'drs://cavatica-ga4gh-api.sbgenomics.com/617f44ead2f88b031e8fa6b0',
  'name': '_2_BRCA2-tout.json',
  'class': 'File'}}

In [13]:
resultsDRSID = runLog['outputs']['out_file']['path']
resultsDRSID

'drs://cavatica-ga4gh-api.sbgenomics.com/617f44ead2f88b031e8fa6a9'

Use the Cavatica DRS Server to retrieve the results files

In [14]:
from fasp.loc import cavaticaDRSClient
drsClient = cavaticaDRSClient('~/.keys/sevenbridges_keys.json', 's3')

### DRS GetObject

In [15]:
sbDRSID = resultsDRSID.split('/')[-1]
fileDetails = drsClient.getObject(sbDRSID)
fileDetails

{'id': '617f44ead2f88b031e8fa6a9',
 'name': '_2_BRCA2-cooccurrences.json',
 'size': 950,
 'checksums': [{'type': 'etag',
   'checksum': '70b9acfd3fe40187a8b5689f97003313-1'}],
 'self_uri': 'drs://cavatica-ga4gh-api.sbgenomics.com/617f44ead2f88b031e8fa6a9',
 'created_time': '2021-11-01T01:37:46Z',
 'updated_time': '2021-11-01T01:37:46Z',
 'mime_type': 'application/json',
 'access_methods': [{'type': 's3',
   'region': 'us-east-1',
   'access_id': 'aws-us-east-1'}]}

In [16]:
url = drsClient.getAccessURL(sbDRSID,'s3')

### Downloading the file
Now we can use the url obtained to download the file. We'll create a small function to encapsulate the download.

In [17]:
import requests
import os
def download(url, file_path):
    with open(os.path.expanduser(file_path), "wb") as file:
        response = requests.get(url)
        file.write(response.content)

In [19]:
fullPath = fileDetails['name']
download(url, fullPath)


In [20]:
with open(fullPath) as json_file:
    data = json.load(json_file)
# delete the local copy of the file
os.remove(fullPath)
# and look at the contents
data

{'cooccurring vus': {"(13, 32355250, 'T', 'C')": {'likelihood data': {'p1': 0.375,
    'p2': 0.001,
    'n': 2,
    'k': 1,
    'likelihood': 0.0042624},
   'allele frequencies': {'maxPop': 'eas',
    'maxPopFreq': '0.977087',
    'cohortFreq': 0.5},
   'pathogenic variants': [[13, 32316508, 'GAC', 'G']]},
  "(13, 32353519, 'A', 'G')": {'likelihood data': {'p1': 0.375,
    'p2': 0.001,
    'n': 1,
    'k': 1,
    'likelihood': 0.0026666666666666666},
   'allele frequencies': {'maxPop': 'afr',
    'maxPopFreq': '0.00385267',
    'cohortFreq': 0.25},
   'pathogenic variants': [[13, 32338749, 'AATTAC', 'A']]},
  "(13, 32353470, 'A', 'C')": {'likelihood data': {'p1': 0.375,
    'p2': 0.001,
    'n': 1,
    'k': 1,
    'likelihood': 0.0026666666666666666},
   'allele frequencies': {'maxPop': 'eas',
    'maxPopFreq': '0.383654',
    'cohortFreq': 0.25},
   'pathogenic variants': [[13, 32340836, 'GACAA', 'G']]}},
 'homozygous vus': {"(13, 32355250, 'T', 'C')": {'count': 1,
   'maxPop': 'eas',

In [21]:
# flatten cooccurrence output
flat_vus = []
for k, v in data['cooccurring vus'].items():
    pathogenic_count = len(v['pathogenic variants'])
    ## this is a pythonic way of merging dicts - it is cryptic
    z = {**{"vus":k}, **v['likelihood data'], **v['allele frequencies'], **{"no_pathogenic_coocurrs":pathogenic_count}}
    flat_vus.append(z)

# turn the array of dicts into a data frame    
import pandas as pd
flat_df = pd.DataFrame(flat_vus)
flat_df

Unnamed: 0,vus,p1,p2,n,k,likelihood,maxPop,maxPopFreq,cohortFreq,no_pathogenic_coocurrs
0,"(13, 32355250, 'T', 'C')",0.375,0.001,2,1,0.004262,eas,0.977087,0.5,1
1,"(13, 32353519, 'A', 'G')",0.375,0.001,1,1,0.002667,afr,0.00385267,0.25,1
2,"(13, 32353470, 'A', 'C')",0.375,0.001,1,1,0.002667,eas,0.383654,0.25,1


## To do
- Submit the pathogenicity file from the local system
- Either  access the gnomad file from Gnomad, or supply it from the local system

### Done
- Make the container available to other WES servers by adding the Docker container to Docker Hub instead of the Seven Bridges docker repository
