## Running the VUS cooccurence workflow via WES on TopMed COPDGene data
Other than the COPDGene vcf file which was the target of the analysis, the files used in the example were uploaded to the BioDataCatalyst Seven Bridges server.

Prior to the workflow the COPDGene_phs000951_TOPMed_WGS_freeze.8.chr13.hg38.c1.vcf was filtered down to the range of BRCA2 by using BCFTools.

In [1]:
from fasp.workflow import sbWESClient
cl = sbWESClient('https://ga4gh-api.sb.biodatacatalyst.nhlbi.nih.gov/ga4gh/wes/v1', 'forei/fasp-vus',
                     '~/.keys/sbbdc_key.json', debug=True)



## Setting up the WES run

In [2]:
params = {
    "project": "forei/fasp-vus",
    "inputs": {
        'vcf_file': {'path': 'drs://ga4gh-api.sb.biodatacatalyst.nhlbi.nih.gov/617c77bce6261a31b6d12f0a',
                     'name': 'COPDGene_phs000951_TOPMed_WGS_freeze.8.chr13.hg38.c1.filtered.vcf',
                     'class': 'File'},

        'pathogenicity_file': {'path': 'drs://ga4gh-api.sb.biodatacatalyst.nhlbi.nih.gov/619d2c405d45c457d555d398',
                               'name': 'clinvar_20211120_13.vcf',
                               'class': 'File'}
        }
    }



Now we have formulated the body in the way that it can be passed to a client function as follows.

In [3]:
import json
run_id= cl.runGenericWorkflow(
    workflow_url="sbg://forei/fasp-vus/cooccurrence/5",
    workflow_params = json.dumps(params),
    workflow_type = "CWL",
    workflow_type_version = "sbg:draft-2"
)
run_id

'19f53635-f367-4dc3-b103-b38f980c42f5'

In [5]:
import dateutil.parser
print(cl.getTaskStatus(run_id))
log = cl.GetRunLog(run_id)
if log['run_log']['start_time']:
    start = dateutil.parser.isoparse(log['run_log']['start_time'])
    end = dateutil.parser.isoparse(log['run_log']['end_time'])
    duration = end - start
    print(str(duration))

COMPLETE
0:13:47


In [6]:
cl.GetRunLog(run_id)

{'request': {'tags': {},
  'workflow_params': {'name': 'cooccurrence run - 11-23-21 19:34:55',
   'project': 'forei/fasp-vus',
   'inputs': {'p2': None,
    'save_files': None,
    'gene': None,
    'chromosome': None,
    'data_directory': None,
    'ensembl_release': None,
    'pathology_file': None,
    'gnomad_file': None,
    'hg_version': None,
    'vcf_file': {'path': 'drs://ga4gh-api.sb.biodatacatalyst.nhlbi.nih.gov/617f3fc9fe2ce002310283ba',
     'name': '_1_COPDGene_phs000951_TOPMed_WGS_freeze.8.chr13.hg38.c1.filtered.vcf',
     'class': 'File'},
    'phased': None,
    'pathogenicity_file': {'path': 'drs://ga4gh-api.sb.biodatacatalyst.nhlbi.nih.gov/619d425f35d6fb33b51ec7ca',
     'name': 'clinvar_20211120_13.vcf',
     'class': 'File'}}},
  'workflow_type': 'CWL',
  'workflow_engine_params': {},
  'workflow_url': 'sbg://forei/fasp-vus/cooccurrence/5'},
 'state': 'COMPLETE',
 'outputs': {'vpi_file': {'path': 'drs://ga4gh-api.sb.biodatacatalyst.nhlbi.nih.gov/619d459b35d6fb33b5

## Getting the results - via DRS
Once the run is complete, further steps can use DRS to obtain the file output from the workflow.

In [7]:
runLog = cl.GetRunLog(run_id)
runLog['outputs']

{'vpi_file': {'path': 'drs://ga4gh-api.sb.biodatacatalyst.nhlbi.nih.gov/619d459b35d6fb33b51ec7e3',
  'name': '_8_BRCA2-vpi.json',
  'class': 'File'},
 'pathology_output': None,
 'all_file': {'path': 'drs://ga4gh-api.sb.biodatacatalyst.nhlbi.nih.gov/619d459b35d6fb33b51ec7df',
  'name': '_8_BRCA2-all.json',
  'class': 'File'},
 'out_file': {'path': 'drs://ga4gh-api.sb.biodatacatalyst.nhlbi.nih.gov/619d459b35d6fb33b51ec7e7',
  'name': '_10_BRCA2-cooccurrences.json',
  'class': 'File'},
 'ipv_file': {'path': 'drs://ga4gh-api.sb.biodatacatalyst.nhlbi.nih.gov/619d459b35d6fb33b51ec7e4',
  'name': '_8_BRCA2-ipv.json',
  'class': 'File'},
 'tout_file': {'path': 'drs://ga4gh-api.sb.biodatacatalyst.nhlbi.nih.gov/619d459b35d6fb33b51ec7e1',
  'name': '_8_BRCA2-tout.json',
  'class': 'File'}}

In [8]:
resultsDRSID = runLog['outputs']['out_file']['path']
resultsDRSID

'drs://ga4gh-api.sb.biodatacatalyst.nhlbi.nih.gov/619d459b35d6fb33b51ec7e7'

Use the BioDataCatalyst SB DRS Server to retrieve the results files

In [9]:
from fasp.loc import sbbdcDRSClient
drsClient = sbbdcDRSClient('~/.keys/sevenbridges_keys.json', 's3')

### DRS GetObject
Here's how we then get details of the file. Note that here only the id portion of the DRS URI is being passed. It is the job of a metaresolver to look at that URI and to determine where to send the id. As noted, we are passing up on the opportunity to use a metaresolver and putting in the id manually.

In [10]:
sbDRSID = resultsDRSID.split('/')[-1]
fileDetails = drsClient.getObject(sbDRSID)
fileDetails

{'id': '619d459b35d6fb33b51ec7e7',
 'name': '_10_BRCA2-cooccurrences.json',
 'size': 19069,
 'checksums': [{'type': 'etag',
   'checksum': '746014d5b3bc71a68d40ebac0e101c6b-1'}],
 'self_uri': 'drs://ga4gh-api.sb.biodatacatalyst.nhlbi.nih.gov/619d459b35d6fb33b51ec7e7',
 'created_time': '2021-11-23T19:48:43Z',
 'updated_time': '2021-11-23T19:48:43Z',
 'mime_type': 'application/json',
 'access_methods': [{'type': 's3',
   'region': 'us-east-1',
   'access_id': 'aws-us-east-1'}]}

In [11]:
url = drsClient.getAccessURL(sbDRSID,'s3')

### Downloading the file
Now we can use the url obtained to download the file. We'll create a small function to encapsulate the download.

In [12]:
import requests
import os
def download(url, file_path):
    with open(os.path.expanduser(file_path), "wb") as file:
        response = requests.get(url)
        file.write(response.content)

In [13]:
fullPath = fileDetails['name']
download(url, fullPath)


In [14]:
with open(fullPath) as json_file:
    data = json.load(json_file)
# delete the local copy of the file
os.remove(fullPath)
# and look at the contents
# data

In [15]:
# flatten cooccurrence output
flat_vus = []
for k, v in data['cooccurring vus'].items():
    pathogenic_count = len(v['pathogenic variants'])
    ## this is a pythonic way of merging dicts - it is cryptic
    z = {**{"vus":k}, **v['likelihood data'], **v['allele frequencies'], **{"no_pathogenic_coocurrs":pathogenic_count}}
    flat_vus.append(z)

# turn the array of dicts into a data frame    
import pandas as pd
flat_df = pd.DataFrame(flat_vus)
flat_df

Unnamed: 0,vus,p1,p2,n,k,likelihood,maxPop,maxPopFreq,cohortFreq,no_pathogenic_coocurrs
0,"('13', 32342270, 'CA', 'C')",0.001077,0.001,5000,1,1.363553,,,0.489428,1
1,"('13', 32344166, 'GAA', 'G')",0.001077,0.001,4192,1,1.281487,,,0.410337,1
2,"('13', 32365109, 'C', 'CT')",0.001077,0.001,3444,1,1.209925,,,0.337118,1
3,"('13', 32368001, 'CTTTTTTTTTT', 'C')",0.001077,0.001,6704,2,1.443371,,,0.656226,2
4,"('13', 32392589, 'CAA', 'C')",0.001077,0.001,5432,2,1.309001,,,0.531715,2
5,"('13', 32399786, 'C', 'CT')",0.001077,0.001,4495,1,1.311667,,,0.439996,1
6,"('13', 32399885, 'C', 'A')",0.001077,0.001,10216,28,0.275916,,,1.0,22
7,"('13', 32400151, 'T', 'A')",0.001077,0.001,9620,16,0.640669,,,0.94166,13
8,"('13', 32349814, 'C', 'CA')",0.001077,0.001,6475,1,1.527155,,,0.63381,1
9,"('13', 32359222, 'GA', 'G')",0.001077,0.001,7084,1,1.6003,,,0.693422,1


In [16]:
# homozygous vus output
homozygous_vus = []
for k, v in data['homozygous vus'].items():
    ## this is a pythonic way of merging dicts - it is cryptic
    z = {**{"vus":k}, **v}
    homozygous_vus.append(z)

# turn the array of dicts into a data frame    
import pandas as pd
hz_df = pd.DataFrame(homozygous_vus)
hz_df

Unnamed: 0,vus,count,cohortFreq
0,"('13', 32344166, 'GAA', 'G')",838,0.082028
1,"('13', 32380534, 'C', 'CT')",689,0.067443
2,"('13', 32399786, 'C', 'CT')",431,0.042189
3,"('13', 32399885, 'C', 'A')",10199,0.998336
4,"('13', 32400151, 'T', 'A')",6099,0.597005
...,...,...,...
142,"('13', 32329935, 'C', 'T')",1,0.000098
143,"('13', 32349878, 'G', 'A')",1,0.000098
144,"('13', 32365109, 'C', 'CTT')",1,0.000098
145,"('13', 32368001, 'C', 'CT')",1,0.000098


## To do
- Submit the pathogenicity file from the local system
- Either  access the gnomad file from Gnomad, or supply it from the local system

## Done
- Make the container available to other WES servers by adding the Docker container to Docker Hub instead of the Seven Bridges docker repository
