The FASPRunner class manages logging and settings/preferences for us

In [1]:
from fasp.runner import FASPRunner

faspRunner = FASPRunner()
settings = faspRunner.settings

Running <ipython-input-1-bec338364b20>


The following sets up two Clients to Search data. Note that at the moment one of these is a placeholder to search a local file. That file contains file ids downloaded as a manifest from the Gen3 Anvil portal. That list of files in that manifest had already been filtered to relevant samples. Note that the DRS ids prefixed with CURIEs (crdc for the Cancer Research Data Commons and anv for Anvil). This indicates which namespace the ids come from and allows the referenced file to be retrieved from the correct DRS server. In the case of the gtex manifest file the anv: prefix was added in an edited version of the file.

In [2]:
from fasp.search import BigQuerySearchClient, Gen3ManifestClient

# Step 1 - Discovery
# query for relevant DRS objects
discoveryClients = {
    "crdc": BigQuerySearchClient(),
    "anv": Gen3ManifestClient('../fasp/data/gtex/gtex-cram-manifest_wCuries.json')
}


# TCGA Query - CRDC
crdcquery = """
    SELECT 'case_'||associated_entities__case_gdc_id , 'crdc:'||file_id
    FROM `isb-cgc.GDC_metadata.rel24_fileData_active` 
    where data_format = 'BAM' 
    and project_disease_type = 'Breast Invasive Carcinoma'
    limit 3"""		


# Run both queries and aggregate results
results = discoveryClients['anv'].runQuery(3)  # Send the query for the first three items
results += discoveryClients['crdc'].runQuery(crdcquery) 
results


[['GTEX-1GTWX-0001-SM-7J3A5.cram',
  'anv:dg.ANV0/76bb893d-12da-41ca-8828-ff89551d3e15'],
 ['GTEX-14PQA-0003-SM-7DLH4.cram',
  'anv:dg.ANV0/66352de8-4b50-4cae-881d-b76d03df5ac8'],
 ['GTEX-1B98T-0004-SM-7J38T.cram',
  'anv:dg.ANV0/ed9ac9ae-02da-4e97-93da-ad86aa77d227'],
 Row(('case_1b703058-e596-45bc-80fe-8b98d545c2e2', 'crdc:030e5e74-6461-4f05-a399-de8e470bc056'), {'f0_': 0, 'f1_': 1}),
 Row(('case_a6edb6ca-ae9f-4da7-8ebe-92d83d2987fb', 'crdc:0329fa7e-d768-4bbe-940e-36f0b9829d7c'), {'f0_': 0, 'f1_': 1}),
 Row(('case_a947a945-4721-45cc-bc45-13b8ea41c10e', 'crdc:04c68898-ddac-4e15-9f9a-5bf278d55e4a'), {'f0_': 0, 'f1_': 1})]

The next step sets up two DRS clients to handle DRS requests based on the CURIE prefix of each individual DRS id. Note that the crdc data will be accessed on Google Cloud (gs) and the Anvil data on Amazon Web Services (s3)

In [5]:
from fasp.loc import crdcDRSClient, anvilDRSClient

drsClients = {
    "crdc": crdcDRSClient('~/.keys/crdc_credentials.json', access_id='gs'),
    "anv": anvilDRSClient('~/.keys/anvil_credentials.json', access_id='s3')
}

Similarly we set up two WES clients. Note that for the data in the Google Cloud we are using GCPLSsamtools a fasp class which accesses Google Cloud's Life Science Pipeline API. The plan is to replace that with the DNA Stack WES server when that is updated. 

In [6]:
from fasp.workflow import GCPLSsamtools, sbWESClient

gcplocation = 'projects/{}/locations/{}'.format(settings['GCPProject'], settings['GCPPipelineRegion'])
wesClients = {
    "crdc": GCPLSsamtools(gcplocation, settings['GCPOutputBucket']),
    "anv": sbWESClient(settings['SevenBridgesInstance'], settings['SevenBridgesProject'],
                    '~/.keys/sbcgc_key.json')
}

The following loops through each result of the query. For each drs_id it retrieves an authorized URL to access the file and submits it for analysis to the appropriate server.

In non-federated scenarios, where all requests go to the same WES server, FASPRunner deals with running this loop. This example shows the process in detail as well as handlign our federated use case. 

In [1]:
import datetime

# repeat steps 2 and 3 for each row of the query
for row in results:

    print("file={}, drsID={}".format(row[0], row[1]))
    resRow = [row[0], row[1]]
    # Step 2 - Use DRS to get the URL
    # This is a local solution to resolve prefixed DRS ids, DRS Metarolver would be better
    # get the prefix
    prefix, drsid = row[1].split(":", 1)
    drsClient = drsClients[prefix]
    print ('Sending id {} to {}'.format(drsid, drsClient.__class__.__name__))

    url = drsClient.getAccessURL(drsid)
    objInfo = drsClient.getObject(drsid)
    fileSize = objInfo['size']

    # Step 3 - Run a pipeline on the file at the drs url
    if url != None:
        outfile = "{}.txt".format(row[0])
        via = 'sh'
        note = 'GTEx and TCGA - federated analysis'
        time = datetime.datetime.now().strftime("%m/%d/%Y, %H:%M:%S")
        wesClient = wesClients[prefix]
        run_id = wesClient.runWorkflow(url, outfile)
        print('Submitted {}'.format(run_id))
        searchClient = discoveryClients[prefix]
        faspRunner.logRun(time, via, note,  run_id, outfile, fileSize,
            searchClient, drsClient, wesClient)
        resRow.append('OK')
    else:
        print('could not get DRS url')
        resRow.append('unauthorized')
    print('_________________________________________________________________________')



NameError: name 'results' is not defined

In [2]:
from fasp.search import DiscoverySearchClient
cl = DiscoverySearchClient('https://ga4gh-search-adapter-presto-public.staging.dnastack.com')
cl.listTableInfo('search_cloud.ncbi_sra.january2021')


{'name': 'search_cloud.ncbi_sra.january2021',
 'description': 'Automatically generated schema',
 'data_model': {'$id': 'https://ga4gh-search-adapter-presto-public.staging.dnastack.com/table/search_cloud.ncbi_sra.january2021/info',
  'description': 'Automatically generated schema',
  '$schema': 'http://json-schema.org/draft-07/schema#',
  'properties': {'drsid': {'format': 'varchar',
    'type': 'string',
    '$comment': 'varchar'},
   'version': {'format': 'varchar', 'type': 'string', '$comment': 'varchar'},
   'self_uri': {'format': 'varchar', 'type': 'string', '$comment': 'varchar'},
   'access_methods': {'items': {'type': 'object',
     '$comment': 'row(string,string,object)',
     'properties': {'region': {'format': 'varchar',
       'type': 'string',
       '$comment': 'varchar'},
      'type': {'format': 'varchar', 'type': 'string', '$comment': 'varchar'},
      'access_url': {'type': 'object',
       '$comment': 'row(string)',
       'properties': {'url': {'format': 'varchar',
 