<img src="../fasp/runner/credits/images/FASPNotebook10.jpg" style="float: right;">

## Obtain Thousand Genomes files from SRA DRS and submit to Seven Bridges WES
This notebook explores use of the SRA DRS server. It is derived from FASPScript14.py but has been adapted to use a Seven Bridges WES service. 

The mapping of DRS ids to SRA accessions may be done in different ways and the process to do so is in flux.

The approach taken below is using mapping is available through subject and specimen data available through the Search API. In fact in this case the SRR accession shown only for information. The query is formulated in terms of a particular population and that we want mapped bam files. This gives us DRS id's directly. Alternatively a list of SRR accessions could be used.

In [1]:
from fasp.search import DiscoverySearchClient

# Step 1 - Discovery
# query for relevant DRS objects
searchClient = DiscoverySearchClient('https://ga4gh-search-adapter-presto-public.prod.dnastack.com/')

query = '''SELECT f.sample_name, drs_id bam_drs_id, acc
FROM thousand_genomes.onek_genomes.ssd_drs s 
join thousand_genomes.onek_genomes.sra_drs_files f on f.sample_name = s.su_submitter_id 
where filetype = 'bam' and mapped = 'mapped' 
and sequencing_type ='exome' and  population = 'JPT' LIMIT 3'''

resultRows = searchClient.runQuery(query, returnType='dataframe')
resultRows

_Retrieving the query_
____Page1_______________
____Page2_______________
____Page3_______________
____Page4_______________
____Page5_______________
____Page6_______________


Unnamed: 0,sample_name,bam_drs_id,acc
0,NA18948,fb1cfb04d3ef99d07c21f9dbf87ccc68,SRR1601121
1,NA18945,9327fb44eb81b49a41e38c8d86eb3b3a,SRR1601115
2,NA18943,9f38253b281c7e9c99e4bdbececd8e2f,SRR1606910


The method of calling the Search client above returns a dataframe. This is convenient for many purposes, including listing the results as above. The default return type from the runQuery gives a list of lists.

In [8]:
results = searchClient.runQuery(query)
results

_Retrieving the query_
____Page1_______________
____Page2_______________
____Page3_______________
____Page4_______________
____Page5_______________
____Page6_______________


[['NA18948', 'fb1cfb04d3ef99d07c21f9dbf87ccc68', 'SRR1601121'],
 ['NA18945', '9327fb44eb81b49a41e38c8d86eb3b3a', 'SRR1601115'],
 ['NA18943', '9f38253b281c7e9c99e4bdbececd8e2f', 'SRR1606910']]

The following shows how the SRA DRS server can be used to determine where the files can be obtained from. The following shows this for the first DRS id from the query results. 

In [33]:
from fasp.loc import DRSClient

#drsClient = DRSMetaResolver()
drsClient = DRSClient('https://locate.be-md.ncbi.nlm.nih.gov', public=True)
test_id = results[0][1]
print(test_id)
objInfo = drsClient.getObject(test_id)
objInfo

fb1cfb04d3ef99d07c21f9dbf87ccc68


{'access_methods': [{'access_id': 'b5f46aadbcb48d7141104db0440feb63cd4e61c8',
   'region': 's3.us-east-1',
   'type': 'https'},
  {'access_id': '1bc0bc010f0edf4ef18af594acdba5db864db67e',
   'region': 'gs.US',
   'type': 'https'},
  {'access_id': '722d3466edf7ad5f6797f9774e21b368c45ad5b1', 'type': 'https'}],
 'checksums': [{'checksum': 'fb1cfb04d3ef99d07c21f9dbf87ccc68',
   'type': 'md5'}],
 'created_time': '2013-02-25T23:24:10Z',
 'id': 'fb1cfb04d3ef99d07c21f9dbf87ccc68',
 'name': 'NA18948.mapped.ILLUMINA.bwa.JPT.exome.20121211.bam',
 'self_url': 'drs://locate.md-be.ncbi.nlm.nih.gov/fb1cfb04d3ef99d07c21f9dbf87ccc68',
 'size': 8752606127}

A second DRS call can be used to obtain a url to access the file from one of the above locations.

In [38]:
access_id = objDetails['access_methods'][0]['access_id']
print('access_id:{}'.format(access_id))
url = drsClient.getAccessURL(test_id, access_id=access_id)
print('url:{}'.format(url))

access_id:b5f46aadbcb48d7141104db0440feb63cd4e61c8
url:https://1000genomes.s3.amazonaws.com/phase3/data/NA18948/exome_alignment/NA18948.mapped.ILLUMINA.bwa.JPT.exome.20121211.bam


Generally FASPRunner encapsulates the steps of pbataining DRS ids and submitting a WES task. FASPRunner currently relies on a convention where access_id itself carries meaning e.g. s3 or gs.us. It seems that is a convention rather than part of the spec. It might be a useful convention but it is probably inadequate. It is necessary to specify to DRS the exact region where we want to access the file - not just a system. In fact gs.us is valid as a region. s3 is not. As of today the Seven Bridges DRS server is using the full region name as access_id. I don't believe that was always been the case. Something else in flux.

It could be that a useful convention for access_id would be 'region' as it does make things convenient.

The SRA DRS server does not follow this convention. The DRS specification does not require it to do so. Instead the SRA DRS Server returns what is to us, the external user, an arbitrary id. In this case we must iterate over the access methods to determine which access method is that which accesses data in the desired region and obtain the corresponding id.

Continuing with our example, we still create an instance of FASPRunner, but only for logging purposes, and to access local settings.

In [2]:
# The program value is used simply to log which script or notebook submitted WES requests via FASPRunner
from fasp.runner import FASPRunner
faspRunner = FASPRunner(program='SRAExample')
settings = faspRunner.settings

from fasp.workflow import sbWESClient

# Settings for which instance and project to use are stored in settings, or you may enter values directly
wesClient = sbWESClient(settings['SevenBridgesInstance'], settings['SevenBridgesProject'],
                        '~/.keys/sbcgc_key.json')

This is the loop that would normally be within FASPRunner, but because of the different approach to access_id we will write a custom version

In [39]:
import datetime

# set the region we want to access data from
region = 's3.us-east-1'
# and some metadata for logging purposes
via = 'SB WES'
note = 'SRA DRS Thousand Genomes'
        
# repeat steps 2 and 3 for each row of the query
for row in results:

    print("subject={}, drsID={}".format(row[0], row[1]))
    drs_id = row[1]


    objInfo = drsClient.getObject(drs_id)
    # Extract the access method for the region where we want to work with the data
    # Not sure whether to be in awe of the power of Python, or to fear its obscurity!
    # This would probably be better encapsulated (hidden away) in a DRS client
    acc_method = [d for d in objInfo['access_methods'] if 'region' in d and d['region'] == region]
    access_id = acc_method[0]['access_id']
    url = drsClient.getAccessURL(drs_id, access_id = access_id)
    fileSize = objInfo['size']

    # Step 3 - Run a pipeline on the file at the drs url
    if url != None:
        outfile = "{}.txt".format(row[0])
        time = datetime.datetime.now().strftime("%m/%d/%Y, %H:%M:%S")
        run_id = wesClient.runWorkflow(url, outfile)
        print('Submitted run {} to {}'.format(run_id, wesClient.__class__.__name__))
        faspRunner.logRun(time, via, note,  run_id, outfile, fileSize,
            searchClient, drsClient, wesClient)
        resRow.append('OK')
    else:
        print('could not get DRS url')
        resRow.append('unauthorized')
    print('_________________________________________________________________________')


subject=NA18948, drsID=fb1cfb04d3ef99d07c21f9dbf87ccc68
Submitted run 966cc432-3b77-448a-8762-24993a4decfe to sbWESClient
_________________________________________________________________________
subject=NA18945, drsID=9327fb44eb81b49a41e38c8d86eb3b3a
Submitted run f898f288-6b87-4ed9-a1b6-cd871ef4e6bb to sbWESClient
_________________________________________________________________________
subject=NA18943, drsID=9f38253b281c7e9c99e4bdbececd8e2f
Submitted run a9f92fad-a515-40c7-a5a5-51a536e142f5 to sbWESClient
_________________________________________________________________________


QED, but this does highlight some considerations about whether any conventions about access_id should be a) hardened and b) made part of the specification.
* Hardening likely would mean the convention should be to use region as the access id
* That might be sufficient to make the convention part of the specification.

It would certainly seem useful to have the capability to ask DRS for the URL that gives access to the data in a specified region. On the other hand, our wishes about which region to use are irrelevant if a particular file is not available in that region. In that case we have to look at where it is available and work with that. In the case of small files that may just consist of downloadign the file to wherever is convenient to work with it. For large files (BAMs, CRAMs, large images) we likely want to use the access information to determine where to run the workflow. 

For applicability across a broad range of DRS services it's hard to see that it's possible to avoid iterating over access methods and having the code respond to what it finds. Changing the specification to adopt the convention doesn't benefit this scenario. In more contrained use cases, accessing fewer DRS services and with known behavior, the convenience of getSAccessURL(drsid, region) would be of benefit.

## Getting the results

In [3]:
runLog = wesClient.GetRunLog('a9f92fad-a515-40c7-a5a5-51a536e142f5')
runLog

{'request': {'tags': {},
  'workflow_params': {'name': 'SAMtools Stats 1.8 run - 01-20-21 15:15:06',
   'project': 'forei/gecco',
   'inputs': {'total_memory_GB': None,
    'coverage_limit': None,
    'include_only_read_group': None,
    'remove_duplicates': None,
    'max_insert_size': None,
    'reference_file': {'path': 'drs://cgc-ga4gh-api.sbgenomics.com/5bad6c83e4b0abc138917143',
     'name': 'references-hs37d5-hs37d5.fasta',
     'class': 'File'},
    'output_file_path': 'NA18943.txt',
    'alignment_file_url': 'https://1000genomes.s3.amazonaws.com/phase3/data/NA18943/exome_alignment/NA18943.mapped.ILLUMINA.bwa.JPT.exome.20121211.bam'}},
  'workflow_type': 'CWL',
  'workflow_engine_params': {},
  'workflow_url': 'sbg://forei/gecco/samtools-stats-1-8-url'},
 'state': 'COMPLETE',
 'outputs': {'statistics': {'path': 'drs://cgc-ga4gh-api.sbgenomics.com/60084900e4b09cae7234aa83',
   'name': '_5_NA18943.txt',
   'class': 'File'}},
 'run_id': 'a9f92fad-a515-40c7-a5a5-51a536e142f5',
 'ru

Use the Seven Bridges CGC DRS service to retrieve the output file

In [11]:
from  fasp.loc import sbcgcDRSClient
resultsDRS = sbcgcDRSClient('~/.keys/sevenbridges_keys.json', 's3')
resultsDRSID = '60084900e4b09cae7234aa83'
url = resultsDRS.getAccessURL(resultsDRSID)

In [9]:
import requests
import os
def download(url, file_path):
    with open(os.path.expanduser(file_path), "wb") as file:
        response = requests.get(url)
        file.write(response.content)

fileDetails = resultsDRS.getObject(resultsDRSID)
fullPath = './' + fileDetails['name']
download(url, fullPath)       