## Accessing SRA Controlled Access data
This notebook explores using GA4GH DRS to access data stored in the cloud for a controlled access dbGaP project - [phs001554](https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs001554.v1.p1) "Detection of Colorectal Cancer Susceptibility Loci Using Genome-Wide Sequencing".

A query using the GA4GH Search API on subject and file attributes identifies some DRS ids. 

>Note: The subject and sample data searched through the initial query are on a scrambled copy of the data so that no controlled access records remain intact. Please do not attempt to draw meaningful associations between the subject and sample attributes with the genomic data; this would lead to erroneous conclusions. These examples serve only to illustrate future possibilities, in the hope of identifying ways to make those possibilities real.
>
>The sequence files used in this notebook are also under controlled access. You will not be able to access those files unless you have been granted access through dbGaP to the phs001554 study. If you have that access you will also have access to the non-scrambled subject and sample data. If that is the case, we would be pleased to hear of your interest, and the possibility of collaborating to ensure the GA4GH tools described here enable your studies. Please see [form to register interest](https://docs.google.com/forms/d/e/1FAIpQLSfmmc3VKd6ANdzaVMyelT3c9gIWuoS4ZwT0vsqD-o2ZRxJf7A/viewform)

In [1]:
from fasp.search import DataConnectClient
cl = DataConnectClient('https://ga4gh-search-adapter-presto-public.prod.dnastack.com/')
query = '''select acc, sa.sample_id, sra_drs_id, sex, age
from search_cloud.cshcodeathon.gecco_sra_drs_index i
join dbgap_demo.scr_gecco_susceptibility.sample_multi sa on sa.sample_id = i.sample_id
join dbgap_demo.scr_gecco_susceptibility.subject_phenotypes_multi su on su.dbgap_subject_id = sa.dbgap_subject_id
where age between 50 and 55
and affection_status = 'Case'
and file_type = 'cram' limit 10'''
df = cl.runQuery(query, returnType='dataframe')
df



_Retrieving the query_
____Page1_______________
____Page2_______________
____Page3_______________
____Page4_______________
____Page5_______________
____Page6_______________
____Page7_______________
____Page8_______________


Unnamed: 0,acc,sample_id,sra_drs_id,sex,age
0,SRR7271789,117486,2b070565053062762fed8a1f61fd5c91,Male,55
1,SRR7271937,117674,8e9c797b4355262ffc196c599feaa9a1,Female,52
2,SRR7271969,117712,f0c0c37b8b86dabbe0e91a7f6992190d,Male,55
3,SRR7271990,117734,c15f6394b7587f5ce555477b6a01c192,Male,55
4,SRR7272010,117757,8509c4f03ffc2887eb2ff8e43a08600d,Female,55
5,SRR7272065,117839,29776be9ed54a784436b85eb3dd83707,Female,54
6,SRR7272074,117851,8c1ff55db5e9939dc8edeacdf5bbacdc,Male,55
7,SRR7272075,117852,b0353ec6fa92d9512ac8212c947f935b,Male,50
8,SRR7272080,117861,4b39f0ae4d472f7a53c59de2a055ced5,Female,53
9,SRR7272088,117876,81ee4a7ff13793ed914cdd913ca10b45,Male,52


In [2]:
from fasp.loc import SRADRSClient
drsClient = SRADRSClient('https://locate.be-md.ncbi.nlm.nih.gov', public=True)
# extract an example if from the results above
example_id = df.at[3,'sra_drs_id']
# Use DRS to find locations for the file
res_list = drsClient.getObject(example_id)
res_list

{'access_methods': [{'access_id': 'd8978ce9600021f666e6d69c0c019b61729e919dcd8595961d7af407efad25a7',
   'region': 'gs.US',
   'type': 'https'},
  {'access_id': '86cc5b67877fa4ced634f19b5acb292f5d2213cadb58c46465a1bdbfe12674b9',
   'region': 's3.us-east-1',
   'type': 'https'}],
 'checksums': [{'checksum': 'c15f6394b7587f5ce555477b6a01c192',
   'type': 'md5'}],
 'created_time': '2018-06-11T10:29:04Z',
 'id': 'c15f6394b7587f5ce555477b6a01c192',
 'name': '117734.recal.cram',
 'self_url': 'drs://locate.be-md.ncbi.nlm.nih.gov/c15f6394b7587f5ce555477b6a01c192',
 'size': 37405850269}

Use custom client method to get a URL to the file in a specified region.

In [3]:
drsClient.getAccessURLRegion(example_id, 's3.us-east-1')


Unauthorized for that DRS id


In [4]:
drsClient.getAccessURL(example_id,'d8978ce9600021f666e6d69c0c019b61729e919dcd8595961d7af407efad25a7')

Unauthorized for that DRS id


### For interest: Compare the same file on Seven Bridges

In [3]:
query2 = "select * from dbgap_demo.scr_gecco_susceptibility.sb_drs_index where sample_id = '117497' and file_type = 'cram'"
cl.runQuery(query2)

_Retrieving the query_
____Page1_______________
____Page2_______________
____Page3_______________
____Page4_______________
____Page5_______________


[['117497', 'cram', '5ba922a0e4b0db63859cd973']]

In [6]:
from fasp.loc import sbcgcDRSClient
drsClient = sbcgcDRSClient('/Users/forei/.keys/sevenbridges_keys.json', 's3')
drsClient.getObject('5ba922a0e4b0db63859cd973')

{'id': '5ba922a0e4b0db63859cd973',
 'name': '117497.recal.cram',
 'size': 36933985367,
 'checksums': [{'type': 'etag',
   'checksum': '3e4e93345c7b74540706416f51959092-4403'}],
 'self_uri': 'drs://cgc-ga4gh-api.sbgenomics.com/5ba922a0e4b0db63859cd973',
 'created_time': '2018-09-24T17:45:04Z',
 'updated_time': '2018-11-09T15:56:37Z',
 'mime_type': 'application/json',
 'access_methods': [{'type': 's3',
   'region': 'us-east-1',
   'access_id': 'aws-us-east-1'}]}

### For interest: Compare the same file on NCI Data Commons Framework (Gen3)
To do.