## Comparing approaches  SRA DRS and submit to Seven Bridges WES
This notebook explores two approaches.
Extracting a specific file of interest from a bundle
Identifying the specific file of interest from the data



For context, another notebook shows how the files identified via the approaches here can be submiited for compute via a  WES service. 

The data is the Thousand Genomes project. The following query shows how in a single step the DRS ids for mapped BAM files for whole exome sequencing for subjects from a particular population.

In [27]:
from fasp.search import DiscoverySearchClient

# Step 1 - Discovery
# query for relevant DRS objects
searchClient = DiscoverySearchClient('https://ga4gh-search-adapter-presto-public.prod.dnastack.com/')

query = '''SELECT f.sample_name, drs_id bam_drs_id, acc
FROM thousand_genomes.onek_genomes.ssd_drs s 
join thousand_genomes.onek_genomes.sra_drs_files f on f.sample_name = s.su_submitter_id 
where filetype = 'bam' and mapped = 'mapped' 
and sequencing_type ='exome' and  population = 'JPT' '''

resultRows = searchClient.runQuery(query, returnType='dataframe')
resultRows

_Retrieving the query_
____Page1_______________
____Page2_______________
____Page3_______________
____Page4_______________
____Page5_______________
____Page6_______________


Unnamed: 0,sample_name,bam_drs_id,acc
0,NA18948,fb1cfb04d3ef99d07c21f9dbf87ccc68,SRR1601121
1,NA18945,9327fb44eb81b49a41e38c8d86eb3b3a,SRR1601115
2,NA18943,9f38253b281c7e9c99e4bdbececd8e2f,SRR1606910
3,NA18944,5aff9cee759c930666e94e65dbb0af94,SRR1601113
4,NA18940,333a651b55970c9402db51ebb5e55d09,SRR1607212
...,...,...,...
99,NA19074,0805baa0849485a2a63ea41429b9b37c,SRR1604135
100,NA19081,cb072733f15565af2790a90efe60b0e1,SRR1598082
101,NA19080,6f9f1fc52166530ed0568d61451b032f,SRR1598080
102,NA19087,b5f9609124241ade815fe49e2eb38c4f,SRR1603951


In [28]:
searchClient.query2Frame("select count(*) rowCount from thousand_genomes.onek_genomes.ssd_drs")

_Retrieving the query_
____Page1_______________
____Page2_______________
____Page3_______________
____Page4_______________
____Page5_______________


Unnamed: 0,rowCount
0,2504


In [29]:
searchClient.query2Frame("select count(*) rowCount from thousand_genomes.onek_genomes.sra_drs_files")

_Retrieving the query_
____Page1_______________
____Page2_______________
____Page3_______________
____Page4_______________
____Page5_______________


Unnamed: 0,rowCount
0,12313


The following shows how the SRA DRS server can be used to determine where the files can be obtained from. The following shows this for the first DRS id from the query results. 

In [None]:
from fasp.loc import DRSClient

#drsClient = DRSMetaResolver()
drsClient = DRSClient('https://locate.be-md.ncbi.nlm.nih.gov', public=True)
test_id = results[0][1]
print(test_id)
objInfo = drsClient.getObject(test_id)
objInfo

A second DRS call can be used to obtain a url to access the file from one of the above locations.

In [None]:
access_id = objDetails['access_methods'][0]['access_id']
print('access_id:{}'.format(access_id))
url = drsClient.getAccessURL(test_id, access_id=access_id)
print('url:{}'.format(url))

The approach above is that which was used in the full FASP example that runs the compute. Refer to that notebook. 

Here we'll continue with the second approach to working with SRA ids.

It addresses one aspect of bundling.

Use the Seven Bridges CGC DRS service to retrieve the output file

## The SRA IDX and DRS services
Can we take an id from above and see what it looks like through IDX, and how that works through in DRS. We'll start with a run accession an SRR.

In [1]:
import json
from fasp.loc import SRADRSClient
drsClient = SRADRSClient('https://locate.be-md.ncbi.nlm.nih.gov', public=True)

# a useful id SRR1601905
accession = 'SRR1601121'
idx = drsClient.acc2drs(accession)
print(json.dumps(idx, indent=3))
drsId = idx['response'][accession]['drs']
print (drsId)



{
   "drs-base": "drs://locate.ncbi.nlm.nih.gov",
   "response": {
      "SRR1601121": {
         "drs": "9466d7c1ec8fde019ce630c9bd88582e",
         "status_code": 200
      }
   }
}
9466d7c1ec8fde019ce630c9bd88582e


Now use the DRS service with that id.

Note: the base URI returned in the result above suggests the DRS service could be accessed at the URL https://locate.ncbi.nlm.nih.gov . At present, for performance purposes the SRA DRS service should be accessed at https://locate.be-md.ncbi.nlm.nih.gov. See the example above

In [8]:
drsClient.getObject(drsId)

{'checksums': [{'checksum': '9466d7c1ec8fde019ce630c9bd88582e',
   'type': 'md5'}],
 'contents': [{'id': '519de9933298caa8bdf551351426d120',
   'name': 'NA18948.unmapped.ILLUMINA.bwa.JPT.exome.20121211.bam'},
  {'id': 'a027e7c2a917cba582a9684244ad339d',
   'name': 'NA18948.mapped.ILLUMINA.bwa.JPT.exome.20121211.bam.bai'},
  {'id': 'fb1cfb04d3ef99d07c21f9dbf87ccc68',
   'name': 'NA18948.mapped.ILLUMINA.bwa.JPT.exome.20121211.bam'}],
 'created_time': '2013-02-25T23:24:10Z',
 'id': '9466d7c1ec8fde019ce630c9bd88582e',
 'name': 'SRR1601121',
 'self_url': 'drs://locate.md-be.ncbi.nlm.nih.gov/9466d7c1ec8fde019ce630c9bd88582e',
 'size': 8763581919}

Work with the mapped bam file

In [9]:
drsClient.getObject('fb1cfb04d3ef99d07c21f9dbf87ccc68')

{'access_methods': [{'access_id': 'b5f46aadbcb48d7141104db0440feb63cd4e61c8',
   'region': 's3.us-east-1',
   'type': 'https'},
  {'access_id': '1bc0bc010f0edf4ef18af594acdba5db864db67e',
   'region': 'gs.US',
   'type': 'https'},
  {'access_id': '722d3466edf7ad5f6797f9774e21b368c45ad5b1', 'type': 'https'}],
 'checksums': [{'checksum': 'fb1cfb04d3ef99d07c21f9dbf87ccc68',
   'type': 'md5'}],
 'created_time': '2013-02-25T23:24:10Z',
 'id': 'fb1cfb04d3ef99d07c21f9dbf87ccc68',
 'name': 'NA18948.mapped.ILLUMINA.bwa.JPT.exome.20121211.bam',
 'self_url': 'drs://locate.md-be.ncbi.nlm.nih.gov/fb1cfb04d3ef99d07c21f9dbf87ccc68',
 'size': 8752606127}

## An SRP accession

SRP - Project
SRS - Sample
SRX - Experiment
SRR - Run

In [21]:
%%time
srp = 'SRP048601'
res = drsClient.acc2drs(srp, verbose=True)

#SRS000157
drsId = res['response'][srp]['drs']
print(drsId)

https://locate.be-md.ncbi.nlm.nih.gov/idx/v1/SRP048601
<Response [200]>
5d8b77dd974e1b7c9de4040cbf9a24c7
CPU times: user 15.3 ms, sys: 5.51 ms, total: 20.8 ms
Wall time: 45 s


The response took 45 seconds.

The DRS id for the project can now be sent to the DRS server.

Note the response time, though it needs to return a list of 5070 ids it does so in less than a second.

In [26]:
%%time
drsRes = drsClient.getObject(drsId)
print(len(drsRes['contents']))

5070
CPU times: user 24.2 ms, sys: 6.76 ms, total: 31 ms
Wall time: 526 ms


In [12]:
print(res)

{'drs-base': 'drs://locate.ncbi.nlm.nih.gov', 'response': {'SRP048601': {'drs': '5d8b77dd974e1b7c9de4040cbf9a24c7', 'status_code': 200}}}


In [23]:
%%time
from fasp.search import BigQuerySearchClient
bqClient = BigQuerySearchClient()
query = "SELECT experiment FROM `nih-sra-datastore.sra.metadata` where  sra_study = 'SRP048601'"
bqClient.runQuery(query)

CPU times: user 72.6 ms, sys: 19.7 ms, total: 92.2 ms
Wall time: 3.02 s


[Row(('SRX720195',), {'experiment': 0}),
 Row(('SRX720978',), {'experiment': 0}),
 Row(('SRX720992',), {'experiment': 0}),
 Row(('SRX721568',), {'experiment': 0}),
 Row(('SRX721490',), {'experiment': 0}),
 Row(('SRX721448',), {'experiment': 0}),
 Row(('SRX721464',), {'experiment': 0}),
 Row(('SRX721520',), {'experiment': 0}),
 Row(('SRX725764',), {'experiment': 0}),
 Row(('SRX728228',), {'experiment': 0}),
 Row(('SRX721412',), {'experiment': 0}),
 Row(('SRX721417',), {'experiment': 0}),
 Row(('SRX721167',), {'experiment': 0}),
 Row(('SRX727667',), {'experiment': 0}),
 Row(('SRX721686',), {'experiment': 0}),
 Row(('SRX725632',), {'experiment': 0}),
 Row(('SRX721226',), {'experiment': 0}),
 Row(('SRX725866',), {'experiment': 0}),
 Row(('SRX721730',), {'experiment': 0}),
 Row(('SRX725072',), {'experiment': 0}),
 Row(('SRX725069',), {'experiment': 0}),
 Row(('SRX724290',), {'experiment': 0}),
 Row(('SRX724258',), {'experiment': 0}),
 Row(('SRX723507',), {'experiment': 0}),
 Row(('SRX723540

One of the main conclusions here is to question the value of DRS ids for logical level constructs. This illustrates the problem for the SRA use case, but is likely to have more general applicability. For example, bundling has been talked about 

In [30]:
drsRes

{'checksums': [{'checksum': '5d8b77dd974e1b7c9de4040cbf9a24c7',
   'type': 'md5'}],
 'contents': [{'id': 'f2b7f3f7c123a38eb904c5412ce48757', 'name': 'SRX719457'},
  {'id': '16139c5b6f36034eb09768c17a90fd23', 'name': 'SRX719843'},
  {'id': '8fa664d99d3cc9fb701d15e026e14950', 'name': 'SRX719844'},
  {'id': 'a4165df1fcea2234c42128bcb1d26cc0', 'name': 'SRX719845'},
  {'id': '287a5d73a2ba5abf10d6bbcdb0b4ed42', 'name': 'SRX719846'},
  {'id': '4b995cc57ff3d4ebeac9684f2b9f7f7f', 'name': 'SRX719847'},
  {'id': 'b488ab01ce3fa83addea057153ec449c', 'name': 'SRX719848'},
  {'id': 'b3dd0d947f7e901bedf9f5789565ed07', 'name': 'SRX719849'},
  {'id': 'a3bfebcf770157458454986092aeda62', 'name': 'SRX719850'},
  {'id': '8165dc2b262ba94fdfd9a14bc7919fd4', 'name': 'SRX719851'},
  {'id': 'c5bb56de080acce3dea77220ee692df8', 'name': 'SRX719852'},
  {'id': 'cd076d9298259a33f27a9f0a86eb83a7', 'name': 'SRX719853'},
  {'id': '2b02a4df0fa40f716ea5cd63916e9443', 'name': 'SRX719854'},
  {'id': '35ae9b8ddaa72b43fa2b250

The objects returned are still not physical files (i.e. a set of bytes) but an id for another logical concept. The 'experiment'.

Calling DRS with the DRS id for the experimen.

In [31]:
drsClient.getObject('16139c5b6f36034eb09768c17a90fd23')

{'checksums': [{'checksum': '16139c5b6f36034eb09768c17a90fd23',
   'type': 'md5'}],
 'contents': [{'id': 'fd074040842ce8c2e114b4eed7accee0',
   'name': 'SRR1596638'}],
 'created_time': '2012-11-19T15:20:25Z',
 'id': '16139c5b6f36034eb09768c17a90fd23',
 'name': 'SRX719843',
 'self_url': 'drs://locate.md-be.ncbi.nlm.nih.gov/16139c5b6f36034eb09768c17a90fd23',
 'size': 9205789476}

We are now down to a level where we get a single drs id, though it is still an id for a logical level construct.

*Strictly what the ids identify is the particular binary content of the file set corresponding to that logical id on a particular date . Essentially it's a version. But unlike GitHub this doesn't have the characteristics of a versioning system, but that probably isn't needed - the need is to say give me the same set of bytes that I, or someone else, got for the same id previously, and that need is fulfilled. What it does not support is give me the fileset for this thing that existed on such and such a date. It would not keep the file in sync with the state of the related data at the saem point in time.

In [32]:
drsClient.getObject('fd074040842ce8c2e114b4eed7accee0')

{'checksums': [{'checksum': 'fd074040842ce8c2e114b4eed7accee0',
   'type': 'md5'}],
 'contents': [{'id': '37f0c2a65cc4b89d497d965332fa530b',
   'name': 'HG00096.unmapped.ILLUMINA.bwa.GBR.exome.20120522.bam'},
  {'id': '5d4ae7a46d470036d99429c363498965',
   'name': 'HG00096.mapped.ILLUMINA.bwa.GBR.exome.20120522.bam'}],
 'created_time': '2012-11-19T15:20:25Z',
 'id': 'fd074040842ce8c2e114b4eed7accee0',
 'name': 'SRR1596638',
 'self_url': 'drs://locate.md-be.ncbi.nlm.nih.gov/fd074040842ce8c2e114b4eed7accee0',
 'size': 9205789476}

Note that sizes and checksums are provided at the higher levels.

Is it clear whether I could use the checksum for it's orginally intended purpose? How would I know what to run MD5 against to get a value to compare with the (perhaps that explains what the LifeBit team encountered).

In [35]:
runDRS = drsClient.getObject('37f0c2a65cc4b89d497d965332fa530b')
runDRS

{'access_methods': [{'access_id': '32d4fc6b0f98ffe8711871d4c58b30359987ef47',
   'region': 's3.us-east-1',
   'type': 'https'},
  {'access_id': '7fb9cd8b87cb926ead5f1f4ddb3a2b13b19ecd02',
   'region': 'gs.US',
   'type': 'https'},
  {'access_id': 'ebb5fafbf402dd6b4e55557d2c6e1e494a682826', 'type': 'https'}],
 'checksums': [{'checksum': '37f0c2a65cc4b89d497d965332fa530b',
   'type': 'md5'}],
 'created_time': '2012-11-19T15:20:25Z',
 'id': '37f0c2a65cc4b89d497d965332fa530b',
 'name': 'HG00096.unmapped.ILLUMINA.bwa.GBR.exome.20120522.bam',
 'self_url': 'drs://locate.md-be.ncbi.nlm.nih.gov/37f0c2a65cc4b89d497d965332fa530b',
 'size': 8838568}

In [39]:
for am in runDRS['access_methods']:
    url = drsClient.getAccessURL('37f0c2a65cc4b89d497d965332fa530b',am['access_id'])
    print('{}\n'.format(url))

https://1000genomes.s3.amazonaws.com/phase3/data/HG00096/exome_alignment/HG00096.unmapped.ILLUMINA.bwa.GBR.exome.20120522.bam

https://storage.googleapis.com/genomics-public-data/ftp-trace.ncbi.nih.gov/1000genomes/ftp/phase3/data/HG00096/exome_alignment/HG00096.unmapped.ILLUMINA.bwa.GBR.exome.20120522.bam

https://ftp-trace.ncbi.nih.gov/1000genomes/ftp/phase3/data/HG00096/exome_alignment/HG00096.unmapped.ILLUMINA.bwa.GBR.exome.20120522.bam



### A GECCO Accession

The GECCO dataset has SRA data in the cloud

In [9]:
accession = 'SRR7271748'
idx = drsClient.acc2drs(accession)
print(json.dumps(idx, indent=3))
run_drsId = idx['response'][accession]['drs']
print (run_drsId)

{
   "drs-base": "drs://locate.ncbi.nlm.nih.gov",
   "response": {
      "SRR7271748": {
         "drs": "490a4101f4a3c217d95a8176354c6de2",
         "status_code": 200
      }
   }
}
490a4101f4a3c217d95a8176354c6de2


In [10]:
drsClient.getObject(run_drsId)

{'checksums': [{'checksum': '490a4101f4a3c217d95a8176354c6de2',
   'type': 'md5'}],
 'contents': [{'id': '2d1f55fd4e684ad14dc85105a094eb04',
   'name': '117438.recal.cram.crai'},
  {'id': '43bf3a1cd9568bba066b048450cc2302', 'name': '117438.recal.cram'}],
 'created_time': '2018-06-10T11:08:39Z',
 'id': '490a4101f4a3c217d95a8176354c6de2',
 'name': 'SRR7271748',
 'self_url': 'drs://locate.md-be.ncbi.nlm.nih.gov/490a4101f4a3c217d95a8176354c6de2',
 'size': 37004772517}

In [11]:
drsClient.getObject('43bf3a1cd9568bba066b048450cc2302')

{'access_methods': [{'access_id': 'c4d4a126aff587fa02cd021924edebe2b05c74c0',
   'region': 's3.us-east-1',
   'type': 'https'},
  {'access_id': 'dbd568530e2f24a6925432c027820e3685c5558d',
   'region': 'gs.US',
   'type': 'https'}],
 'checksums': [{'checksum': '43bf3a1cd9568bba066b048450cc2302',
   'type': 'md5'}],
 'created_time': '2018-06-10T11:08:39Z',
 'id': '43bf3a1cd9568bba066b048450cc2302',
 'name': '117438.recal.cram',
 'self_url': 'drs://locate.md-be.ncbi.nlm.nih.gov/43bf3a1cd9568bba066b048450cc2302',
 'size': 37003127498}

In [12]:
drsClient.getAccessURL('43bf3a1cd9568bba066b048450cc2302','c4d4a126aff587fa02cd021924edebe2b05c74c0')

Unauthorized for that DRS id
