## Comparing approaches: Identifying DRS object via query or unpacking DRS Bundles
This notebook explores two approaches to getting to specific objects or files via DRS.
* Identifying the specific file of interest from attributes of subjects and specimens (sometimes called metadata)
* Extracting a specific file of interest from a bundle




For context, another notebook shows how the files identified via the approaches here can be submitted for compute via a  WES service. 

The data and files used are from the Thousand Genomes project. The following query using GA4GH Search shows how, in a single step, the DRS ids for mapped BAM files for whole exome sequencing for subjects from a particular population.

This example was worked through in the January 2021 FASP Hackathon.

In [1]:
from fasp.search import DataConnectClient

# Step 1 - Discovery
# query for relevant DRS objects
searchClient = DataConnectClient('https://ga4gh-search-adapter-presto-public.prod.dnastack.com/')

query = '''SELECT f.sample_name, drs_id bam_drs_id, acc
FROM thousand_genomes.onek_genomes.ssd_drs s 
join thousand_genomes.onek_genomes.sra_drs_files f on f.sample_name = s.su_submitter_id 
where filetype = 'bam' and mapped = 'mapped' 
and sequencing_type ='exome' and  population = 'JPT' '''

resultRows = searchClient.runQuery(query, returnType='dataframe')
resultRows



_Retrieving the query_
____Page1_______________
____Page2_______________
____Page3_______________
____Page4_______________
____Page5_______________
____Page6_______________
____Page7_______________


Unnamed: 0,sample_name,bam_drs_id,acc
0,NA18945,9327fb44eb81b49a41e38c8d86eb3b3a,SRR1601115
1,NA18943,9f38253b281c7e9c99e4bdbececd8e2f,SRR1606910
2,NA18944,5aff9cee759c930666e94e65dbb0af94,SRR1601113
3,NA18940,333a651b55970c9402db51ebb5e55d09,SRR1607212
4,NA18952,ac972e5bb3737622e5d0328cee59d724,SRR1604558
...,...,...,...
99,NA19001,f23f984cdfdb257a058faedfe9c0d10a,SRR1601170
100,NA19084,4d8075a1b7115b7fcd242d9d25f8de25,SRR1598088
101,NA19090,e7a65db7f7b4e7caf65193aa7986e584,SRR1603949
102,NA19087,b5f9609124241ade815fe49e2eb38c4f,SRR1603951


The following shows how the SRA DRS server can be used to determine where the files can be obtained from. The following shows this for the first DRS id from the query results. 

In [5]:
from fasp.loc import SRADRSClient

# Set up a client to access NCBI's  DRS Server for the Sequence Read Archive (SRA)
drsClient = SRADRSClient('https://locate.be-md.ncbi.nlm.nih.gov', public=True)
# Get the DRS id from the query results above
test_id = resultRows.iloc[0]['bam_drs_id']
print(test_id)
# Use the DRS GetObject function to find out where the file is availble for access
objInfo = drsClient.getObject(test_id)
objInfo

9327fb44eb81b49a41e38c8d86eb3b3a


{'access_methods': [{'access_id': 'd3f48734dd64671ce17675652a4d1d926d49ed0a8bd09b611d06c06416a59b02',
   'region': 'gs.US',
   'type': 'https'},
  {'access_id': '20faac3996cf30269c05837053b49098e098d974e5d8c6cf83b2c1d19587efe7',
   'type': 'https'},
  {'access_id': '9309d041f3ec70cc9a376c0b5884d9528c943a5606cae7424246260110b4ab80',
   'region': 's3.us-east-1',
   'type': 'https'}],
 'checksums': [{'checksum': '9327fb44eb81b49a41e38c8d86eb3b3a',
   'type': 'md5'}],
 'created_time': '2013-02-25T23:13:15Z',
 'id': '9327fb44eb81b49a41e38c8d86eb3b3a',
 'name': 'NA18945.mapped.ILLUMINA.bwa.JPT.exome.20121211.bam',
 'self_url': 'drs://locate.be-md.ncbi.nlm.nih.gov/9327fb44eb81b49a41e38c8d86eb3b3a',
 'size': 10606854428}

A second DRS call can be used to obtain a url to access the file from one of the above locations.

Note that unlike other DRS servers the SRA DRS server uses arbitrary user_ids (consistent with spec) so our SRA DRS client function to obtain a URL takes the region we want to use rather than the access_id.

See issue to resolve practices for access_id https://github.com/ga4gh/data-repository-service-schemas/issues/341

In [7]:
access_id = objInfo['access_methods'][0]['access_id']
print('access_id:{}'.format(access_id))
#url = drsClient.getAccessURL(test_id, access_id=access_id)
url = drsClient.getAccessURL(test_id, region='gs.US')
print('url:{}'.format(url))

access_id:d3f48734dd64671ce17675652a4d1d926d49ed0a8bd09b611d06c06416a59b02
url:https://storage.googleapis.com/genomics-public-data/ftp-trace.ncbi.nih.gov/1000genomes/ftp/phase3/data/NA18945/exome_alignment/NA18945.mapped.ILLUMINA.bwa.JPT.exome.20121211.bam


The approach above is that which was used in the full FASP example that runs the compute. Refer to that notebook. 

### Second approach - unpacking DRS Bundles provided by SRA DRS Server
Here we'll continue with the second approach to working with SRA ids.

It addresses one aspect of bundling in DRS - namely when the bundle contains a collection of different files related to the provided DRS id. 

First a query against the data we used above for comparison. 

In [8]:
searchClient.query2Frame('''select filetype, mapped, count(*) rowCount from thousand_genomes.onek_genomes.sra_drs_files 
group by filetype, mapped  ''')

_Retrieving the query_
____Page1_______________
____Page2_______________
____Page3_______________
____Page4_______________
____Page5_______________
____Page6_______________


Unnamed: 0,filetype,mapped,rowCount
0,bam,mapped,5070
1,bam,unmapped,5070
2,bai,mapped,1889
3,bai,unmapped,284


## The SRA Identity Exchange and DRS services
Can we take an SRA accession number from above and see what it looks like through the SRA IDentity eXchange service (IDX), and how that works through in DRS. We'll start with a run accession an SRR?

The SRADRSClient has an additional function to access the IDX service with a SRA accession number 

In [9]:
import json
from fasp.loc import SRADRSClient
drsClient = SRADRSClient('https://locate.be-md.ncbi.nlm.nih.gov', public=True)

accession = 'SRR1601121'
idx = drsClient.acc2drs(accession)
print(json.dumps(idx, indent=3))
drsId = idx['response'][accession]['drs']
print (drsId)



{
   "drs-base": "drs://locate.be-md.ncbi.nlm.nih.gov",
   "response": {
      "SRR1601121": {
         "drs": "9466d7c1ec8fde019ce630c9bd88582e",
         "status_code": 200
      }
   }
}
9466d7c1ec8fde019ce630c9bd88582e


<del>Note: the base URI returned in the result above suggests the DRS service could be accessed at the URL https://locate.ncbi.nlm.nih.gov . At present, for performance purposes the SRA DRS service should be accessed at https://locate.be-md.ncbi.nlm.nih.gov. See the example above</del>

Now use the DRS service with that id.

In [10]:
drsClient.getObject(drsId)

{'checksums': [{'checksum': '9466d7c1ec8fde019ce630c9bd88582e',
   'type': 'md5'}],
 'contents': [{'id': '519de9933298caa8bdf551351426d120',
   'name': 'NA18948.unmapped.ILLUMINA.bwa.JPT.exome.20121211.bam'},
  {'id': 'a027e7c2a917cba582a9684244ad339d',
   'name': 'NA18948.mapped.ILLUMINA.bwa.JPT.exome.20121211.bam.bai'},
  {'id': 'fb1cfb04d3ef99d07c21f9dbf87ccc68',
   'name': 'NA18948.mapped.ILLUMINA.bwa.JPT.exome.20121211.bam'}],
 'created_time': '2013-02-25T23:24:10Z',
 'id': '9466d7c1ec8fde019ce630c9bd88582e',
 'name': 'SRR1601121',
 'self_url': 'drs://locate.be-md.ncbi.nlm.nih.gov/9466d7c1ec8fde019ce630c9bd88582e',
 'size': 8763581919}

Our intent as in the first approach is to work with the mapped bam file. We can see visually, from the filename, which file which DRS id we need.
#### An issue
This highlights the first issue with this approach. The information we need to identify the file we need is in the file name. That would be fine for low throughput situations carried out by human eye. It does not scale to machine actionable larger use cases.

#### Moving on
We use the manually identified id via DRS to identify how we may get the file of interest. This is identical to how we did this under the first approach.

In [11]:
drsClient.getObject('fb1cfb04d3ef99d07c21f9dbf87ccc68')

{'access_methods': [{'access_id': '1e4846c05c81a49f684e7f940ffbd3a98e5f0e335f019ee4d32d85c72096b743',
   'region': 'gs.US',
   'type': 'https'},
  {'access_id': 'b14572d74b5aafe87a0fcc873050d6c3993f27338cdd088b5883aed4b118f0c8',
   'type': 'https'},
  {'access_id': '0623f9350999297e5fa3a77a05c08b8cf1fbd10ef4e392c0d52dde9a4e469a85',
   'region': 's3.us-east-1',
   'type': 'https'}],
 'checksums': [{'checksum': 'fb1cfb04d3ef99d07c21f9dbf87ccc68',
   'type': 'md5'}],
 'created_time': '2013-02-25T23:24:10Z',
 'id': 'fb1cfb04d3ef99d07c21f9dbf87ccc68',
 'name': 'NA18948.mapped.ILLUMINA.bwa.JPT.exome.20121211.bam',
 'self_url': 'drs://locate.be-md.ncbi.nlm.nih.gov/fb1cfb04d3ef99d07c21f9dbf87ccc68',
 'size': 8752606127}

In the first approach we were able to identify the DRS ids in the same step as querying for the other sample and subject attributes of interest.

The first approach is common in many other cases. See fasp-scripts notebooks for a range of examples where it is possible to get DRS ids from a query against some external database.

## Bundling at a higher level. An SRP accession

In another application of bundling, a bundle is simply a list of other DRS ids

SRA's data model has a number of levels above the Run. In descending order they are.
* SRP - Project, a project in which sequencing has been done
* SRS - Sample, a physical sample from the project. What it represents depnds on scientific investigation in the Project.
* SRX - Experiment, the application of a particular sequencing technology to some Sample
* SRR - Run, the run, on a sequencer, of material from the Experiment

In the following example, the IDentity eXchange service (IDX) is called to get the DRS id which corresponds to the project. A feature in iPython is used to time how long the response takes.

Please be patient for this step to complete, it can take 45-90 seconds.

In [12]:
%%time
srp = 'SRP048601'
res = drsClient.acc2drs(srp, verbose=True)

#SRS000157
drsId = res['response'][srp]['drs']
print(drsId)

https://locate.be-md.ncbi.nlm.nih.gov/idx/v1/SRP048601
<Response [200]>
5d8b77dd974e1b7c9de4040cbf9a24c7
CPU times: user 18.8 ms, sys: 4.41 ms, total: 23.2 ms
Wall time: 37.4 s


The DRS id for the project can now be sent to the DRS server. 

The response will consists of a bundle of DRS ids for experiments within the project.

Note the response time, though it needs to return a list of 5070 ids it does so in less than a second.

In [13]:
%%time
drsRes = drsClient.getObject(drsId)
print(len(drsRes['contents']))

5070
CPU times: user 27.1 ms, sys: 27 ms, total: 54.1 ms
Wall time: 781 ms


The full bundle is not printed here. The following is a truncated example.

```json
{'checksums': [{'checksum': '5d8b77dd974e1b7c9de4040cbf9a24c7',
   'type': 'md5'}],
 'contents': [{'id': 'f2b7f3f7c123a38eb904c5412ce48757', 'name': 'SRX719457'},
  {'id': '16139c5b6f36034eb09768c17a90fd23', 'name': 'SRX719843'},
  {'id': '8fa664d99d3cc9fb701d15e026e14950', 'name': 'SRX719844'},
  {'id': 'a4165df1fcea2234c42128bcb1d26cc0', 'name': 'SRX719845'},
  {'id': '287a5d73a2ba5abf10d6bbcdb0b4ed42', 'name': 'SRX719846'},
  {'id': '4b995cc57ff3d4ebeac9684f2b9f7f7f', 'name': 'SRX719847'},
  {'id': 'b488ab01ce3fa83addea057153ec449c', 'name': 'SRX719848'},
  {'id': 'b3dd0d947f7e901bedf9f5789565ed07', 'name': 'SRX719849'},
  {'id': 'a3bfebcf770157458454986092aeda62', 'name': 'SRX719850'},
  {'id': '8165dc2b262ba94fdfd9a14bc7919fd4', 'name': 'SRX719851'},
 
  ...],
 'created_time': '2012-11-15T14:00:55Z',
 'id': '5d8b77dd974e1b7c9de4040cbf9a24c7',
 'name': 'SRP048601',
 'self_url': 'drs://locate.md-be.ncbi.nlm.nih.gov/5d8b77dd974e1b7c9de4040cbf9a24c7',
 'size': 87447929899239}
```

The objects returned are still not physical files (i.e. a set of bytes) but an id for another logical concept. The 'experiment'.

Calling DRS with the DRS id for the experiment.

In [14]:
drsClient.getObject('16139c5b6f36034eb09768c17a90fd23')

{'checksums': [{'checksum': '16139c5b6f36034eb09768c17a90fd23',
   'type': 'md5'}],
 'contents': [{'id': 'fd074040842ce8c2e114b4eed7accee0',
   'name': 'SRR1596638'}],
 'created_time': '2012-11-19T15:20:25Z',
 'id': '16139c5b6f36034eb09768c17a90fd23',
 'name': 'SRX719843',
 'self_url': 'drs://locate.be-md.ncbi.nlm.nih.gov/16139c5b6f36034eb09768c17a90fd23',
 'size': 9205789476}

We are now down to a level where we get a single drs id, though it is still an id for a logical level entity - the Run. 

*Strictly what the DRS ids identify is the particular binary content of the file set corresponding to that logical id on a particular date . Essentially it's a version. But unlike GitHub this doesn't have the characteristics of a versioning system. Those characteristics probably aren't needed - the need supported by DRS is "give me the same set of bytes that I, or someone else, got for the same id previously", and that basic need is fulfilled. QED. What it does not support is give me the fileset for this thing that existed on such and such a date. It would not keep the file in sync with the state of the related data at the same point in time. That is an additional requirement not discussed further here, other than to comment that iti could be addressed in a higher level model that deals with versioning as a real world (logical) construct.

Returning to the DRS id for the Run. This is familiar from the previous two examples.

In [15]:
drsClient.getObject('fd074040842ce8c2e114b4eed7accee0')

{'checksums': [{'checksum': 'fd074040842ce8c2e114b4eed7accee0',
   'type': 'md5'}],
 'contents': [{'id': '37f0c2a65cc4b89d497d965332fa530b',
   'name': 'HG00096.unmapped.ILLUMINA.bwa.GBR.exome.20120522.bam'},
  {'id': '5d4ae7a46d470036d99429c363498965',
   'name': 'HG00096.mapped.ILLUMINA.bwa.GBR.exome.20120522.bam'}],
 'created_time': '2012-11-19T15:20:25Z',
 'id': 'fd074040842ce8c2e114b4eed7accee0',
 'name': 'SRR1596638',
 'self_url': 'drs://locate.be-md.ncbi.nlm.nih.gov/fd074040842ce8c2e114b4eed7accee0',
 'size': 9205789476}

Things would proceed as with the previous example from here.

However one other observation raises a question.

### Comparability of checksums for higher level objects
Note that sizes and checksums are provided at the higher (logical) levels above the file (physical) level i.e. Run (SRR) Experiment (SRX) and Project (SRP).

Is it clear whether the checksum for it's orginally intended purpose checking sums? How would I know what to run MD5 against to get a value to compare with the values reported in the responses (perhaps that explains what the LifeBit team encountered).

### In conclusion
One of the main conclusions here is to question the value of DRS ids for logical level constructs. The examples above illustrates the problem for the SRA use case, but is likely to have more general applicability. For example, bundling has been talked about for DICOM and pathology imaging.

It is suggested that the higher, application level, questions should be dealt with using schemas and models specific to the domain being supported. That follows much existing practice amongst GA4GH participants. 

This does not exclude that the higher level schemas might be referenced by or even included within DRS bundles. However. 

Nor does this exclude the appropriate use of bulk operations within DRS and the pagination that would support that. These too should only address the fundamental level, and not be used to handle application/logical level concepts.