## Expanding DRS Bundles -  SRA Example
Aside from the question of the schema in a bundle covered in the SRA_IDs_and_bundling example the following looks at the capability in the DRS request to expand a bundle.

To ask for a bundle to be expanded a parameter is added to the request, for example for the SRA server:
```
https://locate.be-md.ncbi.nlm.nih.gov/idx/v1/objects/<id>?expand=true
```

DRSClient provides a Python binding for that capability and is used in the examples that follow.

### Bundling at the experiment level. An SRX accession

SRA's data model has a number of levels. In descending order they are.
* SRP - Project, a project in which sequencing has been done
* SRS - Sample, a physical sample from the project. What it represents depnds on scientific investigation in the Project.
* SRX - Experiment, the application of a particular sequencing technology to some Sample
* SRR - Run, the run, on a sequencer, of material from the Experiment

In the following example, SRADRSClient IDentity eXchange service (IDX) is called to get the DRS id which corresponds to a sequencing experiment (SRX). A DRS getObject call with expand=True results in the Run (SRR) nested within the Experiment to be expanded.

In [1]:
from fasp.loc import SRADRSClient
import json

# Set up a client to access NCBI's  DRS Server for the Sequence Read Archive (SRA)
drsClient = SRADRSClient('https://locate.be-md.ncbi.nlm.nih.gov', public=True)
accession= 'SRX719843'
res = drsClient.acc2drs(accession)
print("Response from IDX for accession {}".format(accession))
print(res)
drs_id = res['response'][accession]['drs']
print("\nDRS ID corresponding to accession {}".format(accession))
print(drs_id)
drs_response = drsClient.getObject(drs_id, expand=True)
print("\nSRA DRS response for {}".format(drs_id))
print(json.dumps(drs_response, indent=3))

Response from IDX for accession SRX719843
{'drs-base': 'drs://locate.be-md.ncbi.nlm.nih.gov', 'response': {'SRX719843': {'drs': '16139c5b6f36034eb09768c17a90fd23', 'status_code': 200}}}

DRS ID corresponding to accession SRX719843
16139c5b6f36034eb09768c17a90fd23

SRA DRS response for 16139c5b6f36034eb09768c17a90fd23
{
   "checksums": [
      {
         "checksum": "16139c5b6f36034eb09768c17a90fd23",
         "type": "md5"
      }
   ],
   "contents": [
      {
         "contents": [
            {
               "id": "37f0c2a65cc4b89d497d965332fa530b",
               "name": "HG00096.unmapped.ILLUMINA.bwa.GBR.exome.20120522.bam"
            },
            {
               "id": "5d4ae7a46d470036d99429c363498965",
               "name": "HG00096.mapped.ILLUMINA.bwa.GBR.exome.20120522.bam"
            }
         ],
         "id": "fd074040842ce8c2e114b4eed7accee0",
         "name": "SRR1596638"
      }
   ],
   "created_time": "2012-11-19T15:20:25Z",
   "id": "16139c5b6f36034eb09768c17a

As a reminder, the unexpanded call looks like this, with only the DRS id provided for the Run.

In [2]:
drsClient.getObject(drs_id)

{'checksums': [{'checksum': '16139c5b6f36034eb09768c17a90fd23',
   'type': 'md5'}],
 'contents': [{'id': 'fd074040842ce8c2e114b4eed7accee0',
   'name': 'SRR1596638'}],
 'created_time': '2012-11-19T15:20:25Z',
 'id': '16139c5b6f36034eb09768c17a90fd23',
 'name': 'SRX719843',
 'self_url': 'drs://locate.be-md.ncbi.nlm.nih.gov/16139c5b6f36034eb09768c17a90fd23',
 'size': 9205789476}

### What is the correct expansion?
Note that the SRA DRS Server does not expand the bundle all the way down to the actual files. An additional DRS call with each of the file DRS ids would be needed.  

The [DRS 1.1 spec](https://ga4gh.github.io/data-repository-service-schemas/preview/release/drs-1.1.0/docs/) does seems to suggest that one would expect expansion all the way down to the actual objects. It would make sense given the intent of ?expand=true. However, it is possible to see how some readings of the spec might come to the conclusion that expansion is only intended to go to the bundle level.

Created issue to address this https://github.com/ga4gh/data-repository-service-schemas/issues/382

## Bundling at a higher level. An SRP accession

In the SRA_IDs_and_bundling example we saw how to get the DRS id for SRP048601. As of when the notebook was run the DRS id was 5d8b77dd974e1b7c9de4040cbf9a24c7

Expanding an SRA project bundle explores the challenges of scaling expansion. 

Here is the unexpanded version. We should not rely on the previous DRS id for that accession still being valid, so we will call IDX again to get it.

In [3]:
%%time
accession= 'SRP048601'
res = drsClient.acc2drs(accession)
print("Response from IDX for accession {}".format(accession))
print(res)

Response from IDX for accession SRP048601
{'drs-base': 'drs://locate.be-md.ncbi.nlm.nih.gov', 'response': {'SRP048601': {'drs': '5d8b77dd974e1b7c9de4040cbf9a24c7', 'status_code': 200}}}
CPU times: user 16.4 ms, sys: 3.65 ms, total: 20.1 ms
Wall time: 22.7 s


In [4]:
%%time
drs_id = res['response'][accession]['drs']
drsRes = drsClient.getObject(drs_id)
print("No of items at top level of bundle: {}".format(len(drsRes['contents'])))

No of items at top level of bundle: 5070
CPU times: user 24.8 ms, sys: 4.96 ms, total: 29.7 ms
Wall time: 1.51 s


The full bundle is not printed here. The following is a truncated example.

```json
{'checksums': [{'checksum': '5d8b77dd974e1b7c9de4040cbf9a24c7',
   'type': 'md5'}],
 'contents': [{'id': 'f2b7f3f7c123a38eb904c5412ce48757', 'name': 'SRX719457'},
  {'id': '16139c5b6f36034eb09768c17a90fd23', 'name': 'SRX719843'},
  {'id': '8fa664d99d3cc9fb701d15e026e14950', 'name': 'SRX719844'},
  {'id': 'a4165df1fcea2234c42128bcb1d26cc0', 'name': 'SRX719845'},
  {'id': '287a5d73a2ba5abf10d6bbcdb0b4ed42', 'name': 'SRX719846'},
  {'id': '4b995cc57ff3d4ebeac9684f2b9f7f7f', 'name': 'SRX719847'},
  {'id': 'b488ab01ce3fa83addea057153ec449c', 'name': 'SRX719848'},
  {'id': 'b3dd0d947f7e901bedf9f5789565ed07', 'name': 'SRX719849'},
  {'id': 'a3bfebcf770157458454986092aeda62', 'name': 'SRX719850'},
  {'id': '8165dc2b262ba94fdfd9a14bc7919fd4', 'name': 'SRX719851'},
 
  ...],
 'created_time': '2012-11-15T14:00:55Z',
 'id': '5d8b77dd974e1b7c9de4040cbf9a24c7',
 'name': 'SRP048601',
 'self_url': 'drs://locate.md-be.ncbi.nlm.nih.gov/5d8b77dd974e1b7c9de4040cbf9a24c7',
 'size': 87447929899239}
```

### The expanded version of an SRA Project bundle
The first thing to note in the following is that it took almost four minutes to expand the bundle, compared with 586 ms without expansion. 

In [5]:
%%time
drsRes = drsClient.getObject(drs_id, expand=True)

CPU times: user 67.1 ms, sys: 12 ms, total: 79 ms
Wall time: 1min 30s


With the same 5070 sequence experiments in the result it is too verbose to list. The following is sufficient to illustrate. Note the following
* Three levels of hierarchy within the expanded bundle
* In this SRA project there is only one run per experiment (see code in next step)
* The content for a run is of a variable nature. i.e. the data model/schema is different.


```json
{
   "checksums": [
      {
         "checksum": "5d8b77dd974e1b7c9de4040cbf9a24c7",
         "type": "md5"
      }
   ],
   "contents": [
      {
         "contents": [
            {
               "contents": [
                  {
                     "id": "60a098a596e5a0155043d4eb42833460",
                     "name": "NA20362.mapped.ILLUMINA.bwa.ASW.low_coverage.20130415.bam"
                  },
                  {
                     "id": "81c5d083909a6fe8e23fc55edb9e0d5a",
                     "name": "NA20362.unmapped.ILLUMINA.bwa.ASW.low_coverage.20130415.bam"
                  }
               ],
               "id": "662aecc9370a4efa7af7c926ed411a06",
               "name": "SRR1596219"
            }
         ],
         "id": "f2b7f3f7c123a38eb904c5412ce48757",
         "name": "SRX719457"
      },
       ...
      {
         "contents": [
            {
               "contents": [
                  {
                     "id": "26b441c4bd1909e4303ba409cc6397e3",
                     "name": "HG01170.unmapped.ILLUMINA.bwa.PUR.exome.20120522.bam"
                  },
                  {
                     "id": "4aa3aef815edb4aa27f1a3ef4ba7499a",
                     "name": "HG01170.mapped.ILLUMINA.bwa.PUR.exome.20120522.bam.bai"
                  },
                  {
                     "id": "6b31d85cf5416c28bca6bb2f4870b5c8",
                     "name": "HG01170.mapped.ILLUMINA.bwa.PUR.exome.20120522.bam"
                  }
               ],
               "id": "673cfcfcefe0e55078efd1408a1eb9d8",
               "name": "SRR1597062"
            }
         ],
         "id": "61205605d1213c6587b310840886f74b",
         "name": "SRX720267"
      },
       ...
      {
         "contents": [
            {
               "contents": [
                  {
                     "id": "27efc8168c68f1cb121e42e857900524",
                     "name": "HG01171.unmapped.ILLUMINA.bwa.PUR.exome.20120522.bam"
                  },
                  {
                     "id": "eef85d6d50ca6ad75678ee32167628af",
                     "name": "HG01171.mapped.ILLUMINA.bwa.PUR.exome.20120522.bam"
                  }
               ],
               "id": "601e6c573db750028d189b7429d02dd8",
               "name": "SRR1597064"
            }
         ],
         "id": "16059c570cc89ccfb69f6a482671863b",
         "name": "SRX720269"
      }
   ],
   "created_time": "2012-11-15T14:00:55Z",
   "id": "5d8b77dd974e1b7c9de4040cbf9a24c7",
   "name": "SRP048601",
   "self_url": "drs://locate.md-be.ncbi.nlm.nih.gov/5d8b77dd974e1b7c9de4040cbf9a24c7",
   "size": 87447929899239
}
```




Though it's too verbose to list the whole output we can summarize the fully expanded bundle as follows

In [6]:
from collections import Counter

class bundleExpander:
    
    def __init__(self):
        self.typeCounter = Counter()
        
    def expand(self, node):
        if 'contents' in node:
            self.typeCounter[node['name'][:3]] +=1
            for subnode in node['contents']:
                self.expand(subnode)
        else:
            # the bottom of the hierarchy is a file, so we'll count the file types
            self.typeCounter[node['name'][-3:]] +=1
            
        return(self.typeCounter)

e = bundleExpander()
counter = e.expand(drsRes)

for k,v in counter.items():
    print(k,v)

SRP 1
SRX 5070
SRR 5070
bam 10140
bai 2173


We have a bundle with 1 Project, 5070 experiments, 5070 runs, 10140 bam files and 2173 bam indexes

In [7]:
expts_with_multiple_runs = 0
for sr_experiment in drsRes['contents']:
    # find any experiments with more than one run
    if len(sr_experiment['contents']) > 1:
        expts_with_multiple_runs += 1
        print(sr_experiment)
print ("No of experiments with more than one run: {}".format(expts_with_multiple_runs))

No of experiments with more than one run: 0


### In conclusion

* If bundles are  to be used (not confirmed) then bundle expansion is a useful capability
* The value of that capability would be greatest if expansion were to go to the file (binary object) level. It would help to confirm that was the intent of the spec. 
* Expanding bundles may not scale well. The SRA server is helpful in providing a working example. It has been suggested that the option should be provided for a server to respond indicating that expansion is not permitted for a given id.
