This notebook explores the use of the DRS implementation published by the Human Microbiome Project (HMP)
See https://github.com/ihmpdcc/drs-server

Cloning the repository was straightforward, and a running DRS server is easily started

```cd docker
docker-compose up -d
```
#### Sidebar
First attempt wouldn't start as I already a mysql database running on my host on the default mysql port which the DRS server also uses. 
Workaround was to shutdown my mysql server for the duration of this work.
The ability to confgure the port would be good. Obviously would have to be changed for both the mysql config and for the DRS server (in this case the DRS server is the client of the mysqk server).

#### Data in the packaged Docker container
Exploration of the mysql server shows that it starts up with the following DRS ids available.
```
blob_a
blob_b
blob_c
blob_d
blob_e
blob_f.1
blob_f.2
blob_f.3
bundle_1
bundle_2 
bundle_2.1```

Set up a DRS client to 
We set two flags on the client
* debug - prints the URL the client is calling.
* public - means the client understands that no authentication or authorization is needed for the calls to the DRS server.

In [17]:
from fasp.loc import DRSClient
import json

cl = DRSClient("http://localhost:9999", debug=True, public=True)

We can now call the client with one of the DRS ids above

In [26]:
a = cl.get_object("blob_a")
print(json.dumps(a, indent=3))

http://localhost:9999/ga4gh/drs/v1/objects/blob_a
{
   "id": "blob_a",
   "created_time": "2021-09-13T00:00:00.000Z",
   "drs_id": "blob_a",
   "checksums": [
      {
         "checksum": "7fc56270e7a70fa81a5935b72eacbe29",
         "type": "md5"
      }
   ],
   "self_uri": "drs://localhost/blob_a",
   "size": 43271233281,
   "description": "The first blob",
   "name": "Blob A",
   "version": "1",
   "access_methods": [
      {
         "access_id": "5ed5534c92-1",
         "access_url": {
            "url": "https://servera/path/to/blob_a"
         },
         "type": "https",
         "headers": [
            {
               "header": "Authorization",
               "value": "whatever"
            },
            {
               "header": "AnotherHeader",
               "value": "foobar"
            }
         ]
      },
      {
         "access_id": "5ed5534c92-2",
         "access_url": {
            "url": "s3://hmpdacc/blob_a"
         },
         "type": "s3",
         "region

#### Implementation Issue
Headers should be within the access_url property
Additionally, the syntax given is not consistent with any DRS spec.

#### DRS issue
That said the DRS specs for this are not consistent.

The specification for AccessURL in DRS [verion 1.0](https://ga4gh.github.io/data-repository-service-schemas/preview/release/drs-1.0.0/docs/#_accessurl) and [version 1.1](https://ga4gh.github.io/data-repository-service-schemas/preview/release/drs-1.1.0/docs/#_accessurl) specifies the headers property as an array of strings. However, the example provided is not consistent with that, illustrating a json object:
```{
"Authorization" : "Basic Z2E0Z2g6ZHJz"
}```

The [version 1.2](https://ga4gh.github.io/data-repository-service-schemas/preview/release/drs-1.2.0/docs/#tag/AccessURLModel) spec partially corrects this. The entry is shown as a string

``` {
  "url": "string",
  "headers": "Authorization: Basic Z2E0Z2g6ZHJz"
}```

Note: all this may be moot as no known DRS server is using this method of access control. It is questionable whether any implementer would do so, as it may not provide sufficient security. Rather than fixing this, perhaps the solution would be to remove 'headers' property.

The get a URL method can be called 

In [19]:
cl.get_access_url("blob_a", "5ed5534c92-3")

http://localhost:9999/ga4gh/drs/v1/objects/blob_a/access/5ed5534c92-3
<Response [200]>


'gs://hmpdacc/data/blob_a'

On most DRS servers this method returns a URL which can be used to access the file. In this case it returns the same URL as was available via the /objects endpoint i.e. from the get_object method.

The advantage of the approach most DRS servers take is that the caller does not need to deal with the specifics of IAM on different cloud providers. There are at least two aspects where this is beneficial.
* The client does not need to use provider specific APIs
* It is not necessary for user accounts and authorizations to be provided on the cloud provider.


### A bundle

In [25]:
b_drs = cl.get_object("bundle_1")
print(json.dumps(b_drs, indent=3))

http://localhost:9999/ga4gh/drs/v1/objects/bundle_1
{
   "id": "bundle_1",
   "created_time": "2021-09-13T13:57:51.000Z",
   "drs_id": "bundle_1",
   "checksums": [
      {
         "checksum": "79a58ab10b666b30ec664097e06bb110",
         "type": "md5"
      }
   ],
   "self_uri": "drs://localhost/bundle_1",
   "size": 5212685564,
   "name": "Bundle 1",
   "contents": [
      {
         "drs_uri": [
            "drs://localhost/blob_a"
         ],
         "id": "drs://localhost/blob_a",
         "name": "Blob A"
      },
      {
         "drs_uri": [
            "drs://localhost/blob_b"
         ],
         "id": "drs://localhost/blob_b",
         "name": "Blob B"
      },
      {
         "drs_uri": [
            "drs://localhost/blob_c"
         ],
         "id": "drs://localhost/blob_c",
         "name": "Blob C"
      }
   ]
}


#### Issue 1
Note that the ids for each object in the contents are given as a full DRS URI
Other DRS servers give only the id within the DRS server. The latter is consistent with the [version 1.0](https://ga4gh.github.io/data-repository-service-schemas/preview/release/drs-1.0.0/docs/#_contentsobject) spec for ContentsObject.

Additionally, the use within of the full DRS URI for the 'id' is redundant with the following.

#### drs_uri
The drs_uri syntax is correct with regard to the spec. However, it is not clear what the intent/use case the spec intended here - specifically why an array of drs_uris is needed vs a single id. 

Spec issue.

### Calling get_object for each member of a bundle
First define a function to call get_object and print the result 

In [32]:
def print_bundle(bundle):
    for o in bundle["contents"]:
        # due to issue 1 above, extract the actual DRS id
        drs_id = o['id'].split('/')[-1]
        print(f"Calling DRS for id {drs_id}")
        o_drs = cl.get_object(drs_id)
        print(json.dumps(o_drs, indent=3))
        print("_"*80)


Now call our function with the response we got earlier

In [33]:
print_bundle(b_drs)

Calling DRS for id blob_a
http://localhost:9999/ga4gh/drs/v1/objects/blob_a
{
   "id": "blob_a",
   "created_time": "2021-09-13T00:00:00.000Z",
   "drs_id": "blob_a",
   "checksums": [
      {
         "checksum": "7fc56270e7a70fa81a5935b72eacbe29",
         "type": "md5"
      }
   ],
   "self_uri": "drs://localhost/blob_a",
   "size": 43271233281,
   "description": "The first blob",
   "name": "Blob A",
   "version": "1",
   "access_methods": [
      {
         "access_id": "5ed5534c92-1",
         "access_url": {
            "url": "https://servera/path/to/blob_a"
         },
         "type": "https",
         "headers": [
            {
               "header": "Authorization",
               "value": "whatever"
            },
            {
               "header": "AnotherHeader",
               "value": "foobar"
            }
         ]
      },
      {
         "access_id": "5ed5534c92-2",
         "access_url": {
            "url": "s3://hmpdacc/blob_a"
         },
         "typ

Other than the issues aleady identified with headers, all this looks valid in terms of DRS.

### Another bundle


In [29]:
b2_drs = cl.get_object("bundle_2")


http://localhost:9999/ga4gh/drs/v1/objects/bundle_2


In [34]:
print(json.dumps(b2_drs, indent=3))

{
   "id": "bundle_2",
   "created_time": "2021-09-13T14:04:09.000Z",
   "drs_id": "bundle_2",
   "checksums": [
      {
         "checksum": "5affb6879844a996adc728025f00aa1f",
         "type": "md5"
      }
   ],
   "self_uri": "drs://localhost/bundle_2",
   "size": 33584098,
   "name": "Bundle 2",
   "contents": [
      {
         "drs_uri": [
            "drs://localhost/blob_e"
         ],
         "id": "drs://localhost/blob_e",
         "name": "Blob E"
      },
      {
         "drs_uri": [
            "drs://localhost/bundle_2.1"
         ],
         "id": "drs://localhost/bundle_2.1",
         "name": "Bundle 2.1",
         "contents": []
      }
   ]
}


The presence of the contents property is misleading. Rather than including the contents attribute and giving an empty array

From the spec:
"If this ContentsObject describes a nested bundle and the caller specified "?expand=true" on the request, then this contents array must be present and describe the objects within the nested bundle."

The implication is that if expand is not true then the contents property should not be present.

The inference that there is empty content is what is misleading. Without expand=true the intent is simply to list the constituent objects.

In [37]:
### Bundle expansion
The following indicates 

SyntaxError: invalid syntax (4265532587.py, line 2)

In [36]:
expansion = cl.get_object("bundle_2", expand=True)
print(json.dumps(expansion, indent=3))

http://localhost:9999/ga4gh/drs/v1/objects/bundle_2?expand=true
{
   "id": "bundle_2",
   "created_time": "2021-09-13T14:04:09.000Z",
   "drs_id": "bundle_2",
   "checksums": [
      {
         "checksum": "5affb6879844a996adc728025f00aa1f",
         "type": "md5"
      }
   ],
   "self_uri": "drs://localhost/bundle_2",
   "size": 33584098,
   "name": "Bundle 2",
   "contents": [
      {
         "drs_uri": [
            "drs://localhost/blob_e"
         ],
         "id": "drs://localhost/blob_e",
         "name": "Blob E"
      },
      {
         "drs_uri": [
            "drs://localhost/bundle_2.1"
         ],
         "id": "drs://localhost/bundle_2.1",
         "name": "Bundle 2.1",
         "contents": [
            {
               "drs_uri": [
                  "drs://localhost/blob_f.1"
               ],
               "id": "drs://localhost/blob_f.1",
               "name": "Blob F.1"
            },
            {
               "drs_uri": [
                  "drs://localhost/

In [35]:
cl.get_object("blob_e")

http://localhost:9999/ga4gh/drs/v1/objects/blob_e


KeyboardInterrupt: 

The above example fails.

Taking a look at the server we see the internal error listed below. There are two issues with this
* The server error itself
* The fact that the server leaves the client waiting for a response rather than returning an error. A 500 error is probably the most appropriate in this case.

From the server Docker container:
~~~
1172.19.0.1 - - [25/Oct/2022:20:46:05 +0000] "GET /ga4gh/drs/v1/objects/blob_c HTTP/1.1" 200 344 "-" "python-requests/2.28.0"
[2022-10-25T20:46:05.113] [DEBUG] app - In build_drs_object.
[2022-10-25T20:46:05.113] [DEBUG] app - Building a blob (file).
[2022-10-25T21:12:38.800] [DEBUG] app - In get_object: blob_e
[2022-10-25T21:12:38.800] [DEBUG] app - Expanding results? false
[2022-10-25T21:12:38.800] [DEBUG] app - In build_object: blob_e
[2022-10-25T21:12:38.800] [DEBUG] app - Expanding results? false
[2022-10-25T21:12:38.801] [DEBUG] app - In query_data: blob_e.
[2022-10-25T21:12:38.803] [DEBUG] app - In collapse_data.
[2022-10-25T21:12:38.804] [DEBUG] app - In build_drs_object.
[2022-10-25T21:12:38.804] [DEBUG] app - Building a blob (file).
[2022-10-25T21:12:38.811] [ERROR] app - Caught exception: Error: Must provide a url.
[2022-10-25T21:12:38.827] [ERROR] app - Error: Must provide a url.
    at new AccessMethod (/src/lib/access-method.js:10:19)
    at /src/lib/object-retrieve.js:424:26
    at arrayEach (/src/node_modules/lodash/lodash.js:530:11)
    at Function.forEach (/src/node_modules/lodash/lodash.js:9410:14)
    at build_drs_object (/src/lib/object-retrieve.js:419:11)
    at /src/lib/object-retrieve.js:100:17
    at Query.<anonymous> (/src/lib/object-retrieve.js:348:13)
    at Query.<anonymous> (/src/node_modules/mysql/lib/Connection.js:526:10)
    at Query._callback (/src/node_modules/mysql/lib/Connection.js:488:16)
    at Query.Sequence.end (/src/node_modules/mysql/lib/protocol/sequences/Sequence.js:83:24)
Check log file for stack trace. Caught exception: Error: Must provide a url.
172.19.0.1 - - [25/Oct/2022:21:13:12 +0000] "GET /ga4gh/drs/v1/objects/blob_e HTTP/1.1" - - "-" "python-requests/2.28.0"
~~~

In [7]:
print_bundle(b2_drs)

http://localhost:9999/ga4gh/drs/v1/objects/blob_e


ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

In [9]:
cl.get_object("bundle_2")

http://localhost:9999/ga4gh/drs/v1/objects/bundle_2


{'id': 'bundle_2',
 'created_time': '2021-09-13T14:04:09.000Z',
 'drs_id': 'bundle_2',
 'checksums': [{'checksum': '5affb6879844a996adc728025f00aa1f',
   'type': 'md5'}],
 'self_uri': 'drs://localhost/bundle_2',
 'size': 33584098,
 'name': 'Bundle 2',
 'contents': [{'drs_uri': ['drs://localhost/blob_e'],
   'id': 'drs://localhost/blob_e',
   'name': 'Blob E'},
  {'drs_uri': ['drs://localhost/bundle_2.1'],
   'id': 'drs://localhost/bundle_2.1',
   'name': 'Bundle 2.1',
   'contents': []}]}

In [10]:
cl.get_object("bundle_2.1")

http://localhost:9999/ga4gh/drs/v1/objects/bundle_2.1


{'id': 'bundle_2.1',
 'created_time': '2021-09-22T11:00:53.000Z',
 'drs_id': 'bundle_2.1',
 'checksums': [{'checksum': '0d599f0ec05c3bda8c3b8a68c32a1b47',
   'type': 'md5'}],
 'self_uri': 'drs://localhost/bundle_2.1',
 'size': 2856688870,
 'description': 'A sub bundle.',
 'name': 'Bundle 2.1',
 'contents': [{'drs_uri': ['drs://localhost/blob_f.1'],
   'id': 'drs://localhost/blob_f.1',
   'name': 'Blob F.1'},
  {'drs_uri': ['drs://localhost/blob_f.2'],
   'id': 'drs://localhost/blob_f.2',
   'name': 'Blob F.2'},
  {'drs_uri': ['drs://localhost/blob_f.3'],
   'id': 'drs://localhost/blob_f.3',
   'name': 'Blob F.3'}]}

In [11]:
cl.get_object("blob_f.1")

http://localhost:9999/ga4gh/drs/v1/objects/blob_f.1


KeyboardInterrupt: 

In [12]:
cl.get_object("blob_f.2")

http://localhost:9999/ga4gh/drs/v1/objects/blob_f.2


KeyboardInterrupt: 

In [13]:
cl.get_object("blob_f.3")

http://localhost:9999/ga4gh/drs/v1/objects/blob_f.3


KeyboardInterrupt: 

In [14]:
cl.get_object("bundle_1", expand=True)

http://localhost:9999/ga4gh/drs/v1/objects/bundle_1?expand=true


{'id': 'bundle_1',
 'created_time': '2021-09-13T13:57:51.000Z',
 'drs_id': 'bundle_1',
 'checksums': [{'checksum': '79a58ab10b666b30ec664097e06bb110',
   'type': 'md5'}],
 'self_uri': 'drs://localhost/bundle_1',
 'size': 5212685564,
 'name': 'Bundle 1',
 'contents': [{'drs_uri': ['drs://localhost/blob_a'],
   'id': 'drs://localhost/blob_a',
   'name': 'Blob A'},
  {'drs_uri': ['drs://localhost/blob_b'],
   'id': 'drs://localhost/blob_b',
   'name': 'Blob B'},
  {'drs_uri': ['drs://localhost/blob_c'],
   'id': 'drs://localhost/blob_c',
   'name': 'Blob C'}]}

In [16]:
expansion = cl.get_object("bundle_2", expand=True)
print(json.dumps(expansion, indent=3))

http://localhost:9999/ga4gh/drs/v1/objects/bundle_2?expand=true
{
   "id": "bundle_2",
   "created_time": "2021-09-13T14:04:09.000Z",
   "drs_id": "bundle_2",
   "checksums": [
      {
         "checksum": "5affb6879844a996adc728025f00aa1f",
         "type": "md5"
      }
   ],
   "self_uri": "drs://localhost/bundle_2",
   "size": 33584098,
   "name": "Bundle 2",
   "contents": [
      {
         "drs_uri": [
            "drs://localhost/blob_e"
         ],
         "id": "drs://localhost/blob_e",
         "name": "Blob E"
      },
      {
         "drs_uri": [
            "drs://localhost/bundle_2.1"
         ],
         "id": "drs://localhost/bundle_2.1",
         "name": "Bundle 2.1",
         "contents": [
            {
               "drs_uri": [
                  "drs://localhost/blob_f.1"
               ],
               "id": "drs://localhost/blob_f.1",
               "name": "Blob F.1"
            },
            {
               "drs_uri": [
                  "drs://localhost/

Implementation notes

Bundles are likely to be deprecated. There are many instances where simple objects are retrieved directly using subject, specimen and file attributes to precisely identify the objects of interest rathef than retrieving a bundle and then having to parse it. Note that there is no structure to a DRS bundle. Parsing a bundle will always be a custom activity depending on the study, project or dataset.

DRS ids as used here are legal DRS ids, any string is a legal DRS id.

However, ids containg meaning such as 'bundle_1' always run the risk of in the field buid. Unintended consequences frequently develop. e.g. a coder might develop the practice of querying a database as "where id like 'bundle_%' " as a way of identifying all bundle objects