# Accessing DSS via DOS

Data in the HCA DSS is replicated across cloud stores. This allows it to be downloaded from the "nearest" location, to avoid egress fees. 

The data in the DSS has been made available using the Data Object Service schemas, which provide an interoperable way for exposing replicated data and versioned data over a simple HTTP API.

## Using the requests module

To access services over HTTP we use the requests module.

In [6]:
import requests
SERVICE_URL = "https://spbnq0bc10.execute-api.us-west-2.amazonaws.com/api"

The `ListDataObjects` method has not been implemented yet. However, one can access the DSS' bundle oriented index using `ListDataBundles`.

### Listing Data Bundles

In [5]:
BASE_URL = "ga4gh/dos/v1"
LIST_DATA_BUNDLES_URL = "{}/{}/{}".format(SERVICE_URL, BASE_URL, "databundles/list")
data_bundles = requests.post(LIST_DATA_BUNDLES_URL).json()['data_bundles']
print(data_bundles)

[{u'version': u'2018-01-31T081714', u'id': u'06c4bd47-c8e2-5045-8bae-bfad24633c87'}, {u'version': u'2018-01-31T081724', u'id': u'0d6371a8-fc4f-5232-9660-e655903b17ea'}, {u'version': u'2018-01-31T082416', u'id': u'0e727062-7fc9-5e46-b1e3-24537426ca4c'}, {u'version': u'2018-01-31T093034', u'id': u'2277b3fc-5a75-5782-86a0-c29f13844e7d'}, {u'version': u'2018-01-31T142518', u'id': u'139f30ba-62d3-50fb-9177-ab3d370e29f8'}, {u'version': u'2018-01-31T142526', u'id': u'1ecf1c35-9e1e-55ef-8f42-71102c3abc33'}, {u'version': u'2018-01-31T152805', u'id': u'44a8837b-4456-5709-b56b-54e23000f13a'}, {u'version': u'2018-01-31T084107', u'id': u'108c3839-a48e-53d8-a765-e7bfa5da6c81'}, {u'version': u'2018-01-31T094039', u'id': u'233bc61e-e9e8-5f75-a8d9-189cfced36fe'}, {u'version': u'2018-01-31T095409', u'id': u'28bebda7-14b1-5c47-b9b7-52540f091866'}]


### Getting Data Bundle details

Now that we have some Data Bundle identifiers, we can use `GetDataBundle` to retrieve more information about a bundle.

In [15]:
DATA_BUNDLE_URL = "{}/{}/databundles/{}".format(SERVICE_URL, BASE_URL, data_bundles[0]['id'])
data_bundle = requests.get(DATA_BUNDLE_URL).json()['data_bundle']
print(data_bundle['data_object_ids'])

[u'1311414e-9f12-4596-99bd-6b06cac69025', u'40628e8c-6456-4d72-8600-91691cb1833d', u'8f3784cd-3e3f-4450-8861-e5a02c7ba554']


### Getting Data Object Details

We can now access Data Objects for download using the data object identifiers from the Data Bundle. Both signed URLs and cloud native URLs are available.

In [18]:
data_object_id = data_bundle['data_object_ids'][0]
DATA_OBJECT_URL = "{}/{}/dataobjects/{}".format(SERVICE_URL, BASE_URL, data_object_id)
data_object = requests.get(DATA_OBJECT_URL).json()['data_object']

The Data Object contains a list of URLs and checksums that can be used to download and access the file.

In [26]:
print("-----------URLS------------")
for url in data_object['urls']:
    print(url)
print("-----------checksums-----------")
for checksum in data_object['checksums']:
    print(checksum)

-----------URLS------------
{u'url': u'https://commons-dss.ucsc-cgp-dev.org/v1/files/1311414e-9f12-4596-99bd-6b06cac69025?replica=aws'}
{u'url': u'https://commons-dss.ucsc-cgp-dev.org/v1/files/1311414e-9f12-4596-99bd-6b06cac69025?replica=azure'}
{u'url': u'https://commons-dss.ucsc-cgp-dev.org/v1/files/1311414e-9f12-4596-99bd-6b06cac69025?replica=gcp'}
{u'url': u's3://commons-dss-commons/blobs/19e5620579898ace0db2135e0434daba4f48edf72c5dbf82bfc1ad173161ff71.e1137f4a813e4d18387a799a5f88bb5d300c2cd6.aa81284302982b0f755d1238c3349762-336.feb986d1'}
{u'url': u'gs://commons-dss-commons/blobs/19e5620579898ace0db2135e0434daba4f48edf72c5dbf82bfc1ad173161ff71.e1137f4a813e4d18387a799a5f88bb5d300c2cd6.aa81284302982b0f755d1238c3349762-336.feb986d1'}
-----------checksums-----------
{u'checksum': u'19e5620579898ace0db2135e0434daba4f48edf72c5dbf82bfc1ad173161ff71', u'type': u'sha256'}
{u'checksum': u'aa81284302982b0f755d1238c3349762-336', u'type': u'etag'}
{u'checksum': u'e1137f4a813e4d18387a799a5f88bb

Now, using a HTTP, S3, or GCP downloader, one can access these files.

## Using the DOS Python Client

The above example can be performed similarly using the DOS client.

In [38]:
from ga4gh.dos.client import Client
client = Client("https://spbnq0bc10.execute-api.us-west-2.amazonaws.com/api")
lc = local_client = client.client
models = client.models

List the data bundles offered by the service.

In [39]:
data_bundles = lc.ListDataBundles(body={}).result().data_bundles

In [40]:
print(data_bundles[0].id)
data_bundle = lc.GetDataBundle(data_bundle_id=data_bundles[0].id).result().data_bundle
print(data_bundle.data_object_ids)

06c4bd47-c8e2-5045-8bae-bfad24633c87
[u'1311414e-9f12-4596-99bd-6b06cac69025', u'40628e8c-6456-4d72-8600-91691cb1833d', u'8f3784cd-3e3f-4450-8861-e5a02c7ba554']


### Get a page of results.

In [44]:
ListDataBundles = models.get_model('ga4ghListDataBundlesRequest')
request = ListDataBundles(page_size=10)
response = lc.ListDataBundles(body=request).result()

In [45]:
len(response.data_bundles)
response.next_page_token

u'DnF1ZXJ5VGhlbkZldGNoBQAAAAAAAY56Fm9aRm80VWE1VDJTYkVTQnFNVVlndGcAAAAAAAGOexZvWkZvNFVhNVQyU2JFU0JxTVVZZ3RnAAAAAAABjn4Wb1pGbzRVYTVUMlNiRVNCcU1VWWd0ZwAAAAAAAY59Fm9aRm80VWE1VDJTYkVTQnFNVVlndGcAAAAAAAGOfBZvWkZvNFVhNVQyU2JFU0JxTVVZZ3Rn'

### Get the next page of results.

In [46]:
request = ListDataBundles(page_size=10, page_token=response.next_page_token)
page_2 = lc.ListDataBundles(body=request).result()

In [51]:
print('page 1')
print("\n".join(["{}.{}".format(x.id, x.version) for x in response.data_bundles]))
print('page 2')
print("\n".join(["{}.{}".format(x.id, x.version) for x in page_2.data_bundles]))

page 1
06c4bd47-c8e2-5045-8bae-bfad24633c87.2018-01-31T081714
0d6371a8-fc4f-5232-9660-e655903b17ea.2018-01-31T081724
0e727062-7fc9-5e46-b1e3-24537426ca4c.2018-01-31T082416
2277b3fc-5a75-5782-86a0-c29f13844e7d.2018-01-31T093034
139f30ba-62d3-50fb-9177-ab3d370e29f8.2018-01-31T142518
1ecf1c35-9e1e-55ef-8f42-71102c3abc33.2018-01-31T142526
44a8837b-4456-5709-b56b-54e23000f13a.2018-01-31T152805
108c3839-a48e-53d8-a765-e7bfa5da6c81.2018-01-31T084107
233bc61e-e9e8-5f75-a8d9-189cfced36fe.2018-01-31T094039
28bebda7-14b1-5c47-b9b7-52540f091866.2018-01-31T095409
page 2
108c3839-a48e-53d8-a765-e7bfa5da6c81.2018-01-31T142513
46e29f86-2983-5658-9f93-5f8aea24a4a2.2018-01-31T155013
492054ee-31e5-5516-ae96-fbba12fbc73d.2018-01-31T160008
4a51ff38-f4ea-5599-b752-8e65724864db.2018-01-31T160635
014a9de5-cb88-5e37-a196-b6e3ab30fff6.2018-01-31T081707
0583d98e-b079-51ae-affc-1c2d6200c84d.2018-01-31T081711
1111ec7b-675d-5c00-8aa4-7eea28f2b846.2018-01-31T084755
0a5f13d7-a1f5-55f6-994f-48f252ac61c7.2018-01-31T142

## Inspect a Data Bundle

In [52]:
data_bundle = lc.GetDataBundle(data_bundle_id=data_bundles[0].id).result().data_bundle

In [53]:
print(data_bundles[0].id)
print(data_bundle.id)

06c4bd47-c8e2-5045-8bae-bfad24633c87
06c4bd47-c8e2-5045-8bae-bfad24633c87


In [54]:
print("\n".join(data_bundle.data_object_ids))

1311414e-9f12-4596-99bd-6b06cac69025
40628e8c-6456-4d72-8600-91691cb1833d
8f3784cd-3e3f-4450-8861-e5a02c7ba554


## Download a Data Objects

In [56]:
data_object = lc.GetDataObject(
    data_object_id=data_bundle.data_object_ids[2]).result().data_object

In [65]:
url = data_object.urls[0].url
print(url)

https://commons-dss.ucsc-cgp-dev.org/v1/files/8f3784cd-3e3f-4450-8861-e5a02c7ba554?replica=aws


In [72]:
!wget $url -O $data_object.id

--2018-02-21 16:23:43--  https://commons-dss.ucsc-cgp-dev.org/v1/files/8f3784cd-3e3f-4450-8861-e5a02c7ba554?replica=aws
Resolving commons-dss.ucsc-cgp-dev.org (commons-dss.ucsc-cgp-dev.org)... 54.192.117.96, 54.192.117.180, 54.192.117.49, ...
Connecting to commons-dss.ucsc-cgp-dev.org (commons-dss.ucsc-cgp-dev.org)|54.192.117.96|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://commons-dss-commons.s3.amazonaws.com/blobs/628d68afd08ac7f5225ccf2cc5ee6ad97d7fe54182c63610739b5f56010767a3.3dd9e1f914d3bfd639e66673bb7abe0de6f6fcf7.90b714a05568197fbad5b43be308f3e8.815ab816?AWSAccessKeyId=ASIAIFCPMRV3NVDDGLVA&Signature=bFPGpzm37KNTbc1Qc0R%2BdJW79fw%3D&x-amz-security-token=FQoDYXdzEDcaDN2i7ksn5opvXEjvxCLkAYa6wedOM%2FxhKAILEunahqXqGRE%2BQ0UWC9az11Fym4hPl02wH6tXW7RyXNWAz5Kq0Z7QqPHodERosMU2LNP7ZAFFBfi3nBoLUZrow8AO1zIIfYp5y2DyL4IVmTHmDnRf9qEWQvwEwb%2FnDr0eRSb93IOtTer0J2rLAONr5JOPGWqciIuKGDuNNhkueJhCXZ0DylAerCebYIZ24lzDVKK%2B1IFoDMrpNJEWzBvt9g0hET2B2EGq4ktFlPKHYCR

In [73]:
!head $data_object.id

{
    "center_name": "NYGC",
    "donor_uuid": "b8284a5b-429d-5652-8247-0257f1e2f61d",
    "program": "TOPMed",
    "project": "HapMap",
    "schema_version": "0.0.3",
    "specimen": [
        {
            "samples": [
                {
