# GA4GH Data Access Example

For this example, metadata have been loaded into a test data registry so they can be accessed using GA4GH methods. `python gdc_dos.py`

## Import the client and models

This will import a Python client and models for accessing data as defined in the schemas.

In [1]:
from ga4gh.dos.client import Client
local_client = Client('http://localhost:8080/')
client = local_client.client
models = local_client.models

## Listing Data Objects

To list the existing Data Objects, we send a ListDataObjectsRequest to the `ListDataObjects` method!

In [2]:
ListDataObjectsRequest = models.get_model('ga4ghListDataObjectsRequest')
list_request = client.ListDataObjects(body=ListDataObjectsRequest(page_size=10000000))
list_response = list_request.result()
print("Number of Data Objects: {} ".format(len(list_response.data_objects)))

Number of Data Objects: 0 


These Data Object messages are for testing purposes only but should contain enough to retrieve their contents from GDC servers.

In [40]:
data_objects = list_response.data_objects
data_object = data_objects[11]
print('url: {}, file_size (B): {}'.format(data_object.urls[0].url, data_object.size))

url: https://api.gdc.cancer.gov/data/8ecfa039-cd6a-4c1a-822c-a1fca0763c3f, file_size (B): 519261


## Filter Public Data

We want to use this service to eventually download data, but first we must find data we have access to.

In [41]:
public_data_objects = filter(
    lambda x: x['urls'][0]['system_metadata']['access'] == 'open', 
    data_objects)
print('Number of public Data Objects: {}'.format(len(public_data_objects)))

public_data_object = public_data_objects[0]

Number of public Data Objects: 3045


## Download a file

We can then download this file and name it.

In [42]:
import requests

# https://stackoverflow.com/questions/16694907/how-to-download-large-file-in-python-with-requests-py
def download_file(url, filename):
    # NOTE the stream=True parameter
    r = requests.get(url, stream=True)
    with open(filename, 'wb') as f:
        for chunk in r.iter_content(chunk_size=1024): 
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)
                #f.flush() commented by recommendation from J.F.Sebastian
    return filename

In [None]:
download_file(public_data_object.urls[0].url, data_object.name)

## Verify the checksum

Data Object messages contain checksums of the underlying files. We can validate it here.

In [13]:
print(public_data_object.checksums)

[ga4ghChecksum(checksum=u'e88715863824b6a714a90d7ca340916a', type=u'md5')]


In [21]:
given_checksum = public_data_object.checksums[0].checksum

# https://stackoverflow.com/questions/3431825/generating-an-md5-checksum-of-a-file
import hashlib
def md5(fname):
    hash_md5 = hashlib.md5()
    with open(fname, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash_md5.update(chunk)
    return hash_md5.hexdigest()

print(md5(data_object.name))
print(given_checksum)
print(given_checksum == md5(data_object.name))

732d394e3b58fcf9ae5531b9f8d8ad43
670b61d2b86b2090060eff38872a45e9
False


## Visualizing the contents of the registry

Here, we look at the file sizes of the contents of the registry. This is a histogram where each bin is a count of the number of files with a size in that range. We plot with a log axis because of the number of very small files dominates a linear scale.

In [3]:
import matplotlib.pyplot as plt
file_sizes = [float(x.file_size) for x in data_objects]
plt.hist(file_sizes, bins=96)
plt.title("n = {}, {} GB total, mean {} GB".format(len(file_sizes), str(sum(file_sizes) / 1000000000.0), (sum(file_sizes) / len(file_sizes)) / 1000000000.0 ))
plt.yscale('symlog')
plt.show()

NameError: name 'data_objects' is not defined

Observe that most of the non-tiny files are around 2GB and a few files are very large.

## Make a Data Bundle of some Data Objects

We can now organize some of the Data Objects into a bundle so we can share them together. 

For example, a few publicly available items. First, we have to gather the list of data objects and compute their concatenated hash.

### Calculate the hash for our Objects

In [24]:
public_data_object_ids = [x.id for x in public_data_objects]
print(public_data_object_ids[0])
hashes = [x.checksums[0].checksum for x in public_data_objects]
print(hashes[0])
bundle_md5 = hashlib.md5()
bundle_md5.update(''.join(hashes[0:10]))
bundle_digest = bundle_md5.hexdigest()
print(bundle_digest)

93e75792-4e71-43be-91d1-c69c461b567f
670b61d2b86b2090060eff38872a45e9
d64983aa044c72e05b8e0e61e3f1b64c


### Create a new Data Bundle

In [30]:
CreateDataBundleRequest = models.get_model('ga4ghCreateDataBundleRequest')
DataBundle = models.get_model('ga4ghDataBundle')
Checksum = models.get_model('ga4ghChecksum')
my_bundle = DataBundle(
    name="My Bundle",
    checksums=[Checksum(checksum=bundle_digest, type='md5')],
    data_object_ids=public_data_object_ids[0:10],
    aliases=["bundle-alias", "access:public"])
create_request = CreateDataBundleRequest(data_bundle=my_bundle)
create_response = client.CreateDataBundle(body=create_request).result()
print(create_response.data_bundle_id)

7121dc9f-c8a4-436a-b4c4-c3c7fa58b5dc


Let's now verify the Data Bundle appears as expected:

In [36]:
get_bundle_response = client.GetDataBundle(data_bundle_id=create_response.data_bundle_id).result()
print(get_bundle_response.data_bundle.data_object_ids)

[u'93e75792-4e71-43be-91d1-c69c461b567f', u'0723954e-37b4-4a04-bd58-c685962b8a78', u'cfc60287-61ff-4631-a318-b0ad49a06c11', u'7d8c060a-f616-4a7c-963a-5aa237b040a0', u'6c57b1e9-a5b2-49ac-96a8-c5d090dd9cb1', u'6e5774ac-10aa-40b9-b901-b0776c3a1b33', u'76abc4c9-7977-447e-9bc3-8b06a1417682', u'07758321-c528-4760-aec0-269470c5be35', u'33942f1a-cf23-4545-8110-f767c623c721', u'8176bc53-925e-47ff-8eb7-ccd1e5612d60']
