# Data Object Service Demo

This notebook demonstrates how to use the demonstration server and client to make a simple Data Object service that makes available data from a few different sources.

## Installing the Python package

First, we'll install the Data Object Service Schemas package from PyPi, it includes a Python client and demonstration server.



In [None]:
!pip install ga4gh-dos-schemas

## Running the server

Once you've installed the PyPi package, you can run the demonstration server using `ga4gh_dos_server`. Open this in a separate terminal.

You should see something like:

```
$  ga4gh_dos_server
 * Serving Flask app "ga4gh.dos.server" (lazy loading)
 * Environment: production
   WARNING: Do not use the development server in a production environment.
   Use a production WSGI server instead.
 * Debug mode: on
 * Running on http://0.0.0.0:8080/ (Press CTRL+C to quit)
 * Restarting with stat
 * Debugger is active!
 * Debugger PIN: 192-487-366
```

Your DOS is now ready to accept requests to Create, Get, and List Data Objects!

## Using the Client to Access the Demo Server

We can now use the Python client to create a simple Data Object. The same could be done using cURL or wget.

In [2]:
from ga4gh.dos.client import Client
client = Client("http://localhost:8080/ga4gh/dos/v1")
c = client.client
models = client.models

At first, the service will not present any Data Objects.

In [4]:
c.ListDataObjects().result()

ListDataObjectsResponse(data_objects=[], next_page_token=None)

We can now create an simple data object representing a file.

In [6]:
!echo "Hello DOS" > dos.txt
!md5sum dos.txt

976feb684cfdb4b2337530699e1d0fbd  dos.txt


In [7]:
DataObject = models.get_model('DataObject')
Checksum = models.get_model('Checksum')
URL = models.get_model('URL')
hello_object = DataObject()

In [8]:
# Set the Data Object metadata
hello_object.id = 'test'
hello_object.checksums = [Checksum(checksum="976feb684cfdb4b2337530699e1d0fbd", type="md5")]
hello_object.urls = [URL(url="file://dos.txt")]
hello_object.name = 'dos.txt'

In [10]:
# Post the Data Object to the service
c.CreateDataObject(body={'data_object': hello_object}).result()

CreateDataObjectResponse(data_object_id=u'test')

In [12]:
# Get the resulting created object
c.GetDataObject(data_object_id='test').result()

GetDataObjectResponse(data_object=DataObject(aliases=None, checksums=[Checksum(checksum=u'976feb684cfdb4b2337530699e1d0fbd', type=u'md5')], created=datetime.datetime(2018, 5, 31, 9, 47, 9, 729521, tzinfo=tzutc()), description=None, id=u'test', mime_type=None, name=u'dos.txt', size=None, updated=datetime.datetime(2018, 5, 31, 9, 47, 9, 729536, tzinfo=tzutc()), urls=[URL(system_metadata=None, url=u'file://dos.txt', user_metadata=None)], version=u'2018-05-31T09:47:09.729541Z'))

## Using DOS With Reference FASTAs

A useful Data Object Service might present a list of available reference FASTAs for performing downstream alignment and analysis.

We'll index the UCSC human reference FASTAs into DOS as an example.

In [15]:
!wget http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chr22.fa.gz
!md5sum chr22.fa.gz

--2018-05-31 09:50:36--  http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chr22.fa.gz
Resolving hgdownload.cse.ucsc.edu (hgdownload.cse.ucsc.edu)... 128.114.119.163
Connecting to hgdownload.cse.ucsc.edu (hgdownload.cse.ucsc.edu)|128.114.119.163|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12255678 (12M) [application/x-gzip]
Saving to: ‘chr22.fa.gz’


2018-05-31 09:50:42 (2.03 MB/s) - ‘chr22.fa.gz’ saved [12255678/12255678]

41b47ce1cc21b558409c19b892e1c0d1  chr22.fa.gz


In [16]:
# Adding a second URL because FTP is preferred
chr22 = DataObject()
chr22.id = 'hg38-chr22'
chr22.name = 'chr22.fa.gz'
chr22.urls = [
    URL(url='http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chr22.fa.gz'),
    URL(url='ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chr22.fa.gz')]
chr22.checksums = [Checksum(checksum='41b47ce1cc21b558409c19b892e1c0d1', type='md5')]
chr22.aliases = ['NC_000022', 'CM000684']
chr22.size = '12255678'

In [17]:
# Add the chr22 Data Object to the service
c.CreateDataObject(body={'data_object': chr22}).result()

CreateDataObjectResponse(data_object_id=u'hg38-chr22')

In [18]:
c.GetDataObject(data_object_id='hg38-chr22').result()

GetDataObjectResponse(data_object=DataObject(aliases=[u'NC_000022', u'CM000684'], checksums=[Checksum(checksum=u'41b47ce1cc21b558409c19b892e1c0d1', type=u'md5')], created=datetime.datetime(2018, 5, 31, 9, 54, 56, 385181, tzinfo=tzutc()), description=None, id=u'hg38-chr22', mime_type=None, name=u'chr22.fa.gz', size=12255678L, updated=datetime.datetime(2018, 5, 31, 9, 54, 56, 385193, tzinfo=tzutc()), urls=[URL(system_metadata=None, url=u'http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chr22.fa.gz', user_metadata=None), URL(system_metadata=None, url=u'ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chr22.fa.gz', user_metadata=None)], version=u'2018-05-31T09:54:56.385197Z'))

## Using DOS with htsget

Data Objects are meant to represent versioned artifacts and can represent an API resource. For example, we could use DOS as a way of exposing htsget resources.

In the [htsget Quickstart documentation](https://htsget.readthedocs.io/en/stable/quickstart.html) a link is made to the following snippet, which will stream the BAM results to the client.

In [19]:
!htsget http://htsnexus.rnd.dnanex.us/v1/reads/BroadHiSeqX_b37/NA12878 \
    --reference-name=2 --start=1000 --end=20000 -O NA12878_2.bam

In [23]:
!md5sum NA12878_2.bam
!ls -al NA12878_2.bam

eaf80af5e9e54db5936578bed06ffcdc  NA12878_2.bam
-rw-r--r-- 1 david david 555749 May 31 10:00 NA12878_2.bam


In [26]:
na12878_2 = DataObject()
na12878_2.id = 'na12878_2'
na12878_2.name = 'NA12878_2.bam'
na12878_2.checksums = [Checksum(checksum='eaf80af5e9e54db5936578bed06ffcdc', type='md5')]
na12878_2.urls = [
    URL(
        url="http://htsnexus.rnd.dnanex.us/v1/reads/BroadHiSeqX_b37/NA12878", 
        system_metadata={'reference_name': 2, 'start': 1000, 'end': 20000})]
na12878_2.aliases = ['NA12878 chr 2 subset']
na12878_2.size = '555749'

In [27]:
c.CreateDataObject(body={'data_object': na12878_2}).result()

CreateDataObjectResponse(data_object_id=u'na12878_2')

In [29]:
c.GetDataObject(data_object_id='na12878_2').result()

GetDataObjectResponse(data_object=DataObject(aliases=[u'NA12878 chr 2 subset'], checksums=[Checksum(checksum=u'eaf80af5e9e54db5936578bed06ffcdc', type=u'md5')], created=datetime.datetime(2018, 5, 31, 10, 5, 7, 748572, tzinfo=tzutc()), description=None, id=u'na12878_2', mime_type=None, name=u'NA12878_2.bam', size=555749L, updated=datetime.datetime(2018, 5, 31, 10, 5, 7, 748583, tzinfo=tzutc()), urls=[URL(system_metadata=SystemMetadata(end=20000, reference_name=2, start=1000), url=u'http://htsnexus.rnd.dnanex.us/v1/reads/BroadHiSeqX_b37/NA12878', user_metadata=None)], version=u'2018-05-31T10:05:07.748588Z'))

## Using DOS with S3

One of the original intentions of DOS is to create an interoperability layer over the various object stores. We can create Data Objects that point to items in S3 so that subsequent downloaders can find them.

Using [dos_connect](https://github.com/ohsu-comp-bio/dos_connect), a DOS hosting the 1kgenomes s3 data is available.

In [31]:
client_1kg = Client('http://ec2-52-26-45-130.us-west-2.compute.amazonaws.com:8080/ga4gh/dos/v1/')
c1kg = client_1kg.client

In [34]:
do_1kg = c1kg.ListDataObjects().result().data_objects[0]

In [38]:
print(do_1kg.urls[0].url)
print(do_1kg.checksums[0])
print(do_1kg.id)

s3://1000genomes/phase3/data/HG02885/alignment/HG02885.mapped.ILLUMINA.bwa.GWD.low_coverage.20121211.bam.cram.crai
Checksum(checksum=u'ddc4d0aea91b82a1c202a0cd1219e520', type=u'md5')
b3549308-9dd0-4fdb-92b2-5a2697521354


We can now use an S3 downloader to retrieve the file and confirm the checksum.

In [40]:
!dos-downloader http://ec2-52-26-45-130.us-west-2.compute.amazonaws.com:8080/ga4gh/dos/v1/ b3549308-9dd0-4fdb-92b2-5a2697521354 --aws_secret_key $aws_secret_access_key --aws_access_key $aws_access_key_id

usage: dos-downloader [-h] [--aws_access_key AWS_ACCESS_KEY]
                      [--aws_secret_key AWS_SECRET_KEY] [--path PATH]
                      url id
dos-downloader: error: argument --aws_secret_key: expected one argument


## DOS GDC Data

Another demonstration in this repository asks you to create a DOS of the NCI GDC data. This process has been automated as part of a lambda: dos-gdc-lambda.

In [43]:
cgdc = Client("https://dos-gdc.ucsc-cgp-dev.org/")

In [48]:
gdc_do = cgdc.client.ListDataObjects().result().data_objects[0]

In [52]:
print(gdc_do.name)
print(gdc_do.size)
print(gdc_do.urls[0].url)

4803fc06-e2de-44aa-b76e-f8fe9308c18d.bam
19098711404
https://api.gdc.cancer.gov/data/4803fc06-e2de-44aa-b76e-f8fe9308c18d


## DAS DOS

UCSC Genome Browser makes available a service for getting sequence by region from named FASTA. Working with DOS is pretty easy.

Both of these APIs allow one to make further range queries against the result.

https://genome.ucsc.edu/FAQ/FAQdownloads.html#download23

In [54]:
chr22.urls.append(URL(url='http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=chr22:15000,16000'))
chr22.urls.append(URL(url='http://togows.org/api/ucsc/hg38/chr22:15000-16000.fasta'))

In [57]:
c.UpdateDataObject(body={'data_object': chr22}, data_object_id=chr22.id).result()

UpdateDataObjectResponse(data_object_id=u'hg38-chr22')

In [84]:
response_chr22 = c.GetDataObject(data_object_id=chr22.id).result().data_object
# Note the change in version, in DOS versions are just arbitrary strings
print(response_chr22.version, chr22.version)
url_1 = response_chr22.urls[2].url
url_2 = response_chr22.urls[3].url

(u'2018-05-31T11:33:04.156318Z', None)


In [89]:
!wget $url_1

--2018-05-31 11:40:17--  http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=chr22:15000,16000
Resolving genome.ucsc.edu (genome.ucsc.edu)... 128.114.119.131, 128.114.119.132, 128.114.119.133, ...
Connecting to genome.ucsc.edu (genome.ucsc.edu)|128.114.119.131|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/xml]
Saving to: ‘dna?segment=chr22:15000,16000.2’

dna?segment=chr22:1     [ <=>                ]   1.22K  --.-KB/s    in 0s      

2018-05-31 11:40:17 (62.9 MB/s) - ‘dna?segment=chr22:15000,16000.2’ saved [1246]



In [90]:
!head dna?segment=chr22:15000,16000

<?xml version="1.0" standalone="no"?>
<!DOCTYPE DASDNA SYSTEM "http://www.biodas.org/dtd/dasdna.dtd">
<DASDNA>
<SEQUENCE id="chr22" start="15000" stop="16000" version="1.00">
<DNA length="1001">
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn


In [91]:
!wget $url_2

--2018-05-31 11:40:20--  http://togows.org/api/ucsc/hg38/chr22:15000-16000.fasta
Resolving togows.org (togows.org)... 133.39.78.80
Connecting to togows.org (togows.org)|133.39.78.80|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: ‘chr22:15000-16000.fasta.1’

chr22:15000-16000.f     [ <=>                ]   1.02K  --.-KB/s    in 0s      

2018-05-31 11:40:21 (49.6 MB/s) - ‘chr22:15000-16000.fasta.1’ saved [1042]



In [92]:
!head chr22:15000-16000.fasta

>hg38:chr22:15000-16000
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
