# Querying DATS

This notebook provides a demonstration of querying datasets expressed using the [DATS model](https://datatagsuite.github.io/docs/html/), which provides [JSON-LD](https://json-ld.org/) descriptions, using [SPARQL](https://www.w3.org/TR/sparql11-query/) queries.

Let's consider the following AGR dataset provided as a BDBag and expressed in DATS. The DATS JSON-LD file is available [here](https://github.com/datatagsuite/examples/blob/master/BDbag-AGR-example.json).


In [12]:
dats_agr="""
{
    "@type": "Dataset",
    "@id": "http://identifiers.org/minid:b9j69h",
    "@context": "https://w3id.org/dats/context/sdo/dataset_sdo_context.jsonld",
    "identifier": {
        "@type": "Identifier",
        "@context": "https://w3id.org/dats/context/sdo/identifier_info_sdo_context.jsonld",
        "identifier": "http://identifiers.org/minid:b9j69h",
        "identifierSource": "minid"
    },
    "title": "AGR Data set with identifier-based references to data in cloud storage",
    "description": "AGR Data set with identifier-based references to data in cloud storage",
    "dates": [{
        "date": "2018-03-19 17:43:57.073822",
        "type": {
            "value": "creation",
            "valueIRI": ""
        }
    }],
    "creators": [{
        "@type": "Person",
        "@context": "https://w3id.org/dats/context/sdo/person_sdo_context.jsonld",
        "@id": "http://orcid.org/0000-0003-2280-917X",
        "identifier": {
            "identifier": "http://orcid.org/0000-0003-2280-917X",
            "identifierSource": "orcid"
        },
        "affiliations": [{
            "@type": "Organization",
            "@context": "https://w3id.org/dats/context/sdo/organization_sdo_context.jsonld",
            "name": "University of Southern California / Information Science"
        }],
        "firstName": "Michel",
        "fullName": "Mike d'Arcy",
        "lastName": "d'Arcy"
    }],
    "types": [{"information": {"value": "model organism data"}}],
    "hasPart": [
        {
            "@type": "Dataset",
            "@context": "https://w3id.org/dats/context/sdo/dataset_sdo_context.jsonld",
            "@id": "https://identifiers.org/minid:b9n39d",
            "identifier": {
                "identifier": "minid:b9n39d",
                "identifierSource": "minid"
            },
            "title": "A list of disease ontology terms obtained from the Disease Ontology website.",
            "types": [{"information": {"value": "ontology terms"}}],
            "creators": [ {} ],
            "distributions": [{
                "@type": "DatasetDistribution",
                "@context": "https://w3id.org/dats/context/sdo/dataset_distribution_sdo_context.jsonld",
                "identifier": {
                    "identifier": "minid:b9n39d",
                    "identifierSource": ""
                },
                "access": {
                    "@type": "Access",
                    "@context": "https://w3id.org/dats/context/sdo/access_sdo_context.jsonld",
                    "accessURL": "https://s3.amazonaws.com/mod-datadumps/DO/do_1.0.obo",
                    "landingPage": "https://identifiers.org/minid:b9n39d"
                },
                "conformsTo": [{
                    "name": "obo format",
                    "type": {
                        "value": "text/plain",
                        "valueIRI": ""
                    }
                }],
                "size": 4784295,
                "unit": {
                    "value": "byte",
                    "valueIRI": "http://purl.obolibrary.org/obo/UO_0000233"
                },
                "version": "Release 2.6.2018"
            }]
        },
        {
            "@type": "Dataset",
            "@id": "http://identifiers.org/minid:b9hd64",
            "@context": "https://w3id.org/dats/context/sdo/dataset_sdo_context.jsonld",
            "identifier": {
                "identifier": "minid:b9hd64",
                "identifierSource": "minid"
            },
            "title": "A list of gene ontology terms obtained from the Gene Ontology Consortium.",
            "types": [{"information": {"value": "ontology terms"}}],
            "creators": [ { } ],
            "distributions": [{
                "@type": "DatasetDistribution",
                "@context": "https://w3id.org/dats/context/sdo/dataset_distribution_sdo_context.jsonld",
                "identifier": {
                    "identifier": "minid:b9hd64",
                    "identifierSource": ""
                },
                "access": {
                    "@type": "Access",
                    "@context": "https://w3id.org/dats/context/sdo/access_sdo_context.jsonld",
                    "accessURL": "https://s3.amazonaws.com/mod-datadumps/GO/go_1.0.obo",
                    "landingPage": "http://identifiers.org/minid:b9hd64"
                },
                "conformsTo": [{
                    "name": "obo format",
                    "type": {
                        "value": "text/plain",
                        "valueIRI": ""
                    }
                }],
                "size": 36520029,
                "unit": {
                    "value": "byte",
                    "valueIRI": "http://purl.obolibrary.org/obo/UO_0000233"
                },
                "version": "Release 2.6.2018"
            }]
        },
        {
            "@type": "Dataset",
            "@id": "http://identifiers.org/minid:b9px1z",
            "@context": "https://w3id.org/dats/context/sdo/dataset_sdo_context.jsonld",
            "identifier": {
                "identifier": "minid:b9px1z",
                "identifierSource": "minid"
            },
            "title": "A list of sequence ontology terms obtained from the Sequence Ontology website.",
            "types": [{"information": {"value": "ontology terms"}}],
            "creators": [ {} ],
            "dates": [{
                "date": "2.6.2018",
                "type": {
                    "value": "creation",
                    "valueIRI": ""
                }
            }],
            "distributions": [{
                "@type": "DatasetDistribution",
                "@context": "https://w3id.org/dats/context/sdo/dataset_distribution_sdo_context.jsonld",
                "identifier": {
                    "identifier": "minid:b9px1z",
                    "identifierSource": ""
                },
                "access": {
                    "@type": "Access",
                    "@context": "https://w3id.org/dats/context/sdo/access_sdo_context.jsonld",
                    "accessURL": "https://s3.amazonaws.com/mod-datadumps/SO/so_1.0.obo",
                    "landingPage": "http://identifiers.org/minid:b9px1z"
                },
                "conformsTo": [{
                    "name": "obo format",
                    "type": {
                        "value": "text/plain",
                        "valueIRI": ""
                    }
                }],
                "size": 902733,
                "unit": {
                    "value": "byte",
                    "valueIRI": "http://purl.obolibrary.org/obo/UO_0000233"
                },
                "version": "Release 11.24.2015"
            }]
        },
        {
            "@type": "Dataset",
            "@id": "http://identifiers.org/minid:b9dm68",
            "@context": "https://w3id.org/dats/context/sdo/dataset_sdo_context.jsonld",
            "identifier": {
                "identifier": "minid:b9dm68",
                "identifierSource": "minid"
            },
            "title": "Flybase MOD data",
            "types": [{"information": {"value": "MOD data"}}],
            "creators": [ {} ],
            "distributions": [{
                "@type": "DatasetDistribution",
                "@context": "https://w3id.org/dats/context/sdo/dataset_distribution_sdo_context.jsonld",
                "identifier": {
                    "identifier": "minid:b9dm68",
                    "identifierSource": ""
                },
                "access": {
                    "@type": "Access",
                    "@context": "https://w3id.org/dats/context/sdo/access_sdo_context.jsonld",
                    "accessURL": "https://s3.amazonaws.com/mod-datadumps/FB_1.0.4_4.tar.gz",
                    "landingPage": "http://identifiers.org/minid:b9dm68"
                },
                "conformsTo": [{
                    "name": "tar.gz",
                    "type": {
                        "value": "application/x-compressed",
                        "valueIRI": ""
                    }
                }],
                "size": 7361930,
                "unit": {
                    "value": "byte",
                    "valueIRI": "http://purl.obolibrary.org/obo/UO_0000233"
                },
                "version": "1.0.4_4"
            }]
        },
        {
            "@type": "Dataset",
            "@id": "http://identifiers.org/minid:b9cm3t",
            "@context": "https://w3id.org/dats/context/sdo/dataset_sdo_context.jsonld",
            "identifier": {
                "identifier": "minid:b9cm3t",
                "identifierSource": "minid"
            },
            "title": "A list of gene ontology associations for Drosophila obtained from the Gene Ontology Consortium.",
            "types": [{"information": {"value": "gene association data"}}],
            "creators": [ {} ],
            "distributions": [{
                "@type": "DatasetDistribution",
                "@context": "https://w3id.org/dats/context/sdo/dataset_distribution_sdo_context.jsonld",
                "identifier": {
                    "identifier": "minid:b9cm3t",
                    "identifierSource": ""
                },
                "access": {
                    "@type": "Access",
                    "@context": "https://w3id.org/dats/context/sdo/access_sdo_context.jsonld",
                    "accessURL": "https://s3.amazonaws.com/mod-datadumps/GO/ANNOT/gene_association.fb.gz",
                    "landingPage": "http://identifiers.org/minid:b9cm3t"
                },
                "conformsTo": [{
                    "name": "tar.gz",
                    "type": {
                        "value": "application/x-compressed",
                        "valueIRI": ""
                    }
                }],
                "size": 2731033,
                "unit": {
                    "value": "byte",
                    "valueIRI": "http://purl.obolibrary.org/obo/UO_0000233"
                },
                "version": "Last updated 2.6.2018"
            }]
        },
        {
            "@type": "Dataset",
            "@id":  "http://identifiers.org/minid:b9m393",
            "@context": "https://w3id.org/dats/context/sdo/dataset_sdo_context.jsonld",
            "identifier": {
                "identifier": "http://identifiers.org/minid:b9m393",
                "identifierSource": "minid"
            },
            "title": "JSON files containing orthology derived from DIOPT v6.2 http://www.flyrnai.org/cgi-bin/DRSC_orthologs.pl",
            "types": [{"information": {"value": "orthology data"}}],
            "creators": [ {} ],
            "distributions": [{
                "@type": "DatasetDistribution",
                "@context": "https://w3id.org/dats/context/sdo/dataset_distribution_sdo_context.jsonld",
                "identifier": {
                    "identifier": "http://identifiers.org/minid:b9m393",
                    "identifierSource": "minid"
                },
                "access": {
                    "@type": "Access",
                    "@context": "https://w3id.org/dats/context/sdo/access_sdo_context.jsonld",
                    "accessURL": "https://s3.amazonaws.com/mod-datadumps/ORTHO/orthology_FlyBase_1.0.0_2.json.tar.gz",
                    "landingPage": "http://identifiers.org/minid:b9m393"
                },
                "conformsTo": [
                    {
                        "name": "tar.gz",
                        "type": {
                            "value": "application/x-compressed",
                            "valueIRI": ""
                        }
                    },
                    {
                        "name": "json",
                        "type": {
                            "value": "application/json",
                            "valueIRI": ""
                        }
                    }
                ],
                "size": 2614596,
                "unit": {
                    "value": "byte",
                    "valueIRI": "http://purl.obolibrary.org/obo/UO_0000233"
                },
                "version": "DIOPT v6.2"
            }]
        }
    ],
    "distributions": [{
        "@type": "DatasetDistribution",
        "@context": "https://w3id.org/dats/context/sdo/dataset_distribution_sdo_context.jsonld",
        "identifier": {
            "identifier": "http://identifiers.org/minid:b9j69h",
            "identifierSource": "minid"
        },
        "access": {
            "@type": "Access",
            "@context": "https://w3id.org/dats/context/sdo/access_sdo_context.jsonld",
            "landingPage": "http://identifiers.org/minid/b9j69h",
            "accessURL": "https://nih-commons.s3.amazonaws.com/misc/agr-example.tgz"
        },
        "conformsTo": [{
            "name": "tar.gz",
            "type": {
                "value": "application/x-compressed",
                "valueIRI": ""
            }
        }],
        "size": -1,
        "unit": {
            "value": "byte",
            "valueIRI": "http://purl.obolibrary.org/obo/UO_0000233"
        },
        "version": ""
    }],
    "extraProperties": [
        {
            "category": "checksum",
            "categoryIRI": "http://purl.obolibrary.org/obo/NCIT_C43522",
            "values": [{
                "value": "6484968f81afac84857d02b573b0d589fb2f9582a2b920572830dc5781e0a53c",
                "valueIRI": ""
            }]
        },
        {
            "category": "checksum algorithm",
            "categoryIRI": "http://purl.obolibrary.org/obo/NCIT_C16275",
            "values": [{
                "value": "MD5",
                "valueIRI": ""
            }]
        }
    ]
}
"""

Now, let's use the [**rdflib**](https://github.com/RDFLib/rdflib) library:

In [13]:
from rdflib import Graph

We can read in the JSON-LD file into a Graph. 

In [14]:
g = Graph().parse(data=dats_agr, format='json-ld')

We can check the graph in Notation3 format

In [15]:
print g.serialize(format='n3', indent=4)

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix sdo: <https://schema.org/> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://identifiers.org/minid:b9j69h> a sdo:Dataset ;
    sdo:creator <http://orcid.org/0000-0003-2280-917X> ;
    sdo:description "AGR Data set with identifier-based references to data in cloud storage"^^sdo:Text ;
    sdo:distribution [ a sdo:DataDownload ;
            sdo:accessMode [ a sdo:Thing ;
                    sdo:contentUrl "https://nih-commons.s3.amazonaws.com/misc/agr-example.tgz"^^sdo:URL ;
                    sdo:url "http://identifiers.org/minid/b9j69h"^^sdo:URL ] ;
            sdo:identifier [ sdo:identifier "http://identifiers.org/minid:b9j69h"^^sdo:Text ] ;
            sdo:version "" ] ;
    sdo:hasPart <http://identifiers.org/minid:b9cm3t>,
        <http://identifiers.org/minid:b9dm68>,
        <http://identif

Let's now query the graph with SPARQL to find out all the files associated with the dataset and all the sub-datasets.

In [16]:
qres = g.query(
            """
            SELECT DISTINCT ?dataset ?file
            WHERE {
            
                                        
                ?dataset a sdo:Dataset.                
                ?dataset sdo:distribution ?distribution.
                ?distribution a sdo:DataDownload.
                ?distribution sdo:accessMode ?access.
                ?access sdo:contentUrl ?file.
                              
                
            }
            """)

Let's check the query results:


In [17]:
print("Dataset \t \t File")
for row in qres:
    print("%s \t %s" % row)

Dataset 	 	 File
http://identifiers.org/minid:b9dm68 	 https://s3.amazonaws.com/mod-datadumps/FB_1.0.4_4.tar.gz
http://identifiers.org/minid:b9hd64 	 https://s3.amazonaws.com/mod-datadumps/GO/go_1.0.obo
http://identifiers.org/minid:b9cm3t 	 https://s3.amazonaws.com/mod-datadumps/GO/ANNOT/gene_association.fb.gz
http://identifiers.org/minid:b9m393 	 https://s3.amazonaws.com/mod-datadumps/ORTHO/orthology_FlyBase_1.0.0_2.json.tar.gz
https://identifiers.org/minid:b9n39d 	 https://s3.amazonaws.com/mod-datadumps/DO/do_1.0.obo
http://identifiers.org/minid:b9px1z 	 https://s3.amazonaws.com/mod-datadumps/SO/so_1.0.obo
http://identifiers.org/minid:b9j69h 	 https://nih-commons.s3.amazonaws.com/misc/agr-example.tgz


Let's now query for the creators of the dataset and sub-datasets:

In [18]:
qres2 = g.query(
            """            
            SELECT DISTINCT ?dataset ?creator
               WHERE {
                  ?dataset a sdo:Dataset .
                  ?dataset sdo:creator ?creator                  
               }""")


And let's inspect the results:

In [19]:
for row in qres2:
    print("%s created by %s" % row)


http://identifiers.org/minid:b9hd64 created by N7abb4973a8e44fb8826edf5627b14067
http://identifiers.org/minid:b9j69h created by http://orcid.org/0000-0003-2280-917X
http://identifiers.org/minid:b9dm68 created by Na645a2df74e149c4b6cc2b90048e6927
https://identifiers.org/minid:b9n39d created by N3034ce5eff884ccf9f76d9b1fe7aa16d
http://identifiers.org/minid:b9cm3t created by N849e8702edb54a398813ca19382fb123
http://identifiers.org/minid:b9m393 created by Na06b87cc30754c0582883f9a6ba80155
http://identifiers.org/minid:b9px1z created by N93334829a15449279c6924de3e20e5b2


Here we see some identifiers for blank nodes as the creators for the subdatasets where not specified in the representation. For any questions, contact [@agbeltran](https://github.com/agbeltran). 