## GA4GH Data Connect - Schemas with semantic references

This notebook illustrates what is possible if you want to implement Data Connect and you have a schema or data model which references a semantic standard for your data which resides in some repository of standards and which can be referred to by an id.

It is possible that your entire datasource can be defined by reference to an external model. We are working on examples for this. To start with we will look at the situation where properties reference an data element.

The following schema lists a subset of the attributes for the CRDC Cancer Data Service Metadata submission template. Three attributes of a subject in that schema are described by Common Data Element identifiers.

Note the "$cde" key is not standard JSON Schema. The point here is to illustrate the data model (metamodel) required in biomedical use cases where the use of sematic resources such as common data elements is common. 

In [2]:
from fasp.search import DataConnectClient
cl = DataConnectClient('http://localhost:8089/')
cds_schema = cl.listTableInfo('bigquery.cds.subject', verbose=True)

_Schema for tablebigquery.cds.subject_
{
   "name": "bigquery.cds.subject",
   "description": "Cancer Data Service (CDS) submission metadata",
   "data_model": {
      "$id": "",
      "description": "Cancer Data Service (CDS) submission metadata",
      "$schema": "http://json-schema.org/draft-07/schema",
      "properties": {
         "sample_age_at_collection": {
            "type": "integer",
            "description": "The number of days from the index date to the date a sample was collected for a specific study or project.",
            "$unit": "days"
         },
         "gender": {
            "allOf": [
               {
                  "description": "Text designations that identify gender. Gender is described as the assemblage of properties that distinguish people on the basis of their societal roles. [Explanatory Comment 1: Identification of gender is based upon self-report and may come from a form, questionnaire, interview, etc.]"
               },
               {
     

 We can extract the cde id and property description as follows. 

In [3]:
prprty = cds_schema.schema['data_model']['properties']['ethnicity']
allOf = prprty['allOf']
our_cde = [x['$cde'] for x in allOf if "$cde" in x][0]
prop_desc = [x['description'] for x in allOf if "description" in x][0]
our_cde

'cadsr:2192217'

One change made from the CDS Excel schema was to change the raw cde public id to a CURIE, by adding the cadsr prefix. We registered that prefix a while back.

The use of CURIEs has two benefits.
* Their compactness makes the schema less bulky, more easily editable and readable
* It means our idenrifiers are independent of specific hosts

The use of a metaresolver deals with sending the request to the right location.

caDSR provides its CDE details as XML, so we use ElementTree to parse the response into a Python object.

In [15]:
import requests
import xml.etree.ElementTree as ET
# Use a metaresolver
mr = "http://identifiers.org"
url = mr +"/"+our_cde
response = requests.get(url)
root = ET.fromstring(response.content)

We can then use ElementTree to access specific attributes of the CDE.

Note that the CDS schema has a different description than that provided within the CDE.

In [16]:
print ("Property definition: {}".format(prop_desc))
for c in root.findall("./queryResponse/class"):
    print('_'*80)

    cde_version = c.find("./field/[@name='version']")
    print("Version: {}".format(cde_version.text))
    preferredDefinition = c.find("./field/[@name='preferredDefinition']")
    print("CDE, Preferred definition: {}".format(preferredDefinition.text))
    val_domain = c.find("./field/[@name='valueDomain']")
    val_domain_link = val_domain.attrib['{http://www.w3.org/1999/xlink}href']
    print("Value domain: {}".format(val_domain_link))
    

Property definition: An individual's self-described social and cultural grouping, specifically whether an individual describes themselves as Hispanic or Latino. The provided values are based on the categories defined by the U.S. Office of Management and Business and used by the U.S. Census Bureau.
________________________________________________________________________________
Version: 1.0
CDE, Preferred definition: The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categories.
Value domain: https://cadsrapi.nci.nih.gov/cadsrapi41/GetXML?query=ValueDomain&DataElement[@id=E75F33C5-A502-7433-E034-0003BA3F9857]&roleName=valueDomain
________________________________________________________________________________
Version: 2.0
CDE, Preferred definition: The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categories.
Value domain: https://cadsrapi.nci.nih.gov/cadsrapi41/GetXML?query=ValueDomai

We could of course then use the URLs for Value Domains to get those details. Pausing for now and reviewing next steps
* For illustration probably best to show only the latest version of the CDE.
* Production versions would have to say which version of a CDE a dataset uses.
* Generating the schema for the full CDS template schema and model. 
 * This will likely include examples where unit is part of the CDE and does not need to be recorded in the JSON schema
* Comparing how we get the study attributes 
* Identifying CDEs that have an enumerated value domain e.g. such as the country list seen in other examples

### SPARQL Metadata Repository (MDR)

In [6]:
mdr_prefix = 'http://cbiit.nci.nih.gov/CADSR#'
iso_mdr_prefix = 'http://www.iso.org/11179/MDR#'

endpoint = 'https://si-shared-dev.nci.nih.gov/sparql'

sparql = '''prefix mdr: <http://cbiit.nci.nih.gov/caDSR#>
prefix isomdr: <http://www.iso.org/11179/MDR#>
select ?subject ?value from <http://cbiit.nci.nih.gov/caDSR>
where {
  values ?subject { mdr:DE2192217 }
  ?subject isomdr:permitted_value ?o .
  ?o isomdr:value ?value .
 }
'''

payload = {'query': sparql}

#print(sparql)
#url = f'{endpoint}query={sparql}'
#print(url)

In [10]:
import json
querystring="prefix mdr: <http://cbiit.nci.nih.gov/caDSR#> prefix isomdr: <http://www.iso.org/11179/MDR#> select ?subject ?value from <http://cbiit.nci.nih.gov/caDSR> where { values ?subject { mdr:DE2192217 } ?subject isomdr:permitted_value ?o . ?o isomdr:value ?value . }"

#print(querystring)
payload = {'query': querystring}
headers = {"Accept":"application/json"}
response = requests.get(endpoint, params=payload, headers=headers)
#print(response.content)

result = response.json()
#print (json.dumps(result, indent=3))
for hit in result["results"]["bindings"]:
    # We want the "value" attribute of the "value" field
    print(hit["value"]["value"])

Hispanic or Latino
Not Hispanic or Latino
Not reported
Unknown
Not allowed to collect


### Using SPARQLWrapper

In [11]:
from SPARQLWrapper import SPARQLWrapper, JSON

# Specify the endpoint
sparql = SPARQLWrapper(endpoint)

# Query 
sparql.setQuery("""prefix mdr: <http://cbiit.nci.nih.gov/caDSR#> prefix isomdr: <http://www.iso.org/11179/MDR#> select ?subject ?value from <http://cbiit.nci.nih.gov/caDSR> where { values ?subject { mdr:DE2192217 } ?subject isomdr:permitted_value ?o . ?o isomdr:value ?value . }""")

# Convert results to JSON format
sparql.setReturnFormat(JSON)
result2 = sparql.query().convert()

# The return data contains "bindings" (a list of dictionaries)
for hit in result2["results"]["bindings"]:
    # We want the "value" attribute of the "comment" field
    print(hit["value"]["value"])

Hispanic or Latino
Not Hispanic or Latino
Not reported
Unknown
Not allowed to collect


In [12]:
for prprty in cds_schema.schema['data_model']['properties']:
    print()
    #allOf = prprty['allOf']
    #our_cde = [x['$cde'] for x in allOf if "$cde" in x][0]
    print(our_cde)
    #prop_desc = [x['description'] for x in allOf if "description" in x][0]


cadsr:2192217

cadsr:2192217

cadsr:2192217

cadsr:2192217


In [14]:
print(json.dumps(result2, indent=3))

{
   "head": {
      "link": [],
      "vars": [
         "subject",
         "value"
      ]
   },
   "results": {
      "distinct": false,
      "ordered": true,
      "bindings": [
         {
            "subject": {
               "type": "uri",
               "value": "http://cbiit.nci.nih.gov/caDSR#DE2192217"
            },
            "value": {
               "type": "literal",
               "value": "Hispanic or Latino"
            }
         },
         {
            "subject": {
               "type": "uri",
               "value": "http://cbiit.nci.nih.gov/caDSR#DE2192217"
            },
            "value": {
               "type": "literal",
               "value": "Not Hispanic or Latino"
            }
         },
         {
            "subject": {
               "type": "uri",
               "value": "http://cbiit.nci.nih.gov/caDSR#DE2192217"
            },
            "value": {
               "type": "literal",
               "value": "Not reported"
            }
  