## A notebook to query the HRA Knowledge Graph (KG)

In [17]:
# install packages
%pip install requests pandas

# import external packages
import requests
import pandas as pd

# import built-in packages
import json
from pprint import pprint
from io import StringIO




## Example 1: Get an HRA Digital Object (DO) as a JSON file

In this example, we will retrieve an anatomical structures (AS), cell types (CT), plus biomarkers (B) table (ASCT+B, see [this paper](https://www.nature.com/articles/s41556-021-00788-6)) as a JSON file. As part of the HRA Digital Object (DO) processing pipeline, we perform normalization and enrichment. When we normalize, we convert every DO into a standard format that follows schemas defined in [LinkML](https://linkml.io/). The default file format for normalized data is [YAML](https://yaml.org/), which can readily be converted to JSON. This would almost be enough as is, but the current models were not fully designed for end users. It was created as a way to generate RDF graphs. So, there are some artifacts of this in the models themselves, which may make them hard to understand. However, with some clean-up, the HRA data is accessible and useful for end users with varying degrees of programming and analysis experience. LinkML can also generate a JSON-LD Context and a JSON Schema (see [https://json-schema.org/](https://json-schema.org/)), which we can then attach to the JSON files to make them validatable, easily convert them to RDF, and to provide autocomplete as well as documentation inline in code editors like VS Code, and to generate online documentation.


In [18]:
# Each HRA DO has a persistent URL (PURL), where the user is served HRA DO data after server-side content negotation
# This is the PURL for the ASCT+B table for the kidney
url = "https://purl.humanatlas.io/asct-b/kidney"

# set headers
headers = {
  "Accept" : "application/json"
}

# make request, parse response
response = requests.get(url, headers=headers)

# delete this once HRA DO JSONs are deployed
kidney_json = json.loads(response.text)
pprint(kidney_json)

{'$schema': 'https://cdn.humanatlas.io/digital-objects/schema/asct-b/latest/assets/schema.json',
 '@context': 'https://cdn.humanatlas.io/digital-objects/schema/asct-b/latest/assets/schema.context.jsonld',
 '@type': 'Container',
 'data': {'anatomical_structures': [{'ccf_asctb_type': 'AS',
                                     'ccf_is_provisional': False,
                                     'ccf_pref_label': 'UBERON anatomical '
                                                       'structure',
                                     'conforms_to': 'AnatomicalStructure',
                                     'id': 'UBERON:0001062',
                                     'parent_class': 'ccf:AnatomicalStructure'},
                                    {'ccf_asctb_type': 'AS',
                                     'ccf_is_provisional': False,
                                     'ccf_pref_label': 'FMA anatomical '
                                                       'structure',
                

In [20]:
# The resulting JSON file has these keys:
kidney_json.keys()

dict_keys(['$schema', '@context', '@type', 'iri', 'metadata', 'data'])

In [24]:
# Now we iterate over all the rows in ASCT+B table
for row in kidney_json['data']['asctb_record']:
  pprint(row)

{'anatomical_structure_list': [{'ccf_pref_label': 'kidney',
                                'id': 'https://purl.humanatlas.io/asct-b/kidney/v1.5#R1-AS1',
                                'label': 'kidney (Table kidney, Record 1, '
                                         'Column AS/1)',
                                'order_number': 1,
                                'record_number': 1,
                                'source_concept': 'UBERON:0002113',
                                'type_of': ['AnatomicalStructureRecord']},
                               {'ccf_pref_label': 'kidney capsule',
                                'id': 'https://purl.humanatlas.io/asct-b/kidney/v1.5#R1-AS2',
                                'label': 'kidney capsule (Table kidney, Record '
                                         '1, Column AS/2)',
                                'order_number': 2,
                                'record_number': 1,
                                'source_concept': 'UBERON:000

## Example 2: Run a SPARQL query via `grlc` with `requests`


`grlc` retrieves SPARQL queries in [https://github.com/hubmapconsortium/ccf-grlc](https://github.com/hubmapconsortium/ccf-grlc), then allows the user to run these queries like RESTful API endpoints. In this example, we retrieve all the AS, CT, and Bs in the ASCT+B table in the lymph node ([https://purl.humanatlas.io/asct-b/lymph-node](https://purl.humanatlas.io/asct-b/lymph-node)). Our desired format is CSV. The query is documented [here](https://apps.humanatlas.io/api/grlc/hra.html#get-/asctb-in-table).

In [25]:
# set desired file format
format = "csv" 

# set url
grlc_url = f"https://grlc.io/api-git/hubmapconsortium/ccf-grlc/subdir/hra/asctb-in-table.{format}?asctb=https%3A%2F%2Fpurl.humanatlas.io%2Fasct-b%2Flymph-node"

# set header
headers = {
  "Accept": "text/csv"
}

# make request
response = requests.get(grlc_url)

# convert text to file-like object
csv_data = StringIO(response.text)  

# concert to DataFrame
df = pd.read_csv(csv_data)
df

Unnamed: 0,as_label,ct_label,bm_label,as,ct,bm,bmType
0,Afferent lymphatic vessel,Endothelial Cell of Lymphatic Vessel,TNFRSF9,http://purl.obolibrary.org/obo/UBERON_0010396,http://purl.obolibrary.org/obo/CL_0002138,http://identifiers.org/hgnc/11924,protein
1,Afferent lymphatic vessel,Smooth Muscle Cell,ACTA2,http://purl.obolibrary.org/obo/UBERON_0010396,http://purl.obolibrary.org/obo/CL_0019017,http://identifiers.org/hgnc/130,gene
2,Afferent lymphatic vessel,Endothelial Cell of Lymphatic Vessel,Lyve1,http://purl.obolibrary.org/obo/UBERON_0010396,http://purl.obolibrary.org/obo/CL_0002138,http://identifiers.org/hgnc/14687,protein
3,Afferent lymphatic vessel,Endothelial Cell of Lymphatic Vessel,CAV1,http://purl.obolibrary.org/obo/UBERON_0010396,http://purl.obolibrary.org/obo/CL_0002138,http://identifiers.org/hgnc/1527,gene
4,Afferent lymphatic vessel,Endothelial Cell of Lymphatic Vessel,CLEC4M,http://purl.obolibrary.org/obo/UBERON_0010396,http://purl.obolibrary.org/obo/CL_0002138,http://identifiers.org/hgnc/13523,gene
...,...,...,...,...,...,...,...
2325,lymph node,Lymphatic Endothelial Cell-Subcapsular Sinus F...,NT5E,http://purl.obolibrary.org/obo/UBERON_0000029,http://purl.obolibrary.org/obo/CL_0009108,http://identifiers.org/hgnc/8021,gene
2326,lymph node,Lymphatic Endothelial Cell-Subcapsular Sinus F...,Claudin11,http://purl.obolibrary.org/obo/UBERON_0000029,http://purl.obolibrary.org/obo/CL_0009108,http://identifiers.org/hgnc/8514,protein
2327,lymph node,Blood Vessel Endothelial Cell,CD123,http://purl.obolibrary.org/obo/UBERON_0000029,http://purl.obolibrary.org/obo/CL_0000071,http://identifiers.org/hgnc/6012,protein
2328,lymph node,Lymphatic Endothelial Cell-Subcapsular Sinus F...,MFAP4,http://purl.obolibrary.org/obo/UBERON_0000029,http://purl.obolibrary.org/obo/CL_0009108,http://identifiers.org/hgnc/7035,protein


## Example 3: Run your own SPARQL query with `requests` 

(do not use hra-api client), ask for text/csv, parse with csv module

In [6]:
# In this example, we write our own SPARQL query inline. Alternatively, you could load a query saved in an RQ file.
query = "SELECT * WHERE { ?sub ?pred ?obj . } LIMIT 10" #this is a simple query for all subject, predicate, object triples in the queried graph

# define endpoint
url = "https://lod.humanatlas.io/sparql"

# define parameters
params = {
    "query": query,
}

# set header
headers = {
  "Accept": "text/csv"
}

# Send the GET request
response = requests.get(url, headers=headers, params=params)

# convert text to file-like object
csv_data = StringIO(response.text)  

# concert to DataFrame
df = pd.read_csv(csv_data)
df

Unnamed: 0,sub,pred,obj
0,http://protege.stanford.edu/plugins/owl/proteg...,http://www.w3.org/1999/02/22-rdf-syntax-ns#type,http://www.w3.org/2002/07/owl#AnnotationProperty
1,http://purl.obolibrary.org/obo/BFO_0000002,http://purl.obolibrary.org/obo/IAO_0000115,An entity that exists in full at any time in w...
2,http://purl.obolibrary.org/obo/BFO_0000002,http://purl.obolibrary.org/obo/IAO_0000116,BFO 2 Reference: Continuant entities are entit...
3,http://purl.obolibrary.org/obo/BFO_0000002,http://www.w3.org/1999/02/22-rdf-syntax-ns#type,http://www.w3.org/2002/07/owl#Class
4,http://purl.obolibrary.org/obo/BFO_0000002,http://www.w3.org/2000/01/rdf-schema#label,continuant
5,http://purl.obolibrary.org/obo/BFO_0000002,http://www.w3.org/2000/01/rdf-schema#subClassOf,http://purl.obolibrary.org/obo/BFO_0000001
6,http://purl.obolibrary.org/obo/BFO_0000002,http://www.w3.org/2000/01/rdf-schema#subClassOf,_:t4732
7,http://purl.obolibrary.org/obo/BFO_0000002,http://www.w3.org/2002/07/owl#disjointWith,http://purl.obolibrary.org/obo/BFO_0000003
8,http://purl.obolibrary.org/obo/BFO_0000002,http://www.w3.org/2002/07/owl#disjointWith,_:t4733
9,http://purl.obolibrary.org/obo/BFO_0000002,http://www.w3.org/2002/07/owl#disjointWith,_:t20743


## Demonstrate how to use the HRA KG either declaratively (via SPARQL) or imperatively (via JSON)

Juxtapose querying JSON with Python vs SPARQL (faster, cleaner, no for or foreach loops)

- For Python, use kidney.json

- For SPARQL, use code example from "Cells to expect in my anatomical region of interest":
```
#+ summary: All cells that make up the kidney cortex


PREFIX ccf: <http://purl.org/ccf/>
PREFIX kidney_cortex: <http://purl.obolibrary.org/obo/UBERON_0001225>


SELECT DISTINCT ?cell_label ?cell_ontology_id
FROM <https://purl.humanatlas.io/asct-b/kidney>
WHERE {
  kidney_cortex: ^ccf:ccf_part_of* ?kidney_cortex_parts .
  ?cell_ontology_id ccf:ccf_located_in ?kidney_cortex_parts .
  ?cell_ontology_id ccf:ccf_pref_label ?cell_label .
}

```

In this section, we juxtapose two ways of using data from the HRA KG to retrieve all the cells that make up the kidney cortex: 
1. imperatively via opening a data product (JSON) in Python 
2. declaratively via a SPARQL query

While both options have advantages, we will show that the SPARQL query needs fewer lines of code and adds the ability to easily query more than one graph.

### Declaratively (via SPARQL)

In [7]:
query = """
PREFIX ccf: <http://purl.org/ccf/>
PREFIX kidney_cortex: <http://purl.obolibrary.org/obo/UBERON_0001225>


SELECT DISTINCT ?cell_label ?cell_ontology_id
FROM <https://purl.humanatlas.io/asct-b/kidney>
WHERE {
  kidney_cortex: ^ccf:ccf_part_of* ?kidney_cortex_parts .
  ?cell_ontology_id ccf:ccf_located_in ?kidney_cortex_parts .
  ?cell_ontology_id ccf:ccf_pref_label ?cell_label .
}
"""

# define endpoint
url = "https://lod.humanatlas.io/sparql"

# define parameters
params = {
    "query": query,
}

# set header
headers = {
  "Accept": "text/csv"
}

# Send the GET request
response = requests.get(url, headers=headers, params=params)

# convert text to file-like object
csv_data = StringIO(response.text)  

# concert to DataFrame
df = pd.read_csv(csv_data)
df

Unnamed: 0,cell_label,cell_ontology_id
0,Podocyte,http://purl.obolibrary.org/obo/CL_0000653
1,Parietal Epithelial Cell,http://purl.obolibrary.org/obo/CL_1000452
2,Cortical Collecting Duct Principal Cell,http://purl.obolibrary.org/obo/CL_1000714
3,Cortical Collecting Duct Intercalated Cell Type A,http://purl.obolibrary.org/obo/CL_1000715


### Imperatively (via JSON)

In [8]:
# To achieve this in Python, we first need to get the JSON representation of the ASCT+B table for the kidney, then we need to iterate through all rows with the UBERON ID for the kidney cortex, and finally, we need to collect all cell types in a DataFrame.
# get ASCT+B table (note that this is the same procedure as Example 1)
pprint(kidney_json)

{'$schema': 'https://cdn.humanatlas.io/digital-objects/schema/asct-b/latest/assets/schema.json',
 '@context': 'https://cdn.humanatlas.io/digital-objects/schema/asct-b/latest/assets/schema.context.jsonld',
 '@type': 'Container',
 'data': {'anatomical_structures': [{'ccf_asctb_type': 'AS',
                                     'ccf_is_provisional': False,
                                     'ccf_pref_label': 'UBERON anatomical '
                                                       'structure',
                                     'conforms_to': 'AnatomicalStructure',
                                     'id': 'UBERON:0001062',
                                     'parent_class': 'ccf:AnatomicalStructure'},
                                    {'ccf_asctb_type': 'AS',
                                     'ccf_is_provisional': False,
                                     'ccf_pref_label': 'FMA anatomical '
                                                       'structure',
                

In [9]:
# iterate over rows in ASCT+B table 
for row in kidney_json['data']['asctb_record']:
  pprint(row)

{'anatomical_structure_list': [{'ccf_pref_label': 'kidney',
                                'id': 'https://purl.humanatlas.io/asct-b/kidney/v1.5#R1-AS1',
                                'label': 'kidney (Table kidney, Record 1, '
                                         'Column AS/1)',
                                'order_number': 1,
                                'record_number': 1,
                                'source_concept': 'UBERON:0002113',
                                'type_of': ['AnatomicalStructureRecord']},
                               {'ccf_pref_label': 'kidney capsule',
                                'id': 'https://purl.humanatlas.io/asct-b/kidney/v1.5#R1-AS2',
                                'label': 'kidney capsule (Table kidney, Record '
                                         '1, Column AS/2)',
                                'order_number': 2,
                                'record_number': 1,
                                'source_concept': 'UBERON:000

In [10]:
# get all rows with kidney cortex
cortex_id = "UBERON:0001225"

In [11]:
# find all AS IDs with a connection to the cortex
def find_connected_ids(kidney_data, cortex_id, collected_ids=None):
  """_summary_

  Args:
      kidney_data (json): table as JSON
      cortex_id (str): AS ID
      collected_ids (set(), optional): A set of IDs. Defaults to None.

  Returns:
       collected_ids (set()): A set of IDs. Defaults to None.
  """
  if collected_ids is None:
      collected_ids = set()

  new_ids = set()

  # Loop through each anatomical structure to find connections to cortex_id
  for structure in kidney_data['data']['anatomical_structures']:
      try:
          # If the cortex_id is in the 'ccf_part_of' list, add the structure's ID
          if cortex_id in structure['ccf_part_of'] and structure['id'] not in collected_ids:
              new_ids.add(structure['id'])
      except KeyError:
          # Skip if 'ccf_part_of' key doesn't exist
          pass

  # If no new IDs are found, return the collected set
  if not new_ids:
      return collected_ids

  # Add newly found IDs to collected_ids
  collected_ids.update(new_ids)

  # Recursively find connections for each new ID found
  for new_id in new_ids:
      find_connected_ids(kidney_data, new_id, collected_ids)

  return collected_ids


# Example usage
ids_with_connection_to_cortex = find_connected_ids(kidney_json, cortex_id)
pprint(ids_with_connection_to_cortex)

{'UBERON:0001284',
 'UBERON:0002189',
 'UBERON:0004188',
 'UBERON:0004203',
 'UBERON:0005271',
 'UBERON:0005750',
 'UBERON:0005751',
 'UBERON:0009883'}


In [12]:
# check connections to cortex
def check_relationships(id: str):
  """Checks if the provided IRI is connected to the cortex
  Args:
      id (str): an IRI (UBERON or FMA)
  Returns:
      check (bool): checks whether there is a connection
  """

  check = False
  for entry in kidney_json['data']['anatomical_structures']:
    if entry['id'] == id:
      try:
        if cortex_id in entry['ccf_part_of']:
          check = True
      except:
        pass

  return check

In [13]:
# initialize dict for result
result = {
  "cell_label": [],
  "cell_ontology_id": []
}

# initialize list for all asctb_records with cortex
relevant_records = []

for record in kidney_json['data']['asctb_record']:
  for row in record['anatomical_structure_list']:
    if (cortex_id in row['source_concept'] or cortex_id in row['type_of'] or check_relationships(row['source_concept']) or row['source_concept'] in ids_with_connection_to_cortex):
      relevant_records.append(record['cell_type_list'])
    
# use list comprehension to simply the list of cell types
list_flattened = [record for sublist in relevant_records for record in sublist]
pprint(list_flattened[:3])

[{'ccf_pref_label': 'Podocyte',
  'id': 'https://purl.humanatlas.io/asct-b/kidney/v1.5#R2-CT1',
  'label': 'Podocyte (Table kidney, Record 2, Column CT/1)',
  'order_number': 1,
  'record_number': 2,
  'source_concept': 'CL:0000653',
  'type_of': ['CellTypeRecord']},
 {'ccf_pref_label': 'Podocyte',
  'id': 'https://purl.humanatlas.io/asct-b/kidney/v1.5#R2-CT1',
  'label': 'Podocyte (Table kidney, Record 2, Column CT/1)',
  'order_number': 1,
  'record_number': 2,
  'source_concept': 'CL:0000653',
  'type_of': ['CellTypeRecord']},
 {'ccf_pref_label': 'Podocyte',
  'id': 'https://purl.humanatlas.io/asct-b/kidney/v1.5#R2-CT1',
  'label': 'Podocyte (Table kidney, Record 2, Column CT/1)',
  'order_number': 1,
  'record_number': 2,
  'source_concept': 'CL:0000653',
  'type_of': ['CellTypeRecord']}]


In [14]:
# prefix for OBO PURL
prefix = "http://purl.obolibrary.org/obo/"

# capture result in dict
for cell_type in list_flattened:
  result['cell_label'].append(cell_type['ccf_pref_label'])
  result['cell_ontology_id'].append(prefix+cell_type['source_concept'])
  
# convert to dataframe
df = pd.DataFrame(result).drop_duplicates(subset="cell_ontology_id")
df

Unnamed: 0,cell_label,cell_ontology_id
0,Podocyte,http://purl.obolibrary.org/obo/CL:0000653
8,Cortical Collecting Duct Principal Cell,http://purl.obolibrary.org/obo/CL:1000714
15,Parietal Epithelial Cell,http://purl.obolibrary.org/obo/CL:1000452
20,Cortical Collecting Duct Intercalated Cell Type A,http://purl.obolibrary.org/obo/CL:1000715
