<p align="center">
  <img width='325' src="https://user-images.githubusercontent.com/8030363/161553611-51a40cf9-e348-4eff-91bd-ab03eac41dd3.png" />
</p>

***
***

# Tutorial: Knowledge Graph Entity Search

***

**Author:** [TJCallahan](http://tiffanycallahan.com/)  
**GitHub Repository:** [PheKnowLator](https://github.com/callahantiff/PheKnowLator/wiki)  
**Wiki Page:** [OWL-NETS-2.0](https://github.com/callahantiff/PheKnowLator/wiki/OWL-NETS-2.0)  
**Tutorial Slides:** [PheKnowLator_Tutorial_EntitySearch](https://docs.google.com/presentation/d/19oc41CFILLlf0HjJfFJlUER-4wt1YV-BItkqWbI4NwU/edit?usp=sharing)   
**Release:** **[v3.0.0](https://github.com/callahantiff/PheKnowLator/wiki/v3.0.0)**  
  
<br> 

## Purpose  
The goal of this notebook is to explore different ways to examine relationships between different types of entities in a PheKnowLator knowledge graph. There are three different types of relationships that we will explore:  
  1. **Positive Relationships:** An existing, well-known, and direct relationship between two entities. In this notebook, we will examine:  
    - `lisinopril dihydrate` ([CHEBI_6503](http://purl.obolibrary.org/obo/CHEBI_6503)) and `myocardial infarction` ([MONDO_0005068](http://purl.obolibrary.org/obo/MONDO_0005068)) 
  
  2. **Negative Relationships:** Neither a direct nor suspected relationship between two entities. In this notebook, we will examine:   
    - `lisinopril dihydrate` ([CHEBI_6503](http://purl.obolibrary.org/obo/CHEBI_6503)) and `contact dermatitis` ([MONDO_0005480](http://purl.obolibrary.org/obo/MONDO_0005480)) 


### Notebook Organization  
- [Set-Up Environment](#set-environment)  
- [Knowledge Graph Data](#kg-data)  
- [Knowledge-based Characterization](#kg-characterization)  
    - [Node-Level](#node-level)  
    - [Path-Level](#path-level) 

***
***

<br>

<br>

***  
## Set-Up Environment  <a class="anchor" id="set-environment"></a> 
***  
___
  

In [None]:
# # uncomment and run to install any required modules from notebooks/requirements.txt
# import sys
# !{sys.executable} -m pip install -r ../../notebooks/requirements.txt

In [None]:
# run only if running a local version (i.e., forked from GitHub) of pkt_kg
import sys
sys.path.append('../../../')

⚠️ **Install graphviz** ⚠️  
graphviz can be a tricky library to install, which is why it is not automatically handled for the user via `notebooks/requirements.txt`. If you want to install this, I recommend using the last comment in the following [StackOverflow](https://stackoverflow.com/questions/69377104/unable-to-execute-dot-command-after-installation-of-graphviz-there-is-no-layou) post, but make sure you understand what it is doing before running any of the code. The only thing that this impacts is the rendering of network visualizations. Simply skip the visualization code chunks if you are unable to install `graphviz`.

In [None]:
# import needed libraries
import glob
import json
import networkx as nx
import os
import pandas as pd
import pickle
import random
import re

from entity_search import *
from rdflib import Graph, Namespace, URIRef, BNode, Literal
from rdflib.namespace import RDFS
from tqdm.notebook import tqdm
from typing import Callable, Dict, List, Optional, Union

# temp work around to avoid logging error
try: from pkt_kg.utils import *
except FileNotFoundError: from pkt_kg.utils import *

# create namespace for OBO ontologies
obo = Namespace('http://purl.obolibrary.org/obo/')

<br>

***

## Knowledge Graph Data  <a class="anchor" id="kg-data"></a>
***

#### *How to access PheKnowLator Data*
___

This notebook was built using a `v3.0.2` OWL-NETS-abstracted subclass-based build with inverse relations, which is publicly available and can be downloaded using the following links:  
- [PheKnowLator_v3.0.2_full_subclass_inverseRelations_OWLNETS_NetworkxMultiDiGraph.gpickle](https://storage.googleapis.com/pheknowlator/current_build/knowledge_graphs/subclass_builds/inverse_relations/owlnets/PheKnowLator_v3.0.2_full_subclass_inverseRelations_OWLNETS_NetworkxMultiDiGraph.gpickle)  
- [PheKnowLator_v3.0.2_full_subclass_inverseRelations_OWLNETS_NodeLabels.txt](https://storage.googleapis.com/pheknowlator/current_build/knowledge_graphs/subclass_builds/inverse_relations/owlnets/PheKnowLator_v3.0.2_full_subclass_inverseRelations_OWLNETS_NodeLabels.txt)  


### Download Data  
***

The knowledge graph data is publicly available and downloaded from the PheKnowLator project's Google Cloud Storage Bucket: https://console.cloud.google.com/storage/browser/pheknowlator/. Data will be downloaded to the `tutorials/entity_search/data` directory.

In [None]:
# notebook will create a temporary directory and will download data to it
write_location = 'data/'
if not os.path.exists(write_location): os.mkdir(write_location)

In [None]:
# download data to the data directory
data_urls = [
    'https://storage.googleapis.com/pheknowlator/current_build/knowledge_graphs/subclass_builds/relations_only/owlnets/PheKnowLator_v3.0.2_full_subclass_relationsOnly_OWLNETS_NetworkxMultiDiGraph.gpickle',
    'https://storage.googleapis.com/pheknowlator/current_build/knowledge_graphs/subclass_builds/relations_only/owlnets/PheKnowLator_v3.0.2_full_subclass_relationsOnly_OWLNETS_NodeLabels.txt'
]

for url in data_urls:
    file_name = url.split('/')[-1] if 'metadata' not in url else re.sub(r'\?.*', '', url.split('/')[-1])
    if not os.path.exists(write_location + file_name): data_downloader(url, write_location, file_name)
        

### Loading Data
***

The knowledge graph will be loaded as a `networkx` MultiDiGraph object and the node labels will be read in and converted to a dictionary to enable easy access to node labels and other relevant metadata.

#### Knowledge Graph

Note that this file is large and can take up to 2 minutes to load on a laptop with 16GB of Ram.

In [None]:
# load the knowledge graph
kg = nx.read_gpickle(write_location + data_urls[0].split('/')[-1])
print('The knowledge graph contains {} nodes and {} edges'.format(len(kg.nodes()), len(kg.edges())))

#### Load Node Metadata

In [None]:
# read in node metadata
data_loc = write_location + data_urls[1].split('/')[-1]
node_data = pd.read_csv(data_loc, header=0, sep=r"\t", encoding="utf8", engine='python', quoting=3)
node_data['entity_uri'] = node_data['entity_uri'].str.strip('<>')  # remove angle brackets
node_data.head()

In [None]:
# convert node data to dictionary
node_data_dict = dict()
for idx, row in tqdm(node_data.iterrows(), total=node_data.shape[0]):
    node_data_dict[row['entity_uri']] = {
        'label': row['label'],
        'description': row['description/definition']
    }

<br>

***

## Knowledge-based Characterization  <a class="anchor" id="kg-characterization"></a>
***
____

The goal is to use the knowledge graph to explore what we know about specific concepts as well as what we can say about pairs of concepts. Additional details are presented by comparison below:

#### [Node-Level](#node-level)
 - <u>Node Ancestry</u>: Identify all ancestors for each node up to the root.
 - <u>Node Neighborhood</u>: Returns all nodes reachable from a node of interest via a single directed edge.   


#### [Path-Level](#path-level)
  - <u>Shortest Path</u>: Returns the single shortest path.  
  - <u>All Shortest Paths</u>: Returns the shortest simple path, if there are multiple paths of the same length then they are all returned.

---

<br>

### Node-Level  <a class="anchor" id="node-level"></a>
***

#### Nodes  
- **Drugs**
  - [lisinopril dihydrate (`CHEBI_6503`)](#chebi_6503)  
- **Outcomes (diseases)**
  - [myocardial infarction (`MONDO_0005068`)](#mondo_0005068)  
  - [contact dermatitis (`MONDO_0005480`)](#mondo_0005480)    

***

#### lisinopril dihydrate (`CHEBI_6503`) <a class="anchor" id="chebi_6503"></a>

In [None]:
node = [obo.CHEBI_6503]
degree = str(kg.degree(node[0]))
in_edges = str(len(kg.in_edges(node[0])))
out_edges = str(len(kg.out_edges(node[0])))
print('This node has degree {}, which consists of {} in-edges and {} out-edges'.format(degree, in_edges, out_edges))

*Ancestors*

In [None]:
# examine the node's ancestors
prefix = 'CHEBI'
chebi6503_anc_dict = processes_ancestor_path_list(nx_ancestor_search(kg, node.copy(), prefix))
chebi6503_ancestors = format_path_ancestors(chebi6503_anc_dict, node_data_dict)

In [None]:
# visualize hierarchy
visualize_ancestor_tree(chebi6503_ancestors)

In [None]:
# print results -- nodes are ordered descending by seniority (lower numbers indicate closer to root)
print('Ancestors of {}\n'.format(node[0]))
for level in range(len(chebi6503_ancestors)):
    print('Level: {}'.format(str(level)))
    for v in chebi6503_ancestors[::-1][level]:
        v_uri = URIRef(v.split('(')[-1].strip(')'))
        descs = len([a[1] for b in [[[i, n[0]] for j in [kg.get_edge_data(*(n[0], v_uri)).keys()]
                                     for i in j] for n in list(kg.in_edges(v_uri))] for a in b
                     if a[0] == RDFS.subClassOf])
        uri_strip = re.sub('http://purl.obolibrary.org/obo/', '', v)
        print('\t- {} - class has {} descendants'.format(uri_strip, descs + 1))

*Neighborhood*

In [None]:
# examine the node's neigborhood (out-edges)
nodes = list(kg.neighbors(node[0]))
neighbors = [a for b in [[[i, n] for j in [kg.get_edge_data(*(node[0], n)).keys()]
                          for i in j] for n in nodes] for a in b]
chebi6503_sorted_neigbors = sorted(neighbors, key=lambda x: (str(x[1]).split('/')[-1], x[0]))

print('{} has {} neighbors'.format(str(node[0]).split('/')[-1], degree))

In [None]:
# print node's neigborhood (out-edges)
formats_node_information(node, chebi6503_sorted_neigbors, node_data_dict, verbose=False)

***
***

#### myocardial infarction (`MONDO_0005068`) <a class="anchor" id="mondo_0005068"></a>

In [None]:
node = [obo.MONDO_0005068]
degree = str(kg.degree(node[0]))
in_edges = str(len(kg.in_edges(node[0])))
out_edges = str(len(kg.out_edges(node[0])))
print('This node has degree {}, which consists of {} in-edges and {} out-edges'.format(degree, in_edges, out_edges))

*Ancestors*

In [None]:
# examine the node's ancestors
prefix = 'MONDO'
mondo0005068_anc_dict = processes_ancestor_path_list(nx_ancestor_search(kg, node.copy(), prefix))
mondo0005068_ancestors = format_path_ancestors(mondo0005068_anc_dict, node_data_dict)

In [None]:
# visualize hierarchy
visualize_ancestor_tree(mondo0005068_ancestors)

In [None]:
# print results -- nodes are ordered by seniority (higher numbers indicate closer to root)
print('Ancestors of {}\n'.format(node[0]))
for level in range(len(mondo0005068_ancestors)):
    print('Level: {}'.format(str(level)))
    for v in mondo0005068_ancestors[::-1][level]:
        v_uri = URIRef(v.split('(')[-1].strip(')'))
        descs = len([a[1] for b in [[[i, n[0]] for j in [kg.get_edge_data(*(n[0], v_uri)).keys()]
                                     for i in j] for n in list(kg.in_edges(v_uri))] for a in b
                     if a[0] == RDFS.subClassOf])
        print('\t- {} - class has {} descendants'.format(re.sub('http://purl.obolibrary.org/obo/', '', v), descs + 1))

*Neighborhood*

In [None]:
# examine the node's neigborhood (out-edges)
nodes = list(kg.neighbors(node[0]))
neighbors = [a for b in [[[i, n] for j in [kg.get_edge_data(*(node[0], n)).keys()]
                          for i in j] for n in nodes] for a in b]
mondo0005068_sorted_neigbors = sorted(neighbors, key=lambda x: (str(x[1]).split('/')[-1], x[0]))

print('{} has {} neighbors'.format(str(node[0]).split('/')[-1], degree))

In [None]:
# print node's neigborhood (out-edges)
formats_node_information(node, mondo0005068_sorted_neigbors, node_data_dict, verbose=False)


***

#### contact dermatitis (`MONDO_0005480`) <a class="anchor" id="mondo_0005480"></a>

In [None]:
node = [obo.MONDO_0005480]
degree = str(kg.degree(node[0]))
in_edges = str(len(kg.in_edges(node[0])))
out_edges = str(len(kg.out_edges(node[0])))
print('This node has degree {}, which consists of {} in-edges and {} out-edges'.format(degree, in_edges, out_edges))

*Ancestors*

In [None]:
# examine the node's ancestors
prefix = 'MONDO'
mondo0005480_anc_dict = processes_ancestor_path_list(nx_ancestor_search(kg, node.copy(), prefix))
mondo0005480_ancestors = format_path_ancestors(mondo0005480_anc_dict, node_data_dict)

In [None]:
# visualize hierarchy
visualize_ancestor_tree(mondo0005480_ancestors)

In [None]:
# print results -- nodes are ordered by seniority (lower numbers indicate closer to root)
print('Ancestors of {}\n'.format(node[0]))
for level in range(len(mondo0005480_ancestors)):
    print('Level: {}'.format(str(level)))
    for v in mondo0005480_ancestors[::-1][level]:
        v_uri = URIRef(v.split('(')[-1].strip(')'))
        descs = len([a[1] for b in [[[i, n[0]] for j in [kg.get_edge_data(*(n[0], v_uri)).keys()]
                                     for i in j] for n in list(kg.in_edges(v_uri))] for a in b
                     if a[0] == RDFS.subClassOf])
        print('\t- {} - class has {} descendants'.format(re.sub('http://purl.obolibrary.org/obo/', '', v), descs + 1))

*Neighborhood*

In [None]:
# examine the node's neigborhood (out-edges)
nodes = list(kg.neighbors(node[0]))
neighbors = [a for b in [[[i, n] for j in [kg.get_edge_data(*(node[0], n)).keys()]
                          for i in j] for n in nodes] for a in b]
mondo0005480_sorted_neigbors = sorted(neighbors, key=lambda x: (str(x[1]).split('/')[-1], x[0]))

print('{} has {} neighbors'.format(str(node[0]).split('/')[-1], degree))

In [None]:
# print node's neigborhood (out-edges)
formats_node_information(node, mondo0005480_sorted_neigbors, node_data_dict, verbose=False)

<br>

### Path-Level Characterization  <a class="anchor" id="path-level"></a>

***

- **Positive Relationship:** [`lisinopril dihydrate` ([CHEBI_6503](http://purl.obolibrary.org/obo/CHEBI_6503)) and `myocardial infarction` ([MONDO_0005068](http://purl.obolibrary.org/obo/MONDO_0005068))](#path1)  
   
- **Negative Relationship:** [`lisinopril dihydrate` ([CHEBI_6503](http://purl.obolibrary.org/obo/CHEBI_6503)) and `contact dermatitis` ([MONDO_0005480](http://purl.obolibrary.org/obo/MONDO_0005480))](#path2)       


***

#### lisinopril dihydrate (`CHEBI_6503`) and myocardial infarction (`MONDO_0005068`) <a class="anchor" id="path1"></a> 

In [None]:
s = obo.CHEBI_6503; t = obo.MONDO_0005068
kg_orientation = kg; ind = 'directed'
spl_d = nx.shortest_path_length(kg, source=s, target=t)
print('The shortest path length is: {}'.format(str(spl_d)))

In [None]:
shortest_paths1 = list(nx.all_shortest_paths(kg_orientation, s, t))
p_len = nx.shortest_path_length(kg_orientation, source=s, target=t)
v = 'is' if len(shortest_paths1) == 1 else 'are'
print('There {} {} shortest paths of length {}'.format(v, str(len(shortest_paths1)), str(p_len)))

In [None]:
formats_path_information(kg=kg_orientation,
                         paths=shortest_paths1,
                         path_type='shortest',
                         metadata_func=metadata_formatter,
                         metadata_dict=None,
                         node_metadata=node_data_dict,
                         verbose=False,
                         rand=True,
                         sample_size=10)

***
#### lisinopril dihydrate (`CHEBI_6503`) and contact dermatitis (`MONDO_0005480`)  <a class="anchor" id="path2"></a>


In [None]:
s = obo.CHEBI_6503; t = obo.MONDO_0005480
kg_orientation = kg; ind = 'directed'
spl_d = nx.shortest_path_length(kg, source=s, target=t)
print('The shortest path length is: {}'.format(str(spl_d)))

In [None]:
shortest_paths2 = list(nx.all_shortest_paths(kg_orientation, s, t))
p_len = nx.shortest_path_length(kg_orientation, source=s, target=t)
v = 'is' if len(shortest_paths2) == 1 else 'are'
print('There {} {} shortest paths of length {}'.format(v, str(len(shortest_paths2)), str(p_len)))

In [None]:
formats_path_information(kg=kg_orientation,
                         paths=shortest_paths2,
                         path_type='shortest',
                         metadata_func=metadata_formatter,
                         metadata_dict=None,
                         node_metadata=node_data_dict,
                         verbose=False,
                         rand=True,
                         sample_size=10)


<br>

***
***

This Notebook is part of the [**PheKnowLator Ecosystem**](https://zenodo.org/communities/pheknowlator-ecosystem/edit/)

```
@misc{callahan_tj_2019_3401437,
  author       = {Callahan, TJ},
  title        = {PheKnowLator},
  month        = mar,
  year         = 2019,
  doi          = {10.5281/zenodo.3401437},
  url          = {https://doi.org/10.5281/zenodo.3401437}
}
```

***