
# PheKnowLator  


***
***

**Author:** [TJCallahan](https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=callahantiff@gmail.com)  
**GitHub Repository:** [PheKnowLator](https://github.com/callahantiff/PheKnowLator/wiki)  

**Purpose:** This notebook serves as a `main` file for the PheKnowLator project. This scripts walks through this program step-by-step and generates the knowledge graph shown below. Several steps must be run before this notebook can be successfully run:
- Make sure that the [master resource file](https://www.dropbox.com/s/4qu4ev96h5q6bdx/resource_info.txt?dl=0) is complete.  
- Make sure that the files specifying the [ontologies](https://www.dropbox.com/s/bmmaavyd499d7px/ontology_source_list.txt?dl=0), [class](https://www.dropbox.com/s/cpxrj1to55syhzi/class_source_list.txt?dl=0), and [instance](https://www.dropbox.com/s/71b07b1g86roz3d/instance_source_list.txt?dl=0) data sources are completed  
- Run `.scripts/NCBO_rest_api.py` to obtain mappings between onttology identifiers.


***

### Table of Contents
* [Data Sources](#data-source)  
* [Create Edge Lists](#create-edges)  
* [Build Knowledge Graph](#build-kg)  
* [Generate Mechanism Embeddings](#generate-embeddings)  
* [t-SNE Plot](#tsne-plot)  

***
***

<img src="https://user-images.githubusercontent.com/8030363/65094681-3383ef00-d97b-11e9-989d-94b1c3ad73b8.png" 
width="900" height="400">


**_NOTE._** _There is also a script version of this file (`./main.py`). Please see the [README](https://github.com/callahantiff/PheKnowLator/blob/master/README.md) for more information._


**Install Dependencies**

In [1]:
# import needed libraries
import glob
import json
import pandas as pd

from rdflib import Graph

# import scripts
import scripts.python.DataSources
import scripts.python.EdgeDictionary
from scripts.python.KnowledgeGraph import *
from scripts.python.KnowledgeGraphEmbedder import *
from scripts.python.KGEmbeddingVisualizer import *


***
***

## Download Data Sources <a class="anchor" id="data-source"></a>

First, we need to download all sources that will be used to construct our knowledge graph. This portion of the script has three steps  
1. Download Ontology Data  
2. Download Class Data  
3. Download Instance Data


**Note.** When running the cells below for the class and instance data sources, you will be prompted to enter a file name for each source. Please use the correct formatting as described in the [Dependencies](https://github.com/callahantiff/PheKnowLator/wiki/Dependencies) section of the project Wiki.   


***
**Step 1: Download Ontology Data**


In [None]:
ont = scripts.python.DataSources.OntData('resources/ontology_source_list.txt')
ont.parses_resource_file()
ont.downloads_data_from_url('imports')
    

In [None]:
ont.generates_source_metadata()
ont.writes_source_metadata_locally()


***
**Step 2: Download Class Data**


In [None]:
cls = scripts.python.DataSources.Data('resources/class_source_list.txt')
cls.parses_resource_file()
cls.downloads_data_from_url('')
    

In [None]:
cls.generates_source_metadata()
cls.writes_source_metadata_locally()


***
**Step 3: Download Instance Data**


In [None]:
inst = scripts.python.DataSources.Data('resources/instance_source_list.txt')
inst.parses_resource_file()
inst.downloads_data_from_url('')
    

In [None]:
inst.generates_source_metadata()
inst.writes_source_metadata_locally()


***
***

## Create Edge Lists <a class="anchor" id="create-edges"></a>

In order to create the edge lists, you will need to do the following (assuming you don't want to use the [data from the current release](https://github.com/callahantiff/PheKnowLator/wiki/v2.0.0):
 - Run `python/NCBO_rest_api.py` script.  
   - Note, that this script will require you to create an account with [BioPortal](http://basic-formal-ontology.org/) and place your API key in `resources/bioportal_api_key.txt`. 
   - When run from the command line, you will be asked to enter two ontologies (`source1=MESH`, `source2=CHEBI`).
   - This will generate a text file that contains mappings between identifiers from two ontologies specified and write the results to `resources/data_maps/source1_source2_map.txt`.  
   
<br>   

The code below will take the dictionaries of processed data described above and use them to create edge lists for each of the edge types specificed in the [`resource_info.txt`](https://github.com/callahantiff/PheKnowLator/blob/development/resources/resource_info.txt). Each edge list will be appended to a nested dictionary (see details below).

<br>

**Master Edge Dictionary**
Below is an example of what the `Master Edge Dictionary` contains for each processed resource:  
```python

master_edges = {'chemical-disease'  :
                {'source_labels'    : ';MESH_;',
                 'data_type'        : 'class-class',
                 'edge_relation'    : 'RO_0002606',
                 'uri'              : ('http://purl.obolibrary.org/obo/',
                                       'http://purl.obolibrary.org/obo/'),
                 'row_splitter'     : '#',
                 'col_splitter'     : 't',
                 'column_indicies'  : '1;4',
                 'identifier_maps'  : '0:./MESH_CHEBI_MAP.txt;1:disease-dbxref-map',
                 'evidence_criteria': "5;!=;' ",
                 'filter_criteria'  : 'None',
                 'edge_list'        : []}
```


In [None]:
# combine data sources
combined_edges = dict(dict(cls.data_files, **inst.data_files), **ont.data_files)


# initialize edge dictionary class
master_edges = scripts.python.EdgeDictionary.EdgeList(combined_edges, './resources/resource_info.txt')

master_edges.creates_knowledge_graph_edges()


In [2]:
# save nested edges locally
# with open('./resources/kg_master_edge_dictionary.json', 'w') as filepath:
#     json.dump(master_edges.source_info, filepath)

# load existing master_edge dictionary
with open('./resources/kg_master_edge_dictionary.json', 'r') as filepath:
    master_edges = json.load(filepath)


In [3]:
# print basic stats on each resource
edge_data = [
    [key,
     ', '.join(master_edges[key]['edge_list'][0]),
     len(master_edges[key]['edge_list'])]
    
    for key in master_edges.keys()]

# convert dict to pandas df for nice printing
df = pd.DataFrame(edge_data, columns = ['edge', 'example_edge', 'edge_list_count']) 
df
                  

Unnamed: 0,edge,example_edge,edge_list_count
0,chemical-gene,"CHEBI_81395, 596",400288
1,chemical-go,"CHEBI_81395, GO_0006309",41604
2,chemical-pathway,"CHEBI_10033, R-HSA-1430728",28582
3,chemical-disease,"CHEBI_81395, DOID_13677",2328410
4,disease-gobp,"DOID_9667, GO_0009257",1223624
5,disease-gomf,"DOID_3021, GO_0033885",138427
6,disease-gocc,"DOID_12849, GO_1990393",85166
7,disease-phenotype,"DOID_0110720, HP_0000006",120556
8,gene-gene,"381, 6712",411868
9,gene-gobp,"728945, GO_0000413",157344



***
***

## Build Knowledge Graph  <a class="anchor" id="build-kg"></a>

Once the edge lists have been created, we can start building our knowledge graph. Since this process is somewhat time consuming, we break into the following steps:  

1. Merge Ontologies   
2. Create Class-Instance Edges    
3. Create Instance-Instance and Class-Class Edges    
4. Remove Disjointness Axioms   
 - [Disjointness axioms](https://go-protege-tutorial.readthedocs.io/en/latest/Disjointness.html) are created in order to restrict ontology classes such that no individual can be a member of more than one class. We remove these types of axioms from our graph before closing our ontology because these axioms often result in unexpected errors and inconsistencies during reasoning.    
5. Deductively Close Graph using [ELK](https://protegewiki.stanford.edu/wiki/ELK) Reasoner   
   - To run this from the command line and not through this program, enter the following from project directory root:   

  ```bash
./resources/lib/owltools ./resources/knowledge_graphs/[input_graph_filename.owl] --reasoner elk --run-reasoner --assert-implied -o resources/knowledge_graphs/[output_graph_name]_Closed_ELK.owl
```
6. Save Edge List    
 - Two versions of the knowledge graphs edges are saved as lists of triples one with: (1) node labels and (2) one with integer labels (the input requirement for the embedding algorithms)

<br>

**‼ IMPORTANT NOTE:** The file containing the merged ontologies is quite large and can take up to 30 minutes to read in.  This is not a limitation of the code directly, but rather a function of the [`RDFLib Library`](https://github.com/RDFLib). While there are other ways to read in this data, we maintain reliance on this library as it is the most user-friendly for non-RDF users.  

***


***
### Biological Knowledge Graph<a class="anchor" id="bio-kg"></a>



**Merge Ontologies**


In [2]:
# set-up vars for file manipulation
ont_files = './resources/ontologies/'
merged_onts = ont_files + 'merged_ontologies/'
ont_kg = './resources/knowledge_graphs/'


In [None]:
# create list of ontologies to merge
ontology_list = [
    [ont_files + 'go_with_imports.owl', ont_files + 'hp_with_imports.owl', merged_onts + 'hp_go_merged.owl'],
    [merged_onts + 'hp_go_merged.owl', ont_files + 'chebi_lite.owl', merged_onts + 'hp_go_chebi_merged.owl'],
    [merged_onts + 'hp_go_chebi_merged.owl', ont_files + 'vo_with_imports.owl', merged_onts +
     'PheKnowLator_v2_MergedOntologies_BioKG.owl']
]

# merge ontologies
merges_ontologies(ontology_list)
  

In [None]:
# read in file and count edges (n=2277644 edges)
print(len(Graph().parse(ont_file_merged + 'hp_go_merged.owl')))


In [None]:
# read in file and count edges (n=3524109 edges)
print(len(Graph().parse(ont_file_merged + 'hp_go_chebi_merged.owl')))

In [None]:
# read in file and count edges (n=3,606,052 edges)
print(len(Graph().parse(ont_file_merged + 'PheKnowVec_v2_MergedOntologies_BioKG.owl')))



**Create Edge Lists**


In [3]:
# separate edge lists by data type
master_edges = master_edges.source_info.copy()
class_edges = {}
other_edges = {}

for edge in master_edges.keys():
    if master_edges[edge]['data_type'] == 'class-instance' or master_edges[edge]['data_type'] == 'instance-class':
        class_edges[edge] = master_edges[edge]
    else:
        other_edges[edge] = master_edges[edge]


_Create Class-Instance Edges_   


In [None]:
# create class-instance and instance-instance edges
class_kg = creates_knowledge_graph_edges(class_edges,
                                         'class',
                                          Graph().parse(merged_onts + 'PheKnowLator_v2_MergedOntologies_BioKG.owl'),
                                          ont_kg + 'PheKnowLator_v2_ClassInstancesOnly_BioKG.owl',
                                          kg_class_iri_map={})


_Create Instance-Instance and Class-Class Edges_

In [None]:
# create instance-instance and class-class edges (in 00:26 minutes (4.88 s/it))
class_inst_kg = creates_knowledge_graph_edges(other_edges,
                                              'other',
                                              class_kg,
                                              ont_kg + 'PheKnowLator_v2_Full_BioKG.owl')


_Remove Disjointness Axioms_

In [None]:
# identified and removed 333 disjointness axioms
removes_disointness_axioms(class_inst_kg, ont_kg + 'PheKnowLator_v2_Full_BioKG_NoDisjointness.owl')


_Deductively Close Graph_

In [None]:
# close graph with ELK reasoner
closes_knowledge_graph(ont_kg + 'PheKnowLator_v2_Full_BioKG_NoDisjointness.owl',
                       'elk',
                       ont_kg + 'PheKnowLator_v2_Full_BioKG_NoDisjointness_Closed_ELK.owl')


_Remove Metadata Nodes_

In [None]:
# removes nodes that are not biologically meaningful
removes_metadata_nodes(Graph().parse(ont_kg + 'PheKnowLator_v2_Full_BioKG_NoDisjointness_Closed_ELK.owl'),
                       ont_kg + 'PheKnowLator_v2_Full_BioKG_NoDisjointness_ELK_Closed_NoMetadataNodes.owl',
                       ont_kg + 'PheKnowLator_v2_ClassInstancesOnly_BioKG_ClassInstanceMap.json')


_Save Edge List_

In [None]:
# convert triples to ints
maps_str_to_int(Graph().parse(ont_kg + 'PheKnowLator_v2_Full_BioKG_NoDisjointness_ELK_Closed_NoMetadataNodes.owl'),
                ont_kg + 'PheKnowLator_v2_Full_BioKG_NoDisjointness_Closed_ELK_Triples_Integers.txt',
                ont_kg + 'PheKnowLator_v2_Full_BioKG_Triples_Integer_Labels_Map.json')
    


***
***

## Generate Mechanism Embeddings <a class="anchor" id="generate-embeddings"></a>  

To create estimates of modlecular mechanisms, we embed knowledge graph information extracted by [DeepWalk](https://github.com/phanein/deepwalk). This repository contains code to run two different versions of the [original method](http://www.perozzi.net/publications/14_kdd_deepwalk.pdf) developed by [Bryan Perozzi](https://github.com/phanein):    

***

 -  [`DeepWalk algorithm-C`](https://github.com/xgfs/deepwalk-c): an implementation of the original algorithm in C++ (with some improvements to speed up initialize the hierarchical softmax tree that was developed by [Anton Tsitsulin](https://github.com/xgfs).  
 
     To run this from the command line and not through this program, enter the following from `./deepwalk-c-master/src`:   

    ```bash
    ./deepwalk -input ../PheKnowLator_v2_Full_BioKG_NoDisjointness_Full_Triples_Integers_.bcsr -output ../PheKnowLator_v2_Full_BioKG_NotClosed_128_100_20_10_100_003.out -threads 32 -dim 128 -nwalks 100 -walklen 20 -window 10 -nprwalks 100 -lr 0.03 -verbose 2
   ```  

***   
    
 - [`DeepWalk-RDF`](https://github.com/bio-ontology-research-group/walking-rdf-and-owl): an extension of the original algorithm that also embeds graph edges; developed by [the Bio-Ontology Research Group](https://github.com/bio-ontology-research-group/walking-rdf-and-owl).  
   - ‼ **Note:** This library depends on the [C++ Boost library](https://www.pyimagesearch.com/2015/04/27/installing
   -boost-and-boost-python-on-osx-with-homebrew/) and [Boost Threadpool Header Files](http://threadpool.sourceforge
   .net/). For the Headers, the sub-directory called `Boost` at the top-level of the `walking-rdf-and-owl-master
   ` directory. In order to compile and run `Deepwalk-RDF` on OSX, there are a few important changes that will need
    to be made:  
      - Change `TIME_UTC` to `TIME_UTC_` in the `boost/threadpool/task_adaptors.hpp`.  
      - Change the `-lboost_thread` argument to `-lboost_thread-mt` in the `walking-rdf-and-owl-master/Makefile` 
      - To troubleshoot incompatability issues between Deepwalk and Gensim, run the following in this order:  
        - `pip uninstall gensim`  
        - `pip uninstall deepwalk`  
        - `pip install gensim==0.10.2` 
        - `pip install deepwalk`  

    To run this from the command line and not through this program, enter the following from `./walking-rdf-and-owl-master/`:   

    ```bash
    deepwalk --workers 64 --representation-size 256 --format edgelist --input     applications/inputs/PheKnowLator_v2_Full_BioKG_NoDisjointness_Closed_ELK_Triples_Integers.txt  --output     applications/outputs/PheKnowLator_v2_Full_BioKG_NoDisjointness_Closed_ELK_Embedddings_256_10_50_40.txt --window-size 10 --number-walks 50 --walk-length 40
    ```

***

**NOTE.** It's both faster and less taxing on your computer to run this code from the command line rather than from here.

_The current analysis implemented DeepWalk-C as its runtime was nearly 10 times faster than DeepWalk-RDF_


In [None]:
# set file path
embed_path = './resources/embeddings/'

runs_deepwalk(input_file=ont_kg + 'PheKnowLator_v2_Full_BioKG_NoDisjointness_Closed_ELK_Triples_Integers.txt',
              output_file=embed_path + 'PheKnowLator_v2_Full_BioKG_DeepWalk_Embeddings_128_10_50_20.txt',
              threads=100,
              dim=128,
              nwalks=100,
              walklen=20,
              window=10,
              nprwalks=100,
              lr=0.01)


**Process and Format Embeddings for Analysis**  
Read in embeddings and replace integer node labels with their more meaningful biological knowledge graph identifiers.


_Not Deductively Closed Knowledge Graph_

In [None]:
# not closed graphs
processes_embedded_nodes(glob.glob('./resources/embeddings/*_NotClosed_*.out'),
                         glob.glob('./resources/knowledge_graphs/kg_not_closed/*Triples_Integers.txt')[0],
                         glob.glob('./resources/knowledge_graphs/kg_not_closed/*.json')[0])
    

_Deductively Closed Knowledge Graph_

In [None]:
# closed graphs
processes_embedded_nodes(glob.glob('./resources/embeddings/*_Closed_*.out'),
                         glob.glob('./resources/knowledge_graphs/kg_closed/*Triples_Integers.txt')[0],
                         glob.glob('./resources/knowledge_graphs/kg_closed/*.json')[0])


***

**Preproess Embedding Data to Prepare for Analysis**

In [6]:
# read in embedding file and re-generate node labels
input_embedding = './resources/embeddings/PheKnowLator_v2_Full_BioKG_Closed_128_100_20_10_100_003_formatted.txt'
embed_df_out = './resources/embeddings/PheKnowLator_v2_Full_BioKG2_Embedding_128_100_20_10_100_003_dataframe.csv'

input_embeddings = glob.glob('./resources/embeddings/*_formatted.txt')

for file in glob.glob('./resources/embeddings/*_formatted.txt'):

    embedding_file = open(file).readlines()[1:]
    node_list = ['HP', 'CHEBI', 'VO', 'DOID', 'R-HSA', 'GO', 'geneid']

    # convert embeddings to df
    embeddings = processes_integer_labeled_embeddings(embedding_file, node_list)
    embedding_data = pd.DataFrame(embeddings, columns=['node_type', 'node_id', 'embedding'])

    # update column names (128: (sigma 0.196289; KL@50 iterations 103.057411)
    embedding_data['node_category'] = embedding_data['node_type'].map({'HP': 'Phenotypes',
                                                                       'DOID': 'Diseases',
                                                                       'VO': 'Vaccines',
                                                                       'CHEBI': 'Chemicals',
                                                                       'GO': 'Gene Ontology',
                                                                       'geneid': 'Genes',
                                                                       'R-HSA': 'Pathways'})

    # save locally
    embedding_data[['node_type', 'node_type', 'node_id', 'embedding']].to_csv(embed_df_out, index=None, header=True)


In [7]:
# preview data
embedding_data.head(n=10)


Unnamed: 0,node_type,node_id,embedding,node_category
0,GO,http://purl.obolibrary.org/obo/GO_0002537,"[-0.12789536, 0.026679337, 0.035235696, -0.104...",Gene Ontology
1,DOID,http://purl.obolibrary.org/obo/DOID_1588,"[-0.11544607, 0.0658144, -0.0057145194, -0.102...",Diseases
2,GO,http://purl.obolibrary.org/obo/GO_0070269,"[-0.14727426, 0.053809088, 0.04400345, -0.1364...",Gene Ontology
3,GO,http://purl.obolibrary.org/obo/GO_0035377,"[-0.1670176, 0.097878315, -0.06523119, -0.0694...",Gene Ontology
4,GO,http://purl.obolibrary.org/obo/GO_0006833,"[-0.08418379, 0.037013043, -0.0058968603, -0.1...",Gene Ontology
5,CHEBI,http://purl.obolibrary.org/obo/CHEBI_35455,"[-0.08914226, 0.031103585, -0.03510634, -0.129...",Chemicals
6,DOID,http://purl.obolibrary.org/obo/DOID_2799,"[-0.12489569, 0.080540515, 0.010497774, -0.100...",Diseases
7,CHEBI,http://purl.obolibrary.org/obo/CHEBI_15930,"[-0.2005802, -0.012602089, -0.04713281, -0.052...",Chemicals
8,geneid,http://purl.uniprot.org/geneid/10640,"[-0.33007488, -0.084811226, -0.026324524, -0.2...",Genes
9,GO,http://purl.obolibrary.org/obo/GO_0046982,"[-0.11928611, 0.03227473, -0.061795298, -0.122...",Gene Ontology



***
***

## t-SNE Plot <a class="anchor" id="tsne-plot"></a>
To visualize the relationships between the embedded nodes, we first need to redduce the dimensions of the molecular mechanism embeddings. To this we use [t-SNE](). Once reduce, we can visualize the results in a scatter plot.


**Biological Knowledge Graph**

_Process Embeddings_

_Dimensionality Reduction with t-SNE_

In [None]:
x_reduced = TruncatedSVD(n_components=50, random_state=1).fit_transform(list(embedding_data['embeddings']))
x_embedded = TSNE(n_components=2, random_state=1, verbose=True, n_iter=1000, learning_rate=20, perplexity=50.0)\
    .fit_transform(x_reduced)
np.save('./resources/embeddings/PheKnowLator_v2_Full_BioKG_NoDisjointness_Closed_ELK_Embeddings_128_10_50_40_tsne', x_embedded)


In [None]:
# set-up plot arguments
# set up colors and legend labels
colors = {'Diseases': '#009EFA',
          'Chemicals': 'indigo',
          'GO Concepts': '#F79862',
          'Genes': '#4fb783',
          'Pathways': 'orchid',
          'Phenotypes': '#A3B14B'}

names = {key: key for key in colors.keys()}

# create data frame to use for plotting data by node type
df = pd.DataFrame(dict(x=x_embedded[:, 0], y=x_embedded[:, 1], group=list(embedding_data['node_type'])))
groups = df.groupby('group')

# create legend arguments
dis = mpatches.Patch(color='#009EFA', label='Diseases')
drg = mpatches.Patch(color='indigo', label='Drugs')
go = mpatches.Patch(color='#F79862', label='GO Concepts')
ge = mpatches.Patch(color='#4fb783', label='Genes')
pat = mpatches.Patch(color='orchid', label='Pathways')
phe = mpatches.Patch(color='#A3B14B', label='Phenotypes')

legend_args = [[dis, drg, go, ge, pat, phe], 14, 'lower center', 3]
title = 't-SNE: Biological Knowledge Graph'

plots_embeddings(colors, names, groups, legend_args, 16, 100, title, 20)


<br>
***
***

```
@misc{callahan_tj_2019_3401437,
  author       = {Callahan, TJ},
  title        = {PheKnowLator},
  month        = mar,
  year         = 2019,
  doi          = {10.5281/zenodo.3401437},
  url          = {https://doi.org/10.5281/zenodo.3401437}
}
```