***
# PheKnowLator - Ontology Cleaning
***
***

**Author:** [TJCallahan](https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=callahantiff@gmail.com)  
**GitHub Repository:** [PheKnowLator](https://github.com/callahantiff/PheKnowLator/wiki)  
**Release:** **[v2.0.0](https://github.com/callahantiff/PheKnowLator/wiki/v2.0.0)**
  
<br>  
  
**Purpose:** This notebook serves as a script to help prepare ontologies prior to be ingested into the knowledge graph build algorithm. This script focuses on preparing ontologies for digestion by performing the following steps:  
1. [Clean Ontologies](#clean-ontologies)  
2. [Merge Ontologies](#merge-ontologies)  
3. [Normalize Classes](#normalize-classes)  


<br>

**Assumptions:**   
- Steps 1-2 (i.e. data downloading and master edge list creation) have already been performed  
- Directory of Imported Ontologies ➞ `./resources/ontologies`    
- Processed data write location ➞ `./resources/ontologies`  

<br>

**Dependencies:**   
- This notebook utilizes several helper functions, which are stored in the [`kg_utils.py`](https://github.com/callahantiff/PheKnowLator/blob/master/pkt_kg/utils/kg_utils.py) script. Hyperlinks to all downloaded and generated data sources are provided on the [Data Sources](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources) Wiki page as well as within each source subsection of this notebook. All generated data is freely available for download from DropBox. 
- [`OWLTools`](https://github.com/owlcollab/owltools)

_____
***

### PheKnowLator Build V2.0.0 Ontology Cleaning Summary  
**Date:** `06/09/2020`  

The table below is meant to provide a high-level overview of the modifications that were applied to each individual ontology as well as to the merged ontology file. The specific details and code for performing each of the tasks in the table are described throughout the rest of this document.  

**Error Types:**   
1. Value Errors  
2. Punning Errors  
3. Obsolete Entities/Unused Entities    
4. Erroneously Defined Entities  
5. Concept Connectivity  
6. Identifier Resolution and Alignment


Ontology | 1 | 2 | 3 | 4 | 5 | 6
:---:              | :---: | :---: | :---: | :---: | :---:  | :---:  
Cell Line Ontology | 1     | 8     | 7 | X | X     | X      | X 


### Set-Up Environment
_____

In [1]:
# import needed libraries
import glob
import pickle

from owlready2 import *
from rdflib import Graph, Namespace, URIRef, BNode, Literal
from rdflib.namespace import OWL
from rdflib.plugins.sparql import prepareQuery
from tqdm import tqdm

# import script containing helper functions
from pkt_kg.utils import * 


In [2]:
# set up environment variables
write_location = './resources/knowledge_graphs'
merged_ontology_file = '/PheKnowLator_MergedOntologies.owl'
ontology_repository = glob.glob('*/ontologies/*_with_imports.owl')


<br><br>

***
### Clean Ontologies <a class="anchor" id="clean-ontologies"></a>

**Purpose:** In this step, we read in the ontologies using the [`owlready2`](https://pypi.org/project/Owlready2/) library and use it to indicate the presence of errors in the ontology files. We use this tool because it has strict filters.

Errors were found in the following ontologies:  

[`Vaccine Ontology`](http://www.violinet.org/vaccineontology/) - GitHub [issue #4](https://github.com/vaccineontology/VO/issues/4)

[`PRotein Ontology`](https://proconsortium.org/pro.shtml) - GitHub [issue #176](https://github.com/PROconsortium/PRoteinOntology/issues/176)

[`Cell Line Ontology`](http://www.clo-ontology.org/) - GitHub [issue #42](https://github.com/CLO-ontology/CLO/issues/42), [issue #52](https://github.com/CLO-ontology/CLO/issues/52)  
- `8` anatomical entities entered as individuals and classes (i.e. digestive tract, lymph node, mesoderm, circulatory system, central nervous system, organ, respiratory system, and musculature of body) were repaired. The problem was resolved by removing the individuals and keeping the classes     
- `7` NCBITaxon identifiers listed as individuals, but never used in the ontology (i.e. `NCBITaxon_110815`, `NCBITaxon_147099`, `NCBITaxon_41324`, `NCBITaxon_6040`, `NCBITaxon_6073`, `NCBITaxon_6157`, and `NCBITaxon_8570`)   were removed


**Import Ontology Check**

In [None]:
for ont in tqdm(ontology_repository):
    load_onto = get_ontology(ont).load()


The [Cell Line Ontology](http://www.clo-ontology.org/) yield the following error message:

```python

ValueError: invalid literal for int() with base 10: '永生的乳腺衍生细胞系细胞'

...

OwlReadyOntologyParsingError: RDF/XML parsing error in file ./resources/knowledge_graphs/PheKnowLator_MergedOntologies.owl, line 2363344, column 99.
```

This tells us that we need to repair the triple containing the Literal '永生的乳腺衍生细胞系细胞' by removing it and redefining it as a `string`, rather than an `int` as it is currently defined as. 

<br>

This is currently noted as an issue in the [Cell Line Ontology's](http://www.clo-ontology.org/) GitHub repo ([issue #45](https://github.com/CLO-ontology/CLO/issues/48)). 

In [None]:
for edge in tqdm(graph):
    if '永生的乳腺衍生细胞系细胞' in str(edge[0]) or '永生的乳腺衍生细胞系细胞' in str(edge[2]):
        
        # repair broken triple
        graph.add((edge[0], edge[1], Literal(str(edge[2]), datatype=URIRef('http://www.w3.org/2001/XMLSchema#string'))))
        graph.remove(edge)
        break

# save cleaned up ontology
graph.serialize(destination='./resources/ontologies/clo_with_imports', format='xml')


In [None]:
# try reading in the cleaned ontology again
merged_onto = get_ontology('./resources/ontologies/clo_with_imports').load()


The next errors that are generated are related to punning, specifically that the following OWL object properties had been incorrectly redeclared as OWL annotation properties:

```bash
2020-03-24 16:48:25,458 ERROR (OWLOntologyManagerImpl:1138) Illegal redeclarations of entities: reuse of entity http://purl.obolibrary.org/obo/RO_0002091 in punning not allowed [Declaration(AnnotationProperty(<http://purl.obolibrary.org/obo/RO_0002091>)), Declaration(ObjectProperty(<http://purl.obolibrary.org/obo/RO_0002091>))]
2020-03-24 16:48:25,460 ERROR (OWLOntologyManagerImpl:1138) Illegal redeclarations of entities: reuse of entity http://purl.obolibrary.org/obo/BFO_0000062 in punning not allowed [Declaration(AnnotationProperty(<http://purl.obolibrary.org/obo/BFO_0000062>)), Declaration(ObjectProperty(<http://purl.obolibrary.org/obo/BFO_0000062>))]
2020-03-24 16:48:25,460 ERROR (OWLOntologyManagerImpl:1138) Illegal redeclarations of entities: reuse of entity http://purl.obolibrary.org/obo/BFO_0000063 in punning not allowed [Declaration(ObjectProperty(<http://purl.obolibrary.org/obo/BFO_0000063>)), Declaration(AnnotationProperty(<http://purl.obolibrary.org/obo/BFO_0000063>))]
2020-03-24 16:48:25,460 ERROR (OWLOntologyManagerImpl:1138) Illegal redeclarations of entities: reuse of entity http://purl.obolibrary.org/obo/RO_0002222 in punning not allowed [Declaration(AnnotationProperty(<http://purl.obolibrary.org/obo/RO_0002222>)), Declaration(ObjectProperty(<http://purl.obolibrary.org/obo/RO_0002222>))]
2020-03-24 16:48:25,460 ERROR (OWLOntologyManagerImpl:1138) Illegal redeclarations of entities: reuse of entity http://purl.obolibrary.org/obo/RO_0000087 in punning not allowed [Declaration(ObjectProperty(<http://purl.obolibrary.org/obo/RO_0000087>)), Declaration(AnnotationProperty(<http://purl.obolibrary.org/obo/RO_0000087>))]
2020-03-24 16:48:25,460 ERROR (OWLOntologyManagerImpl:1138) Illegal redeclarations of entities: reuse of entity http://purl.obolibrary.org/obo/RO_0002161 in punning not allowed [Declaration(ObjectProperty(<http://purl.obolibrary.org/obo/RO_0002161>)), Declaration(AnnotationProperty(<http://purl.obolibrary.org/obo/RO_0002161>))]
```

From this message, we can see that we need to remove the following redeclaration to Annotation Properties for the following object properties:  
- RO_0002091  
- BFO_0000062  
- BFO_0000063  
- RO_0002222  
- RO_0000087  
- RO_0002161  

<br>

- This is another error caused by the [Cell Line Ontology](http://www.clo-ontology.org/) and has been posted to GitHub ([issue #43](https://github.com/CLO-ontology/CLO/issues/43))  
- We also removed RO_0002161 (the Annotation Property) from GO, UBERON, and HPO.

**Class Identifier Check**  
Check class identifiers to ensure consistency in identifier prefixes. Running this check revealed mislabeling of two [pROtein Ontology](https://proconsortium.org/) identifiers in the [Vaccine Ontology](http://www.violinet.org/vaccineontology/) (see [this](https://github.com/vaccineontology/VO/issues) GitHub issue).

In [None]:
# find all classes in graph
kg_classes = graph.query(
    """SELECT DISTINCT ?c
           WHERE {?c rdf:type owl:Class . }
           """, initNs={'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
                        'owl': 'http://www.w3.org/2002/07/owl#'}
)
        

In [None]:
# convert results to list of classes and only keep hgnc identifiers
class_list = [res[0] for res in tqdm(kg_classes) if isinstance(res[0], URIRef) and 'obo' in str(res[0])]


In [None]:
class_types = []

for cls in class_list2:
    class_types.append(cls.split('/')[-1].split('_')[0])
    
set(class_types)


**Remove Obsolete Ontology Classes**  
To make sure that the ontology only contains current information, all obsolete classes and any triples that they participate in are removed from the ontologies.

For the current build, we removed the following number of entites related to or containing a deprecated ontology class:    

Ontology | Classes | Axioms
:--: | :--: | :--:
CheBI Lite | 18,443 | 73,831  
CLO | 16 | 153  
DOID | 2,431 | 17,760  
GO | 5,610 | 42,840  
HPO | 293 | 1,636 
PW | 42 | 728  
PRO - Human | 8 | 75
SO | 336 | 2,091  
UBERON | 1,558 | 11,072  
VO | 10 | 87  
RO | 2 | 48



_NOTE._ In addition to running the code below, it may also be necessary to check for classes that are a sub-class of `oboInOwl:ObsoleteClass`, as well as any obsolete or deprecated annotations, individuals, or `owl:ObjectProperty`. Please note that `8` genes (i.e. [`HGNC:26619`](https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/26619), [`HGNC:13392`](https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/13392), [`HGNC:31424`](https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/31424), [`HGNC:8103`](https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/8103), 
[`HGNC:25943`](https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/25943), [`HGNC:16957`](https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/16957), [`HGNC:23418`](https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/23418), [`HGNC:32021`](https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/32021)), which are currently obsolete, were removed from the PRO (more details below, but the numbers reflect those changes). Finally, we recommend verifying each ontology using an ontology debugger (i.e. by running a reasoner) to ensure that your changes and edits have not introduced unexpected errors.


In [None]:
# prepare query
deprecated_class_query = prepareQuery(
    """SELECT DISTINCT ?c
       WHERE {?c owl:deprecated true . }
    """, initNs={'owl': OWL})


In [None]:
# remove triples containing deprecated classes
for ont in tqdm(ontology_repository):
    dep_class_counter = 0
    
    print('\nLoading: {}'.format(ont))
    graph = Graph()
    graph.parse(ont)
    
    print('Obtaining a list of deprecated classes')
    results = graph.query(deprecated_class_query)
   
    dep_cls = [res[0] for res in tqdm(results) if isinstance(res[0], URIRef)]   # convert to node list
    
    print('Removing triples containing deprecated classes')
    for node in tqdm(dep_cls):        
        graph.remove((node, None, None))  # remove all triples about node
        graph.remove((None, None, node))  # remove all triples pointing to node
        dep_class_counter += 1

    print('Serializing cleaned ontology to: {}'.format(ont[:-4] + '_clean_test.owl'))
    graph.serialize(destination=ont[:-4] + '_clean_test.owl', format='xml')
    
    print('Removed {} triples containing a deprecated ontology class.'.format(dep_class_counter))
    del graph


<br><br>

***
### MERGE ONTOLOGIES <a class="anchor" id="merge-ontologies"></a>

**Purpose:** In this step, the `OWL Tools` library is designed to merge a directory of ontology files into a single ontology file. This merged ontology file is required as input to the knowledge graph build algorithm.  

**Inputs:** A directory of ontology files (`.owl`)  
**Outputs:** [`PheKnowLator_MergedOntologies.owl`](https://www.dropbox.com/s/1lhh4hdwbjzds74/PheKnowLator_MergedOntologiesGeneID_Normalized_Cleaned.owl?dl=1)

<br>

*Merged Ontology File Punning Update:* Consistent with the solution described [here](https://github.com/oborel/obo-relations/issues/130), we removed all `AnnotationProperty` declarations from the merged ontology file. The Annotation Properties for each of the Object Properties listed above were removed using Protége.

In [None]:
# verify there are ontology files in ontology repo
for ont in ontology_repository:
    print(ont)


In [None]:
# merge ontologies
if write_location + merged_ontology_file in glob.glob(write_location + '/*.owl'):
    graph = Graph()
    graph.parse(write_location + merged_ontology_file)
    gets_ontology_statistics(write_location + merged_ontology_file)
else:
    merges_ontologies(ontology_repository, write_location, merged_ontology_file)


<br><br>

***

### Normalize Classes <a class="anchor" id="normalize-classes"></a>

**Purpose:** The goal of this section is to checked the cleaned merged ontology file to ensure that there is consistency between the existing classes. To do this, we check the following two things:  
- <u>Aligning Existing Ontology Classes</u>: For this check, we want to make sure that all classes that represent the same entity are connected to each other. For example, consider the following:  
    - Ontologies: [Sequence Ontology](http://www.sequenceontology.org/), [ChEBI](https://www.ebi.ac.uk/chebi), and [PRotein Ontology](https://proconsortium.org/) all include terms for protein, but none of these classes are connected to each other. 
    
    
- <u>Aligning Ontology Classes and New Edge Data</u>: For this check, we want to make sure that any of the existing ontology classes can be aligned with any of the new data entities that we want to add to the knowledge graph. For example:  
  - Gene Classes: there are several gene classes that use [HGNC](https://www.genenames.org/) identifiers. We also want to add genes, but prefer to use [Entrez gene](https://www.ncbi.nlm.nih.gov/gene) identifiers. In order to be used with our data, we must first normalize all of the HPO gene classes to Entrez gene identifiers.
  
<br>

**Dependencies:** The Merged Gene, RNA, Protein Map ([`Merged_gene_rna_protein_identifiers.pkl`](https://www.dropbox.com/s/6idnt7b3i322hlh/Merged_gene_rna_protein_identifiers.pkl?dl=1)) we generated in order to map genomic identifier data sources.

<br>

**Aligning Existing Ontology Classes**

The follow classes occur in all of the ontologies used in the current build and have to be resolved:  

- Gene: [VO](http://purl.obolibrary.org/obo/OGG_0000000002)  
  - <u>Solution</u>: Make the VO imported OGG class a subclass of the SO gene term  

- Protein: [SO](http://purl.obolibrary.org/obo/SO_0000104), [PRO](http://purl.obolibrary.org/obo/PR_000000001), [ChEBI](http://purl.obolibrary.org/obo/CHEBI_36080) 
  - <u>Solution</u>: Make the ChEBI and PRO classes a subclass of the SO protein term  
  
- Disorder: [VO](http://purl.obolibrary.org/obo/OGMS_0000045)  
  - <u>Solution</u>: Make the VO imported OGMS class a subclass of the DOID disease term  

- Antigen: [VO](http://purl.obolibrary.org/obo/OBI_1110034)  
  - <u>Solution</u>: Make the VO imported OBI class a subclass of the CHEBI antigen term  

- Gelatin: [VO]('http://purl.obolibrary.org/obo/VO_0003030') 
  - <u>Solution</u>: Make the VO class a subclass of the CHEBI gelatin term 

- Hormone: [VO](http://purl.obolibrary.org/obo/FMA_12278) 
  - <u>Solution</u>: Make the VO imported FMA class a subclass of the CHEBI hormone term
  

In [None]:
# fix gene class inconsistencies
graph.add((URIRef('http://purl.obolibrary.org/obo/OGG_0000000002'),
           URIRef('http://www.w3.org/2000/01/rdf-schema#subClassOf'),
           URIRef('http://purl.obolibrary.org/obo/SO_0000704')))

# fix protein class inconsistencies
graph.add((URIRef('http://purl.obolibrary.org/obo/PR_000000001'),
           URIRef('http://www.w3.org/2000/01/rdf-schema#subClassOf'),
           URIRef('http://purl.obolibrary.org/obo/SO_0000104')))

graph.add((URIRef('http://purl.obolibrary.org/obo/CHEBI_36080'),
           URIRef('http://www.w3.org/2000/01/rdf-schema#subClassOf'),
           URIRef('http://purl.obolibrary.org/obo/SO_0000104')))

# fix disorder class inconsistencies
graph.add((URIRef('http://purl.obolibrary.org/obo/OGMS_0000045'),
           URIRef('http://www.w3.org/2000/01/rdf-schema#subClassOf'),
           URIRef('http://purl.obolibrary.org/obo/DOID_4')))

# fix antigen class inconsistencies
graph.add((URIRef('http://purl.obolibrary.org/obo/OBI_1110034'),
           URIRef('http://www.w3.org/2000/01/rdf-schema#subClassOf'),
           URIRef('http://purl.obolibrary.org/obo/CHEBI_59132')))

# fix gelatin class inconsistencies
graph.add((URIRef('http://purl.obolibrary.org/obo/VO_0003030'),
           URIRef('http://www.w3.org/2000/01/rdf-schema#subClassOf'),
           URIRef('http://purl.obolibrary.org/obo/CHEBI_5291')))

# fix hormone class inconsistencies
graph.add((URIRef('http://purl.obolibrary.org/obo/FMA_12278'),
           URIRef('http://www.w3.org/2000/01/rdf-schema#subClassOf'),
           URIRef('http://purl.obolibrary.org/obo/CHEBI_24621')))
                        

<br>

**Aligning Ontology Classes and New Edge Data**  
The first step to normalizing ontology classes with multiple identifiers is to query the ontology and obtain all classes that are not part of the [Open Biomedical Ontology](http://www.obofoundry.org/) namespace.

For the current build, the primary focus of this task is to convert all classes that reference an HGNC gene (`n=19,820`) to an Entrez identifier. To do this, we will utilize the genomic identifier mapping information ([`Merged_gene_rna_protein_identifiers.pkl`](https://www.dropbox.com/s/6idnt7b3i322hlh/Merged_gene_rna_protein_identifiers.pkl?dl=1)) we constructed in the [`Data_Preparation.ipynb`](https://github.com/callahantiff/PheKnowLator/blob/master/Data_Preparation.ipynb) Jupyter notebook. Note that we aree only updating identifiers and not verifying labels or other metadata.

In [None]:
# find all classes in graph
kg_classes = graph.query(
    """SELECT DISTINCT ?c
           WHERE {?c rdf:type owl:Class . }
           """, initNs={'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
                        'owl': 'http://www.w3.org/2002/07/owl#'}
)

# convert results to list of classes and only keep hgnc identifiers
class_list_gene = [res[0] for res in tqdm(kg_classes) if isinstance(res[0], URIRef) and 'hgnc' in str(res[0])]


In [None]:
# load genomic identifier mapping dictionary
genomic_id_map = pickle.load(open('resources/processed_data/Merged_gene_rna_protein_identifiers.pkl', 'rb'), encoding='bytes')


In [None]:
# loop over each gene class and get entrez gene id equivalent
matches, not_matched = {}, []
gene_url = 'https://www.ncbi.nlm.nih.gov/gene/'

for gene_class in tqdm(class_list_gene):
    key = 'hgnc_id_' + str(gene_class).split('=')[-1]
    
    if key in genomic_id_map.keys() and any(x for x in genomic_id_map[key] if x.startswith('entrez_id')):        
        matches[str(gene_class)] = [gene_url + x.split('_')[-1] for x in genomic_id_map[key] if 'entrez_id' in x]
    else:
        not_matched.append(gene_class)
        

In [None]:
# print non-matching gene uris
not_matched


_Investigate UnMatched Genes_  
Only 3 of the HGNC genes were not found in our dictionary ([`HGNC:24033`](http://www.genenames.org/cgi-bin/gene_symbol_report?hgnc_id=24033), [`HGNC:31447`](http://www.genenames.org/cgi-bin/gene_symbol_report?hgnc_id=31447), [`HGNC:33870`](http://www.genenames.org/cgi-bin/gene_symbol_report?hgnc_id=33870)). Investigating these revealed that HGNC made these identifiers obsolete and replaced them with new identifiers. Until this term is updated in the PRO ontology, we have to manually fix it. 

Additionally, there were 8 HGNC ids that have all been withdrawn (i.e. [`HGNC:26619`](https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/26619), [`HGNC:13392`](https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/13392), [`HGNC:31424`](https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/31424), [`HGNC:8103`](https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/8103), 
[`HGNC:25943`](https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/25943), [`HGNC:16957`](https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/16957), [`HGNC:23418`](https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/23418), [`HGNC:32021`](https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/32021)) and for our purposes, will be removed. _Note_. Please verify the output file to ensure that no errors were added when removing these genes (as mentioned in a prior cell, these `8` genes were each deleted from the PRO ontology prior to merging).

This issue has been reported to [PRotein Ontology](https://proconsortium.org/pro.shtml) (see [this](https://github.com/PROconsortium/PRoteinOntology/issues/176) GitHub issue).

In [None]:
# investigate HGNC genes with no mappings to Entrez
not_matched

# update mapping dictionary
gene_url = 'https://www.ncbi.nlm.nih.gov/gene/'
matches['http://www.genenames.org/cgi-bin/gene_symbol_report?hgnc_id=24033'] =  [gene_url + x.split('_')[-1] for x in genomic_id_map['hgnc_id_26545'] if 'entrez_id' in x]
matches['http://www.genenames.org/cgi-bin/gene_symbol_report?hgnc_id=31447'] =  [gene_url + x.split('_')[-1] for x in genomic_id_map['hgnc_id_16932'] if 'entrez_id' in x]
matches['http://www.genenames.org/cgi-bin/gene_symbol_report?hgnc_id=33870'] =  [gene_url + x.split('_')[-1] for x in genomic_id_map['hgnc_id_20667'] if 'entrez_id' in x]


In [None]:
# updated gene identifiers in graph
for edge in tqdm(graph):
    if str(edge[0]) in matches.keys():
        for mapped_id in matches[str(edge[0])]:
            graph.add((URIRef(mapped_id), edge[1], edge[2]))
            graph.remove(edge)
    elif str(edge[2]) in matches.keys():
        for mapped_id in matches[str(edge[2])]:
            graph.add((edge[0], edge[1], URIRef(mapped_id)))
            graph.remove(edge)
    else:
        continue
    

**Check URIs for Overlapping Ontology and Non-Ontology Class Entities**  
To make sure that any new edges that we add don't inadvertently result in duplicate nodes, we need to verify the URIs used in the merged ontologies with the URIs we plan to use when constructing a PheKnowLator knowledge graph.

From looking into the knowledge graph we identified `20` classes that existed in the new edge list and in the merged ontologies, but that had differing URIs (`http://www.ncbi.nlm.nih.gov/` vs. `https://www.ncbi.nlm.nih.gov/`). Those classes were:  
- `http://www.ncbi.nlm.nih.gov/gene/100129307`
- `http://www.ncbi.nlm.nih.gov/gene/100131107`
- `http://www.ncbi.nlm.nih.gov/gene/101927789`
- `http://www.ncbi.nlm.nih.gov/gene/102723383`
- `http://www.ncbi.nlm.nih.gov/gene/105373297`
- `http://www.ncbi.nlm.nih.gov/gene/107987235`
- `http://www.ncbi.nlm.nih.gov/gene/140606`
- `http://www.ncbi.nlm.nih.gov/gene/157285`
- `http://www.ncbi.nlm.nih.gov/gene/163404`
- `http://www.ncbi.nlm.nih.gov/gene/390928`
- `http://www.ncbi.nlm.nih.gov/gene/392490`
- `http://www.ncbi.nlm.nih.gov/gene/50810`
- `http://www.ncbi.nlm.nih.gov/gene/51714`
- `http://www.ncbi.nlm.nih.gov/gene/54886`
- `http://www.ncbi.nlm.nih.gov/gene/58515`
- `http://www.ncbi.nlm.nih.gov/gene/64748`
- `http://www.ncbi.nlm.nih.gov/gene/79948`
- `http://www.ncbi.nlm.nih.gov/gene/83642`
- `http://www.ncbi.nlm.nih.gov/gene/84717`
- `http://www.ncbi.nlm.nih.gov/gene/9890`

*Read in Master Edge List**  
The [`master edge list`](https://www.dropbox.com/s/t8sgzd847t1rof4/Master_Edge_List_Dict.json?dl=1) that is created in Step 2 of the `pkt_kg` algorithm is read in and processed into a dictionary.

In [None]:
# read in master edge list
edge_data = json.load(open('./resources/Master_Edge_List_Dict.json', 'r'))

# convert to dictionary
edge_dict = dict()

# iterate over master edges to 
for k in tqdm(edge_data):
    rel = edge_data[k]['uri']
    for edge in edge_data[k]['edge_list']:
        for x in edge:
            if x in edge_dict.keys():
                edge_dict[x] |= {rel[edge.index(x)]}
            else:
                edge_dict[x] = {rel[edge.index(x)]}


*Process merged Ontologies and Verify New Edge Relations*

In [None]:
for edge in tqdm(graph):
    if str(edge[0]).startswith('http://www.ncbi.nlm.nih.gov/gene/'):
        updated_subj = str(edge[0]).replace('http://www.ncbi.nlm.nih.gov/gene/', 'https://www.ncbi.nlm.nih.gov/gene/')
        graph.add((URIRef(updated_subj), edge[1], edge[2]))
        graph.remove(edge)
        
    if str(edge[2]).startswith('http://www.ncbi.nlm.nih.gov/gene/'):
        updated_obj = str(edge[2]).replace('http://www.ncbi.nlm.nih.gov/gene/', 'https://www.ncbi.nlm.nih.gov/gene/')
        graph.add((edge[0], edge[1], URIRef(updated_obj)))
        graph.remove(edge)
            

In [None]:
# save normalized ontology
graph.serialize(destination=write_location + merged_ontology_file[:-4] + 'GeneID_Normalized_Cleaned.owl', format='xml')

# apply OWL API formatting to file
ontology_file_formatter(write_location, merged_ontology_file[:-4] + 'GeneID_Normalized_Cleaned.owl')

# get ontology stats
gets_ontology_statistics(write_location + merged_ontology_file[:-4] + 'GeneID_Normalized_Cleaned.owl')
