***
# PheKnowLator - Ontology Cleaning
***
***

**Author:** [TJCallahan](https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=callahantiff@gmail.com)  
**GitHub Repository:** [PheKnowLator](https://github.com/callahantiff/PheKnowLator/wiki)  
**Release:** **[v2.0.0](https://github.com/callahantiff/PheKnowLator/wiki/v2.0.0)**
  

## Purpose

This notebook serves as a script to help prepare ontologies prior to be ingested into the knowledge graph build algorithm. This script performs the following steps:  
1. [Clean Ontologies](#clean-ontologies)  
2. [Merge Ontologies](#merge-ontologies)  
3. [Normalize Classes](#normalize-classes)

## Assumptions and Dependencies  
  
**Assumptions:**   
- Knowledge Graph Build Steps 1-2 (i.e. data downloading and master edge list creation) have already been performed  
- Directory of Imported Ontologies ➞ `./resources/ontologies`    
- Processed data write location ➞ `./resources/ontologies`  

**Dependencies:**   
- This notebook utilizes several helper functions, which are stored in the [`kg_utils.py`](https://github.com/callahantiff/PheKnowLator/blob/master/pkt_kg/utils/kg_utils.py) script. Hyperlinks to all downloaded and generated data sources are provided on the [Data Sources](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources) Wiki page as well as within each source subsection of this notebook. All generated data is freely available for download from DropBox. 
- [`OWLTools`](https://github.com/owlcollab/owltools)
 

<br>

_____

### PheKnowLator Build V2.0.0 Ontology Cleaning Summary  
**Date:** `12/01/2020` 
***

The table below is meant to provide a high-level overview of the modifications that we applied to each individual ontology as well as to the merged ontology file for the latest PheKnowLator build.

Ontology | Value Errors | Punning Errors | Class Identifier Errors | Obsolete/Deprecated Classes | Normalization Errors  
:---:    | :---:        | :---:          | :---:                   | :---:                       | :---:  
ChEBI    | 0            | 0              | 0                       | 14/18498                    | X    
CLO      | 1            | 8              | 0                       | 14/2                        | X    
MONDO    | 0            | 0              | 0                       | 1945/2196                   | X    
GO       | 0            | 0              | 0                       | 3121/6302                   | X    
HP       | 0            | 0              | 0                       | 297/348                     | X  
PW       | 0            | 0              | 0                       | 6/42                        | X    
PRO - Human | 0         | 0              | 0                       | 0/0                         | X  
SO       | 0            | 0              | 0                       | 0/0                         | X    
UBERON   | 0            | 0              | 0                       | 1062/1564                   | X   
VO       | 0            | 0              | 2                       | 10/0                        | X    
RO       | 0            | 0              | 0                       | 0/0                         | X  

<br>

***  
## Set-Up Environment
***  

In [1]:
# import needed libraries
import glob
import pickle

from owlready2 import *
from rdflib import Graph, Namespace, URIRef, BNode, Literal
from rdflib.namespace import OWL, RDF, RDFS 
from rdflib.plugins.sparql import prepareQuery

# import script containing helper functions
from pkt_kg.utils import * 

In [26]:
# set up environment variables
ontologies = ['chebi', 'clo', 'ext', 'go', 'hp', 'mondo', 'pro', 'pw', 'ro', 'so', 'vo']
write_location = 'resources/ontologies'
merged_ontology_file = '/PheKnowLator_MergedOntologies.owl'
ontology_repository = glob.glob('*/ontologies/*.owl')

***
## Clean Ontologies <a class="anchor" id="clean-ontologies"></a>
***

**Purpose:** In this step, we read in the ontologies using the [`owlready2`](https://pypi.org/project/Owlready2/) library and use it to indicate the presence of errors in the ontology files. We use this tool because it has strict filters. Using this tool we performed the following checks to clean the ontologies:

* [Value Errors](#value-error)  
* [Punning Errors](#punning-error)  
* [Class Identifier Check](#identifier-check)  
* [Obsolete/Deprecated Classes](#obsolete-classes)  

***

<br>

### Value Errors <a class="anchor" id="value-error"></a>
***

This check utilizes the [`owlready2`](https://pypi.org/project/Owlready2/) library to read in each of the ontologies. This library is strict and will catch a wide variety of value errors. 

#### Findings  
Of the 11 ontologies utilized in the PheKnowLator `v2.0` build, only 1 had a value error. The [Cell Line Ontology](http://www.clo-ontology.org/) yield the following error message:

```python
ValueError: invalid literal for int() with base 10: '永生的乳腺衍生细胞系细胞'
...
OwlReadyOntologyParsingError: RDF/XML parsing error in file ./resources/knowledge_graphs/PheKnowLator_MergedOntologies.owl, line 2363344, column 99.
```

**Interpretation:** This tells us that we need to repair the triple containing the Literal '永生的乳腺衍生细胞系细胞' by removing it and redefining it as a `string`, rather than an `int` as it is currently defined as. 

**Solution:** Retype the edge correctly. This is currently noted as an issue in the [Cell Line Ontology's](http://www.clo-ontology.org/) GitHub repo ([issue #48](https://github.com/CLO-ontology/CLO/issues/48)). 

In [8]:
for ont in ontology_repository:
    print('Loading: {}'.format(ont))
    load_onto = get_ontology(ont).load()


Loading: resources/ontologies/chebi_lite_with_imports.owl
Loading: resources/ontologies/clo_with_imports.owl
Loading: resources/ontologies/ext_with_imports.owl
Loading: resources/ontologies/go_with_imports.owl
Loading: resources/ontologies/hp_with_imports.owl
Loading: resources/ontologies/human_pro_closed.owl
Loading: resources/ontologies/mondo_with_imports.owl
Loading: resources/ontologies/pw_with_imports.owl
Loading: resources/ontologies/ro_with_imports.owl
Loading: resources/ontologies/so_with_imports.owl
Loading: resources/ontologies/vo_with_imports.owl


**Value Error Repairs**  
Code to fix the Cell Ontology *Value Error* is provided below.

In [4]:
# load the graph
graph = Graph().parse(write_location + '/clo_with_imports.owl')

# fix the error
for edge in graph:
    if '永生的乳腺衍生细胞系细胞' in str(edge[0]) or '永生的乳腺衍生细胞系细胞' in str(edge[2]):
        
        # repair broken triple
        graph.add((edge[0], edge[1], Literal(str(edge[2]), datatype=URIRef('http://www.w3.org/2001/XMLSchema#string'))))
        graph.remove(edge)
        break

# save cleaned up ontology
graph.serialize(destination=write_location + '/clo_with_imports.owl', format='xml')
ontology_file_formatter(write_location, '/clo_with_imports.owl', './pkt_kg/libs/owltools')


*** Applying OWL API Formatting to Knowledge Graph OWL File ***


In [24]:
# try reading in the cleaned ontology again -- there should be no errors this time!
clo_onto = get_ontology(write_location + '/clo_with_imports.owl').load()

<br>

### Punning Errors <a class="anchor" id="punning-error"></a>
***

[Punning](https://www.w3.org/2007/OWL/wiki/Punning) or redeclaration errors occur for a few different reasons, but the primary or most prevalent cause observed in the ontologies used in `PheKnowLator` is due to an `owl:ObjectProperty` being incorrectly redeclared as an `owl:AnnotationProperty` or an `owl:Class` also being defined as an `OWL:ObjectProeprty`. To detect these types of errors we currently rely on [OWLTools](https://github.com/owlcollab/owltools) and the [`owlready2`](https://pypi.org/project/Owlready2/) Python library. The easiest way to check for these types of errors is to merge the ontologies together, during which, an error message will be generated should any errors be found.

#### Findings  
The [Cell Line Ontology](http://www.clo-ontology.org/) had 7 object properties that were illegally redeclared and triggered punning errors. More details regarding these errors are shown below. 

```bash
2020-12-03 20:57:15,616 ERROR (OWLOntologyManagerImpl:1138) Illegal redeclarations of entities: reuse of entity http://purl.obolibrary.org/obo/RO_0002091 in punning not allowed [Declaration(AnnotationProperty(<http://purl.obolibrary.org/obo/RO_0002091>)), Declaration(ObjectProperty(<http://purl.obolibrary.org/obo/RO_0002091>))]
2020-12-03 20:57:15,619 ERROR (OWLOntologyManagerImpl:1138) Illegal redeclarations of entities: reuse of entity http://purl.obolibrary.org/obo/BFO_0000062 in punning not allowed [Declaration(AnnotationProperty(<http://purl.obolibrary.org/obo/BFO_0000062>)), Declaration(ObjectProperty(<http://purl.obolibrary.org/obo/BFO_0000062>))]
2020-12-03 20:57:15,620 ERROR (OWLOntologyManagerImpl:1138) Illegal redeclarations of entities: reuse of entity http://purl.obolibrary.org/obo/BFO_0000063 in punning not allowed [Declaration(ObjectProperty(<http://purl.obolibrary.org/obo/BFO_0000063>)), Declaration(AnnotationProperty(<http://purl.obolibrary.org/obo/BFO_0000063>))]
2020-12-03 20:57:15,620 ERROR (OWLOntologyManagerImpl:1138) Illegal redeclarations of entities: reuse of entity http://purl.obolibrary.org/obo/RO_0002222 in punning not allowed [Declaration(AnnotationProperty(<http://purl.obolibrary.org/obo/RO_0002222>)), Declaration(ObjectProperty(<http://purl.obolibrary.org/obo/RO_0002222>))]
2020-12-03 20:57:15,620 ERROR (OWLOntologyManagerImpl:1138) Illegal redeclarations of entities: reuse of entity http://purl.obolibrary.org/obo/RO_0000087 in punning not allowed [Declaration(ObjectProperty(<http://purl.obolibrary.org/obo/RO_0000087>)), Declaration(AnnotationProperty(<http://purl.obolibrary.org/obo/RO_0000087>))]
2020-12-03 20:57:15,620 ERROR (OWLOntologyManagerImpl:1138) Illegal redeclarations of entities: reuse of entity http://purl.obolibrary.org/obo/RO_0002161 in punning not allowed [Declaration(ObjectProperty(<http://purl.obolibrary.org/obo/RO_0002161>)), Declaration(AnnotationProperty(<http://purl.obolibrary.org/obo/RO_0002161>))]
```

From this message, we can see that we need to remove the following `owl:ObjectProperty` redeclared to `owl:AnnotationProperty`: `RO_0002091`, `BFO_0000062`, `BFO_0000063`, `RO_0002222`, `RO_0000087`, `RO_0002161`. There were also 2 classes (i.e. `CLO_0054407` and `CLO_0054409`) defined as being a `owl:Class` and an `owl:ObjectProperty`.

<br>

**Solution:** Consistent with the solution described [here](https://github.com/oborel/obo-relations/issues/130), we removed all `owl:AnnotationProperty` declarations. For the class defined as a class and object property, we removed the class annotation. This is currently noted as an issue in the Cell Line Ontology's GitHub repo [issue #43](https://github.com/CLO-ontology/CLO/issues/43)).

In [7]:
for ont in ontology_repository:
    print('\n\n\nProcessing Ontology: {}'.format(ont))
    graph, bad_classes = Graph().parse(ont), set()
    
    for s, p, o in graph:
        triples = list(graph.triples((s, None, None)))
        
        # check for objects defined as classes and object properties
        class_prop, obj_prop = (s, RDF.type, OWL.Class), (s, RDF.type, OWL.ObjectProperty)
        if (class_prop in triples and obj_prop in triples) and str(s) not in bad_classes:
            bad_classes.add(str(s))
            print('Punning Error: {} defined as an owl:Class and owl:ObjectProperty'.format(str(s)))
            graph.remove(class_prop)
    
        # check for objects defined as object properties and annotation properties
        if o == URIRef('http://www.w3.org/2002/07/owl#ObjectProperty'):
            obj_prop, annot_prop = (s, RDF.type, OWL.ObjectProperty), (s, RDF.type, OWL.AnnotationProperty)
            if obj_prop in triples and annot_prop in triples:
                print('Punning Error: {} defined as an owl:ObjectProperty and owl:AnnotationProperty'.format(str(s)))
                graph.remove(annot_prop)
    
    # save cleaned up ontology
    graph.serialize(destination=ont, format='xml') 
    ontology_file_formatter(write_location, '/' + ont.split('/')[-1], './pkt_kg/libs/owltools')




Processing Ontology: resources/ontologies/chebi_lite_with_imports.owl

*** Applying OWL API Formatting to Knowledge Graph OWL File ***



Processing Ontology: resources/ontologies/clo_with_imports.owl
Punning Error: http://purl.obolibrary.org/obo/CLO_0054407 defined as an owl:Class and owl:ObjectProperty
Punning Error: http://purl.obolibrary.org/obo/BFO_0000062 defined as an owl:ObjectProperty and owl:AnnotationProperty
Punning Error: http://purl.obolibrary.org/obo/CLO_0054409 defined as an owl:Class and owl:ObjectProperty
Punning Error: http://purl.obolibrary.org/obo/BFO_0000063 defined as an owl:ObjectProperty and owl:AnnotationProperty
Punning Error: http://purl.obolibrary.org/obo/RO_0002161 defined as an owl:ObjectProperty and owl:AnnotationProperty
Punning Error: http://purl.obolibrary.org/obo/RO_0002222 defined as an owl:ObjectProperty and owl:AnnotationProperty
Punning Error: http://purl.obolibrary.org/obo/RO_0000087 defined as an owl:ObjectProperty and owl:AnnotationProperty

<br>

### Class Identifier Check  <a class="anchor" id="identifier-check"></a>
***

Check class identifiers to ensure consistency in identifier prefixes. For example, we want to identifiers that are incorrectly formatted like occurrences of `PRO_XXXXXXX` which should be `PR_XXXXXXX`.

#### Findings  
Running this check revealed mislabeling of `18` [pROtein Ontology](https://proconsortium.org/) identifiers in the [Vaccine Ontology](http://www.violinet.org/vaccineontology/) (see [this](https://github.com/vaccineontology/VO/issues/4) GitHub issue).


**Solution:** Reformat the incorrectly formatted class identifiers.


In [9]:
for ont in ontologies:
    print('\n\nProcessing Ontology: {}'.format(ont))
    ont_data = [x for x in ontology_repository if ont.lower() in x][0]
    graph = Graph().parse(ont_data)
    
    # get classes
    kg_classes = set([x for x in graph.subjects(RDF.type, OWL.Class)])
    
    # convert results to list of classes and only keep hgnc identifiers
    class_list = [res for res in kg_classes if isinstance(res, URIRef) and 'obo/' in str(res)]

    # print unique identifier types for all classes in each ontology
    print('Unique Identifier Types: {}'.format(', '.join(sorted(set([x.split('/')[-1].split('_')[0] for x in class_list])))))



Processing Ontology: chebi
Unique Identifier Types: CHEBI


Processing Ontology: clo
Unique Identifier Types: BFO, CARO, CHEBI, CL, CLO, CP, DOID, GO, HGNC, IAO, NCBITaxon, NCIT, OBI, OGG, PATO, PR, SO, UBERON, VO


Processing Ontology: ext
Unique Identifier Types: BFO, BSPOTEMP, CARO, CHEBI, CL, CP, D96882F1-8709-49AB-BCA9-772A67EA6C33, EHDAA2, EMAPA, ENVO, FBbt, FMA, GO, NBO, NCBITaxon, PATO, PR, RO, RnorDv, UBERON, UBERON#, UBERONTEMP, ZFA


Processing Ontology: go
Unique Identifier Types: GO


Processing Ontology: hp
Unique Identifier Types: BFO, CARO, CHEBI, CL, CP, ENVO, GO, HP, HsapDv, MOD, MPATH, NBO, NCBITaxon, OBI, PATO, PR, RO, SO, UBERON


Processing Ontology: mondo
Unique Identifier Types: BFO, CARO, CHEBI, CL, CP, ECTO, ENVO, ExO, FOODON, GO, HP, IAO, MF, MFOEM, MFOMD, MONDO, NBO, NCBITaxon, NCIT, OBA, OBI, OGMS, PATO, PCO, PO, PR, RO, SO, UBERON, UMLS, UPHENO


Processing Ontology: pro
Unique Identifier Types: BFO, CHEBI, CL, GO, MOD, NCBITaxon, OBI, PR, SO


Processin

In [15]:
# replace badly formatted identifiers with the correct ones
graph = Graph().parse('resources/ontologies/vo_with_imports.owl')
bad_classes = set()

for edge in graph:
    if 'http://purl.obolibrary.org/obo/PRO_' in str(edge[0]):
        updated_subj = str(edge[0]).replace('http://purl.obolibrary.org/obo/PRO_', 'http://purl.obolibrary.org/obo/PR')
        graph.add((URIRef(updated_subj), edge[1], edge[2]))
        graph.remove(edge)
        bad_classes.add(str(edge[0]))
    if 'http://purl.obolibrary.org/obo/PRO_' in str(edge[2]):
        updated_obj = str(edge[0]).replace('http://purl.obolibrary.org/obo/PRO_', 'http://purl.obolibrary.org/obo/PR')
        graph.add((edge[0], edge[1], URIRef(updated_obj)))
        graph.remove(edge)
        bad_classes.add(str(edge[2]))

print('The following classes were updated:\n{}'.format('\n'.join(bad_classes)))

# save cleaned up ontology
graph.serialize(destination='resources/ontologies/vo_with_imports.owl', format='xml') 
ontology_file_formatter(write_location, '/' + 'vo_with_imports.owl', './pkt_kg/libs/owltools')

The following classes were updated:
http://purl.obolibrary.org/obo/PRO_000000001
http://purl.obolibrary.org/obo/PRO_000015399


<br>

### Remove Obsolete and/or Deprecated Classes    <a class="anchor" id="obsolete-classes"></a>
***

To make sure that the ontology only contains current information, all obsolete classes and any triples that they participate in are removed from the ontologies. For build `V2.0`, we removed the following number of entities related to or containing a `deprecated` or `obsolete` ontology class:    

Ontology | Obsolete | Deprecated
:--: | :--: | :--:
CheBI Lite | 14 | 18498   
CLO | 14 | 2    
MONDO | 1945 | 2196    
GO | 3121 | 6302    
HPO | 297 | 348   
PW | 6 | 42   
PRO - Human | 0 | 0   
SO | 0 | 0   
UBERON | 1062 | 1564    
VO | 10 | 0   
RO | 0 | 0   


_NOTE._ In addition to running the code below, it may also be necessary to check for classes that are a sub-class of `oboInOwl:ObsoleteClass`, as well as any obsolete or deprecated annotations, individuals, or `owl:ObjectProperty`. Finally, we recommend verifying each ontology using an ontology debugger (i.e. by running a reasoner) to ensure that your changes and edits have not introduced unexpected errors.


In [16]:
# remove triples containing deprecated classes
for ont in ontologies:    
    print('\nLoading: {}'.format(ont))
    ont_switch = ont if ont != 'ext' else 'uberon'
    ont_prefix = 'http://purl.obolibrary.org/obo/' + ont_switch.upper()
    ont_data = [x for x in ontology_repository if ont.lower() in x][0]
    graph = Graph().parse(ont_data)

    # get deprecated classes and triples
    dep_cls = [x[0] for x in list(graph.triples((None, OWL.deprecated, Literal('true', datatype=URIRef('http://www.w3.org/2001/XMLSchema#boolean')))))]
    dep_triples = [(s, p, o) for s, p, o in graph if 'deprecated' in ', '.join([str(s).lower(), str(p).lower(), str(o).lower()]) and str(s).startswith(ont_prefix)]
    oth_dep_triple_classes = [x[0] for x in dep_triples]
    deprecated_classes = set(dep_cls + oth_dep_triple_classes)
    # get obsolete classes and triples
    obs_cls = [x[0] for x in list(graph.triples((None, RDFS.subClassOf, URIRef('http://www.geneontology.org/formats/oboInOwl#ObsoleteClass'))))]
    obs_triples = [(s, p, o) for s, p, o in graph if 'obsolete' in ', '.join([str(s).lower(), str(p).lower(), str(o).lower()]) and str(s).startswith(ont_prefix)]
    oth_obs_triple_classes = [x[0] for x in obs_triples]
    obsolete_classes = set(obs_cls + oth_obs_triple_classes)
    
    # remove deprecated/obsolete classes
    for node in list(deprecated_classes) + list(obsolete_classes):        
        graph.remove((node, None, None))  # remove all triples about node
        graph.remove((None, None, node))  # remove all triples pointing to node
        
    # remove deprecated/obsolete triples
    for triple in dep_triples + obs_triples:        
        graph.remove(triple)

    print('Removed {} obsolete classes and {} deprecated classes\n'.format(len(obsolete_classes), len(deprecated_classes)))
    
    # serialize graph
    graph.serialize(destination=ont_data, format='xml')
    ontology_file_formatter(write_location, '/' + ont_data.split('/')[-1], './pkt_kg/libs/owltools')
    


Loading: chebi
Removed 14 obsolete classes and 18498 deprecated classes


*** Applying OWL API Formatting to Knowledge Graph OWL File ***

Loading: clo
Removed 14 obsolete classes and 2 deprecated classes


*** Applying OWL API Formatting to Knowledge Graph OWL File ***

Loading: ext
Removed 1062 obsolete classes and 1564 deprecated classes


*** Applying OWL API Formatting to Knowledge Graph OWL File ***
None

Loading: go
Removed 3121 obsolete classes and 6302 deprecated classes


*** Applying OWL API Formatting to Knowledge Graph OWL File ***

Loading: hp
Removed 297 obsolete classes and 348 deprecated classes


*** Applying OWL API Formatting to Knowledge Graph OWL File ***

Loading: mondo
Removed 1945 obsolete classes and 2196 deprecated classes


*** Applying OWL API Formatting to Knowledge Graph OWL File ***
None

Loading: pro
Removed 0 obsolete classes and 0 deprecated classes


*** Applying OWL API Formatting to Knowledge Graph OWL File ***

Loading: pw
Removed 6 obsolete clas

<br>

***
## MERGE ONTOLOGIES <a class="anchor" id="merge-ontologies"></a>
***

**Purpose:** In this step, the [`OWLTools`](https://github.com/owlcollab/owltools) library is designed to merge a directory of ontology files into a single ontology file. This merged ontology file is required as input to the knowledge graph build algorithm.  

**Inputs:** A directory of ontology files (`.owl`)  
**Outputs:** [`PheKnowLator_MergedOntologies.owl`](https://www.dropbox.com/s/1lhh4hdwbjzds74/PheKnowLator_MergedOntologiesGeneID_Normalized_Cleaned.owl?dl=1)

#### Findings

The merge set of ontologies contained `366,828` classes, `3,841,408` axioms `818` object properties, and `151` individuals.

In [28]:
# merge ontologies
if write_location + merged_ontology_file in glob.glob(write_location + '/*.owl'):
    graph = Graph().parse(write_location + merged_ontology_file)
    gets_ontology_statistics(write_location + merged_ontology_file)
else:
    merges_ontologies(ontology_repository, write_location, merged_ontology_file)
    gets_ontology_statistics(write_location + merged_ontology_file)


Merging Ontologies: vo_with_imports.owl, so_with_imports.owl

Merging Ontologies: ro_with_imports.owl, PheKnowLator_MergedOntologies.owl

Merging Ontologies: pw_with_imports.owl, PheKnowLator_MergedOntologies.owl

Merging Ontologies: mondo_with_imports.owl, PheKnowLator_MergedOntologies.owl
None

Merging Ontologies: human_pro_closed.owl, PheKnowLator_MergedOntologies.owl

Merging Ontologies: hp_with_imports.owl, PheKnowLator_MergedOntologies.owl

Merging Ontologies: go_with_imports.owl, PheKnowLator_MergedOntologies.owl

Merging Ontologies: ext_with_imports.owl, PheKnowLator_MergedOntologies.owl
None

Merging Ontologies: clo_with_imports.owl, PheKnowLator_MergedOntologies.owl

Merging Ontologies: chebi_lite_with_imports.owl, PheKnowLator_MergedOntologies.owl

The knowledge graph contains 366828 classes, 3841408 axioms, 818 object properties, and 151 individuals



In [30]:
# use owlready2 to read in merged ontology file
merged_onts = get_ontology(write_location + merged_ontology_file).load()

<br>

***
## Normalize Classes <a class="anchor" id="normalize-classes"></a>
***

**Purpose:** The goal of this section is to checked the cleaned merged ontology file to ensure that there is consistency between the existing classes. To do this, we check two things: (1) [Aligning Existing Ontology Classes](#aligning-existing-ontologies); and (2) [Aligning Ontology Classes and New Edge Data](#aligning-new-data). More details for each type of checkare provided below.
  

**Dependencies:** The Merged Gene, RNA, Protein Map ([`Merged_gene_rna_protein_identifiers.pkl`](https://www.dropbox.com/s/6idnt7b3i322hlh/Merged_gene_rna_protein_identifiers.pkl?dl=1)) we generated in order to map genomic identifier data sources.

<br>

### Aligning Existing Ontology Classes <a class="anchor" id="aligning-existing-ontologies"></a>
***

**Purpose:** For this check, we want to make sure that all classes that represent the same entity are connected to each other. For example, consider the following:  
- Ontologies: [Sequence Ontology](http://www.sequenceontology.org/), [ChEBI](https://www.ebi.ac.uk/chebi), and [PRotein Ontology](https://proconsortium.org/) all include terms for protein, but none of these classes are connected to each other. 

#### Findings  
*Normalize Duplicate Ontology Concepts*  
The follow classes occur in all of the ontologies used in the current build and have to be normalizesd so that there are not multiple versions of the same concept:  

- Gene: [VO](http://purl.obolibrary.org/obo/OGG_0000000002)  
  - <u>Solution</u>: Make the `VO` imported `OGG` class a subclass of the `SO` gene term  

- Protein: [SO](http://purl.obolibrary.org/obo/SO_0000104), [PRO](http://purl.obolibrary.org/obo/PR_000000001), [ChEBI](http://purl.obolibrary.org/obo/CHEBI_36080) 
  - <u>Solution</u>: Make the `CHEBI` and `PRO` classes a subclass of the `SO` protein term  
  
- Disorder: [VO](http://purl.obolibrary.org/obo/OGMS_0000045)  
  - <u>Solution</u>: Make the `VO` imported `OGMS` class a subclass of the `MONDO` disease term  

- Antigen: [VO](http://purl.obolibrary.org/obo/OBI_1110034)  
  - <u>Solution</u>: Make the `VO` imported OBI class a subclass of the `CHEBI` antigen term  

- Gelatin: [VO]('http://purl.obolibrary.org/obo/VO_0003030') 
  - <u>Solution</u>: Make the `VO` class a subclass of the `CHEBI` gelatin term 

- Hormone: [VO](http://purl.obolibrary.org/obo/FMA_12278) 
  - <u>Solution</u>: Make the `VO` imported `FMA` class a subclass of the `CHEBI` hormone term

*Normalize Existing Ontology Classes*  
`19,820` `HGNC` identifiers were found and need to be converted to `Entrez` gene identifiers, which is the identifier type for genes utilized by the majority of ontologies.


**Solution:**   
*Normalize Duplicate Ontology Concepts*  
Choose a primary concept for all duplicate scenarios and make duplicate concepts an `RDFS:subClassOf` the primary concept.

*Normalize Existing Ontology Classes*   
Convert all classes that reference an HGNC gene (`n=19,820`) to an Entrez identifier. To do this, we will utilize the genomic identifier mapping information ([`Merged_gene_rna_protein_identifiers.pkl`](https://www.dropbox.com/s/6idnt7b3i322hlh/Merged_gene_rna_protein_identifiers.pkl?dl=1)) we constructed in the [`Data_Preparation.ipynb`](https://github.com/callahantiff/PheKnowLator/blob/master/Data_Preparation.ipynb) Jupyter notebook. Note that we are only updating identifiers and not verifying labels or other metadata.
  

#### *Load Merged Ontology Data*

In [31]:
# read in merged data
merged_onts = Graph().parse(write_location + merged_ontology_file)

**Normalize Duplicate Ontology Concepts**

In [None]:
# fix gene class inconsistencies
graph.add((URIRef('http://purl.obolibrary.org/obo/OGG_0000000002'),
           URIRef('http://www.w3.org/2000/01/rdf-schema#subClassOf'),
           URIRef('http://purl.obolibrary.org/obo/SO_0000704')))

# fix protein class inconsistencies
graph.add((URIRef('http://purl.obolibrary.org/obo/PR_000000001'),
           URIRef('http://www.w3.org/2000/01/rdf-schema#subClassOf'),
           URIRef('http://purl.obolibrary.org/obo/SO_0000104')))

graph.add((URIRef('http://purl.obolibrary.org/obo/CHEBI_36080'),
           URIRef('http://www.w3.org/2000/01/rdf-schema#subClassOf'),
           URIRef('http://purl.obolibrary.org/obo/SO_0000104')))

# fix disorder class inconsistencies
graph.add((URIRef('http://purl.obolibrary.org/obo/OGMS_0000045'),
           URIRef('http://www.w3.org/2000/01/rdf-schema#subClassOf'),
           URIRef('http://purl.obolibrary.org/obo/DOID_4')))

# fix antigen class inconsistencies
graph.add((URIRef('http://purl.obolibrary.org/obo/OBI_1110034'),
           URIRef('http://www.w3.org/2000/01/rdf-schema#subClassOf'),
           URIRef('http://purl.obolibrary.org/obo/CHEBI_59132')))

# fix gelatin class inconsistencies
graph.add((URIRef('http://purl.obolibrary.org/obo/VO_0003030'),
           URIRef('http://www.w3.org/2000/01/rdf-schema#subClassOf'),
           URIRef('http://purl.obolibrary.org/obo/CHEBI_5291')))

# fix hormone class inconsistencies
graph.add((URIRef('http://purl.obolibrary.org/obo/FMA_12278'),
           URIRef('http://www.w3.org/2000/01/rdf-schema#subClassOf'),
           URIRef('http://purl.obolibrary.org/obo/CHEBI_24621')))
                        

**Normalize Existing Ontology Classes**

In [None]:
# find all classes in graph
kg_classes = merged_onts.query(
    """SELECT DISTINCT ?c
           WHERE {?c rdf:type owl:Class . }
           """, initNs={'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
                        'owl': 'http://www.w3.org/2002/07/owl#'}
)

# convert results to list of classes and only keep hgnc identifiers
class_list_gene = [res[0] for res in kg_classes if isinstance(res[0], URIRef) and 'hgnc' in str(res[0])]


In [None]:
# load genomic identifier mapping dictionary
genomic_id_map = pickle.load(open('resources/processed_data/Merged_gene_rna_protein_identifiers.pkl', 'rb'), encoding='bytes')


In [None]:
# loop over each gene class and get entrez gene id equivalent
matches, not_matched = {}, []
gene_url = 'https://www.ncbi.nlm.nih.gov/gene/'

for gene_class in class_list_gene:
    key = 'hgnc_id_' + str(gene_class).split('=')[-1]
    
    if key in genomic_id_map.keys() and any(x for x in genomic_id_map[key] if x.startswith('entrez_id')):        
        matches[str(gene_class)] = [gene_url + x.split('_')[-1] for x in genomic_id_map[key] if 'entrez_id' in x]
    else:
        not_matched.append(gene_class)
        

In [None]:
# print non-matching gene uris
not_matched


**Investigate UnMatched Genes**  
Only 3 of the HGNC genes were not found in our dictionary ([`HGNC:24033`](http://www.genenames.org/cgi-bin/gene_symbol_report?hgnc_id=24033), [`HGNC:31447`](http://www.genenames.org/cgi-bin/gene_symbol_report?hgnc_id=31447), [`HGNC:33870`](http://www.genenames.org/cgi-bin/gene_symbol_report?hgnc_id=33870)). Investigating these revealed that HGNC made these identifiers obsolete and replaced them with new identifiers. Until this term is updated in the PRO ontology, we have to manually make the same fix. 

Additionally, there were 8 HGNC ids that have all been withdrawn (i.e. [`HGNC:26619`](https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/26619), [`HGNC:13392`](https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/13392), [`HGNC:31424`](https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/31424), [`HGNC:8103`](https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/8103), 
[`HGNC:25943`](https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/25943), [`HGNC:16957`](https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/16957), [`HGNC:23418`](https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/23418), [`HGNC:32021`](https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/32021)) and for our purposes, will be removed. _Note_. Please verify the output file to ensure that no errors were added when removing these genes (as mentioned in a prior cell, these `8` genes were each deleted from the PRO ontology prior to merging).

This issue has been reported to [PRotein Ontology](https://proconsortium.org/pro.shtml) (see [this](https://github.com/PROconsortium/PRoteinOntology/issues/176) GitHub issue).

In [None]:
# investigate HGNC genes with no mappings to Entrez
not_matched

# update mapping dictionary
gene_url = 'https://www.ncbi.nlm.nih.gov/gene/'
matches['http://www.genenames.org/cgi-bin/gene_symbol_report?hgnc_id=24033'] =  [gene_url + x.split('_')[-1] for x in genomic_id_map['hgnc_id_26545'] if 'entrez_id' in x]
matches['http://www.genenames.org/cgi-bin/gene_symbol_report?hgnc_id=31447'] =  [gene_url + x.split('_')[-1] for x in genomic_id_map['hgnc_id_16932'] if 'entrez_id' in x]
matches['http://www.genenames.org/cgi-bin/gene_symbol_report?hgnc_id=33870'] =  [gene_url + x.split('_')[-1] for x in genomic_id_map['hgnc_id_20667'] if 'entrez_id' in x]


In [None]:
# updated gene identifiers in graph
for edge in graph:
    if str(edge[0]) in matches.keys():
        for mapped_id in matches[str(edge[0])]:
            graph.add((URIRef(mapped_id), edge[1], edge[2]))
            graph.remove(edge)
    elif str(edge[2]) in matches.keys():
        for mapped_id in matches[str(edge[2])]:
            graph.add((edge[0], edge[1], URIRef(mapped_id)))
            graph.remove(edge)
    else:
        continue
    

### Aligning Existing Ontology Classes and New Edge Data<a class="anchor" id="aligning-new-data"></a>
***

**Purpose:** For this check, we want to make sure that any of the existing ontology classes can be aligned with any of the new data entities that we want to add to the knowledge graph. For example:  
  - Gene Classes: there are several gene classes that use [HGNC](https://www.genenames.org/) identifiers. We also want to add genes, but prefer to use [Entrez gene](https://www.ncbi.nlm.nih.gov/gene) identifiers. In order to be used with our data, we must first normalize all of the HPO gene classes to Entrez gene identifiers.

#### Findings  
From looking into the knowledge graph we identified `20` classes that existed in the new edge list and in the merged ontologies, but that had differing URIs (`http://www.ncbi.nlm.nih.gov/` vs. `https://www.ncbi.nlm.nih.gov/`). Those classes were:  
- `http://www.ncbi.nlm.nih.gov/gene/100129307`
- `http://www.ncbi.nlm.nih.gov/gene/100131107`
- `http://www.ncbi.nlm.nih.gov/gene/101927789`
- `http://www.ncbi.nlm.nih.gov/gene/102723383`
- `http://www.ncbi.nlm.nih.gov/gene/105373297`
- `http://www.ncbi.nlm.nih.gov/gene/107987235`
- `http://www.ncbi.nlm.nih.gov/gene/140606`
- `http://www.ncbi.nlm.nih.gov/gene/157285`
- `http://www.ncbi.nlm.nih.gov/gene/163404`
- `http://www.ncbi.nlm.nih.gov/gene/390928`
- `http://www.ncbi.nlm.nih.gov/gene/392490`
- `http://www.ncbi.nlm.nih.gov/gene/50810`
- `http://www.ncbi.nlm.nih.gov/gene/51714`
- `http://www.ncbi.nlm.nih.gov/gene/54886`
- `http://www.ncbi.nlm.nih.gov/gene/58515`
- `http://www.ncbi.nlm.nih.gov/gene/64748`
- `http://www.ncbi.nlm.nih.gov/gene/79948`
- `http://www.ncbi.nlm.nih.gov/gene/83642`
- `http://www.ncbi.nlm.nih.gov/gene/84717`
- `http://www.ncbi.nlm.nih.gov/gene/9890`

**Solution:** Replace the occurrences of `HTTP` URLs with `HTTPS`.  

**Read in Master Edge List**  
The [`master edge list`](https://www.dropbox.com/s/t8sgzd847t1rof4/Master_Edge_List_Dict.json?dl=1) that is created in Step 2 of the `pkt_kg` algorithm is read in and processed into a dictionary.

In [None]:
# read in master edge list
edge_data = json.load(open('./resources/Master_Edge_List_Dict.json', 'r'))

# convert to dictionary
edge_dict = dict()

# iterate over master edges to 
for k in edge_data:
    rel = edge_data[k]['uri']
    for edge in edge_data[k]['edge_list']:
        for x in edge:
            if x in edge_dict.keys():
                edge_dict[x] |= {rel[edge.index(x)]}
            else:
                edge_dict[x] = {rel[edge.index(x)]}


*Process merged Ontologies and Verify New Edge Relations*

In [None]:
for edge in graph:
    if str(edge[0]).startswith('http://www.ncbi.nlm.nih.gov/gene/'):
        updated_subj = str(edge[0]).replace('http://www.ncbi.nlm.nih.gov/gene/', 'https://www.ncbi.nlm.nih.gov/gene/')
        graph.add((URIRef(updated_subj), edge[1], edge[2]))
        graph.remove(edge)
        
    if str(edge[2]).startswith('http://www.ncbi.nlm.nih.gov/gene/'):
        updated_obj = str(edge[2]).replace('http://www.ncbi.nlm.nih.gov/gene/', 'https://www.ncbi.nlm.nih.gov/gene/')
        graph.add((edge[0], edge[1], URIRef(updated_obj)))
        graph.remove(edge)
            

In [None]:
# save normalized ontology
graph.serialize(destination=write_location + merged_ontology_file[:-4] + 'GeneID_Normalized_Cleaned.owl', format='xml')

# apply OWL API formatting to file
ontology_file_formatter(write_location, merged_ontology_file[:-4] + 'GeneID_Normalized_Cleaned.owl')

# get ontology stats
gets_ontology_statistics(write_location + merged_ontology_file[:-4] + 'GeneID_Normalized_Cleaned.owl')
