***
# PheKnowLator - Ontology Cleaning
***
***

**Author:** [TJCallahan](https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=callahantiff@gmail.com)  
**GitHub Repository:** [PheKnowLator](https://github.com/callahantiff/PheKnowLator/wiki)  
**Release:** **[v2.0.0](https://github.com/callahantiff/PheKnowLator/wiki/v2.0.0)**
  

## Purpose

This notebook serves as a script to help prepare ontologies prior to be ingested into the knowledge graph build algorithm. This script performs the following steps:  
1. [Clean Ontologies](#clean-ontologies)  
2. [Merge Ontologies](#merge-ontologies)  
3. [Normalize Classes](#normalize-classes)

## Assumptions and Dependencies  
  
**Assumptions:**   
- Knowledge Graph Build Steps 1-2 (i.e. data downloading and master edge list creation) have already been performed  
- Directory of Imported Ontologies ➞ `./resources/ontologies`    
- Processed data write location ➞ `./resources/ontologies`  

**Dependencies:**   
- <u>Scripts</u>: This notebook utilizes several helper functions, which are stored in the [`kg_utils.py`](https://github.com/callahantiff/PheKnowLator/blob/master/pkt_kg/utils/kg_utils.py) script.
- <u>Software</u>:[`OWLTools`](https://github.com/owlcollab/owltools)  
- <u>Data</u>: Details on the data utilized in this script can be found on the [Data Sources](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources) Wiki. Data can be downloaded from [this](https://console.cloud.google.com/storage/browser/pheknowlator/release_v2.0.0?project=pheknowlator) dedicated Google Cloud Storage Bucket. Please note that all build data are freely available and organized by release and build date. 
 

<br>

***  
## Set-Up Environment
***  

In [None]:
# import needed libraries
import glob
import pickle
import re

from owlready2 import *
from rdflib import Graph, Namespace, URIRef, BNode, Literal
from rdflib.namespace import OWL, RDF, RDFS 
from rdflib.plugins.sparql import prepareQuery
from tqdm import tqdm

# import script containing helper functions
from pkt_kg.utils import * 

In [None]:
# set up environment variables
write_location = 'resources/ontologies'
merged_ontology_file = '/PheKnowLator_MergedOntologies.owl'
ontology_repository = glob.glob('*/ontologies/*.owl')
processed_data_location = 'resources/processed_data/'

# set global namespaces
schema = Namespace('http://www.w3.org/2001/XMLSchema#')
obo = Namespace('http://purl.obolibrary.org/obo/')
oboinowl = Namespace('http://www.geneontology.org/formats/oboInOwl#')

***
## Clean Ontologies <a class="anchor" id="clean-ontologies"></a>
***

**Purpose:** In this step, we read in the ontologies using the [`owlready2`](https://pypi.org/project/Owlready2/) library and use it to indicate the presence of errors in the ontology files. We use this tool because it has strict filters. Using this tool we performed the following checks to clean the ontologies:

* [Value Errors](#value-error)  
* [Punning Errors](#punning-error)  
* [Double-Typed Classes](#double-typed)  
* [Class Identifier Check](#identifier-check)  
* [Obsolete/Deprecated Classes](#obsolete-classes)  

***

<br>

### Value Errors <a class="anchor" id="value-error"></a>
***

This check utilizes the [`owlready2`](https://pypi.org/project/Owlready2/) library to read in each of the ontologies. This library is strict and will catch a wide variety of value errors. Should any of these errors arise, we retype the edge correctly. 

#### Example Findings  
The [Cell Line Ontology](http://www.clo-ontology.org/) yield the following error message:

```python
ValueError: invalid literal for int() with base 10: '永生的乳腺衍生细胞系细胞'
...
OwlReadyOntologyParsingError: RDF/XML parsing error in file clo_with_imports.owl, line 10970, column 99.
```

This tells us that we need to repair the triple containing the Literal '永生的乳腺衍生细胞系细胞' by removing it and redefining it as a `string`, rather than an `int` as it is currently defined as. This is currently noted as an issue in the [Cell Line Ontology's](http://www.clo-ontology.org/) GitHub repo ([issue #48](https://github.com/CLO-ontology/CLO/issues/48)). 

In [None]:
errors = {x: {} for x in ontology_repository}
for ont in ontology_repository:
    print('Loading: {}'.format(ont))
    try: load_onto = get_ontology(ont).load()
    except OwlReadyOntologyParsingError as e: errors[ont]['OwlReadyOntologyParsingError'] = str(e)
    except KeyError as e: errors[ont]['KeyError'] = str(e)
    except TypeError as e: errors[ont]['PunningError'] = str(e)

**Value Error Repairs**  
Code to fix the Cell Ontology *Value Error* is provided below.

In [None]:
for key in errors.keys():
    if 'OwlReadyOntologyParsingError' in errors[key].keys():
        error_key = 'OwlReadyOntologyParsingError'
        line_num = int(re.findall(r'(?<=line\s).*(?=,)', str(errors[key][error_key]))[0]) - 1
        raw_data, graph = open(key).readlines(), Graph().parse(key)
       
        # obtain bad string and triple -- assuming for now the errors are miss-typed string errors
        bad_content = re.findall(r'(?<=\>).*(?=\<)', raw_data[line_num])[0]      
        bad_triple = [x for x in graph if bad_content in str(x[0]) or bad_content in str(x[2])]
        for e in bad_triple:
            graph.add((e[0], e[1], Literal(str(e[2]), datatype=schema.string)))
            graph.remove(e)
        
        # save cleaned up ontology
        filename = '/' + key.split('/')[-1]
        graph.serialize(destination=key, format='xml')
        ontology_file_formatter(write_location, filename, './pkt_kg/libs/owltools')

<br>

***
## MERGE ONTOLOGIES <a class="anchor" id="merge-ontologies"></a>
***

**Purpose:** In this step, the [`OWLTools`](https://github.com/owlcollab/owltools) library is designed to merge a directory of ontology files into a single ontology file. This merged ontology file is required as input to the knowledge graph build algorithm.  

**Inputs:** A directory of ontology files (`.owl`)

**Outputs:** [`PheKnowLator_MergedOntologies.owl`](https://www.dropbox.com/s/1lhh4hdwbjzds74/PheKnowLator_MergedOntologiesGeneID_Normalized_Cleaned.owl?dl=1)


In [None]:
# merge ontologies
if write_location + merged_ontology_file in glob.glob(write_location + '/*.owl'):
    graph = Graph().parse(write_location + merged_ontology_file)
    gets_ontology_statistics(write_location + merged_ontology_file)
else:
    merges_ontologies(ontology_repository, write_location, merged_ontology_file)
    gets_ontology_statistics(write_location + merged_ontology_file)

**Load Merged Ontology Data**

In [None]:
# read in merged data
merged_onts = Graph().parse(write_location + merged_ontology_file)

<br>

### Punning Errors <a class="anchor" id="punning-error"></a>
***

[Punning](https://www.w3.org/2007/OWL/wiki/Punning) or redeclaration errors occur for a few different reasons, but the primary or most prevalent cause observed in the ontologies used in `PheKnowLator` is due to an `owl:ObjectProperty` being incorrectly redeclared as an `owl:AnnotationProperty` or an `owl:Class` also being defined as an `OWL:ObjectProperty`. Consistent with the solution described [here](https://github.com/oborel/obo-relations/issues/130), for `owl:ObjectProperty` redeclarations we remove all `owl:AnnotationProperty` declarations. For all `owl:Class` redeclarations, we remove all `owl:ObjectProperty` redeclarations.

#### Example Findings  
The [Cell Line Ontology](http://www.clo-ontology.org/) had 7 object properties that were illegally redeclared and triggered punning errors. More details regarding these errors are shown below. 

```bash
2020-12-03 20:57:15,616 ERROR (OWLOntologyManagerImpl:1138) Illegal redeclarations of entities: reuse of entity http://purl.obolibrary.org/obo/RO_0002091 in punning not allowed [Declaration(AnnotationProperty(<http://purl.obolibrary.org/obo/RO_0002091>)), Declaration(ObjectProperty(<http://purl.obolibrary.org/obo/RO_0002091>))]
2020-12-03 20:57:15,619 ERROR (OWLOntologyManagerImpl:1138) Illegal redeclarations of entities: reuse of entity http://purl.obolibrary.org/obo/BFO_0000062 in punning not allowed [Declaration(AnnotationProperty(<http://purl.obolibrary.org/obo/BFO_0000062>)), Declaration(ObjectProperty(<http://purl.obolibrary.org/obo/BFO_0000062>))]
2020-12-03 20:57:15,620 ERROR (OWLOntologyManagerImpl:1138) Illegal redeclarations of entities: reuse of entity http://purl.obolibrary.org/obo/BFO_0000063 in punning not allowed [Declaration(ObjectProperty(<http://purl.obolibrary.org/obo/BFO_0000063>)), Declaration(AnnotationProperty(<http://purl.obolibrary.org/obo/BFO_0000063>))]
2020-12-03 20:57:15,620 ERROR (OWLOntologyManagerImpl:1138) Illegal redeclarations of entities: reuse of entity http://purl.obolibrary.org/obo/RO_0002222 in punning not allowed [Declaration(AnnotationProperty(<http://purl.obolibrary.org/obo/RO_0002222>)), Declaration(ObjectProperty(<http://purl.obolibrary.org/obo/RO_0002222>))]
2020-12-03 20:57:15,620 ERROR (OWLOntologyManagerImpl:1138) Illegal redeclarations of entities: reuse of entity http://purl.obolibrary.org/obo/RO_0000087 in punning not allowed [Declaration(ObjectProperty(<http://purl.obolibrary.org/obo/RO_0000087>)), Declaration(AnnotationProperty(<http://purl.obolibrary.org/obo/RO_0000087>))]
2020-12-03 20:57:15,620 ERROR (OWLOntologyManagerImpl:1138) Illegal redeclarations of entities: reuse of entity http://purl.obolibrary.org/obo/RO_0002161 in punning not allowed [Declaration(ObjectProperty(<http://purl.obolibrary.org/obo/RO_0002161>)), Declaration(AnnotationProperty(<http://purl.obolibrary.org/obo/RO_0002161>))]
```

From this message, we can see that we need to remove the following `owl:ObjectProperty` redeclared to `owl:AnnotationProperty`: `RO_0002091`, `BFO_0000062`, `BFO_0000063`, `RO_0002222`, `RO_0000087`, `RO_0002161`. There were also 2 classes (i.e. `CLO_0054407` and `CLO_0054409`) defined as being a `owl:Class` and an `owl:ObjectProperty`. This is currently noted as an issue in the Cell Line Ontology's GitHub repo [issue #43](https://github.com/CLO-ontology/CLO/issues/43)).

In [None]:
bad_classes = set()
# identifiy and remove punning errors
for s, p, o in tqdm(merged_onts):
    triples = list(merged_onts.triples((s, None, None)))
    # check for objects defined as classes and object properties
    class_prop, obj_prop = (s, RDF.type, OWL.Class), (s, RDF.type, OWL.ObjectProperty)
    if (class_prop in triples and obj_prop in triples) and str(s) not in bad_classes:
        bad_classes.add(str(s))
        print('Punning Error: {} defined as an owl:Class and owl:ObjectProperty'.format(str(s)))
        merged_onts.remove(class_prop)
    # check for objects defined as object properties and annotation properties
    if o == OWL.ObjectProperty:
        obj_prop, annot_prop = (s, RDF.type, OWL.ObjectProperty), (s, RDF.type, OWL.AnnotationProperty)
        if obj_prop in triples and annot_prop in triples:
            print('Punning Error: {} defined as an owl:ObjectProperty and owl:AnnotationProperty'.format(str(s)))
            merged_onts.remove(annot_prop)

<br>

### Double-Typed Classes <a class="anchor" id="double-typed"></a>
***


Similar to resolving punning errors, we also need to identify classes that have been typed as `owl:Class` and `owl:NamedIndividuals` and remove the `owl:NamedIndividual` axiom.

#### Example Findings  
The `UBERON` Ontology contains the following re-typing errors:
- UBERON_0001009-Class/UBERON_0001009-NamedIndividual
- UBERON_0001004-Class/UBERON_0001004-NamedIndividual
- UBERON_0001555-Class/UBERON_0001555-NamedIndividual
- UBERON_0000383-NamedIndividual/UBERON_0000383-Class
- UBERON_0000029-NamedIndividual/UBERON_0000029-Class
- UBERON_0001017-Class/UBERON_0001017-NamedIndividual
- UBERON_0000062-Class/UBERON_0000062-NamedIndividual
- UBERON_0000926-Class/UBERON_0000926-NamedIndividual

In [None]:
bad_cls = []
kg_classes = gets_ontology_classes(merged_onts)
for cls in tqdm(kg_classes):
    class_types = list(merged_onts.triples((cls, RDF.type, None)))
    if len(class_types) > 1:
        bad_cls += [', '.join([str(x[0]).split('/')[-1] + '-' + str(x[2]).split('#')[-1] for x in class_types])]
        to_remove = list(merged_onts.triples((cls, RDF.type, OWL.NamedIndividual)))
        for edge in to_remove:
            merged_onts.remove(edge)

<br>

### Class Identifier Check  <a class="anchor" id="identifier-check"></a>
***

Check class identifiers to ensure consistency in identifier prefixes. For example, we want to identifiers that are incorrectly formatted like occurrences of `PRO_XXXXXXX` which should be `PR_XXXXXXX`. For all detected errors, we reformat the incorrectly formatted class identifiers. This is a tricky task to do in an automated manner and is something that should be updated if any new ontologies are added to the `PheKnowLator` build. Currently, the code below checks and logs any hits, but only fixes the following known errors: Vaccine Ontology: `PRO` which should be `PR`.

#### Example Findings  
Running this check revealed mislabeling of `2` [pROtein Ontology](https://proconsortium.org/) identifiers in the [Vaccine Ontology](http://www.violinet.org/vaccineontology/) (see [this](https://github.com/vaccineontology/VO/issues/4) GitHub issue).


**Solution:** 


In [None]:
# convert results to list of classes and only keep hgnc identifiers
kg_classes = set([x for x in merged_onts.subjects(RDF.type, OWL.Class)])
class_list = [res for res in kg_classes if isinstance(res, URIRef) and 'obo/' in str(res)]

# print unique identifier types for all classes in each ontology
print('Unique Identifier Types: {}'.format(', '.join(sorted(set([x.split('/')[-1].split('_')[0] for x in class_list])))))

In [None]:
# replace badly formatted identifiers with the correct ones
bad_classes = set()
for edge in tqdm(merged_onts):
    if 'http://purl.obolibrary.org/obo/PRO_' in str(edge[0]):
        updated_subj = str(edge[0]).replace('http://purl.obolibrary.org/obo/PRO_', 'http://purl.obolibrary.org/obo/PR')
        merged_onts.add((URIRef(updated_subj), edge[1], edge[2]))
        merged_onts.remove(edge)
        bad_classes.add(str(edge[0]))
    if 'http://purl.obolibrary.org/obo/PRO_' in str(edge[2]):
        updated_obj = str(edge[0]).replace('http://purl.obolibrary.org/obo/PRO_', 'http://purl.obolibrary.org/obo/PR')
        merged_onts.add((edge[0], edge[1], URIRef(updated_obj)))
        merged_onts.remove(edge)
        bad_classes.add(str(edge[2]))

print('The following classes were updated:\n{}'.format('\n'.join(bad_classes)))

<br>

### Remove Obsolete and/or Deprecated Classes    <a class="anchor" id="obsolete-classes"></a>
***

To make sure that the ontology only contains current information, all obsolete classes and any triples that they participate in are removed from the ontologies. In addition to running the code below, it may also be necessary to check for classes that are a sub-class of `oboInOwl:ObsoleteClass`, as well as any obsolete or deprecated annotations, individuals, or `owl:ObjectProperty`.

In [None]:
# get deprecated classes and triples
dep_cls = [x[0] for x in list(merged_onts.triples((None, OWL.deprecated, Literal('true', datatype=schema.boolean))))]
dep_triples = [(i, j, k) for i, j, k in merged_onts
               if 'deprecated' in ', '.join([str(i).lower(), str(j).lower(), str(k).lower()])
               and len(list(merged_onts.triples((i, RDF.type, OWL.Class)))) == 1]
deprecated_classes = set(dep_cls + [x[0] for x in dep_triples])

# get obsolete classes and triples
obs_cls = [x[0] for x in list(merged_onts.triples((None, RDFS.subClassOf, oboinowl.ObsoleteClass)))]
obs_triples = [(i, j, k) for i, j, k in merged_onts
               if 'obsolete' in ', '.join([str(i).lower(), str(j).lower(), str(k).lower()])
               and len(list(merged_onts.triples((i, RDF.type, OWL.Class)))) == 1 and '#' not in str(i)]
obsolete_classes = set(obs_cls + [x[0] for x in obs_triples])

# remove deprecated/obsolete classes
for node in list(deprecated_classes) + list(obsolete_classes):
    merged_onts.remove((node, None, None))

print('Removed {} obsolete classes and {} deprecated classes\n'.format(len(obsolete_classes), len(deprecated_classes)))

<br>

***
## Normalize Classes <a class="anchor" id="normalize-classes"></a>
***

**Purpose:** The goal of this section is to checked the cleaned merged ontology file to ensure that there is consistency between the existing classes. To do this, we check two things: (1) [Aligning Existing Ontology Classes](#aligning-existing-ontologies); and (2) [Aligning Ontology Classes and New Edge Data](#aligning-new-data). More details for each type of check are provided below.

<br>

### Aligning Existing Ontology Classes <a class="anchor" id="aligning-existing-ontologies"></a>
***

**Purpose:** For this check, there are two types of checks that are performed:  

*Normalize Duplicate Ontology Concepts*  
we want to make sure that all classes that represent the same entity are connected to each other. For example, consider the following: the [Sequence Ontology](http://www.sequenceontology.org/), [ChEBI](https://www.ebi.ac.uk/chebi), and [PRotein Ontology](https://proconsortium.org/) all include terms for protein, but none of these classes are connected to each other. The solution to fixing these errors is to choose a primary concept for all duplicate scenarios and make duplicate concepts an `RDFS:subClassOf` the primary concept.

*Normalize Existing Ontology Classes*  
Checks for inconsistencies in ontology classes that overlap with non-ontology entity identifiers (e.g. if HP includes `HGNC` identifiers, but PheKnowLator utilizes `Entrez` identifiers). While there are other types of identifiers, we focus primarily on resolving the genomic types, since we have a master dictionary we can used to help with this ([`Merged_gene_rna_protein_identifiers.pkl`](https://storage.googleapis.com/pheknowlator/release_v2.0.0/build_31DEC2020/data/processed_data/Merged_gene_rna_protein_identifiers.pkl)). This can be updated in future iterations to include other types of identifiers, but given our detailed examination of the `v2.0.0` ontologies, these were the identifier types that needed repair.

**Dependencies:** [`Merged_gene_rna_protein_identifiers.pkl`](https://storage.googleapis.com/pheknowlator/release_v2.0.0/build_31DEC2020/data/processed_data/Merged_gene_rna_protein_identifiers.pkl)  

#### Sample Findings  
*Normalize Duplicate Ontology Concepts*  
The follow classes occur in all of the ontologies used in the current build and have to be normalizesd so that there are not multiple versions of the same concept:  

- Gene: [VO](http://purl.obolibrary.org/obo/OGG_0000000002)  
  - <u>Solution</u>: Make the `VO` imported `OGG` class a subclass of the `SO` gene term  

- Protein: [SO](http://purl.obolibrary.org/obo/SO_0000104), [PRO](http://purl.obolibrary.org/obo/PR_000000001), [ChEBI](http://purl.obolibrary.org/obo/CHEBI_36080) 
  - <u>Solution</u>: Make the `CHEBI` and `PRO` classes a subclass of the `SO` protein term  
  
- Disorder: [VO](http://purl.obolibrary.org/obo/OGMS_0000045)  
  - <u>Solution</u>: Make the `VO` imported `OGMS` class a subclass of the `MONDO` disease term  

- Antigen: [VO](http://purl.obolibrary.org/obo/OBI_1110034)  
  - <u>Solution</u>: Make the `VO` imported OBI class a subclass of the `CHEBI` antigen term  

- Gelatin: [VO]('http://purl.obolibrary.org/obo/VO_0003030') 
  - <u>Solution</u>: Make the `VO` class a subclass of the `CHEBI` gelatin term 

- Hormone: [VO](http://purl.obolibrary.org/obo/FMA_12278) 
  - <u>Solution</u>: Make the `VO` imported `FMA` class a subclass of the `CHEBI` hormone term

***  
**Normalize Duplicate Ontology Concepts**

In [None]:
merged_onts.add((obo.OGG_0000000002, RDFS.subClassOf, obo.SO_0000704))  # fix gene class inconsistencies
merged_onts.add((obo.PR_000000001, RDFS.subClassOf, obo.SO_0000104))  # fix protein class inconsistencies
merged_onts.add((obo.CHEBI_36080, RDFS.subClassOf, obo.SO_0000104))  # fix protein class inconsistencies
merged_onts.add((obo.OGMS_0000045, RDFS.subClassOf, obo.MONDO_0000001))  # fix disorder class inconsistencies
merged_onts.add((obo.OBI_1110034, RDFS.subClassOf, obo.CHEBI_59132))  # fix antigen class inconsistencies
merged_onts.add((obo.VO_0003030, RDFS.subClassOf, obo.CHEBI_5291))  # fix gelatin class inconsistencies
merged_onts.add((obo.FMA_12278, RDFS.subClassOf, obo.CHEBI_24621))  # fix hormone class inconsistencies                   

***
**Normalize Existing Ontology Classes**

In [None]:
# download required dependencies
url = 'https://storage.googleapis.com/pheknowlator/release_v2.0.0/build_31DEC2020/data/processed_data/Merged_gene_rna_protein_identifiers.pkl'
if not os.path.exists(processed_data_location + 'Merged_gene_rna_protein_identifiers.pkl'):
    data_downloader(url, processed_data_location)

gene_ids = pickle.load(open(processed_data_location + 'Merged_gene_rna_protein_identifiers.pkl', 'rb'))


In [None]:
# get all classes in the merged knowledge graph that are not obo classes and remove them
non_ont = set([x for x in gets_ontology_classes(merged_onts) if not str(x).startswith(str(obo))])
hgnc, url = set([x for x in non_ont if 'hgnc' in str(x)]), 'http://www.ncbi.nlm.nih.gov/gene/'

for node in tqdm(hgnc):
    trips = list(merged_onts.triples((node, None, None))) + list(merged_onts.triples((None, None, node)))
    node_str = 'hgnc_id_' + str(node).split('=')[-1]
    if node_str in gene_ids.keys():
        ent_maps = [URIRef(url + x) for x in gene_ids[node_str] if x.startswith('entrez_id_')]
        for edge in trips:
            if node in edge[0]:
                for i in ent_maps:
                    merged_onts.add((i, edge[1], edge[2]))
            if node in edge[2]:
                for i in ent_maps:
                    merged_onts.add((edge[0], edge[1], i))
            merged_onts.remove(edge)

*** 
**Save Cleaned Merged Ontologies**

In [None]:
merged_onts.serialize(write_location + merged_ontology_file, format='xml')
ontology_file_formatter(write_location, merged_ontology_file, './pkt_kg/libs/owltools')


<br>

***
***

```
@misc{callahan_tj_2019_3401437,
  author       = {Callahan, TJ},
  title        = {PheKnowLator},
  month        = mar,
  year         = 2019,
  doi          = {10.5281/zenodo.3401437},
  url          = {https://doi.org/10.5281/zenodo.3401437}
}
```