***
# PheKnowLator - Ontology Cleaning
***
***

**Author:** [TJCallahan](https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=callahantiff@gmail.com)  
**GitHub Repository:** [PheKnowLator](https://github.com/callahantiff/PheKnowLator/wiki)  
**Release:** **[v2.0.0](https://github.com/callahantiff/PheKnowLator/wiki/v2.0.0)**
  
<br>  
  
**Purpose:** This notebook serves as a script to help prepare ontologies prior to be ingested into the knowledge graph build algorithm. This script focuses on preparing ontologies for digestion by performing the following steps:  
1. [Clean Ontologies](#clean-ontologies)  
2. [Merge Ontologies](#merge-ontologies)  
3. [Normalize Classes](#normalize-classes)  


<br>

**Assumptions:**   
- Directory of Imported Ontologies ➞ `./resources/ontologies`    
- Processed data write location ➞ `./resources/ontologies`  

<br>

**Dependencies:**   
- This notebook utilizes several helper functions, which are stored in the [`kg_utils.py`](https://github.com/callahantiff/PheKnowLator/blob/master/pkt_kg/utils/kg_utils.py) script. Hyperlinks to all downloaded and generated data sources are provided on the [Data Sources](https://github.com/callahantiff/PheKnowLator/wiki/v2-Data-Sources) Wiki page as well as within each source subsection of this notebook. All generated data is freely available for download from DropBox. 
- [`OWL Tools`](https://github.com/owlcollab/owltools)

_____
***

### Set-Up Environment
_____

In [5]:
# import needed libraries
import glob
import pickle

from owlready2 import *
from rdflib import Graph, Namespace, URIRef, BNode, Literal
from tqdm import tqdm

# import script containing helper functions
from pkt_kg.utils import * 

In [None]:
# set up environment variables
write_location = './resources/knowledge_graphs'
merged_ontology_file = '/PheKnowLator_MergedOntologies.owl'
ontology_repository = glob.glob('*/ontologies/*.owl')


<br><br>

***
### Clean Ontologies <a class="anchor" id="clean-ontologies"></a>

**Purpose:** In this step, we read in the ontologies using the [`owlready2`](https://pypi.org/project/Owlready2/) library and use it to indicate the presence of errors in the ontology files. We use this tool because it has strict filters.

Errors were found in the following ontologies:  
- [Vaccine Ontology](http://www.violinet.org/vaccineontology/): GitHub [issue #4](https://github.com/vaccineontology/VO/issues/4)
- [Cell Line Ontology](http://www.clo-ontology.org/): GitHub [issue #42](https://github.com/CLO-ontology/CLO/issues/42), [issue #45](https://github.com/CLO-ontology/CLO/issues/45)  
- [PRotein Ontology](https://proconsortium.org/pro.shtml): GitHub [issue #176](https://github.com/PROconsortium/PRoteinOntology/issues/176)


**Import Ontology Check**

In [None]:
for ont in tqdm(ontology_repository):
    load_onto = get_ontology(ont).load()


The [Cell Line Ontology](http://www.clo-ontology.org/) yield the following error message:

```python

ValueError: invalid literal for int() with base 10: '永生的乳腺衍生细胞系细胞'

...

OwlReadyOntologyParsingError: RDF/XML parsing error in file ./resources/knowledge_graphs/PheKnowLator_MergedOntologies.owl, line 2363344, column 99.
```

This tells us that we need to repair the triple containing the Literal '永生的乳腺衍生细胞系细胞' by removing it and redefining it as a `string`, rather than an `int` as it is currently defined as. 

<br>

This is currently noted as an issue in the [Cell Line Ontology's](http://www.clo-ontology.org/) GitHub repo ([issue #45](https://github.com/CLO-ontology/CLO/issues/48)). 

In [None]:
for edge in tqdm(graph):
    if '永生的乳腺衍生细胞系细胞' in str(edge[0]) or '永生的乳腺衍生细胞系细胞' in str(edge[2]):
        
        # repair broken triple
        graph.add((edge[0], edge[1], Literal(str(edge[2]), datatype=URIRef('http://www.w3.org/2001/XMLSchema#string'))))
        graph.remove(edge)
        break

# save cleaned up ontology
graph.serialize(destination='./resources/ontologies/clo_with_imports', format='xml')


In [None]:
# try reading in the cleaned ontology again
merged_onto = get_ontology('./resources/ontologies/clo_with_imports').load()


The next errors that are generated are related to punning, specifically that the following OWL object properties had been incorrectly redeclared as OWL annotation properties:

```bash
2020-03-24 16:48:25,458 ERROR (OWLOntologyManagerImpl:1138) Illegal redeclarations of entities: reuse of entity http://purl.obolibrary.org/obo/RO_0002091 in punning not allowed [Declaration(AnnotationProperty(<http://purl.obolibrary.org/obo/RO_0002091>)), Declaration(ObjectProperty(<http://purl.obolibrary.org/obo/RO_0002091>))]
2020-03-24 16:48:25,460 ERROR (OWLOntologyManagerImpl:1138) Illegal redeclarations of entities: reuse of entity http://purl.obolibrary.org/obo/BFO_0000062 in punning not allowed [Declaration(AnnotationProperty(<http://purl.obolibrary.org/obo/BFO_0000062>)), Declaration(ObjectProperty(<http://purl.obolibrary.org/obo/BFO_0000062>))]
2020-03-24 16:48:25,460 ERROR (OWLOntologyManagerImpl:1138) Illegal redeclarations of entities: reuse of entity http://purl.obolibrary.org/obo/BFO_0000063 in punning not allowed [Declaration(ObjectProperty(<http://purl.obolibrary.org/obo/BFO_0000063>)), Declaration(AnnotationProperty(<http://purl.obolibrary.org/obo/BFO_0000063>))]
2020-03-24 16:48:25,460 ERROR (OWLOntologyManagerImpl:1138) Illegal redeclarations of entities: reuse of entity http://purl.obolibrary.org/obo/RO_0002222 in punning not allowed [Declaration(AnnotationProperty(<http://purl.obolibrary.org/obo/RO_0002222>)), Declaration(ObjectProperty(<http://purl.obolibrary.org/obo/RO_0002222>))]
2020-03-24 16:48:25,460 ERROR (OWLOntologyManagerImpl:1138) Illegal redeclarations of entities: reuse of entity http://purl.obolibrary.org/obo/RO_0000087 in punning not allowed [Declaration(ObjectProperty(<http://purl.obolibrary.org/obo/RO_0000087>)), Declaration(AnnotationProperty(<http://purl.obolibrary.org/obo/RO_0000087>))]
2020-03-24 16:48:25,460 ERROR (OWLOntologyManagerImpl:1138) Illegal redeclarations of entities: reuse of entity http://purl.obolibrary.org/obo/RO_0002161 in punning not allowed [Declaration(ObjectProperty(<http://purl.obolibrary.org/obo/RO_0002161>)), Declaration(AnnotationProperty(<http://purl.obolibrary.org/obo/RO_0002161>))]
```

From this message, we can see that we need to remove the following redeclaration to Annotation Properties for the following object properties:  
- RO_0002091  
- BFO_0000062  
- BFO_0000063  
- RO_0002222  
- RO_0000087  
- RO_0002161  

<br>

This is another error caused by the [Cell Line Ontology](http://www.clo-ontology.org/) and has been posted to GitHub ([issue #42](https://github.com/CLO-ontology/CLO/issues/42)).

We also removed RO_0002161 (the Annotation Property) from GO, UBERON, and HPO.

<br>

Consistent with the solution described [here](https://github.com/oborel/obo-relations/issues/130), we removed all `AnnotationProperty` declarations from the merged ontology file. The Annotation Properties for each of the Object Properties listed above were removed using Protége.

**Class Identifier Check**  
Check class identifiers to ensure consistency in identifier prefixes. Running this check revealed mislabeling of two [pROtein Ontology](https://proconsortium.org/) identifiers in the [Vaccine Ontology](http://www.violinet.org/vaccineontology/) (see [this](https://github.com/vaccineontology/VO/issues) GitHub issue).

In [63]:
# find all classes in graph
kg_classes = graph.query(
    """SELECT DISTINCT ?c
           WHERE {?c rdf:type owl:Class . }
           """, initNs={'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
                        'owl': 'http://www.w3.org/2002/07/owl#'}
)
        

In [None]:
# convert results to list of classes and only keep hgnc identifiers
class_list = [res[0] for res in tqdm(kg_classes) if isinstance(res[0], URIRef) and 'obo' in str(res[0])]


In [None]:
class_types = []

for cls in class_list2:
    class_types.append(cls.split('/')[-1].split('_')[0])
    
set(class_types)


<br><br>

***
### MERGE ONTOLOGIES <a class="anchor" id="merge-ontologies"></a>

**Purpose:** In this step, the `OWL Tools` library is designed to merge a directory of ontology files into a single ontology file. This merged ontology file is required as input to the knowledge graph build algorithm.  

**Inputs:** A directory of ontology files (`.owl`)  
**Outputs:** [`PheKnowLator_MergedOntologies.owl`](https://www.dropbox.com/s/6e7gf8r229nbu67/PheKnowLator_MergedOntologies.owl?dl=1)

In [49]:
# verify there are ontology files in ontology repo
for ont in ontology_repository:
    print(ont)


resources/ontologies/chebi_lite_with_imports.owl
resources/ontologies/clo_with_imports.owl
resources/ontologies/doid_with_imports.owl
resources/ontologies/ext_with_imports.owl
resources/ontologies/go_with_imports.owl
resources/ontologies/hp_with_imports.owl
resources/ontologies/human_pro_closed_with_imports.owl
resources/ontologies/pw_with_imports.owl
resources/ontologies/so_with_imports.owl
resources/ontologies/vo_with_imports.owl


In [50]:
# merge ontologies
if write_location + merged_ontology_file in glob.glob(write_location + '/*.owl'):
    graph = Graph()
    graph.parse(write_location + merged_ontology_file)
    gets_ontology_statistics(write_location + merged_ontology_file)
else:
    merges_ontologies(ontology_repository, write_location, merged_ontology_file)



The knowledge graph contains 395290 classes, 4168504 axioms, 558 object properties, and 137 individuals



<br><br>

***

### Normalize Classes <a class="anchor" id="normalize-classes"></a>

**Purpose:** The goal of this section is to checked the cleaned merged ontology file to ensure that there is consistency between the existing classes. To do this, we check the following two things:  
- <u>Connectivity Between Existing Classes</u>: For this check, we want to make sure that all classes that represent the same entity are connected to each other. For example, consider the following:  
    - Ontologies: [Sequence Ontology](http://www.sequenceontology.org/), [ChEBI](https://www.ebi.ac.uk/chebi), and [PRotein Ontology](https://proconsortium.org/) all include terms for protein, but none of these classes are connected to each other. 
    
    
- <u>Consistency Between Ontology Classes and New Edge Data</u>: For this check, we want to make sure that any of the existing ontology classes can be aligned with any of the new data entities that we want to add to the knowledge graph. For example:  
  - Gene Classes: there are several gene classes that use [HGNC](https://www.genenames.org/) identifiers. We also want to add genes, but prefer to use [Entrez gene](https://www.ncbi.nlm.nih.gov/gene) identifiers. In order to be used with our data, we must first normalize all of the HPO gene classes to Entrez gene identifiers.
  
<br>

**Dependencies:** The Merged Gene, RNA, Protein Map ([`Merged_gene_rna_protein_identifiers.pkl`](https://www.dropbox.com/s/9zlysbqvpdtfq62/Merged_gene_rna_protein_identifiers.pkl?dl=1)) we generated in order to map genomic identifier data sources.

<br>

**Connectivity Between Existing Classes**

The follow classes occur in all of the ontologies used in the current build and have to be resolved:  
- Protein: [SO](http://purl.obolibrary.org/obo/SO_0000104), [PRO](http://purl.obolibrary.org/obo/PR_000000001), [ChEBI](http://purl.obolibrary.org/obo/CHEBI_36080)  
  - <u>Solution</u>: Make the ChEBI and PRO classes a subclass of the SO term  
    ```python
    PR_000000001, rdfs:subClassOf, SO_0000104   
    CHEBI_36080, rdfs:subClassOf, SO_0000104 
    ```

In [51]:
# fix protein class inconsistencies
graph.add((URIRef('http://purl.obolibrary.org/obo/PR_000000001'),
           URIRef('http://www.w3.org/2000/01/rdf-schema#subClassOf'),
           URIRef('http://purl.obolibrary.org/obo/SO_0000104')))

graph.add((URIRef('http://purl.obolibrary.org/obo/CHEBI_36080'),
           URIRef('http://www.w3.org/2000/01/rdf-schema#subClassOf'),
           URIRef('http://purl.obolibrary.org/obo/SO_0000104')))

# save cleaned up ontology
graph.serialize(destination=write_location + merged_ontology_file, format='xml')


<br>

**Consistency Between Ontology Classes and New Edge Data**  
The first step to normalizing ontology classes with multiple identifiers is to query the ontology and obtain all classes that are not part of the [Open Biomedical Ontology](http://www.obofoundry.org/) namespace.

For the current build, the primary focus of this task is to convert all classes that reference an HGNC gene (`n=19,820`) to an Entrez identifier. To do this, we will utilize the genomic identifier mapping information ([`Merged_gene_rna_protein_identifiers.pkl`](https://www.dropbox.com/s/9zlysbqvpdtfq62/Merged_gene_rna_protein_identifiers.pkl?dl=1)) we constructed in the [`Data_Preparation.ipynb`](https://github.com/callahantiff/PheKnowLator/blob/master/Data_Preparation.ipynb) Jupyter notebook. Note that we aree only updating identifiers and not verifying labels or other metadata.

In [None]:
# convert results to list of classes and only keep hgnc identifiers
class_list_gene = [res[0] for res in tqdm(kg_classes) if isinstance(res[0], URIRef) and 'hgnc' in str(res[0])]
  

In [66]:
# load genomic identifier mapping dictionary
genomic_id_map = pickle.load(open('resources/processed_data/Merged_gene_rna_protein_identifiers.pkl', 'rb'), encoding='bytes')


In [None]:
# loop over each gene class and get entrez gene id equivalent
matches, not_matched = {}, []
gene_url = 'https://www.ncbi.nlm.nih.gov/gene/'

for gene_class in tqdm(class_list_gene):
    key = 'hgnc_id_' + str(gene_class).split('=')[-1]
    
    if key in genomic_id_map.keys():
        matches[str(gene_class)] = [gene_url + x.split('_')[-1] for x in genomic_id_map[key] if 'entrez_id' in x]
    else:
        not_matched.append(gene_class)
        

_Investigate UnMatched Genes_  
Only 1 of the HGNC genes was not found in our dictionary ([HGNC_24033](http://www.genenames.org/cgi-bin/gene_symbol_report?hgnc_id=24033)). Investigating this issue revealed that HGNC made this identifier obsolete and replaced it with [`HGNC:26545`](https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/24033). Until this term is updated in the ontology, we have to manually fix it. 

This issue has been reported to [PRotein Ontology](https://proconsortium.org/pro.shtml) (see [this](https://github.com/PROconsortium/PRoteinOntology/issues/176) GitHub issue).

In [85]:
# investigate HGNC genes with no mappings to Entrez
not_matched

# update mapping dictionary
gene_url = 'https://www.ncbi.nlm.nih.gov/gene/'
matches[str(not_matched[0])] =  [gene_url + x.split('_')[-1] for x in genomic_id_map['hgnc_id_26545'] if 'entrez_id' in x]


In [None]:
# updated gene identifiers in graph
for edge in tqdm(graph):
    if str(edge[0]) in matches.keys():
        for mapped_id in matches[str(edge[0])]:
            graph.add((URIRef(mapped_id), edge[1], edge[2]))
            graph.remove(edge)
    elif str(edge[2]) in matches.keys():
        for mapped_id in matches[str(edge[2])]:
            graph.add((edge[0], edge[1], URIRef(mapped_id)))
            graph.remove(edge)
    else:
        continue
    

In [None]:
# save normalized ontology
graph.serialize(destination=write_location + merged_ontology_file[:-4] + 'GeneID_Normalized.owl', format='xml')


In [40]:
# code to crreate pairwise ontology merges -- -for owl reasonere challenge
# import itertools

# # get all pairwise merges of ontologies
# ontology_merge_list_2 = list(itertools.combinations(ontology_repository, 2))
# ontology_merge_list_3 = list(itertools.combinations(ontology_repository, 3))
# ontology_merge_list_4 = list(itertools.combinations(ontology_repository, 4))
# ontology_merge_list_5 = list(itertools.combinations(ontology_repository, 5))
# ontology_merge_list_6 = list(itertools.combinations(ontology_repository, 6))
# ontology_merge_list_7 = list(itertools.combinations(ontology_repository, 7))
# ontology_merge_list_8 = list(itertools.combinations(ontology_repository, 8))
# ontology_merge_list_9 = list(itertools.combinations(ontology_repository, 9))

# # iterate
# Ontology_list = ontology_merge_list_3
# write_location = './resources/ontologies/reasoner_challenge/'

# for ont_list in tqdm(Ontology_list):
#     merged_file_name = '_'.join(['_'.join(x.split('/')[-1].split('_')[:-2]) for x in ont_list]) + '_merged.owl'
#     merges_ontologies(list(ont_list), './resources/ontologies/reasoner_challenge/', merged_file_name)
