# mOWL Tutorial

This tutorial will teach you how to use machine learning with ontologies. The tutorial captures the different approaches for generating OWL ontology embeddings, and methods to use them. We rely on the mOWL library which intends to implement all embedding methods for Semantic Web (OWL) ontologies.

The majority of libraries for processing OWL ontologies are written in Java while most machine learning libraries are written in Python. First, we will need to access Java libraries in Python so that we can process ontologies and perform reasoning. We rely on the JPype library for this purpose which makes Java classes available in Python. We also have to set the memory available to the Java Virtual Machine.

In [None]:
import mowl
import tempfile
mowl.init_jvm("40g") # the amount of memory to assign to the JVM

We can now access classes from the OWLAPI (the main reference implementation for processing Semantic Web ontologies) through their Python interfaces, just as we would in Java. The next code will load an ontology and classify it using the Elk reasoner. We then query for all subclasses of the Human Phenotype Ontology (HPO) class "Mode of inheritance".

In [None]:
from org.semanticweb.elk.owlapi import ElkReasonerFactory
from org.semanticweb.owlapi.apibinding import OWLManager
from org.semanticweb.owlapi.model import IRI

manager = OWLManager.createOWLOntologyManager()
fac = manager.getOWLDataFactory()
ont = manager.loadOntologyFromOntologyDocument(IRI.create("file:merged-phenomenet.owl"))
print("Number of classes: ", ont.getClassesInSignature(True).size())

reasoner_factory = ElkReasonerFactory()
reasoner = reasoner_factory.createReasoner(ont)

for i in reasoner.getSubClasses(fac.getOWLClass(IRI.create("http://purl.obolibrary.org/obo/HP_0000005")), False).getFlattened():
    print(i)

mOWL wraps some functionality that is commonly used for generating ontology embeddings in the MOWLReasoner class, which can be used to compute a limited form of the deductive closure of an ontology.

# Embedding ontologies

mOWL implements several different ontology embeddings. The overall recipe of embedding ontologies is:
* generate a Dataset for the ontology
* project the OWL ontology suitable for an embedding
* apply the embedding model
* infer axioms using an inference model
* (optional) evaluate the embeddings using an evaluation set

## Datasets

mOWL operates on OWL axioms, and every dataset consists of a set of OWL axioms (here, also called an ontology). mOWL also provides several datasets for testing purposes, and we will use a small dataset here first, the PPI Yeast Slim Dataset.

PPIYeastSlimDataset consists of axioms from the Gene Ontology (GO), in particular the "yeast slim" of the GO, a set of yeast proteins, and an association between proteins and GO classes. The GO is natively available in OWL, but the associations are commonly available only as "annotation" file from various websites. This dataset makes a particular ontological commitment and represents all proteins as OWL classes. Given a protein $P$ and GO class $G$ that is an annotation of $P$, the following axiom is in the PPIYeastSlimDataset: $P \sqsubseteq \exists hasFunction.G$. PPIYeastSlimDataset further adds protein--protein interactions to the ontology; if protein $P_1$ interacts with $P_2$, the axioms $P_1 \sqsubseteq \exists interactsWith.P2$ and $P_2 \sqsubseteq \exists interactsWith.P_1$ are added.

We can print the axioms in the ontology underlying `PPIYeastSlimDataset`:

In [None]:
from mowl.datasets.ppi_yeast import PPIYeastSlimDataset

dataset = PPIYeastSlimDataset()
dataset.get_labels()
count = 0
for i in dataset.ontology.getAxioms(True):
    if count < 100:
        print(i)
        count += 1


A Dataset may additionally have validation and testing data. Both validation and testing are again sets of axioms (ontologies). For the `PPIYeastSlimDataset`, both validation and testing is done only on interactions. We can investigate the axioms used for testing:

In [None]:
count = 0
for i in dataset.testing.getAxioms(True):
    if count < 100:
        print(i)
        count += 1
print(dataset.testing.getAxioms(True).size())

## Graph generation

We generate a graph by projecting axioms in the ontology onto edges in a heterogeneous graph (or knowledge graph). There are several methods available for this operation, and we rely on the DL2Vec methods here which generates edges from axioms based on a set of patterns.

In [None]:
from mowl.projection.dl2vec.model import DL2VecProjector

projector = DL2VecProjector( True)
edges = projector.project(dataset.ontology)


We can now visualize the graph generated using the `networkx` package (it's a bit large, so we just visualize parts of the graph):

In [None]:
import networkx as nx
import pylab as plt

elist = []
count = 1000
for i in edges:
    if count > 0:
        elist.append( (i.src(), i.dst()) )
    count -= 1
    
G=nx.from_edgelist(elist)
nx.draw(G, node_size=10)
plt.show()

Now that we generated a graph from the OWL axioms, we can embed the graph using any (heterogeneous) graph embedding method. The reason we need a method to embed "heterogeneous" graphs is that the projection operations we use consider the relation types, and they should be treated differently in the embedding. Fortunately, there are *many* methods to generate [Knowledge Graph Embeddings](https://persagen.com/files/misc/Wang2017Knowledge.pdf) and mOWL provides access to most of them either by directly implementing them or through the [PyKEEN library](https://github.com/pykeen/pykeen).

Let's start by using embeddings based on random walks over the graph followed by Word2Vec. This method applies a repeated random walk starting from each node to generate a "corpus", followed by a word embedding that captures co-occurrence relations in this corpus. We have to set some parameters: the number of walks from each node; the length/depth of the random walk; a restart probability; and a file to write these walks to:

In [None]:
from mowl.walking.deepwalk.model import DeepWalk
from mowl.walking.node2vec.model import Node2Vec

tmp = tempfile.NamedTemporaryFile()
walker = Node2Vec(
	              100, # number of walks
				  10, # length of each walk
				  0, # probability of restart
				  workers = 8, # number of usable CPUs
                  outfile = tmp.name,
                  q = 0.9
				  )

walks = walker.walk(edges)


We can now embed the corpus using a language model like Word2Vec. Word2Vec captures co-occurrence relations within a window. We just use a standard Word2Vec implementation here. Parameters we have to set is the embedding method (Skipgram or Continuous Bag Of Words), the minimum occurrence count of a word (should be set to `1` as otherwise some embeddings may be missing), the embedding size, the window size (within which co-occurrence is evaluated), and the epochs.

In [None]:
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence

corpus = LineSentence(tmp.name)
w2v_model = Word2Vec(
       corpus,
       sg=1,
       min_count=1,
       vector_size=50,
       window = 10,
       epochs = 2,
       workers = 16)

vectors = w2v_model.wv


The result of the embedding is a set of vectors representing each word in the corpus, and therefore one vector for each entity that was included in the graph generated from the ontology axioms. We can visualize these embeddings using a TNSE.

In [None]:
# The mOWL TSNR is a wrapper for the sklearn.manifold tsne.
from mowl.visualization.base import TSNE

labels = dataset.get_labels() #PPIYeast or PPIYeastSlim datasets now contain the EC number labels. The labels are a dicttionary of the form entity_name -> label_name

tsne = TSNE(vectors, labels)
tsne.generate_points(5000, workers = 14, verbose = 1)
tsne.show()

#TSNE plot can be saved using the following line:
#tsne.savefig(path_to_image.jpg)


As we can see, the embeddings (somehow) cluster according to their Enzyme Classification. We may then use a similarity function between the embedding vectors to generate "meaningful" relations. In mOWL, meaningful relations between OWL entities are expressed in the form of OWL axioms. To obtain axioms from the embeddings, we need an inference method that uses similarity to determine axioms. Here, we rely on cosine similarity between (proteins) $X$ and $Y$ in order to predict axioms of the form $X \sqsubseteq \exists interactsWith.Y$.

## Inference

Visualizing the embeddings is nice, but we may want some more specific answers from these embeddings. In particular, we can use them to infer OWL axioms that should hold true. For performing inference, we need two ingredients. First, we need an axiom (type) that we would like to infer; here we want to infer axioms of the type "X SubClassOf: interacts-with some Y" (because these are the axioms in our test set). The way mOWL implements the inference is that it computes a score for query axioms, possibly iterating through sets of classes; the scoring function can be chosen depending on the type of embedding that is used; here, we just use cosine similarity between the embeddings of the class "X" and "Y" to compute the score of "X SubClassOf: interacts-with some Y".

In [None]:
from mowl.inference.cosine import CosineSimilarityInfer

cosine_infer = CosineSimilarityInfer(vectors, "http://interacts_with")
preds = cosine_infer.score("c?.*?4932\.(Q).*? SubClassOf http://interacts_with some  c?.*?4932.*?")
len(preds)


Now we can look at some of the scores of these axioms; based on the scores, we can also compute metrics with respect to a test set of axioms, including recall at certain ranks, or ROCAUC, etc.

In [None]:
preds_it = iter(preds.items())
for i in range(10):
    print(next(preds_it))

## Translating embeddings

Random walks are a form of embedding of graphs that relies on adjacency. However, other knowledge graph embeddings are more explicit about the kind of graph properties they preserve. For example, [TransE](https://paperswithcode.com/method/transe) generates embeddings $e$ for nodes and edge types such that $e(h) + e(r) \approx e(t)$ if $r(h,t)$ is an edge in the graph. But there are hundreds of similar embedding methods available, and we rely on the PyKEEN library for accessing these kinds of embeddings. Here is an example of using TransE to generate an embedding of the projected graph:

In [None]:
from mowl.embeddings.translational.model import TranslationalOnt
import torch
cuda0 = torch.device('cuda:0')


trans_model = TranslationalOnt(
     edges,
     trans_method = "transE",
     embedding_dim = 50,
     epochs = 10,
     batch_size = 1024,
#     device = cuda0,
     model_filepath = "/tmp/trans_model.th"
 )

trans_model.train()
cls_embeddings, rel_embeddings = trans_model.get_embeddings()

We can now visualize these embeddings as before, coloring the vectors by their EC number (useful to see how well the embeddings work for a classification task):

In [None]:
from mowl.visualization.base import TSNE

labels = dataset.get_labels() #PPIYeast or PPIYeastSlim datasets now contain the EC number labels. The labels are a dicttionary of the form entity_name -> label_name
tsne = TSNE(cls_embeddings, labels)
tsne.generate_points(5000, workers = 16, verbose = 1)
tsne.show()


Similar as before, we can apply a function to score axioms. Here, we use a different function that relies on how the TransE model performs link prediction:

In [None]:
from mowl.inference.el import GCI2Score
gci2_scorer = GCI2Score(trans_model.score_method_point, list(cls_embeddings.keys()), list(rel_embeddings.keys())) #in this case we need to imput the class and relations list.
print(f"Accepted pattern: {gci2_scorer.patterns}")

preds = gci2_scorer.score("c?.*?4932.Q0110.*? SubClassOf p?.*?int.*? some c?.*?4932.Q0.*?")
len(preds)

In [None]:
preds_it = iter(preds.items())
for i in range(10):
    print(next(preds_it))

## Embeddings without graph projection

So far, we have just reused knowledge graph embedding methods, and first projected ontologies onto graphs. This works great, in particular as there are so many knowledge graph embedding methods available. However, there are some disadvantages; in particular, projecting axioms onto graphs will almost always lose some information. For example, almost no graph projection method will adequately deal with disjointness between classes, or they may not consider axioms of a certain complexity. mOWL implements a number of embedding methods that are based directly on axioms. We can try one of the simplest methods, Onto2Vec, which simply applies a language model to the ontology axioms directly. We first extract the axioms as a "corpus" (set of sentences), and then we embed this using Word2Vec:

In [None]:
from mowl.corpus.base import extract_and_save_axiom_corpus, extract_annotation_corpus
from gensim.models.word2vec import LineSentence
from gensim.models import Word2Vec

tmp = tempfile.NamedTemporaryFile()
extract_and_save_axiom_corpus(dataset.ontology, tmp.name)
extract_annotation_corpus(dataset.ontology, tmp.name)

sentences = LineSentence(tmp.name)

model = Word2Vec(
         sentences,
         sg = 1,
         min_count = 1,
         vector_size = 20,
         window = 5,
         epochs = 20,
         workers = 8
     )

vectors = model.wv


As before we can now visualize the embeddings generated:

from mowl.visualization.base import TSNE

labels = dataset.get_labels() #PPIYeast or PPIYeastSlim datasets now contain the EC number labels. The labels are a dicttionary of the form entity_name -> label_name
tsne = TSNE(vectors, labels)
tsne.generate_points(5000, workers = 16, verbose = 1)
tsne.show()


## EL Embeddings

The embedding methods so far did not really exploit any of the semantics of the OWL language. The last embedding model here is EL Embeddings; EL Embeddings generate a model *as* the embedding, i.e., with respect to a particular set of operations, the resulting embeddings generate (or approximate) a model.

In [None]:
from mowl.embeddings.elembeddings.model import ELEmbeddings
import torch

cuda0 = torch.device('cpu')

model = ELEmbeddings(
     dataset,
     epochs = 10,
     margin = 0.1,
     model_filepath = "model.th",
    device = cuda0
 )

model.train()


In [None]:
dataset.get_evaluation_classes() # returns all classes in the axioms in testing

As before, we can perform inference on these classes, using the specific scoring function of EL Embeddings:

In [None]:
from mowl.inference.el import GCI2Score
elem_cls_embs, elem_rel_embs = model.get_embeddings()
gci2_score_elem = GCI2Score(model.gci2_loss, list(elem_cls_embs.keys()), list(elem_rel_embs.keys())) #in this case we need to imput the class and relations list.
print(f"Accepted pattern: {gci2_score_elem.patterns}")

preds_elem = gci2_score_elem.score("c?.*?4932.Q0110.*? SubClassOf p?.*?int.*? some c?.*?4932.Q0.*?")
len(preds_elem)

In [None]:
preds_it = iter(preds_elem.items())
for i in range(10):
    print(next(preds_it))