---
# Clustering and Classification using Knowledge Graph Embeddings
---

In this tutorial we will explore how to use the knowledge embeddings generated by a graph of international football matches (since the 19th century) in clustering and classification tasks. Knowledge graph embeddings are typically used for missing link prediction and knowledge discovery, but they can also be used for entity clustering, entity disambiguation, and other downstream tasks. The embeddings are a form of representation learning that allow linear algebra and machine learning to be applied to knowledge graphs, which otherwise would be difficult to do.


We will cover in this tutorial:

1. Creating the knowledge graph (i.e. triples) from a tabular dataset of football matches
2. Training the ComplEx embedding model on those triples
3. Evaluating the quality of the embeddings on a validation set
4. Clustering the embeddings, comparing to the natural clusters formed by the geographical continents
5. Applying the embeddings as features in classification task, to predict match results
6. Evaluating the predictive model on a out-of-time test set, comparing to a simple baseline

We will show that knowledge embedding clusters manage to capture implicit geographical information from the graph and that they can be a useful feature source for a downstream machine learning classification task, significantly increasing accuracy from the baseline.

---

## Requirements

A Python environment with the AmpliGraph library installed. Please follow the [install guide](http://docs.ampligraph.org/en/latest/install.html).

Some sanity check:

In [2]:
!pip install "tensorflow-gpu>=1.15.2,<2.0" ampligraph



In [3]:
import numpy as np
import pandas as pd
import ampligraph
import ampligraph.datasets

ampligraph.__version__

'1.4.0'

In [4]:
import tensorflow as tf

tf.test.is_gpu_available()

True

In [5]:
anigraph = ampligraph.datasets.load_from_rdf("/content/sample_data", "test.owl", rdf_format="n3")

In [5]:
import rdflib
url = 'sample_data/test.owl'
g = rdflib.Graph()
result = g.parse(url, format='turtle')

In [7]:
knows_query = """
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX ani: <http://ani.me#>
SELECT ?character
WHERE
{
  ?character a ani:Character
}  
"""

qres = g.query(knows_query)
characters = []
for row in qres:
    characters.append(str(row.character))
print(characters)

NameError: ignored

In [8]:
knows_query = """
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX ani: <http://ani.me#>
SELECT ?op
WHERE
{
  ?op a owl:ObjectProperty
}  
"""

qres = g.query(knows_query)
relations = []
for row in qres:
    if 'op_relation_' in str(row.op):
      relations.append(str(row.op))
print(relations)

NameError: ignored

In [8]:
from ampligraph.evaluation import train_test_split_no_unseen 

X_train, X_valid = train_test_split_no_unseen(anigraph, test_size=16000)

In [9]:
print('Train set size: ', X_train.shape)
print('Test set size: ', X_valid.shape)

Train set size:  (184050, 3)
Test set size:  (16000, 3)


In [10]:
from ampligraph.latent_features import ComplEx

model = ComplEx(batches_count=50,
                epochs=300,
                k=100,
                eta=20,
                optimizer='adam', 
                optimizer_params={'lr':1e-4},
                loss='multiclass_nll',
                regularizer='LP', 
                regularizer_params={'p':3, 'lambda':1e-5}, 
                seed=0, 
                verbose=True)

In [11]:
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.ERROR)

model.fit(X_train)

Average ComplEx Loss:   0.362071: 100%|██████████| 300/300 [13:15<00:00,  2.65s/epoch]


In [13]:
import ampligraph.utils as au
au.save_model(model, model_name_path='aaa.model')

In [None]:
from ampligraph.evaluation import evaluate_performance
filter_triples = np.concatenate((X_train, X_valid))
ranks = evaluate_performance(X_valid,
                             model=model, 
                             filter_triples=filter_triples,
                             use_default_protocol=True,
                             verbose=True)

In [14]:
from ampligraph.evaluation import mr_score, mrr_score, hits_at_n_score

mr = mr_score(ranks)
mrr = mrr_score(ranks)

print("MRR: %.2f" % (mrr))
print("MR: %.2f" % (mr))

hits_10 = hits_at_n_score(ranks, n=10)
print("Hits@10: %.2f" % (hits_10))
hits_3 = hits_at_n_score(ranks, n=3)
print("Hits@3: %.2f" % (hits_3))
hits_1 = hits_at_n_score(ranks, n=1)
print("Hits@1: %.2f" % (hits_1))

NameError: ignored

In [1]:
from scipy.special import expit
statements = np.array([
    [f'http://ani.me#character_anidbch35717', 
     'http://ani.me#op_relation_is_the_pilot_of',
     'http://ani.me#character_anidbch38630']
])
scores = model.predict(statements)
print(scores)


probs = expit(scores)
probs

NameError: ignored

In [6]:
rels = relations
chars = characters
statements = []
triplets = []

for who in chars:
  for relation in rels:
    for whom in chars:
      statements.append([who, relation, whom])
      triplets.append({"relation": relation, "prob": None, "who": who, "whom": whom})
scores = model.predict(statements)
probs = expit(scores)
probs

i=0
for triplet in triplets:
  triplet['prob']=float(probs[i])
  i+=1

import json
        
with open('triplets.json', "w") as f:
  f.write(json.dumps(triplets))
pd.DataFrame(triplets)

NameError: ignored

In [None]:
from google.colab import drive
drive.mount('/content/drive')