# Using Pykeen Metrics For An Unsupervised Evaluation of a knowledge graph

This notebook presents different metrics from the Pykeen library to evaluate the robustness of a set of triplets, concretely the triplets are the ones extracted to form our knowledge graph, for each metric we provide definitions and usage example to showcase it's strong points and weak points.

In [None]:
%%capture
!pip install pykeen

In [None]:
import pykeen
from pykeen.triples import TriplesFactory
from pykeen.pipeline import pipeline
import pandas as pd
import numpy as np

INFO:pykeen.utils:Using opt_einsum


# I. Mean Rank (MR)

- **Definition**:  The arithmetic mean over all individual ranks
- **Ranking**: For each removed entity, the model ranks all possible entities in the knowledge graph based on their likelihood of being the correct entity. The rank is the position of the true entity in this sorted list.
- **Interpretation**:  A lower MR indicates better performance, it implies that the true triples are ranked higher by the model. MR can be sensitive to the size of the dataset and the number of entities.
- **Limitation**:  MR can be heavily influenced by a few very poorly ranked true triples that can skew the overall mean. It also depends on the model used (for predicting head or tail).
- **Conclusions**: Relying only on MR, as computed by a single model might not be the most effective approach for our goal to evaluate the correctness of a set of triplets.

In [None]:
triplets = np.array([
    ['A', 'knows', 'B'],
    ['A', 'likes', 'Coffee'],
    ['B', 'likes', 'Tea'],
])

tf = TriplesFactory.from_labeled_triples(triplets)

results = pipeline(
    model='TransE',
    training=tf,
    testing=tf,
    training_loop='slcwa',
    training_kwargs=dict(num_epochs=5)
)

mr_score = results.metric_results.get_metric('mean_rank')
print(f"Mean Rank (MR): {mr_score:.2f}")

INFO:pykeen.pipeline.api:Using device: None


Training epochs on cuda:0:   0%|          | 0/5 [00:00<?, ?epoch/s]

Training batches on cuda:0:   0%|          | 0/1 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0/1 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0/1 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0/1 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0/1 [00:00<?, ?batch/s]

INFO:pykeen.evaluation.evaluator:Starting batch_size search for evaluation now...
INFO:pykeen.evaluation.evaluator:Concluded batch_size search with batch_size=3.


Evaluating on cuda:0:   0%|          | 0.00/3.00 [00:00<?, ?triple/s]

INFO:pykeen.evaluation.evaluator:Evaluation took 0.11s seconds


Mean Rank (MR): 3.00


In [None]:
modified_triplets = np.array([
    ['A', 'knows', 'B'],
    ['B', 'knows', 'C'],
    ['C', 'knows', 'A'],
    ['A', 'likes', 'Coffee'],
    ['B', 'likes', 'Tea'],
    ['C', 'likes', 'Juice'],
    ['A', 'visits', 'Paris'],
    ['B', 'visits', 'London'],
    ['C', 'visits', 'Berlin'],
    ['Paris', 'located_in', 'France'],
    ['London', 'located_in', 'England'],
    ['Berlin', 'located_in', 'Germany'],
])

modified_tf = TriplesFactory.from_labeled_triples(modified_triplets)

modified_results = pipeline(
    model='TransE',
    training=modified_tf,
    testing=modified_tf,
    training_loop='slcwa',
    training_kwargs=dict(num_epochs=5)
)

modified_mr_score = modified_results.metric_results.get_metric('mean_rank')
print(f"Modified Mean Rank (MR): {modified_mr_score:.2f}")


INFO:pykeen.pipeline.api:Using device: None


Training epochs on cuda:0:   0%|          | 0/5 [00:00<?, ?epoch/s]

Training batches on cuda:0:   0%|          | 0/1 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0/1 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0/1 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0/1 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0/1 [00:00<?, ?batch/s]

INFO:pykeen.evaluation.evaluator:Starting batch_size search for evaluation now...
INFO:pykeen.evaluation.evaluator:Concluded batch_size search with batch_size=12.


Evaluating on cuda:0:   0%|          | 0.00/12.0 [00:00<?, ?triple/s]

INFO:pykeen.evaluation.evaluator:Evaluation took 0.16s seconds


Modified Mean Rank (MR): 6.21


- **Inference**: MR is likely to be lower (better) for a KG with correct triples, as the model can learn accurate relationships and rank true triples higher. Conversely, for a KG with some incorrect triples, the MR is expected to be higher (worse), reflecting the model's difficulty in accurately ranking triples due to the misleading information it learned during training.

# II. Mean Reciprocal Rank (MRR)

- **Definition**:  The arithmetic mean of reciprocal ranks
- **Ranking**: We use the reciprocal rank, getting the answer right at the top position (rank 1) has a big impact, while answers further down the list have less impact.
- **Interpretation**:  A high MRR value is good. It means that the model usually puts the correct answer very close to the top of its list of guesses.
- **Limitation**:  Doesn't care much about lower rank, Whether the second best guess is actually in the second place or the tenth doesn't change the MRR that much.
- **Conclusions**: Using MRR to measure the correctness of a set of triples from a knowledge graph is uncertain since it focuses on the top results and doesn't care about lower ranking triplet. MRR won't tell us whether a triple is correct or not, it will only tells us about the relative ranking of correct triples, which could still be interesting for us.

In [None]:
triplets = np.array([
    ['A', 'knows', 'B'],
    ['A', 'likes', 'Coffee'],
    ['B', 'likes', 'Tea'],
])

tf = TriplesFactory.from_labeled_triples(triplets)

results = pipeline(
    model='TransE',
    training=tf,
    testing=tf,
    training_loop='slcwa',
    training_kwargs=dict(num_epochs=5)
)

mrr_score = results.metric_results.get_metric('mean_reciprocal_rank')
print(f"Mean Reciprocal Rank (MRR) Score: {mrr_score:.3f}")

INFO:pykeen.pipeline.api:Using device: None


Training epochs on cuda:0:   0%|          | 0/5 [00:00<?, ?epoch/s]

Training batches on cuda:0:   0%|          | 0/1 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0/1 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0/1 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0/1 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0/1 [00:00<?, ?batch/s]

INFO:pykeen.evaluation.evaluator:Starting batch_size search for evaluation now...
INFO:pykeen.evaluation.evaluator:Concluded batch_size search with batch_size=3.


Evaluating on cuda:0:   0%|          | 0.00/3.00 [00:00<?, ?triple/s]

INFO:pykeen.evaluation.evaluator:Evaluation took 0.16s seconds


Mean Reciprocal Rank (MRR) Score: 0.375


In [None]:
modified_triplets = np.array([
    ['A', 'knows', 'B'],
    ['B', 'knows', 'C'],
    ['C', 'knows', 'A'],
    ['A', 'likes', 'Coffee'],
    ['B', 'likes', 'Tea'],
    ['C', 'likes', 'Juice'],
    ['A', 'visits', 'Paris'],
    ['B', 'visits', 'London'],
    ['C', 'visits', 'Berlin'],
    ['Paris', 'located_in', 'France'],
    ['London', 'located_in', 'England'],
    ['Berlin', 'located_in', 'Germany'],
])

modified_tf = TriplesFactory.from_labeled_triples(modified_triplets)

modified_results = pipeline(
    model='TransE',
    training=modified_tf,
    testing=modified_tf,
    training_loop='slcwa',
    training_kwargs=dict(num_epochs=5)
)

modified_mrr_score = modified_results.metric_results.get_metric('mean_reciprocal_rank')
print(f"Modified Mean Reciprocal Rank (MRR) Score: {modified_mrr_score:.3f}")

INFO:pykeen.pipeline.api:Using device: None


Training epochs on cuda:0:   0%|          | 0/5 [00:00<?, ?epoch/s]

Training batches on cuda:0:   0%|          | 0/1 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0/1 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0/1 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0/1 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0/1 [00:00<?, ?batch/s]

INFO:pykeen.evaluation.evaluator:Starting batch_size search for evaluation now...
INFO:pykeen.evaluation.evaluator:Concluded batch_size search with batch_size=12.


Evaluating on cuda:0:   0%|          | 0.00/12.0 [00:00<?, ?triple/s]

INFO:pykeen.evaluation.evaluator:Evaluation took 0.19s seconds


Modified Mean Reciprocal Rank (MRR) Score: 0.195


- **Inference**: MRR is a good metric when the first result is significantly more valuable than the rest. However, this is also a limitation when we need to evaluate the model's performance across the entire ranked list.

# III. Hits at K (Hits@K)

- **Definition**:  measure of how often the correct answer appears in the top K predictions made by a model
- **Interpretation**:  Hits@K allows us to evaluate the model's performance at different levels of strictness. A high Hits@K value means that the model frequently ranks the correct triples within the top K positions.
- **Limitation**:  Very dependent on the value of K and doesn't distinguish where within the top 'K' the correct answer lies. Also, it doesn't account for the number of incorrect answers that are ranked above the correct one.
- **Conclusions**: Hits@K woul be useful when we care about whether the model can provide a set of top 'K' predictions that include the correct answer, rather than its exact rank, but in this case we wouldn't get much of an evaluation of the correctness of our set of triplets.

In [None]:
triplets = np.array([
    ['A', 'knows', 'B'],
    ['A', 'likes', 'Coffee'],
    ['B', 'likes', 'Tea'],
    ['B', 'knows', 'C'],
])

tf = TriplesFactory.from_labeled_triples(triplets)

results = pipeline(
    model='TransE',
    training=tf,
    testing=tf,
    training_loop='slcwa',
    training_kwargs=dict(num_epochs=5)
)

# very strict
hits_at_1_score = results.metric_results.get_metric('hits@1')
print(f"Hits@1 Score: {hits_at_1_score:.3f}")

# moderately strict
hits_at_3_score = results.metric_results.get_metric('hits@3')
print(f"Hits@3 Score: {hits_at_3_score:.3f}")

# lenient
hits_at_10_score = results.metric_results.get_metric('hits@10')
print(f"Hits@10 Score: {hits_at_10_score:.3f}")

INFO:pykeen.pipeline.api:Using device: None


Training epochs on cuda:0:   0%|          | 0/5 [00:00<?, ?epoch/s]

Training batches on cuda:0:   0%|          | 0/1 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0/1 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0/1 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0/1 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0/1 [00:00<?, ?batch/s]

INFO:pykeen.evaluation.evaluator:Starting batch_size search for evaluation now...
INFO:pykeen.evaluation.evaluator:Concluded batch_size search with batch_size=4.


Evaluating on cuda:0:   0%|          | 0.00/4.00 [00:00<?, ?triple/s]

INFO:pykeen.evaluation.evaluator:Evaluation took 0.08s seconds


Hits@1 Score: 0.000
Hits@3 Score: 0.625
Hits@10 Score: 1.000


In [None]:
modified_triplets = np.array([
    ['A', 'knows', 'B'],
    ['B', 'knows', 'C'],
    ['C', 'knows', 'A'],
    ['A', 'likes', 'Coffee'],
    ['B', 'likes', 'Tea'],
    ['C', 'likes', 'Juice'],
    ['A', 'visits', 'Paris'],
    ['B', 'visits', 'London'],
    ['C', 'visits', 'Berlin'],
    ['Paris', 'located_in', 'France'],
    ['London', 'located_in', 'England'],
    ['Berlin', 'located_in', 'Germany'],
])

modified_tf = TriplesFactory.from_labeled_triples(modified_triplets)

modified_results = pipeline(
    model='TransE',
    training=modified_tf,
    testing=modified_tf,
    training_loop='slcwa',
    training_kwargs=dict(num_epochs=5)
)

modified_hits_at_1_score = modified_results.metric_results.get_metric('hits@1')
print(f"Modified Dataset Hits@1 Score: {modified_hits_at_1_score:.3f}")

modified_hits_at_3_score = modified_results.metric_results.get_metric('hits@3')
print(f"Modified Dataset Hits@3 Score: {modified_hits_at_3_score:.3f}")

modified_hits_at_10_score = modified_results.metric_results.get_metric('hits@10')
print(f"Modified Dataset Hits@10 Score: {modified_hits_at_10_score:.3f}")

INFO:pykeen.pipeline.api:Using device: None


Training epochs on cuda:0:   0%|          | 0/5 [00:00<?, ?epoch/s]

Training batches on cuda:0:   0%|          | 0/1 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0/1 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0/1 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0/1 [00:00<?, ?batch/s]

Training batches on cuda:0:   0%|          | 0/1 [00:00<?, ?batch/s]

INFO:pykeen.evaluation.evaluator:Starting batch_size search for evaluation now...
INFO:pykeen.evaluation.evaluator:Concluded batch_size search with batch_size=12.


Evaluating on cuda:0:   0%|          | 0.00/12.0 [00:00<?, ?triple/s]

INFO:pykeen.evaluation.evaluator:Evaluation took 0.08s seconds


Modified Dataset Hits@1 Score: 0.000
Modified Dataset Hits@3 Score: 0.208
Modified Dataset Hits@10 Score: 0.917


- **Inference**: For evaluating a knowledge graph using Hits at K the choice of K should be considered relative to the size of the dataset. The model ranks the true triplet highest not whether the triplets themselves are correct.