<a href="https://colab.research.google.com/github/IrinaProkofieva/KnowledgeGrpahCourse/blob/patch-1/Practice/2023/IPKN/Prokofieva_Gritsai/Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---
# Clustering and Classification using Knowledge Graph Embeddings
---

In this tutorial we will explore how to use the knowledge embeddings generated by a graph of international football matches (since the 19th century) in clustering and classification tasks. Knowledge graph embeddings are typically used for missing link prediction and knowledge discovery, but they can also be used for entity clustering, entity disambiguation, and other downstream tasks. The embeddings are a form of representation learning that allow linear algebra and machine learning to be applied to knowledge graphs, which otherwise would be difficult to do.


We will cover in this tutorial:

1. Creating the knowledge graph (i.e. triples) from a tabular dataset of football matches
2. Training the ComplEx embedding model on those triples
3. Evaluating the quality of the embeddings on a validation set
4. Clustering the embeddings, comparing to the natural clusters formed by the geographical continents
5. Applying the embeddings as features in classification task, to predict match results
6. Evaluating the predictive model on a out-of-time test set, comparing to a simple baseline

We will show that knowledge embedding clusters manage to capture implicit geographical information from the graph and that they can be a useful feature source for a downstream machine learning classification task, significantly increasing accuracy from the baseline.

---

## Requirements

A Python environment with the AmpliGraph library installed. Please follow the [install guide](http://docs.ampligraph.org/en/latest/install.html).

Some sanity check:

In [None]:
!pip install tensorflow==2.9.0 ampligraph==2.0.1

In [None]:
import numpy as np
import pandas as pd
import ampligraph

from scipy.special import expit

In [None]:
import tensorflow as tf

tf.test.is_gpu_available()

True

## Graph

In [None]:
!pip install rdflib
!pip install urllib3



In [None]:
from rdflib import Graph

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
g = Graph()

g.parse('/content/drive/MyDrive/Colab Notebooks/Графы знаний/ontology-with-individuals.owl', format="turtle")
print(g.serialize(format='n3'))

In [None]:
triples_df = pd.DataFrame(g, columns=['s', 'p', 'o'])
# prefix deletion
for column in triples_df:
    triples_df[column] = triples_df[column].str.replace(r'.*#', '', regex=True)
#film/book id deletion
for column in triples_df:
    triples_df[column] = triples_df[column].str.replace(r'^\d+_', '', regex=True)

triples_df.head(5)

Unnamed: 0,s,p,o
0,Преступление_и_наказание,type,Фильм
1,Вердень_Альфонсина_Карловна,описан_в,Подросток
2,Ресслих_Гертруда_Карловна,описан_в,Преступление_и_наказание
3,Владимир_Артемов,type,Реальный_человек
4,Лыжин_Павел_Петрович,род_деятельности,Присяжный поверенный


In [None]:
# triples_df.to_csv('triples.csv')

## Training knowledge graph embeddings

We split our training dataset further into training and validation, where the new training set will be used to the knowledge embedding training and the validation set will be used in its evaluation. The test set will be used to evaluate the performance of the classification algorithm built on top of the embeddings.

What differs from the standard method of randomly sampling N points to make up our validation set is that our data points are two entities linked by some relationship, and we need to take care to ensure that all entities are represented in train and validation sets by at least one triple.

To accomplish this, AmpliGraph provides the [`train_test_split_no_unseen`](https://docs.ampligraph.org/en/latest/generated/ampligraph.evaluation.train_test_split_no_unseen.html#train-test-split-no-unseen) function.

In [None]:
from ampligraph.evaluation import train_test_split_no_unseen

X_train, X_valid = train_test_split_no_unseen(np.array(g), test_size=1000)

In [None]:
print('Train set size: ', X_train.shape)
print('Test set size: ', X_valid.shape)

Train set size:  (6034, 3)
Test set size:  (1000, 3)


In [None]:
from ampligraph.latent_features import ScoringBasedEmbeddingModel
from ampligraph.latent_features.loss_functions import get as get_loss
from ampligraph.latent_features.regularizers import get as get_regularizer
from ampligraph.evaluation import mr_score, mrr_score, hits_at_n_score
from ampligraph.utils import save_model, restore_model

# Build embeddings

## ComplEx

AmpliGraph has implemented [several Knowledge Graph Embedding models](https://docs.ampligraph.org/en/latest/ampligraph.latent_features.html#knowledge-graph-embedding-models) (TransE, ComplEx, DistMult, HolE), but to begin with we're just going to use the [ComplEx](https://docs.ampligraph.org/en/latest/generated/ampligraph.latent_features.ComplEx.html#ampligraph.latent_features.ComplEx) model, which is known to bring state-of-the-art predictive power.

The hyper-parameter choice was based on the [best results](https://docs.ampligraph.org/en/latest/experiments.html) we have found so far for the ComplEx model applied to some benchmark datasets used in the knowledge graph embeddings community. This tutorial does not cover [hyper-parameter tuning](https://docs.ampligraph.org/en/latest/examples.html#model-selection).


In [None]:
# Initialize a ComplEx neural embedding model
model = ScoringBasedEmbeddingModel(k=100,
                                   eta=20,
                                   scoring_type='ComplEx')


# Optimizer, loss and regularizer definition
optim = tf.keras.optimizers.Adam(learning_rate=1e-4)
loss = get_loss('multiclass_nll')
regularizer = get_regularizer('LP', {'p': 3, 'lambda': 1e-5})

# Compilation of the model
model.compile(optimizer=optim,
              loss=loss,
              entity_relation_regularizer=regularizer)

Lets go through the parameters to understand what's going on:

- **`batches_count`** : the number of batches in which the training set is split during the training loop. If you are having into low memory issues than settings this to a higher number may help.
- **`epochs`** : the number of epochs to train the model for.
- **`k`**: the dimensionality of the embedding space.
- **`eta`** ($\\eta$) : the number of negative, or false triples that must be generated at training runtime for each positive, or true triple.
- **`optimizer`** : the Adam optimizer, with a learning rate of 1e-4 set via the *optimizer_params* kwarg.
- **`loss`** : pairwise loss, with a margin of 0.5 set via the *loss_params* kwarg.
- **`regularizer`** : $L_p$ regularization with $p=3$, i.e. l3 regularization. $\\lambda$ = 1e-5, set via the *regularizer_params* kwarg.
- **`seed`** : random seed, used for reproducibility.
- **`verbose`** - displays a progress bar.

Training should take around 10 minutes on a modern GPU:

In [None]:
import tensorflow as tf
# tf.logging.set_verbosity(tf.logging.ERROR)

# Fit the model
model.fit(X_train,
          batch_size=int(X_train.shape[0] / 50),
          epochs=5000,  # Number of training epochs
          verbose=True  # Enable stdout messages
          )

### Evaluating knowledge embeddings

AmpliGraph aims to follow scikit-learn's ease-of-use design philosophy and simplify everything down to **`fit`**, **`evaluate`**, and **`predict`** functions.

However, there are some knowledge graph specific steps we must take to ensure our model can be trained and evaluated correctly. The first of these is defining the filter that will be used to ensure that no negative statements generated by the corruption procedure are actually positives. This is simply done by concatenating our train and test sets. Now when negative triples are generated by the corruption strategy, we can check that they aren't actually true statements.

In [None]:
filter_triples = {'test': np.concatenate((X_train, X_valid))}

For this we'll use the `evaluate_performance` function:

- **`X`** - the data to evaluate on. We're going to use our test set to evaluate.
- **`model`** - the model we previously trained.
- **`filter_triples`** - will filter out the false negatives generated by the corruption strategy.
- **`use_default_protocol`** - specifies whether to use the default corruption protocol. If True, then subj and obj are corrupted separately during evaluation.
- **`verbose`** - displays a progress bar.

In [None]:
# from ampligraph.evaluation import evaluate_performance

ranks = model.evaluate(X_valid,
                      #  model=model,
                        use_filter=filter_triples,
                      #  use_default_protocol=True,
                        verbose=True)



We're going to use the mrr_score (mean reciprocal rank) and hits_at_n_score functions.

- **mrr_score**: The function computes the mean of the reciprocal of elements of a vector of rankings ranks.
- **hits_at_n_score**: The function computes how many elements of a vector of rankings ranks make it to the top n positions.

In [None]:
mr = mr_score(ranks)
mrr = mrr_score(ranks)

print("MRR: %.2f" % (mrr))
print("MR: %.2f" % (mr))

hits_10 = hits_at_n_score(ranks, n=10)
print("Hits@10: %.2f" % (hits_10))
hits_3 = hits_at_n_score(ranks, n=3)
print("Hits@3: %.2f" % (hits_3))
hits_1 = hits_at_n_score(ranks, n=1)
print("Hits@1: %.2f" % (hits_1))

MRR: 0.28
MR: 334.41
Hits@10: 0.42
Hits@3: 0.33
Hits@1: 0.19


In [None]:
save_model(model, model_name_path='/content/drive/MyDrive/Colab Notebooks/Графы знаний/ComplEx.pkl')



In [None]:
model = restore_model(model_name_path='ComplEx.pkl')

Saved model does not include a db file. Skipping.


## TransE

In [None]:
model = ScoringBasedEmbeddingModel(k=100,
                                   eta=20,
                                   scoring_type='TransE')


# Optimizer, loss and regularizer definition
optim = tf.keras.optimizers.Adam(learning_rate=1e-4)
loss = get_loss('multiclass_nll')
regularizer = get_regularizer('LP', {'p': 3, 'lambda': 1e-5})

# Compilation of the model
model.compile(optimizer=optim,
              loss=loss,
              entity_relation_regularizer=regularizer)

model.fit(X_train,
          batch_size=int(X_train.shape[0] / 50),
          epochs=5000,  # Number of training epochs
          verbose=True  # Enable stdout messages
          )

In [None]:
filter_triples = {'test': np.concatenate((X_train, X_valid))}

# from ampligraph.evaluation import evaluate_performance

ranks = model.evaluate(X_valid,
                      #  model=model,
                        use_filter=filter_triples,
                      #  use_default_protocol=True,
                        verbose=True)

mr = mr_score(ranks)
mrr = mrr_score(ranks)

print("MRR: %.2f" % (mrr))
print("MR: %.2f" % (mr))

hits_10 = hits_at_n_score(ranks, n=10)
print("Hits@10: %.2f" % (hits_10))
hits_3 = hits_at_n_score(ranks, n=3)
print("Hits@3: %.2f" % (hits_3))
hits_1 = hits_at_n_score(ranks, n=1)
print("Hits@1: %.2f" % (hits_1))

MRR: 0.44
MR: 123.12
Hits@10: 0.61
Hits@3: 0.52
Hits@1: 0.32


In [None]:
save_model(model, model_name_path='TransE.pkl')



In [None]:
model_1 = restore_model(model_name_path='TransE.pkl')

Saved model does not include a db file. Skipping.


## Other models

In [None]:
model = ScoringBasedEmbeddingModel(k=100,
                                   eta=20,
                                   scoring_type='DistMult')


# Optimizer, loss and regularizer definition
optim = tf.keras.optimizers.Adam(learning_rate=1e-4)
loss = get_loss('multiclass_nll')
regularizer = get_regularizer('LP', {'p': 3, 'lambda': 1e-5})

# Compilation of the model
model.compile(optimizer=optim,
              loss=loss,
              entity_relation_regularizer=regularizer)

model.fit(X_train,
          batch_size=int(X_train.shape[0] / 50),
          epochs=500,  # Number of training epochs
          verbose=True  # Enable stdout messages
          )

In [None]:
filter_triples = {'test': np.concatenate((X_train, X_valid))}

# from ampligraph.evaluation import evaluate_performance

ranks = model.evaluate(X_valid,
                      #  model=model,
                        use_filter=filter_triples,
                      #  use_default_protocol=True,
                        verbose=True)

mr = mr_score(ranks)
mrr = mrr_score(ranks)

print("MRR: %.2f" % (mrr))
print("MR: %.2f" % (mr))

hits_10 = hits_at_n_score(ranks, n=10)
print("Hits@10: %.2f" % (hits_10))
hits_3 = hits_at_n_score(ranks, n=3)
print("Hits@3: %.2f" % (hits_3))
hits_1 = hits_at_n_score(ranks, n=1)
print("Hits@1: %.2f" % (hits_1))

MRR: 0.53
MR: 141.76
Hits@10: 0.62
Hits@3: 0.55
Hits@1: 0.47


In [None]:
model = ScoringBasedEmbeddingModel(k=100,
                                   eta=20,
                                   scoring_type='HolE')


# Optimizer, loss and regularizer definition
optim = tf.keras.optimizers.Adam(learning_rate=1e-4)
loss = get_loss('multiclass_nll')
regularizer = get_regularizer('LP', {'p': 3, 'lambda': 1e-5})

# Compilation of the model
model.compile(optimizer=optim,
              loss=loss,
              entity_relation_regularizer=regularizer)

model.fit(X_train,
          batch_size=int(X_train.shape[0] / 50),
          epochs=500,  # Number of training epochs
          verbose=True  # Enable stdout messages
          )

In [None]:
filter_triples = {'test': np.concatenate((X_train, X_valid))}

# from ampligraph.evaluation import evaluate_performance

ranks = model.evaluate(X_valid,
                      #  model=model,
                        use_filter=filter_triples,
                      #  use_default_protocol=True,
                        verbose=True)

mr = mr_score(ranks)
mrr = mrr_score(ranks)

print("MRR: %.2f" % (mrr))
print("MR: %.2f" % (mr))

hits_10 = hits_at_n_score(ranks, n=10)
print("Hits@10: %.2f" % (hits_10))
hits_3 = hits_at_n_score(ranks, n=3)
print("Hits@3: %.2f" % (hits_3))
hits_1 = hits_at_n_score(ranks, n=1)
print("Hits@1: %.2f" % (hits_1))

MRR: 0.53
MR: 164.61
Hits@10: 0.64
Hits@3: 0.55
Hits@1: 0.47


# Link prediction


## TransE

In [None]:
ranks_statements = model.evaluate(dfFiltered,
                      #  model=model,
                        use_filter=statements_filter,
                      #  use_default_protocol=True,
                       corrupt_side = 's+o',
                        verbose=True)


2 triples containing invalid keys skipped!


In [None]:
scores = model.predict(statements)
scores


2 triples containing invalid keys skipped!


array([-12.313822, -12.43676 , -17.577484], dtype=float32)

In [None]:
from scipy.special import expit
probs = expit(scores)

pd.DataFrame(list(zip([' '.join(x) for x in statements],
                      ranks_statements,
                      np.squeeze(scores),
                      np.squeeze(probs))),
             columns=['statement', 'rank', 'score', 'prob']).sort_values("prob")

Unnamed: 0,statement,rank,score,prob
2,http://www.semanticweb.org/irina/ontologies/20...,[1],-17.577484,2.323783e-08
1,http://www.semanticweb.org/irina/ontologies/20...,[1],-12.43676,3.969923e-06
0,http://www.semanticweb.org/irina/ontologies/20...,[1],-12.313822,4.489244e-06


## ComplEx

Link prediction allows us to infer missing links in a graph.

In our case, we're going to predict match result.
Choose match that exist in train dataset.

In [None]:
model = restore_model(model_name_path='/content/drive/MyDrive/Графы знаний/ComplEx.pkl')

Saved model does not include a db file. Skipping.


In [None]:
df = pd.DataFrame(X_train,columns = ['subject','predicate','object'])

matchSubject = "http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#Достоевский_Федор_Михайлович"
df[df.subject==matchSubject]

Unnamed: 0,subject,predicate,object
1070,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...,мужской
1196,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...
2024,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...
3254,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...,1881
3643,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...,1821
3789,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...
3949,http://www.semanticweb.org/irina/ontologies/20...,http://www.w3.org/1999/02/22-rdf-syntax-ns#type,http://www.semanticweb.org/irina/ontologies/20...
4483,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...
4881,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...
4968,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...


Remove result for this node from train dataframe.

In [None]:
dfFiltered = np.array(df[(df.subject!=matchSubject) | ((df.subject==matchSubject) & ~df.predicate.isin(["http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#является_автором"]))])

Fit model on triples without this data.

We can create a few statements.

In [None]:
statements = np.array([
    [f'{matchSubject}', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#является_автором', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#Братья_Карамазовы'],
    [f'{matchSubject}', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#является_автором', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#Преступление_и_наказание'],
    [f'{matchSubject}', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#является_автором', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#Идиот'],
    [f'{matchSubject}', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#является_автором', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#Бесы'],
    [f'{matchSubject}', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#является_автором', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#322063_Братья_Карамазовы'], # российская адаптация
    [f'{matchSubject}', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#является_автором', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#43239_Преступление_и_наказание'], # российская адаптация
    [f'{matchSubject}', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#является_автором', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#718242_Бесы'], # российская адаптация
    [f'{matchSubject}', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#является_автором', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#77051_Идиот'], # российская адаптация
    [f'{matchSubject}', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#является_автором', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#439635_Идиот'], # индийская адаптация
    [f'{matchSubject}', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#является_автором', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#427353_Идиот'] # французская адаптация
])

Unite the triplets of the graph and the proposed statements.

In [None]:
statements_filter = {'test': np.array(list({tuple(i) for i in np.vstack((dfFiltered, statements))}))}
statements_filter

{'test': array([['http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#Айно_Сеппо',
         'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#снимался_в',
         'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#197691_Преступление_и_наказание'],
        ['http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#Жюли_Дельпи',
         'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#род_деятельности',
         'актер'],
        ['http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#Шатова_Мария',
         'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#пол',
         'женский'],
        ...,
        ['http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#Юрий_Колокольников',
         'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#играл_персонажа',
         'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#Ставрогин_Николай_Всеволодович'],
        ['h

In [None]:
ranks_statements = model.evaluate(dfFiltered,
                      #  model=model,
                        use_filter=statements_filter,
                      #  use_default_protocol=True,
                       corrupt_side = 's+o',
                        verbose=True)


23 triples containing invalid keys skipped!

23 triples containing invalid keys skipped!


In [None]:
scores = model.predict(statements)
scores

array([11.386696  , 10.912289  ,  7.650546  , 10.952097  , -2.8545794 ,
        0.72903633, -0.5548361 , -0.18547091, -0.16761619, -0.1300585 ],
      dtype=float32)

Present the result of predictions.

In [None]:
probs = expit(scores)

pd.DataFrame(list(zip([' '.join(x) for x in statements],
                      ranks_statements,
                      np.squeeze(scores),
                      np.squeeze(probs))),
             columns=['statement', 'rank', 'score', 'prob']).sort_values("prob")

Unnamed: 0,statement,rank,score,prob
4,http://www.semanticweb.org/irina/ontologies/20...,[1],-2.854579,0.054445
6,http://www.semanticweb.org/irina/ontologies/20...,[15],-0.554836,0.364743
7,http://www.semanticweb.org/irina/ontologies/20...,[1],-0.185471,0.453765
8,http://www.semanticweb.org/irina/ontologies/20...,[1],-0.167616,0.458194
9,http://www.semanticweb.org/irina/ontologies/20...,[57],-0.130058,0.467531
5,http://www.semanticweb.org/irina/ontologies/20...,[1],0.729036,0.674594
2,http://www.semanticweb.org/irina/ontologies/20...,[139],7.650546,0.999524
1,http://www.semanticweb.org/irina/ontologies/20...,[161],10.912289,0.999982
3,http://www.semanticweb.org/irina/ontologies/20...,[64],10.952097,0.999982
0,http://www.semanticweb.org/irina/ontologies/20...,[7],11.386696,0.999989


In [None]:
df = pd.DataFrame(X_train,columns = ['subject','predicate','object'])

matchSubject = "http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#Преступление_и_наказание"
df[df.subject==matchSubject]

Unnamed: 0,subject,predicate,object
2126,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...,1866
2583,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...,Преступление и наказание
5542,http://www.semanticweb.org/irina/ontologies/20...,http://www.w3.org/1999/02/22-rdf-syntax-ns#type,http://www.semanticweb.org/irina/ontologies/20...


In [None]:
dfFiltered = np.array(df[(df.subject!=matchSubject) | ((df.subject==matchSubject) & ~df.predicate.isin(["http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#является_прототипом", "http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#снят_по_сюжету"]))])

In [None]:
statements = np.array([
    [f'{matchSubject}', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#является_прототипом', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#327456_Преступление_и_наказание'],
    [f'{matchSubject}', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#является_прототипом', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#13005_Преступление_и_наказание'],
    [f'{matchSubject}', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#является_прототипом', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#43239_Преступление_и_наказание'],
    [f'{matchSubject}', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#является_прототипом', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#197691_Преступление_и_наказание']
])

In [None]:
statements_filter = {'test': np.array(list({tuple(i) for i in np.vstack((dfFiltered, statements))}))}

In [None]:
ranks_statements = model.evaluate(dfFiltered,
                      #  model=model,
                        use_filter=statements_filter,
                      #  use_default_protocol=True,
                       corrupt_side = 's+o',
                        verbose=True)


21 triples containing invalid keys skipped!

21 triples containing invalid keys skipped!


In [None]:
scores = model.predict(statements)
scores

array([ 2.3702922 ,  2.4293036 ,  0.23860142, -0.5993578 ], dtype=float32)

In [None]:
probs = expit(scores)

pd.DataFrame(list(zip([' '.join(x) for x in statements],
                      ranks_statements,
                      np.squeeze(scores),
                      np.squeeze(probs))),
             columns=['statement', 'rank', 'score', 'prob']).sort_values("prob")

Unnamed: 0,statement,rank,score,prob
3,http://www.semanticweb.org/irina/ontologies/20...,[143],-0.599358,0.354491
2,http://www.semanticweb.org/irina/ontologies/20...,[2],0.238601,0.559369
0,http://www.semanticweb.org/irina/ontologies/20...,[1],2.370292,0.914534
1,http://www.semanticweb.org/irina/ontologies/20...,[133],2.429304,0.919035


In [None]:
df = pd.DataFrame(X_train,columns = ['subject','predicate','object'])

matchSubject = "http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#327456_Преступление_и_наказание"
df[df.subject==matchSubject]

Unnamed: 0,subject,predicate,object
345,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...
758,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...,криминал
988,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...,7.7
1071,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...,2002
1108,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...,драма
1132,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...,Преступление и наказание
1376,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...
4814,http://www.semanticweb.org/irina/ontologies/20...,http://www.w3.org/1999/02/22-rdf-syntax-ns#type,http://www.semanticweb.org/irina/ontologies/20...


In [None]:
dfFiltered = np.array(df[(df.subject!=matchSubject) | ((df.subject==matchSubject) & ~df.predicate.isin(["http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#снят_по_сюжету"]))])

In [None]:
statements = np.array([
    [f'{matchSubject}', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#снят_по_сюжету', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#Преступление_и_наказание'],
    [f'{matchSubject}', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#снят_по_сюжету', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#Преступление_и_наказание']
])

In [None]:
statements_filter = {'test': np.array(list({tuple(i) for i in np.vstack((dfFiltered, statements))}))}

In [None]:
ranks_statements = model.evaluate(dfFiltered,
                      #  model=model,
                        use_filter=statements_filter,
                      #  use_default_protocol=True,
                       corrupt_side = 's+o',
                        verbose=True)


21 triples containing invalid keys skipped!

21 triples containing invalid keys skipped!


In [None]:
scores = model.predict(statements)
scores

array([12.391866, 12.391866], dtype=float32)

In [None]:
probs = expit(scores)

pd.DataFrame(list(zip([' '.join(x) for x in statements],
                      ranks_statements,
                      np.squeeze(scores),
                      np.squeeze(probs))),
             columns=['statement', 'rank', 'score', 'prob']).sort_values("prob")

Unnamed: 0,statement,rank,score,prob
0,http://www.semanticweb.org/irina/ontologies/20...,[1],12.391866,0.999996
1,http://www.semanticweb.org/irina/ontologies/20...,[133],12.391866,0.999996


In [None]:
df = pd.DataFrame(X_train,columns = ['subject','predicate','object'])

matchSubject = "http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#Суслова_Аполлинария_Прокофьевна"
df[df.subject==matchSubject]

Unnamed: 0,subject,predicate,object
722,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...
1100,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...,1918
1611,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...
3927,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...,Писательница
4010,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...
4151,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...,1839
5821,http://www.semanticweb.org/irina/ontologies/20...,http://www.w3.org/1999/02/22-rdf-syntax-ns#type,http://www.semanticweb.org/irina/ontologies/20...


In [None]:
dfFiltered = np.array(df[(df.subject!=matchSubject) | ((df.subject==matchSubject) & ~df.predicate.isin(["http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#является_прототипом"]))])

In [None]:
statements = np.array([
    [f'{matchSubject}', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#является_прототипом', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#Раскольникова_Авдотья_Романовна'],
    [f'{matchSubject}', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#является_прототипом', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#Барашкова_Настасья_Филипповна'],
    [f'{matchSubject}', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#является_прототипом', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#Тушина_Лизавета_Николаевна'],
    [f'{matchSubject}', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#является_прототипом', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#Иволгина_Нина_Александровна']
])

In [None]:
statements_filter = {'test': np.array(list({tuple(i) for i in np.vstack((dfFiltered, statements))}))}

In [None]:
ranks_statements = model.evaluate(dfFiltered,
                      #  model=model,
                        use_filter=statements_filter,
                      #  use_default_protocol=True,
                       corrupt_side = 's+o',
                        verbose=True)


21 triples containing invalid keys skipped!

21 triples containing invalid keys skipped!


In [None]:
scores = model.predict(statements)
scores

array([13.162555 , 14.095688 ,  6.6072354,  4.8869   ], dtype=float32)

In [None]:
probs = expit(scores)

pd.DataFrame(list(zip([' '.join(x) for x in statements],
                      ranks_statements,
                      np.squeeze(scores),
                      np.squeeze(probs))),
             columns=['statement', 'rank', 'score', 'prob']).sort_values("prob")

Unnamed: 0,statement,rank,score,prob
3,http://www.semanticweb.org/irina/ontologies/20...,[143],4.8869,0.992512
2,http://www.semanticweb.org/irina/ontologies/20...,[2],6.607235,0.998651
0,http://www.semanticweb.org/irina/ontologies/20...,[1],13.162555,0.999998
1,http://www.semanticweb.org/irina/ontologies/20...,[133],14.095688,0.999999


In [None]:
df = pd.DataFrame(X_train,columns = ['subject','predicate','object'])

matchSubject = "http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#Верховенский_Петр_Степанович"
df[df.subject==matchSubject]

Unnamed: 0,subject,predicate,object
288,http://www.semanticweb.org/irina/ontologies/20...,http://www.w3.org/1999/02/22-rdf-syntax-ns#type,http://www.semanticweb.org/irina/ontologies/20...
2587,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...,Заменить собою Христа
4109,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...,Руководитель тайной организации
5259,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...
5943,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...


In [None]:
dfFiltered = np.array(df[(df.subject!=matchSubject) | ((df.subject==matchSubject) & ~df.predicate.isin(["http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#совершил"]))])

In [None]:
statements = np.array([
    [f'{matchSubject}', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#совершил', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#Убийство_Шатова'],
    [f'{matchSubject}', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#совершил', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#Подделка_акций_Тамбово-Козловской_железной_дороги_1874'],
    [f'{matchSubject}', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#совершил', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#Убийство_студента_Иванова']
])

In [None]:
statements_filter = {'test': np.array(list({tuple(i) for i in np.vstack((dfFiltered, statements))}))}

In [None]:
ranks_statements = model.evaluate(dfFiltered,
                      #  model=model,
                        use_filter=statements_filter,
                      #  use_default_protocol=True,
                       corrupt_side = 's+o',
                        verbose=True)


21 triples containing invalid keys skipped!

21 triples containing invalid keys skipped!


In [None]:
scores = model.predict(statements)
scores

array([6.6538363, 1.2656591, 1.8687721], dtype=float32)

In [None]:
probs = expit(scores)

pd.DataFrame(list(zip([' '.join(x) for x in statements],
                      ranks_statements,
                      np.squeeze(scores),
                      np.squeeze(probs))),
             columns=['statement', 'rank', 'score', 'prob']).sort_values("prob")

Unnamed: 0,statement,rank,score,prob
1,http://www.semanticweb.org/irina/ontologies/20...,[133],1.265659,0.779999
2,http://www.semanticweb.org/irina/ontologies/20...,[2],1.868772,0.866316
0,http://www.semanticweb.org/irina/ontologies/20...,[1],6.653836,0.998713


In [None]:
df = pd.DataFrame(X_train,columns = ['subject','predicate','object'])

matchSubject = "http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#Нечаев_Сергей_Геннадьевич"
df[df.subject==matchSubject]

Unnamed: 0,subject,predicate,object
695,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...,1882
1830,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...
2096,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...
3234,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...
3735,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...,Революционер
5163,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...
5980,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...,1847


In [None]:
dfFiltered = np.array(df[(df.subject!=matchSubject) | ((df.subject==matchSubject) & ~df.predicate.isin(["http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#совершил"]))])

In [None]:
statements = np.array([
    [f'{matchSubject}', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#совершил', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#Убийство_студента_Иванова'],
    [f'{matchSubject}', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#совершил', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#Убийство_Шатова']
])

In [None]:
statements_filter = {'test': np.array(list({tuple(i) for i in np.vstack((dfFiltered, statements))}))}

In [None]:
ranks_statements = model.evaluate(dfFiltered,
                      #  model=model,
                        use_filter=statements_filter,
                      #  use_default_protocol=True,
                       corrupt_side = 's+o',
                        verbose=True)


21 triples containing invalid keys skipped!

21 triples containing invalid keys skipped!


In [None]:
scores = model.predict(statements)
scores

array([12.073622 ,  1.5247501], dtype=float32)

In [None]:
probs = expit(scores)

pd.DataFrame(list(zip([' '.join(x) for x in statements],
                      ranks_statements,
                      np.squeeze(scores),
                      np.squeeze(probs))),
             columns=['statement', 'rank', 'score', 'prob']).sort_values("prob")

Unnamed: 0,statement,rank,score,prob
1,http://www.semanticweb.org/irina/ontologies/20...,[133],1.52475,0.821237
0,http://www.semanticweb.org/irina/ontologies/20...,[1],12.073622,0.999994


In [None]:
df = pd.DataFrame(X_train,columns = ['subject','predicate','object'])

matchSubject = "http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#Писарев_Дмитрий_Иванович"
df[df.subject==matchSubject]

Unnamed: 0,subject,predicate,object
892,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...,1840
1026,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...,мужской
1183,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...
1498,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...,Критик
3178,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...
4542,http://www.semanticweb.org/irina/ontologies/20...,http://www.w3.org/1999/02/22-rdf-syntax-ns#type,http://www.semanticweb.org/irina/ontologies/20...
5952,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...,1868


In [None]:
dfFiltered = np.array(df[(df.subject!=matchSubject) | ((df.subject==matchSubject) & ~df.predicate.isin(["http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#совершил"]))])

In [None]:
statements = np.array([
    [f'{matchSubject}', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#совершил', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#Убийство_студента_Иванова'],
    [f'{matchSubject}', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#совершил', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#Убийство_Шатова']
])

In [None]:
statements_filter = {'test': np.array(list({tuple(i) for i in np.vstack((dfFiltered, statements))}))}

In [None]:
ranks_statements = model.evaluate(dfFiltered,
                      #  model=model,
                        use_filter=statements_filter,
                      #  use_default_protocol=True,
                       corrupt_side = 's+o',
                        verbose=True)


21 triples containing invalid keys skipped!

21 triples containing invalid keys skipped!


In [None]:
scores = model.predict(statements)
scores

array([5.4293184 , 0.40834635], dtype=float32)

In [None]:
probs = expit(scores)

pd.DataFrame(list(zip([' '.join(x) for x in statements],
                      ranks_statements,
                      np.squeeze(scores),
                      np.squeeze(probs))),
             columns=['statement', 'rank', 'score', 'prob']).sort_values("prob")

Unnamed: 0,statement,rank,score,prob
1,http://www.semanticweb.org/irina/ontologies/20...,[133],0.408346,0.600691
0,http://www.semanticweb.org/irina/ontologies/20...,[1],5.429318,0.995633


In [None]:
df = pd.DataFrame(X_train,columns = ['subject','predicate','object'])

matchSubject = "http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#Антон_Шагин"
df[df.subject==matchSubject]

Unnamed: 0,subject,predicate,object
113,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...,актер
507,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...
1842,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...,http://www.semanticweb.org/irina/ontologies/20...
4111,http://www.semanticweb.org/irina/ontologies/20...,http://www.w3.org/1999/02/22-rdf-syntax-ns#type,http://www.semanticweb.org/irina/ontologies/20...


In [None]:
dfFiltered = np.array(df[(df.subject!=matchSubject) | ((df.subject==matchSubject) & ~df.predicate.isin(["http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#совершил"]))])

In [None]:
statements = np.array([
    [f'{matchSubject}', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#совершил', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#Убийство_студента_Иванова'],
    [f'{matchSubject}', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#совершил', 'http://www.semanticweb.org/irina/ontologies/2023/10/dostoevskiy#Убийство_Шатова']
])

In [None]:
statements_filter = {'test': np.array(list({tuple(i) for i in np.vstack((dfFiltered, statements))}))}

In [None]:
ranks_statements = model.evaluate(dfFiltered,
                      #  model=model,
                        use_filter=statements_filter,
                      #  use_default_protocol=True,
                       corrupt_side = 's+o',
                        verbose=True)


21 triples containing invalid keys skipped!

21 triples containing invalid keys skipped!


In [None]:
scores = model.predict(statements)
scores

array([-0.26462725, -1.9177855 ], dtype=float32)

In [None]:
probs = expit(scores)

pd.DataFrame(list(zip([' '.join(x) for x in statements],
                      ranks_statements,
                      np.squeeze(scores),
                      np.squeeze(probs))),
             columns=['statement', 'rank', 'score', 'prob']).sort_values("prob")

Unnamed: 0,statement,rank,score,prob
1,http://www.semanticweb.org/irina/ontologies/20...,[133],-1.917786,0.128109
0,http://www.semanticweb.org/irina/ontologies/20...,[1],-0.264627,0.434227
