# KG Recommender with PyKEEN (TransE)

This notebook:
1) Loads your triples (`movie_kg_triples.tsv`)
2) Trains **TransE** with PyKEEN
3) Extracts **entity** and **relation** embeddings
4) Builds a **user vector** from your ratings/likes
5) Composes and **scores a new movie** in embedding space

> Update paths as needed for your project layout.

In [1]:
import pandas as pd
import numpy as np
import torch
from pykeen.pipeline import pipeline
from pykeen.triples import TriplesFactory
from pathlib import Path

pd.set_option("display.max_colwidth", 200)
print("PyTorch:", torch.__version__)

PyTorch: 2.8.0


In [2]:
# Path to your triples. Adjust if needed.
# If you run this notebook right after downloading, the example here expects your project layout.
# For quick testing in this environment, we also show a fallback to the uploaded file.
default_path = Path('../data/kg/movie_kg_triples.tsv')
uploaded_path = Path('/mnt/data/movie_kg_triples.tsv')

triples_path = default_path if default_path.exists() else uploaded_path
print("Using triples from:", triples_path)

Using triples from: ../data/kg/movie_kg_triples.tsv


In [3]:
# Load triples into a PyKEEN TriplesFactory
tf = TriplesFactory.from_path(str(triples_path))
print(f"Loaded {len(tf.triples)} triples. Entities: {tf.num_entities}, Relations: {tf.num_relations}")

Reconstructing all label-based triples. This is expensive and rarely needed.


Loaded 41139 triples. Entities: 19880, Relations: 21


In [4]:
# Split and train TransE
train_tf, test_tf = tf.split([0.8, 0.2])
result = pipeline(
    model='TransE',
    training=train_tf,
    testing=test_tf,
    training_kwargs=dict(num_epochs=100),
    random_seed=42,
    device='cuda' if torch.cuda.is_available() else 'cpu',
)

using automatically assigned random_state=1554936718


Training epochs on cpu:   0%|          | 0/100 [00:00<?, ?epoch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0.00/129 [00:00<?, ?batch/s]



Evaluating on cpu:   0%|          | 0.00/8.23k [00:00<?, ?triple/s]

INFO:pykeen.evaluation.evaluator:Evaluation took 38.40s seconds


In [5]:
# Extract embeddings
entity_rep = result.model.entity_representations[0]
relation_rep = result.model.relation_representations[0]

entity_embeds = entity_rep(torch.arange(tf.num_entities)).detach().cpu().numpy()
relation_embeds = relation_rep(torch.arange(tf.num_relations)).detach().cpu().numpy()

entity_df = pd.DataFrame(entity_embeds, index=tf.entity_labeling.label_to_id.keys())
relation_df = pd.DataFrame(relation_embeds, index=tf.relation_labeling.label_to_id.keys())

entity_df.to_csv("../data/kg/embeddings/entity_embeddings.csv")
relation_df.to_csv("../data/kg/embeddings/relation_embeddings.csv")

print("Entity embedding shape:", entity_df.shape)
print("Relation embedding shape:", relation_df.shape)

Entity embedding shape: (19880, 50)
Relation embedding shape: (21, 50)


In [6]:
# Parse ratings, likes, names from triples file for profile building
triples_df = pd.read_csv(triples_path, sep='\t', header=None, names=['head', 'relation', 'tail'])

def parse_rating(v):
    import re
    m = re.match(r'personalVote_([0-9.]+)', str(v))
    return float(m.group(1)) if m else None

ratings = triples_df[triples_df['relation'] == 'schema:review']
likes = triples_df[triples_df['relation'] == 'ex:liked']
names = triples_df[triples_df['relation'] == 'schema:name']

movie2rating = dict(zip(ratings['head'], ratings['tail'].map(parse_rating)))
movie2liked = dict(zip(likes['head'], likes['tail'].map(lambda x: str(x).lower() == 'liked_yes')))
id2name = dict(zip(names['head'], names['tail']))

valid_ratings = [v for v in movie2rating.values() if v is not None]
rmin, rmax = (min(valid_ratings), max(valid_ratings)) if valid_ratings else (0.0, 1.0)
denom = (rmax - rmin) if rmax > rmin else 1.0

def weight(movie, like_bonus=0.15):
    r = movie2rating.get(movie)
    if r is None: return 0.0
    w = (r - rmin) / denom
    if movie2liked.get(movie, False):
        w += like_bonus
    return max(0.0, w)

weighted_vecs = []
for movie in entity_df.index:
    if movie in movie2rating and movie in entity_df.index:
        w = weight(movie)
        if w > 0:
            weighted_vecs.append(w * entity_df.loc[movie].values)

if not weighted_vecs:
    raise RuntimeError("No weighted movies found. Check that your triples contain schema:review values.")
user_vec = np.mean(weighted_vecs, axis=0)
user_vec = user_vec / (np.linalg.norm(user_vec) + 1e-12)
print("User vector created. Norm:", np.linalg.norm(user_vec))

User vector created. Norm: 1.0


In [14]:
# Compose a new (unseen) movie embedding via TransE translations
def translate_neighbor_to_movie(neighbor_id, relation_label):
    if neighbor_id not in entity_df.index or relation_label not in relation_df.index:
        return None
    # TransE: movie + r ≈ neighbor  => movie ≈ neighbor - r
    return entity_df.loc[neighbor_id].values - relation_df.loc[relation_label].values

# Example inputs (replace with your own test)
actors = ["Joaqin Phoenix", "Charles Dance"]
directors = ["James Cameron"]
genres = ["Action"]      # ensure your KG has an entity labeled 'Action'; else leave empty or adjust
language = ["en"]        # ensure language token exists as an entity if you want to include it

# Resolve names -> ids
name2id = {v: k for k, v in id2name.items()}
actor_ids = [name2id.get(n) for n in actors if name2id.get(n) in entity_df.index]
director_ids = [name2id.get(n) for n in directors if name2id.get(n) in entity_df.index]
genre_ids = [n for n in genres if n in entity_df.index]
lang_ids = [n for n in language if n in entity_df.index]

parts = []
for aid in actor_ids:
    v = translate_neighbor_to_movie(aid, "schema:actor")
    if v is not None: parts.append(v)
for did in director_ids:
    v = translate_neighbor_to_movie(did, "schema:director")
    if v is not None: parts.append(v)
for gid in genre_ids:
    v = translate_neighbor_to_movie(gid, "schema:genre")
    if v is not None: parts.append(v)
for lid in lang_ids:
    v = translate_neighbor_to_movie(lid, "ex:originalLanguage")
    if v is not None: parts.append(v)

if not parts:
    print("⚠️ No components found for the new movie. Check that the chosen names/labels exist in your KG.")
else:
    new_movie_vec = np.mean(parts, axis=0)
    new_movie_vec = new_movie_vec / (np.linalg.norm(new_movie_vec) + 1e-12)
    cosine = float(np.dot(user_vec, new_movie_vec))
    print(f"New movie similarity score: {cosine:.4f}")

New movie similarity score: 0.8937


In [7]:
# 🔍 Automatically extract favorites from KG using rating-weighted frequency

from collections import Counter

# Load full triples if not already loaded
if 'triples_df' not in locals():
    triples_df = pd.read_csv(triples_path, sep='\t', header=None, names=['head', 'relation', 'tail'])

# --- QUICK FIX: enrich id2name with labels from the KG (covers genres) ---
label_preds = {"ex:name", "schema:name", "rdfs:label"}
label_map = (
    triples_df[triples_df['relation'].isin(label_preds)]
    .dropna(subset=['head','tail'])
    .drop_duplicates(subset=['head'])
    .set_index('head')['tail']
    .to_dict()
)
# Merge into existing id2name if present, otherwise create it
try:
    id2name
    id2name = {**label_map, **id2name}  # your custom names (right) win
except NameError:
    id2name = label_map
# --- end quick fix ---

# Get rated/liked movies with weight > 0
relevant_movies = {m for m in movie2rating if weight(m) > 0.0}

# Helper: count tail values linked to these movies by a relation
def top_related_entities(relation, top_k=10):
    related = triples_df[
        (triples_df['relation'] == relation) &
        (triples_df['head'].isin(relevant_movies))
        ]['tail']
    return Counter(related).most_common(top_k)

# Top actors, directors, genres, and original languages
top_actors = top_related_entities("schema:actor", top_k=10)
top_directors = top_related_entities("schema:director", top_k=10)
top_genres = top_related_entities("schema:genre", top_k=5)
top_langs = top_related_entities("ex:originalLanguage", top_k=3)

# Map IDs to readable names
def display_top(counter_list):
    return [(id2name.get(eid, eid), count) for eid, count in counter_list]

print("🎭 Top actors:")
print(display_top(top_actors))
print("\n🎬 Top directors:")
print(display_top(top_directors))
print("\n🏷️ Top genres:")
print(display_top(top_genres))
print("\n🌍 Top languages:")
print(display_top(top_langs))

🎭 Top actors:
[('Willem Dafoe', 10), ('Bill Murray', 9), ('Keanu Reeves', 7), ('Margot Robbie', 7), ('Jason Schwartzman', 7), ('J.K. Simmons', 7), ('Laurence Fishburne', 6), ('Sigourney Weaver', 6), ('Woody Harrelson', 6), ('Jeffrey Wright', 6)]

🎬 Top directors:
[('Wes Anderson', 11), ('David Lynch', 5), ('Quentin Tarantino', 5), ('Zack Snyder', 5), ('George Miller', 4), ('Hayao Miyazaki', 4), ('Gore Verbinski', 4), ('Bo Burnham', 4), ('Paul W. S. Anderson', 3), ('Dan Trachtenberg', 3)]

🏷️ Top genres:
[('Drama', 115), ('Comedy', 112), ('Adventure', 91), ('Action', 86), ('Science Fiction', 74)]

🌍 Top languages:
[('English', 270), ('Japanese', 8), ('German', 6)]


In [None]:
'''
The code in the previous cells is in big parts AI generated by the free and paid version of ChatGPT and was afterwards heavily adapted by me. Since it is not possible to accurately say which parts were originaly AI generated by wich promt, I have included all prompts that were used on this file here.
These following prompts were used:


    "what i originally wanted to do was create new movie recommendations based on these data. can I do this, for example deriving info on liked actors/directors and applying this info to a new list on movies that is currently not in the data yet?"


    "Lets start from scratch - having only my movie_kg_triples.tsv file, can you generate a script that uses PyKEEN for exactly the things your script did before? only now I want a Jupyter notebook file (.ipynb) that I can include in my project."

    "# 🔍 Automatically extract favorites from KG using rating-weighted frequency

        from collections import Counter

        # Load full triples if not already loaded
        if 'triples_df' not in locals():
            triples_df = pd.read_csv(triples_path, sep='\t', header=None, names=['head', 'relation', 'tail'])

        # Get rated/liked movies with weight > 0
        relevant_movies = {m for m in movie2rating if weight(m) > 0.0}

        # Helper: count tail values linked to these movies by a relation
        def top_related_entities(relation, top_k=10):
            related = triples_df[
                (triples_df['relation'] == relation) &
                (triples_df['head'].isin(relevant_movies))
                ]['tail']
            return Counter(related).most_common(top_k)

        # Top actors, directors, genres, and original languages
        top_actors = top_related_entities("schema:actor", top_k=10)
        top_directors = top_related_entities("schema:director", top_k=10)
        top_genres = top_related_entities("schema:genre", top_k=5)
        top_langs = top_related_entities("ex:originalLanguage", top_k=3)

        # Map IDs to readable names
        def display_top(counter_list):
            return [(id2name.get(eid, eid), count) for eid, count in counter_list]

        print("🎭 Top actors:")
        print(display_top(top_actors))
        print("\n🎬 Top directors:")
        print(display_top(top_directors))
        print("\n🏷️ Top genres:")
        print(display_top(top_genres))
        print("\n🌍 Top languages:")
        print(display_top(top_langs))

        this code works on this file: movie_kg_triples.tsv

        🎭 Top actors:
        [('Willem Dafoe', 10), ('Bill Murray', 9), ('Keanu Reeves', 7), ('Margot Robbie', 7), ('Jason Schwartzman', 7), ('J.K. Simmons', 7), ('Sigourney Weaver', 6), ('Woody Harrelson', 6), ('Jeffrey Wright', 6), ('Stanley Tucci', 6)]

        🎬 Top directors:
        [('Wes Anderson', 11), ('David Lynch', 5), ('Quentin Tarantino', 5), ('Zack Snyder', 5), ('George Miller', 4), ('Hayao Miyazaki', 4), ('Gore Verbinski', 4), ('Bo Burnham', 4), ('Paul W. S. Anderson', 3), ('Dan Trachtenberg', 3)]

        🏷️ Top genres:
        [('genre18', 115), ('genre35', 112), ('genre12', 91), ('genre28', 86), ('genre878', 74)]

        🌍 Top languages:
        [('English', 270), ('Japanese', 8), ('German', 6)]

        the output like this looks good, but i want the genres also connected to their names, which can be found in the triples."

    "# 🔍 Automatically extract favorites from KG using rating-weighted frequency

        from collections import Counter

        # Load full triples if not already loaded
        if 'triples_df' not in locals():
            triples_df = pd.read_csv(triples_path, sep='\t', header=None, names=['head', 'relation', 'tail'])

        # Get rated/liked movies with weight > 0
        relevant_movies = {m for m in movie2rating if weight(m) > 0.0}

        # Helper: count tail values linked to these movies by a relation
        def top_related_entities(relation, top_k=10):
            related = triples_df[
                (triples_df['relation'] == relation) &
                (triples_df['head'].isin(relevant_movies))
                ]['tail']
            return Counter(related).most_common(top_k)

        # Top actors, directors, genres, and original languages
        top_actors = top_related_entities("schema:actor", top_k=10)
        top_directors = top_related_entities("schema:director", top_k=10)
        top_genres = top_related_entities("schema:genre", top_k=5)
        top_langs = top_related_entities("ex:originalLanguage", top_k=3)

        # Map IDs to readable names
        def display_top(counter_list):
            return [(id2name.get(eid, eid), count) for eid, count in counter_list]

        print("🎭 Top actors:")
        print(display_top(top_actors))
        print("\n🎬 Top directors:")
        print(display_top(top_directors))
        print("\n🏷️ Top genres:")
        print(display_top(top_genres))
        print("\n🌍 Top languages:")
        print(display_top(top_langs))

        can you just include your quick fix in this code please"


'''