# Induktiver Hetero-GNN **mit KGE-Embeddings als Features** (Movies/Persons/Users) + Kanten-Embeddings (optional)

**Ziel:** Für neue Filme ohne `schema:review` / `ex:liked` Kanten Vorhersagen erzeugen und als Triples exportieren.

**Was dieses Notebook macht**
1. Lädt `movie_kg_triples.tsv` (head, rel, tail).
2. Lädt **entity_embeddings.csv** (Pfad: `../data/kg/embeddings/`; Fallback: Upload).
3. (Optional) Lädt **relation_embeddings.csv** und fügt diese als `edge_attr` pro Kanten-Typ hinzu.
4. Baut einen **heterogenen Graphen** (User/Movie/Person) in **PyTorch Geometric**.
5. Erzeugt Node-Features: **KGE-Embedding** (wenn vorhanden) **⊕** **Metadaten-Fallback** (Jahr/Runtime/Popularität/Sprache).
6. Trainiert **GraphSAGE** + Edge-Head (**Regression**: Rating in [0,5] *oder* **Klassifikation**: liked).
7. Inferenz: Scoring für **neue** Filme (ohne persönliche Review-Kante).
8. Export: `ex:predictedReview` (Regression) oder `ex:liked` (Klassifikation) als Triples (auskommentiert; Backup empfohlen).


In [52]:
# Optional: Installation (falls lokal ausgeführt, in Colab/virtualenv)
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
!pip install torch-geometric torch-scatter torch-sparse torch-cluster torch-spline-conv -f https://data.pyg.org/whl/torch-2.2.0+cpu.html
!pip install pandas numpy scikit-learn tqdm python-dotenv

Looking in indexes: https://download.pytorch.org/whl/cpu
Looking in links: https://data.pyg.org/whl/torch-2.2.0+cpu.html


## 1) Imports, Pfade & Konfiguration

In [53]:
import os, re, math
from pathlib import Path
from collections import Counter

import numpy as np
import pandas as pd

import torch
from torch import nn
from sklearn.model_selection import train_test_split

try:
    from torch_geometric.data import HeteroData
    from torch_geometric.nn import HeteroConv, SAGEConv, Linear
    pyg_available = True
    print("PyTorch Geometric available.")
except Exception as e:
    pyg_available = False
    print("PyG not available in this env. You can still inspect/export this notebook; run it locally with PyG installed.")

# ---- Konfiguration ----
# Pfade (passe an, falls nötig)
KG_PATH = Path("../data/kg/triples/movie_kg_triples.tsv")
EMB_DIR = Path("../data/kg/embeddings")
ENTITY_EMB_PATHS = [EMB_DIR / "entity_embeddings.csv", Path("/mnt/data/entity_embeddings.csv")]
REL_EMB_PATHS    = [EMB_DIR / "relation_embeddings.csv", Path("/mnt/data/relation_embeddings.csv")]

# Trainingsmodus für den Edge-Head: 'regression' (Rating 0..5) oder 'classification' (Liked)
HEAD_MODE = 'regression'   # 'regression' | 'classification'
LIKED_THRESHOLD = 4.0      # nur relevant für Klassifikation -> Label = 1, wenn personal5 >= TH

# Inferenz/Export
TOPK_EXPORT = 100          # wie viele Vorhersagen exportieren
EXPORT_MODE = 'predictedReview'  # 'predictedReview' | 'liked'

PyTorch Geometric available.


## 2) Daten laden: Triples & Embeddings

In [54]:
# KG laden
if not KG_PATH.exists() and Path("../data/kg/triples/movie_kg_triples.tsv").exists():
    KG_PATH = Path("../data/kg/triples/movie_kg_triples.tsv")

triples = pd.read_csv(KG_PATH, sep="\t", header=None, names=["head","rel","tail"])
print("Triples loaded:", len(triples))
display(triples.head(5))

def first_existing(paths):
    for p in paths:
        if Path(p).exists():
            return Path(p)
    return None

ent_path = first_existing(ENTITY_EMB_PATHS)
rel_path = first_existing(REL_EMB_PATHS)

print("Entity embeddings:", ent_path if ent_path else "NOT FOUND")
print("Relation embeddings:", rel_path if rel_path else "NOT FOUND")

# Entity-Embeddings lesen (flexibles Schema: 'entity' oder 'id' + numerische Spalten)
entity_emb = None
ent_id_col = None
if ent_path:
    tmp = pd.read_csv(ent_path)
    if "entity" in tmp.columns:
        ent_id_col = "entity"
    elif "id" in tmp.columns:
        ent_id_col = "id"
    else:
        ent_id_col = tmp.columns[0]
    # nur numerische Spalten als Vektor
    vec_cols = [c for c in tmp.columns if c != ent_id_col]
    for c in vec_cols: tmp[c] = pd.to_numeric(tmp[c], errors="coerce")
    vec_cols = [c for c in vec_cols if tmp[c].notna().any()]
    entity_emb = tmp[[ent_id_col] + vec_cols].dropna()
    print(f"Loaded entity embeddings: {entity_emb.shape[0]} entities, dim={len(vec_cols)}")
else:
    print("No entity embeddings available; will use metadata-only features for all nodes.")

# Relation-Embeddings (optional)
relation_emb = None
rel_key_col = None
if rel_path:
    tmp = pd.read_csv(rel_path)
    if "relation" in tmp.columns:
        rel_key_col = "relation"
    elif "rel" in tmp.columns:
        rel_key_col = "rel"
    else:
        rel_key_col = tmp.columns[0]
    rvec_cols = [c for c in tmp.columns if c != rel_key_col]
    for c in rvec_cols: tmp[c] = pd.to_numeric(tmp[c], errors="coerce")
    rvec_cols = [c for c in rvec_cols if tmp[c].notna().any()]
    relation_emb = tmp[[rel_key_col] + rvec_cols].dropna()
    print(f"Loaded relation embeddings: {relation_emb.shape[0]} relations, dim={len(rvec_cols)}")
else:
    print("No relation embeddings; edge_attr will be empty (the model still works).")


Triples loaded: 58616


Unnamed: 0,head,rel,tail
0,movie452522,rdf:type,schema:Movie
1,movie452522,schema:name,Twin Peaks
2,movie452522,schema:datePublished,published_1989
3,movie452522,schema:aggregateRating,avgVote_8.4
4,movie452522,schema:review,personalVote_5.0


Entity embeddings: ../data/kg/embeddings/entity_embeddings.csv
Relation embeddings: ../data/kg/embeddings/relation_embeddings.csv
Loaded entity embeddings: 19834 entities, dim=50
Loaded relation embeddings: 21 relations, dim=50


## 3) Knoten/Attribute ableiten & Feature-Fallbacks

In [55]:
# Parser
def extract_year(s):
    if pd.isna(s): return None
    m = re.search(r"(19|20)\d{2}", str(s))
    return int(m.group(0)) if m else None

def extract_float_token(s, key_prefix):
    if s is None or (isinstance(s, float) and pd.isna(s)): return None
    m = re.search(rf"{re.escape(key_prefix)}[_:\s]*([0-9]+(?:[.,][0-9]+)?)", str(s), flags=re.I)
    if m: return float(m.group(1).replace(",", "."))
    return None

def extract_personal_review(s):  # 0..5
    return extract_float_token(s, "personalVote")

def extract_avg_vote(s):         # 0..10
    return extract_float_token(s, "avgVote")

# Movies
#movie_ids = set(triples.loc[(triples.rel=="rdf:type") & (triples.tail=="schema:Movie"), "head"].astype(str))

# 1) Normalisierung
triples["rel_norm"]  = triples["rel"].str.strip().str.lower()
triples["tail_norm"] = triples["tail"].str.strip().str.lower()

# 2) Erkenne Typ-Tripel in mehreren Varianten
type_aliases = {"rdf:type", "a", "type", "schema:type", "@type"}
movie_aliases = {
    "schema:movie", "movie",
    "http://schema.org/movie", "https://schema.org/movie", "schema.org/movie"
}

type_mask  = triples["rel_norm"].isin(type_aliases)
movie_mask = triples["tail_norm"].isin(movie_aliases)
movie_ids_1 = set(triples.loc[type_mask & movie_mask, "head"])

# 3) Fallback: Knoten, die typische Movie-Attribute tragen
movie_like_rels = {
    "schema:director","schema:actor","schema:aggregaterating","schema:datepublished",
    "schema:duration","schema:genre","ex:originallanguage","ex:popularity", # "schema:name"
}
movie_ids_2 = set(triples.loc[triples["rel_norm"].isin(movie_like_rels), "head"])

# 4) Optional: Präfix-Heuristik (falls du tmdb-IDs nutzt)
movie_ids_3 = set(triples.loc[triples["head"].str.startswith("tmdbmovie", na=False), "head"])

movie_ids = movie_ids_1 | movie_ids_2 | movie_ids_3
print("Movies erkannt:", len(movie_ids))

name_map = triples[triples.rel=="schema:name"].set_index("head")["tail"].to_dict()
year_map = triples[triples.rel=="schema:datePublished"].set_index("head")["tail"].to_dict()
avg_map  = triples[triples.rel=="schema:aggregateRating"].set_index("head")["tail"].to_dict()
rev_map  = triples[triples.rel=="schema:review"].set_index("head")["tail"].to_dict()
dur_map  = triples[triples.rel=="schema:duration"].set_index("head")["tail"].to_dict()
lang_map = triples[triples.rel=="ex:originalLanguage"].set_index("head")["tail"].to_dict()
pop_map  = triples[triples.rel=="ex:popularity"].set_index("head")["tail"].to_dict()

rows = []
for mid in movie_ids:
    title = name_map.get(mid)
    year = extract_year(year_map.get(mid))
    personal5 = extract_personal_review(rev_map.get(mid))
    avg10 = extract_avg_vote(avg_map.get(mid))
    runtime = None
    if mid in dur_map:
        m = re.search(r"(\d+)", str(dur_map[mid]))
        runtime = int(m.group(1)) if m else None
    lang = lang_map.get(mid)
    pop = None
    if mid in pop_map:
        m = re.search(r"([0-9]+(?:[.,][0-9]+)?)", str(pop_map[mid]))
        if m: pop = float(m.group(1).replace(",","."))
    rows.append({"movie_id": mid, "title": title, "year": year, "personal5": personal5, "avg10": avg10,
                 "runtime": runtime, "language": lang, "pop": pop})
movie_tbl = pd.DataFrame(rows)

# Directors / Actors
dir_edges = triples[triples.rel=="schema:director"][["head","tail"]].astype(str).values.tolist()
act_edges = triples[triples.rel=="schema:actor"][["head","tail"]].astype(str).values.tolist()
person_ids = set([t for _,t in dir_edges+act_edges])

# Users
has_user_nodes = (triples["head"].str.startswith("user").any()) or (triples["tail"].str.startswith("user").any())
user_ids = set(triples.loc[triples["head"].str.startswith("user"), "head"].astype(str)) if has_user_nodes else {"user0"}

# Ratings als (User -> Movie) mit Label 0..5
rated_src, rated_dst, rated_y = [], [], []
for _, row in movie_tbl.dropna(subset=["personal5"]).iterrows():
    rated_src.append(next(iter(user_ids)))   # erster/only user
    rated_dst.append(row["movie_id"])
    rated_y.append(float(row["personal5"]))

print("Movies:", len(movie_tbl), "| Persons:", len(person_ids), "| Users:", len(user_ids), "| Rated edges:", len(rated_y))
display(movie_tbl.head(5))

Movies erkannt: 703
Movies: 703 | Persons: 4611 | Users: 1 | Rated edges: 297


Unnamed: 0,movie_id,title,year,personal5,avg10,runtime,language,pop
0,movie496,Borat: Cultural Learnings of America for Make ...,2006,,6.782,84.0,en,7.069
1,movie22970,The Cabin in the Woods,2011,,6.639,95.0,en,8.1935
2,movie391713,Lady Bird,2017,4.5,7.3,94.0,en,8.4172
3,movie537116,"tick, tick... BOOM!",2021,3.5,7.615,115.0,en,4.9097
4,movie354912,Coco,2017,3.5,8.2,105.0,en,23.6656


## 4) ID-Mappings & relationale Kanten (edge_attr via relation_embeddings)

In [56]:
def build_idmap(ids):
    ids = sorted(list(ids))
    return {k:i for i,k in enumerate(ids)}, ids

movie_id2idx, movie_idx2id = build_idmap(movie_ids)
person_id2idx, person_idx2id = build_idmap(person_ids)
user_id2idx, user_idx2id = build_idmap(user_ids)

import numpy as np
def edge_index_from_pairs(pairs, src_map, dst_map):
    idx = [[src_map[h], dst_map[t]] for h,t in pairs if h in src_map and t in dst_map]
    if len(idx)==0:
        return np.zeros((2,0), dtype=int)
    return np.array(idx, dtype=int).T

dir_edge_index = edge_index_from_pairs(dir_edges, movie_id2idx, person_id2idx)
act_edge_index = edge_index_from_pairs(act_edges, movie_id2idx, person_id2idx)
rated_edge_index = edge_index_from_pairs(list(zip(rated_src, rated_dst)), user_id2idx, movie_id2idx)
rated_y = np.array(rated_y, dtype=float)

print("edge_index shapes -> dir:", dir_edge_index.shape, "| act:", act_edge_index.shape, "| rated:", rated_edge_index.shape)

# Edge-Attr pro Rel-Typ (gleicher Vektor für alle Kanten dieser Relation)
edge_attr = {}
if relation_emb is not None:
    rkey = relation_emb.columns[0]
    rvec_cols = relation_emb.columns[1:]
    rel2vec = {row[rkey]: row[rvec_cols].values.astype(np.float32) for _,row in relation_emb.iterrows()}
    def make_edge_attr(num_edges, rel_name):
        if rel_name in rel2vec and num_edges>0:
            vec = rel2vec[rel_name]
            return np.tile(vec, (num_edges,1))
        return None
    edge_attr[('Movie','hasDirector','Person')] = make_edge_attr(dir_edge_index.shape[1], 'schema:director')
    edge_attr[('Movie','hasActor','Person')]    = make_edge_attr(act_edge_index.shape[1], 'schema:actor')
    edge_attr[('User','rated','Movie')]         = make_edge_attr(rated_edge_index.shape[1], 'schema:review')
else:
    edge_attr = { }

edge_index shapes -> dir: (2, 803) | act: (2, 6745) | rated: (2, 297)


## 5) Node-Features = **KGE-Entity-Embedding** ⊕ **Metadaten-Fallback**

In [57]:
# 5.1 Metadaten-Vektoren (Fallback)
def norm_col(x):
    x = x.astype(float)
    mn, mx = np.nanmin(x), np.nanmax(x)
    if not np.isfinite(mn) or not np.isfinite(mx) or mx==mn:
        return np.zeros_like(x, dtype=float)
    y = (x - mn) / (mx - mn)
    y[np.isnan(y)] = 0.0
    return y

mt = movie_tbl.set_index("movie_id").reindex(movie_idx2id)
year_feat    = norm_col(mt["year"].fillna(mt["year"].median()).values)
runtime_feat = norm_col(mt["runtime"].fillna(mt["runtime"].median()).values)
pop_feat     = norm_col(mt["pop"].fillna(mt["pop"].median()).values)

lang_series = mt["language"].fillna("unknown").astype(str)
top_langs = [l for l,_ in Counter(lang_series).most_common(8)]
lang_feat = np.stack([ (lang_series==L).astype(float).values for L in top_langs ], axis=1) if len(top_langs)>0 else np.zeros((len(mt),0))

meta_movie = np.stack([year_feat, runtime_feat, pop_feat], axis=1)
if lang_feat.shape[1] > 0:
    meta_movie = np.concatenate([meta_movie, lang_feat], axis=1)

meta_person = np.zeros((len(person_idx2id), max(4, meta_movie.shape[1]//2)), dtype=float)
meta_user   = np.zeros((len(user_idx2id),   max(4, meta_movie.shape[1]//2)), dtype=float)

# 5.2 KGE-Entity-Embeddings mappen (falls vorhanden)
def build_entity_lookup(df, id_col):
    return {str(row[id_col]): row.drop(labels=[id_col]).to_numpy(dtype=np.float32) for _,row in df.iterrows()}

entity_lookup = build_entity_lookup(entity_emb, entity_emb.columns[0]) if entity_emb is not None else {}

def stack_features(ids_list, meta_fallback):
    X = []
    for i, node_id in enumerate(ids_list):
        vec = entity_lookup.get(str(node_id))
        if vec is not None:
            # concat: [entity_emb ⊕ meta_fallback[i]]
            v = np.concatenate([vec, meta_fallback[i]], axis=0)
        else:
            v = meta_fallback[i]
        X.append(v.astype(np.float32))
    # pad to common dim
    maxd = max(x.shape[0] for x in X) if X else 0
    Xp = np.zeros((len(X), maxd), dtype=np.float32)
    for i,x in enumerate(X):
        Xp[i,:x.shape[0]] = x
    return Xp

movie_x  = stack_features(movie_idx2id,  meta_movie)
person_x = stack_features(person_idx2id, meta_person)
user_x   = stack_features(user_idx2id,   meta_user)

print("movie_x:", movie_x.shape, "| person_x:", person_x.shape, "| user_x:", user_x.shape)


movie_x: (703, 61) | person_x: (4611, 55) | user_x: (1, 5)


## 6) HeteroData in PyG

In [58]:
if not pyg_available:
    raise RuntimeError("PyG nicht verfügbar. Bitte lokal mit installiertem torch_geometric ausführen.")

data = HeteroData()
data['Movie'].x  = torch.tensor(movie_x, dtype=torch.float)
data['Person'].x = torch.tensor(person_x, dtype=torch.float)
data['User'].x   = torch.tensor(user_x, dtype=torch.float)

# Edge indices
data[('Movie','hasDirector','Person')].edge_index = torch.tensor(dir_edge_index, dtype=torch.long)
data[('Movie','hasActor','Person')].edge_index    = torch.tensor(act_edge_index, dtype=torch.long)
data[('User','rated','Movie')].edge_index         = torch.tensor(rated_edge_index, dtype=torch.long)
if rated_edge_index.shape[1] > 0:
    data[('User','rated','Movie')].edge_label = torch.tensor(rated_y, dtype=torch.float)

# Edge attributes (relation embeddings), wenn vorhanden
for key, ea in edge_attr.items():
    if ea is not None:
        data[key].edge_attr = torch.tensor(ea, dtype=torch.float)

print(data)

HeteroData(
  Movie={ x=[703, 61] },
  Person={ x=[4611, 55] },
  User={ x=[1, 5] },
  (Movie, hasDirector, Person)={
    edge_index=[2, 803],
    edge_attr=[803, 50],
  },
  (Movie, hasActor, Person)={
    edge_index=[2, 6745],
    edge_attr=[6745, 50],
  },
  (User, rated, Movie)={
    edge_index=[2, 297],
    edge_label=[297],
    edge_attr=[297, 50],
  }
)


## 7) Train/Val/Test Split über vorhandene `rated`-Kanten

In [59]:
if data[('User','rated','Movie')].edge_index.numel() > 0:
    E = data[('User','rated','Movie')].edge_index.shape[1]
    idx = np.arange(E)
    train_idx, test_idx = train_test_split(idx, test_size=0.2, random_state=42)
    train_idx, val_idx  = train_test_split(train_idx, test_size=0.2, random_state=42)

    for split, ids in [('train',train_idx),('val',val_idx),('test',test_idx)]:
        mask = torch.zeros(E, dtype=torch.bool)
        mask[torch.tensor(ids, dtype=torch.long)] = True
        data[('User','rated','Movie')][f'{split}_mask'] = mask
    print({k:int(v.sum()) for k,v in {
        'train': data[('User','rated','Movie')]['train_mask'],
        'val':   data[('User','rated','Movie')]['val_mask'],
        'test':  data[('User','rated','Movie')]['test_mask']
    }.items()})
else:
    print("Keine rated-Kanten gefunden — Regression/Klassifikation kann nicht trainiert werden.")


{'train': 189, 'val': 48, 'test': 60}


## 8) Modell: Hetero GraphSAGE + Edge-Head

In [60]:
from torch_geometric.nn import HeteroConv, SAGEConv, Linear

class HeteroSAGE(torch.nn.Module):
    def __init__(self, metadata, hidden_channels=64, out_channels=64, num_layers=2):
        super().__init__()
        self.convs = torch.nn.ModuleList()
        for _ in range(num_layers):
            conv = HeteroConv({
                ('Movie','hasDirector','Person'): SAGEConv((-1, -1), hidden_channels),
                ('Movie','hasActor','Person'):    SAGEConv((-1, -1), hidden_channels),
                ('User','rated','Movie'):         SAGEConv((-1, -1), hidden_channels),
                # Rückkanten (on-the-fly erzeugt)
                ('Person','rev_hasDirector','Movie'): SAGEConv((-1, -1), hidden_channels),
                ('Person','rev_hasActor','Movie'):    SAGEConv((-1, -1), hidden_channels),
                ('Movie','rev_rated','User'):         SAGEConv((-1, -1), hidden_channels),
            }, aggr='sum')
            self.convs.append(conv)
        self.lin_dict = torch.nn.ModuleDict({nt: Linear(-1, out_channels) for nt in metadata[0]})

    def forward(self, x_dict, edge_index_dict):
        def add_rev(edge_index): return edge_index.flip(0)
        if ('Movie','hasDirector','Person') in edge_index_dict:
            edge_index_dict.setdefault(('Person','rev_hasDirector','Movie'), add_rev(edge_index_dict[('Movie','hasDirector','Person')]))
        if ('Movie','hasActor','Person') in edge_index_dict:
            edge_index_dict.setdefault(('Person','rev_hasActor','Movie'), add_rev(edge_index_dict[('Movie','hasActor','Person')]))
        if ('User','rated','Movie') in edge_index_dict:
            edge_index_dict.setdefault(('Movie','rev_rated','User'), add_rev(edge_index_dict[('User','rated','Movie')]))

        for conv in self.convs:
            x_dict = conv(x_dict, edge_index_dict)
            x_dict = {k: torch.relu(v) for k,v in x_dict.items()}
        out = {k: self.lin_dict[k](v) for k,v in x_dict.items()}
        return out

class EdgeHead(nn.Module):
    def __init__(self, in_dim, mode='regression'):
        super().__init__()
        self.mode = mode
        self.mlp = nn.Sequential(
            nn.Linear(in_dim*3, 128),
            nn.ReLU(),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, 1),
        )
        self.out_act = nn.Identity() if mode=='regression' else nn.Sigmoid()

    def forward(self, u, m):
        x = torch.cat([u, m, u*m], dim=-1)
        y = self.mlp(x)
        return self.out_act(y).squeeze(-1)


## 9) Training & Evaluation

In [61]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
data = data.to(device)

hidden = 64
outdim = 64
model = HeteroSAGE(data.metadata(), hidden_channels=hidden, out_channels=outdim, num_layers=2).to(device)
head = EdgeHead(outdim, mode=HEAD_MODE).to(device)

params = list(model.parameters()) + list(head.parameters())
opt = torch.optim.Adam(params, lr=1e-3, weight_decay=1e-4)
loss_fn = nn.MSELoss() if HEAD_MODE=='regression' else nn.BCELoss()

def get_edge_indices(split):
    mask = data[('User','rated','Movie')][f'{split}_mask']
    ei = data[('User','rated','Movie')].edge_index[:, mask]
    ys = data[('User','rated','Movie')].edge_label[mask]
    return ei, ys

def train_epoch():
    model.train(); head.train()
    opt.zero_grad()
    out = model(data.x_dict, data.edge_index_dict)
    ei, ys = get_edge_indices('train')
    u_emb = out['User'][ei[0]]
    m_emb = out['Movie'][ei[1]]
    target = ys if HEAD_MODE=='regression' else (ys >= LIKED_THRESHOLD).float()
    pred = head(u_emb, m_emb)
    loss = loss_fn(pred, target)
    loss.backward()
    opt.step()
    return float(loss.item())

@torch.no_grad()
def eval_split(split):
    model.eval(); head.eval()
    out = model(data.x_dict, data.edge_index_dict)
    ei, ys = get_edge_indices(split)
    u_emb = out['User'][ei[0]]
    m_emb = out['Movie'][ei[1]]
    target = ys if HEAD_MODE=='regression' else (ys >= LIKED_THRESHOLD).float()
    pred = head(u_emb, m_emb)
    if HEAD_MODE=='regression':
        rmse = torch.sqrt(torch.mean((pred-ys)**2)).item()
        return rmse
    else:
        prob = pred.detach().cpu().numpy()
        ytrue = target.detach().cpu().numpy()
        acc = ((prob>=0.5)==(ytrue>=0.5)).mean()
        return acc

if data[('User','rated','Movie')].edge_index.numel() > 0:
    for epoch in range(1, 51):
        tr_loss = train_epoch()
        if epoch%5==0:
            metric = eval_split('val')
            print(f"Epoch {epoch:03d} | train_loss={tr_loss:.4f} | val_{'RMSE' if HEAD_MODE=='regression' else 'ACC'}={metric:.4f}")
else:
    print("Skip training: no rated edges.")


Epoch 005 | train_loss=13.9891 | val_RMSE=3.7830
Epoch 010 | train_loss=12.2530 | val_RMSE=3.4693
Epoch 015 | train_loss=6.7613 | val_RMSE=2.3008
Epoch 020 | train_loss=2.5395 | val_RMSE=1.8093
Epoch 025 | train_loss=1.2347 | val_RMSE=1.0952
Epoch 030 | train_loss=2.0615 | val_RMSE=1.4733
Epoch 035 | train_loss=1.3304 | val_RMSE=1.0770
Epoch 040 | train_loss=1.4147 | val_RMSE=1.1062
Epoch 045 | train_loss=1.1055 | val_RMSE=1.0295
Epoch 050 | train_loss=1.1898 | val_RMSE=1.1005


## 10) Inferenz für **neue Filme** (ohne persönliche Review)

In [62]:
@torch.no_grad()
def predict_for_user_on_movies(user_id):
    model.eval(); head.eval()
    out = model(data.x_dict, data.edge_index_dict)
    u_idx = torch.tensor([user_id2idx[user_id]], device=out['User'].device)
    u_emb = out['User'][u_idx].repeat(len(movie_idx2id), 1)
    m_emb = out['Movie']
    pred = head(u_emb, m_emb).detach().cpu().numpy()
    return pred  # [num_movies]

# Filme ohne persönliche Bewertung identifizieren
rated_movies_set = set([m for m in rated_dst])
unrated_movie_mask = [mid not in rated_movies_set for mid in movie_idx2id]
unrated_indices = np.where(unrated_movie_mask)[0]

if len(unrated_indices) == 0:
    print("Alle Filme haben bereits personal reviews im KG (nichts zu empfehlen).")
else:
    uid0 = next(iter(user_id2idx.keys()))
    scores_all = predict_for_user_on_movies(uid0)
    unrated_scores = scores_all[unrated_indices]
    pred_df = pd.DataFrame({
        "movie_id": [movie_idx2id[i] for i in unrated_indices],
        "pred_score": unrated_scores
    }).sort_values("pred_score", ascending=False)
    display(pred_df.head(10))

Unnamed: 0,movie_id,pred_score
285,movie539617,3.768758
152,movie284053,3.711945
36,movie12445,3.664667
132,movie259316,3.630157
234,movie425,3.616824
69,movie162,3.610948
195,movie364,3.600406
232,movie424121,3.593978
323,movie675,3.582785
315,movie64688,3.578642


## 11) Export als neue Triples (auskommentiert – erst Backup anlegen!)

In [63]:
# === EXPORT: ein Tripel pro Prediction im Format:
# movieID    ex:predictedReview    personalVote_4.00
# mit Rundung auf 0.5er Schritte + automatisches Backup ===

import os, csv, math, shutil
from datetime import datetime

def round_to_half(x: float) -> float:
    # Rundung auf nächste 0.5 (3.66 -> 3.5, 3.99 -> 4.0)
    return math.floor(x*2 + 0.5) / 2.0

def sanitize_cell(x: str) -> str:
    if x is None:
        return ""
    s = str(x)
    return s.replace("\t", " ").replace("\r", " ").replace("\n", " ").strip()

def append_triples_safely(new_triples_df: pd.DataFrame, path: str):
    df = new_triples_df.rename(columns={0:"head",1:"rel",2:"tail"})[["head","rel","tail"]].copy()
    for c in ["head","rel","tail"]:
        df[c] = df[c].map(sanitize_cell)

    # sorge für eine abschließende Zeile vorm Anhängen
    if os.path.exists(path):
        with open(path, "rb") as f:
            try:
                f.seek(-1, os.SEEK_END)
                if f.read(1) != b"\n":
                    with open(path, "ab") as g:
                        g.write(b"\n")
            except OSError:
                # leere Datei
                pass

    df.to_csv(
        path,
        mode="a",
        header=False,
        index=False,
        sep="\t",
        lineterminator="\n",
        encoding="utf-8",
        quoting=csv.QUOTE_MINIMAL,
        escapechar="\\",
    )

# --- baue Triples aus pred_df ---
if 'pred_df' in locals() and len(pred_df) > 0:
    top = pred_df.head(TOPK_EXPORT).copy()

    rows = []
    for _, r in top.iterrows():
        mid = r['movie_id']
        score = float(r['pred_score'])

        if HEAD_MODE == 'classification':
            # falls Klassifikation aktiv ist, interpretiere score als Prob. und mappe auf 0..5
            score = 5.0 * score

        rounded = round_to_half(score)
        # clamp auf 0..5 (nur zur Sicherheit)
        rounded = max(0.0, min(5.0, rounded))

        tail_value = f"personalVote_{rounded:.2f}"
        rows.append([mid, "ex:predictedReview", tail_value])

    export_df = pd.DataFrame(rows, columns=["head","rel","tail"])

    # Optional: Duplikate gegen bestehende Triples vermeiden
    existing = pd.read_csv(KG_PATH, sep="\t", header=None, names=["head","rel","tail"], dtype=str, keep_default_na=False, engine="python")
    merged = export_df.merge(existing.drop_duplicates(), on=["head","rel","tail"], how="left", indicator=True)
    unique_new = merged[merged["_merge"] == "left_only"][["head","rel","tail"]]

    print(f"Neue Triples (nach Deduplikation): {len(unique_new)}")
    display(unique_new.head(10))

    # --- Backup vor dem Anhängen ---
    ts = datetime.now().strftime("%Y%m%d_%H%M%S")
    # Backup-Verzeichnis erstellen, falls es nicht existiert
    BACKUP_DIR = KG_PATH.parent / "backups"
    BACKUP_DIR.mkdir(parents=True, exist_ok=True)

    # Backup-Dateiname mit Zeitstempel
    timestamp   = datetime.now().strftime("%Y%m%d_%H%M%S")
    backup_path = BACKUP_DIR / f"movie_kg_triples_backup_{timestamp}.tsv"

    shutil.copy2(KG_PATH, backup_path)
    print("📂 Backup gespeichert unter:", backup_path)

    # --- an KG anhängen ---
    append_triples_safely(unique_new, str(KG_PATH))
    print("✅ Triples angehängt an:", KG_PATH)

else:
    print("No predictions to export.")

Neue Triples (nach Deduplikation): 100


Unnamed: 0,head,rel,tail
0,movie539617,ex:predictedReview,personalVote_4.00
1,movie284053,ex:predictedReview,personalVote_3.50
2,movie12445,ex:predictedReview,personalVote_3.50
3,movie259316,ex:predictedReview,personalVote_3.50
4,movie425,ex:predictedReview,personalVote_3.50
5,movie162,ex:predictedReview,personalVote_3.50
6,movie364,ex:predictedReview,personalVote_3.50
7,movie424121,ex:predictedReview,personalVote_3.50
8,movie675,ex:predictedReview,personalVote_3.50
9,movie64688,ex:predictedReview,personalVote_3.50


📂 Backup gespeichert unter: ../data/kg/triples/movie_kg_triples_backup_20250915_005750.tsv
✅ Triples angehängt an: ../data/kg/triples/movie_kg_triples.tsv
