# Film-KG (RDF) — scalable pipeline + TSV-Append

Dieses Notebook:
1) Kanonisiert `schema:character`, mintet Character-URIs `ex:char/<slug>`, verknüpft Filme via `ex:featuresCharacter`.
2) Leitet `ex:SAME_UNIVERSE` via Self-Join über Character ab, schreibt **batchweise** `.nt`.
3) Bildet `ex:CREATIVE_PAIR` (Director×Actor ≥2), ebenfalls **batchweise** `.nt`.
4) **Hängt** alle neuen Kanten zusätzlich **als TSV** an deine bestehende `movie_kg_triples.tsv` **nach Backup** an.

## 0) Setup & Konfiguration

In [16]:
!python -c "import rdflib" 2>/dev/null || pip -q install rdflib==7.0.0
from rdflib import Graph, Namespace, URIRef, Literal
from rdflib.namespace import RDF, RDFS, XSD
from pathlib import Path
from collections import defaultdict
from itertools import combinations
import shutil, re
from datetime import datetime

DATA_PATH = Path("../data/kg/triples/movie_kg_triples.tsv")  # bestehende KG-Datei (TSV)
OUT_DIR = Path("../data/kg/triples"); OUT_DIR.mkdir(parents=True, exist_ok=True)
BATCH_SIZE = 50000  # Tripel je Ausgabedatei

SCHEMA = Namespace("http://schema.org/")
EX     = Namespace("http://example.org/")
CHAR_NS = Namespace(str(EX) + "char/")

print("Data:", DATA_PATH.resolve())
print("Output:", OUT_DIR.resolve())


Data: /Users/tschaffel/PycharmProjects/letterboxd-KG/data/kg/triples/movie_kg_triples.tsv
Output: /Users/tschaffel/PycharmProjects/letterboxd-KG/data/kg/triples


## 1) Daten laden (robuster TSV-Parser)

In [17]:
g = Graph(); g.bind("schema", SCHEMA); g.bind("ex", EX); g.bind("rdf", RDF)
prefix_map = {"schema": str(SCHEMA), "rdf": str(RDF), "rdfs": str(RDFS), "xsd": str(XSD), "ex": str(EX)}

def parse_term(term: str):
    term = term.strip()
    if len(term) >= 2 and term[0] == '"' and term[-1] == '"':
        return Literal(term[1:-1])
    if term.startswith("http://") or term.startswith("https://"):
        return URIRef(term)
    if ":" in term:
        pfx, local = term.split(":", 1)
        if pfx in prefix_map:
            return URIRef(prefix_map[pfx] + local)
    return Literal(term)

count = 0
with DATA_PATH.open("r", encoding="utf-8") as f:
    for line in f:
        line = line.strip()
        if not line or line.startswith("#"):
            continue
        parts = line.split("\t")
        if len(parts) != 3:
            continue
        s, p, o = map(parse_term, parts)
        g.add((s, p, o))
        count += 1
print("Geladene Tripel:", count)
print("Beispiel-Tripel:")
for i, (s,p,o) in enumerate(g):
    print("-", s, p, o)
    if i >= 4: break


Geladene Tripel: 58601
Beispiel-Tripel:
- company77208 http://schema.org/name Gama Entertainment Partners
- movie590223 http://schema.org/actor person1576672
- movie120 http://example.org/timesWatched timesWatched_nan
- person2969804 http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://schema.org/Person
- company195260 http://example.org/country GB


## 2) Backup & TSV-Append-Helfer

In [18]:
BACKUP_DIR = OUT_DIR / "backups"; BACKUP_DIR.mkdir(parents=True, exist_ok=True)
BACKUP_PATH = BACKUP_DIR / f"movie_kg_triples_backup_{datetime.now().strftime('%Y%m%d_%H%M%S')}.tsv"

# Backup anlegen
shutil.copy2(DATA_PATH, BACKUP_PATH)
print("Backup gespeichert:", BACKUP_PATH)

def term_to_str(term):
    if isinstance(term, URIRef):
        # Versuche Prefix-Kurzform
        for pfx, ns in prefix_map.items():
            if str(term).startswith(ns):
                return f"{pfx}:{str(term)[len(ns):]}"
        return str(term)
    elif isinstance(term, Literal):
        s = str(term).replace('"', '\"')
        return f'"{s}"'
    else:
        return str(term)

def append_triples_tsv(triples):
    with DATA_PATH.open("a", encoding="utf-8") as f:
        for (s,p,o) in triples:
            f.write(term_to_str(s) + "\t" + term_to_str(p) + "\t" + term_to_str(o) + "\n")


Backup gespeichert: ../data/kg/triples/backups/movie_kg_triples_backup_20250915_024129.tsv


## 3) Vorverarbeitung: Character-Knoten & `ex:featuresCharacter`

In [19]:
canon_re = re.compile(r"\s*\([^)]*\)")
def canonize(text: str) -> str:
    s = canon_re.sub("", text)
    return s.strip().lower()

def slugify(text: str) -> str:
    t = re.sub(r"[^a-z0-9]+", "-", text)
    t = re.sub(r"-+", "-", t).strip("-")
    return t or "x"

char_uri_by_canon = {}
added_nodes = 0; added_edges = 0

new_feature_triples = []  # für optionales Mitschreiben in TSV

for f, _, ch in g.triples((None, SCHEMA.character, None)):
    c = canonize(str(ch))
    if not c:
        continue
    uri = char_uri_by_canon.get(c)
    if uri is None:
        uri = URIRef(CHAR_NS + slugify(c))
        char_uri_by_canon[c] = uri
        if (uri, RDF.type, EX.Character) not in g:
            g.add((uri, RDF.type, EX.Character))
            g.add((uri, EX.canonName, Literal(c)))
            new_feature_triples.append((uri, RDF.type, EX.Character))
            new_feature_triples.append((uri, EX.canonName, Literal(c)))
            added_nodes += 1
    if (f, EX.featuresCharacter, uri) not in g:
        g.add((f, EX.featuresCharacter, uri))
        new_feature_triples.append((f, EX.featuresCharacter, uri))
        added_edges += 1

print("Neue Character-Knoten:", added_nodes)
print("Neue featuresCharacter-Kanten:", added_edges)

# Optional: diese neuen Vorverarbeitungs-Tripel direkt ins TSV anhängen
#if new_feature_triples:
#    append_triples_tsv(new_feature_triples)
#    print("Vorverarbeitungs-Tripel ins TSV angehängt:", len(new_feature_triples))


Neue Character-Knoten: 5392
Neue featuresCharacter-Kanten: 6645


## 4) SAME_UNIVERSE ableiten (Self-Join über Character) — Batch + TSV-Append

In [20]:
from itertools import combinations

films_by_char = defaultdict(list)
for f, _, c in g.triples((None, EX.featuresCharacter, None)):
    films_by_char[c].append(f)

new_su_triples = []
seen = set()

for char_uri, films in films_by_char.items():
    if len(films) < 2:
        continue
    films_sorted = sorted(set(films), key=str)
    for f1, f2 in combinations(films_sorted, 2):
        key = (str(f1), str(f2))
        if key in seen:
            continue
        seen.add(key)
        new_su_triples.append((f1, EX.sameUniverse, f2))

append_triples_tsv(new_su_triples)
print("SAME_UNIVERSE Tripel angehängt:", len(new_su_triples))

SAME_UNIVERSE Tripel angehängt: 1576


## 5) CREATIVE_PAIR (Director×Actor ≥2) — Batch + TSV-Append

In [21]:
pair_counts = defaultdict(int)
for f, _, d in g.triples((None, SCHEMA.director, None)):
    for _, _, a in g.triples((f, SCHEMA.actor, None)):
        pair_counts[(d, a)] += 1

new_cp_triples = []
for (d, a), n in pair_counts.items():
    if n >= 2:
        new_cp_triples.append((d, EX.creativePair, a))
        new_cp_triples.append((d, EX.creativePairRoles, Literal("Director,Actor")))
        new_cp_triples.append((d, EX.creativePairCount, Literal(n)))

append_triples_tsv(new_cp_triples)
print("CREATIVE_PAIR Tripel angehängt:", len(new_cp_triples))

CREATIVE_PAIR Tripel angehängt: 1644


## 6) Export: Basisgraph (mit featuresCharacter)

In [22]:
merged_path = OUT_DIR / "graph_with_features.ttl"
g.serialize(destination=str(merged_path), format="turtle")
print("Basisgraph (inkl. featuresCharacter) gespeichert:", merged_path)


Basisgraph (inkl. featuresCharacter) gespeichert: ../data/kg/triples/graph_with_features.ttl


### Hinweise
- Backup wurde vor jedem TSV-Append erstellt (Ordner: `outputs/backups`).
- Die `.nt`-Batches eignen sich für Bulk-Loader (Fuseki/GraphDB/Stardog/Blazegraph).
- TSV-Append nutzt Kurz-Prefixe wo möglich und quotet Literale.
- Passe `BATCH_SIZE` nach Bedarf an (RAM/IO).