# Film-KG (RDF) — scalable pipeline

Dieses Notebook implementiert eine performante Pipeline:

1) **Vorverarbeitung**: Kanonisiert `schema:character` Werte, mintet **Character-Knoten** unter `ex:char/<slug>` und verknüpft Filme mit `ex:featuresCharacter`.
2) **SAME_UNIVERSE**: Ableitung als Self-Join über **denselben Character-Knoten** (ohne Regex/Global-Join) und **Batch-Serialisierung** (50 000 Tripel je Datei).
3) **CREATIVE_PAIR (Director×Actor)**: Zählt gemeinsame Filme in Python, materialisiert ab `≥2`, ebenfalls **Batch-Serialisierung**.

Optional kannst du die resultierenden `.nt`/`.ttl` Files in einen Triplestore (Fuseki, Blazegraph, GraphDB, Stardog) laden.


## 0) Setup & Konfiguration

In [3]:
!python -c "import rdflib" 2>/dev/null || pip -q install rdflib==7.0.0
from rdflib import Graph, Namespace, URIRef, Literal
from rdflib.namespace import RDF, RDFS, XSD
from pathlib import Path
from collections import defaultdict
from itertools import combinations
import re, math

DATA_PATH = Path("../data/kg/triples/movie_kg_triples.tsv")  # Pfad ggf. anpassen
OUT_DIR = Path("outputs"); OUT_DIR.mkdir(parents=True, exist_ok=True)
BATCH_SIZE = 50000  # Tripel je Ausgabedatei

SCHEMA = Namespace("http://schema.org/")
EX     = Namespace("http://example.org/")
CHAR_NS = Namespace(str(EX) + "char/")

print("Data:", DATA_PATH.resolve())
print("Output:", OUT_DIR.resolve())


Data: /Users/tschaffel/PycharmProjects/letterboxd-KG/data/kg/triples/movie_kg_triples.tsv
Output: /Users/tschaffel/PycharmProjects/letterboxd-KG/data/kg/outputs


## 1) Daten laden (robuste Parser-Logik)

In [6]:
g = Graph(); g.bind("schema", SCHEMA); g.bind("ex", EX); g.bind("rdf", RDF)
prefix_map = {"schema": str(SCHEMA), "rdf": str(RDF), "rdfs": str(RDFS), "xsd": str(XSD)}

def parse_term(term: str):
    term = term.strip()
    # Explizites Literal
    if len(term) >= 2 and term[0] == '"' and term[-1] == '"':
        return Literal(term[1:-1])
    # Absolute URI
    if term.startswith("http://") or term.startswith("https://"):
        return URIRef(term)
    # QName
    if ":" in term:
        pfx, local = term.split(":", 1)
        if pfx in prefix_map:
            return URIRef(prefix_map[pfx] + local)
    # sonst Literal (z. B. Namen ohne Quotes)
    return Literal(term)

count = 0
with DATA_PATH.open("r", encoding="utf-8") as f:
    for line in f:
        line = line.strip()
        if not line or line.startswith("#"):
            continue
        parts = line.split("\t")
        if len(parts) != 3:
            continue
        s, p, o = map(parse_term, parts)
        g.add((s, p, o))
        count += 1
print("Geladene Tripel:", count)
print("Beispiel-Tripel:")
for i, (s,p,o) in enumerate(g):
    print("-", s, p, o)
    if i >= 4: break


Geladene Tripel: 58601
Beispiel-Tripel:
- movie10527 http://schema.org/genre genre12
- company114130 http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://schema.org/Company
- movie8852 http://schema.org/aggregateRating avgVote_6.8
- movie50357 http://schema.org/productionCompany company7405
- movie315162 ex:timesWatched timesWatched_1


## 2) Vorverarbeitung: Character-Knoten & Kanten `ex:featuresCharacter`
Wir kanonisieren Charakternamen (Klammerzusätze entfernen, trimmen, lower-case) und minten URIs `ex:char/<slug>`.

In [7]:
canon_re = re.compile(r"\s*\([^)]*\)")
def canonize(text: str) -> str:
    s = canon_re.sub("", text)
    return s.strip().lower()

def slugify(text: str) -> str:
    t = re.sub(r"[^a-z0-9]+", "-", text)
    t = re.sub(r"-+", "-", t).strip("-")
    return t or "x"

char_uri_by_canon = {}
added_nodes = 0; added_edges = 0

for f, _, ch in g.triples((None, SCHEMA.character, None)):
    if isinstance(ch, Literal):
        c = canonize(str(ch))
    else:
        c = canonize(str(ch))
    if not c:
        continue
    uri = char_uri_by_canon.get(c)
    if uri is None:
        uri = URIRef(CHAR_NS + slugify(c))
        char_uri_by_canon[c] = uri
        g.add((uri, RDF.type, EX.Character))
        g.add((uri, EX.canonName, Literal(c)))
        added_nodes += 1
    if (f, EX.featuresCharacter, uri) not in g:
        g.add((f, EX.featuresCharacter, uri))
        added_edges += 1

print("Neue Character-Knoten:", added_nodes)
print("Neue featuresCharacter-Kanten:", added_edges)


Neue Character-Knoten: 5395
Neue featuresCharacter-Kanten: 6645


## 3) SAME_UNIVERSE ableiten (Self-Join über Character) — **Batch** schreiben

In [9]:
# === Batching: SAME_UNIVERSE als N-Triples manuell schreiben ===
from itertools import combinations

films_by_char = defaultdict(list)
for f, _, c in g.triples((None, EX.featuresCharacter, None)):
    films_by_char[c].append(f)

su_out_prefix = OUT_DIR / "same_universe_part"
batch = []
file_index = 1
created = 0

def nt_line(s, p, o) -> str:
    # rdflib-Terms sauber zu N-Triples serialisieren
    return f"{s.n3()} {p.n3()} {o.n3()} .\n"

def flush_batch():
    global batch, file_index
    if not batch:
        return
    path = Path(f"{su_out_prefix}_{file_index:04d}.nt")
    with path.open("w", encoding="utf-8", newline="\n") as fh:
        for (s, p, o) in batch:
            fh.write(nt_line(s, p, o))
    print("geschrieben:", path.name, "Tripel:", len(batch))
    batch = []
    file_index += 1

# Generate SAME_UNIVERSE edges
seen = set()
for char_uri, films in films_by_char.items():
    if len(films) < 2:
        continue
    films_sorted = sorted(set(films), key=str)
    for f1, f2 in combinations(films_sorted, 2):
        key = (str(f1), str(f2))
        if key in seen:
            continue
        seen.add(key)
        batch.append((f1, EX.SAME_UNIVERSE, f2))
        created += 1
        if len(batch) >= BATCH_SIZE:
            flush_batch()

flush_batch()
print("SAME_UNIVERSE erzeugt (gesamt):", created)

geschrieben: same_universe_part_0001.nt Tripel: 1576
SAME_UNIVERSE erzeugt (gesamt): 1576


## 4) CREATIVE_PAIR (Director×Actor ≥2) — **Batch** schreiben

In [10]:
# === Batching: CREATIVE_PAIR als N-Triples manuell schreiben ===
pair_counts = defaultdict(int)
for f, _, d in g.triples((None, SCHEMA.director, None)):
    for _, _, a in g.triples((f, SCHEMA.actor, None)):
        pair_counts[(d, a)] += 1

cp_out_prefix = OUT_DIR / "creative_pair_part"
batch = []
file_index = 1
created = 0

def flush_cp():
    global batch, file_index
    if not batch:
        return
    path = Path(f"{cp_out_prefix}_{file_index:04d}.nt")
    with path.open("w", encoding="utf-8", newline="\n") as fh:
        for (s, p, o) in batch:
            fh.write(nt_line(s, p, o))
    print("geschrieben:", path.name, "Tripel:", len(batch))
    batch = []
    file_index += 1

for (d, a), n in pair_counts.items():
    if n >= 2:
        batch.append((d, EX.CREATIVE_PAIR, a))
        batch.append((d, EX.creativePairRoles, Literal("Director,Actor")))
        batch.append((d, EX.creativePairCount, Literal(n)))
        created += 3
        if len(batch) >= BATCH_SIZE:
            flush_cp()

flush_cp()
print("CREATIVE_PAIR Tripel erzeugt (gesamt):", created)

geschrieben: creative_pair_part_0001.nt Tripel: 1644
CREATIVE_PAIR Tripel erzeugt (gesamt): 1644


## 5) Export: Basisgraph (mit featuresCharacter) speichern & Tipps für Triplestores

In [11]:
merged_path = OUT_DIR / "graph_with_features.ttl"
g.serialize(destination=str(merged_path), format="turtle")
print("Basisgraph (inkl. featuresCharacter) gespeichert:", merged_path)
print("\nLaden in Triplestores:")
print("- Fuseki (TDB2): tdb2.tdbloader --loc DB graph_with_features.ttl same_universe_part_*.nt creative_pair_part_*.nt")
print("- GraphDB/Blazegraph/Stardog: UI/CLI-Bulk-Loader; .nt-Batches sind optimal.")


Basisgraph (inkl. featuresCharacter) gespeichert: ../data/kg/outputs/graph_with_features.ttl

Laden in Triplestores:
- Fuseki (TDB2): tdb2.tdbloader --loc DB graph_with_features.ttl same_universe_part_*.nt creative_pair_part_*.nt
- GraphDB/Blazegraph/Stardog: UI/CLI-Bulk-Loader; .nt-Batches sind optimal.


### Hinweise
- **Self-Join**: `SAME_UNIVERSE` entsteht über denselben Character-Knoten — keine Regex/Global-Join-Explosion.
- **Batching**: Dateien `same_universe_part_*.nt` & `creative_pair_part_*.nt` (je ≤50k Tripel) sind für Bulk-Loader optimiert.
- **Erweiterbar**: Weitere Regeln (Actor×Actor etc.) können analog erzeugt und gebatcht werden.
