
# LO2: Logic Rule — Exclude Already-Watched Movies from Recommendations

This notebook applies a simple **logical rule** to your recommendation candidates:

> **Rule (formal):** `recommendedFor(u, m) :- candidateFor(m, u) ∧ ¬ watched(u, m)`

It filters out any movie that is already in your `watched.csv`.  
Here we identify movies by **(Name, Year)** instead of TMDB IDs.

Sources used from your project:
- `data/letterboxd_export/watched.csv` (columns: `Name`, `Year`)
- `data/kg/tmdb_rerank_with_embedding_results_movies_only.csv` (should also contain `Name`, `Year` or similar)

Outputs:
- `data/kg/rerank_filtered_by_LO2.csv` — filtered recommendations
- (Optional) `data/kg/recommended_materialized.ttl` — RDF materialization of `:recommendedFor` triples


In [1]:

import os
import pandas as pd
from pathlib import Path

# Resolve project root automatically
here = Path.cwd()
candidate = here
while candidate != candidate.parent and not (candidate / "data").exists():
    candidate = candidate.parent
project_root = candidate if (candidate / "data").exists() else Path(".")

print("Detected project_root:", project_root.resolve())

# File paths
watched_path = project_root / "data" / "letterboxd_export" / "watched.csv"
candidates_path = project_root / "data" / "kg" / "tmdb_rerank_with_embedding_results_movies_only.csv"
output_csv = project_root / "data" / "kg" / "rerank_filtered_by_LO2.csv"


Detected project_root: /Users/tschaffel/PycharmProjects/letterboxd-KG


## Load Data

In [3]:

watched = pd.read_csv(watched_path)
recs = pd.read_csv(candidates_path)

print("watched.csv columns:", list(watched.columns))
print("candidates.csv columns:", list(recs.columns))

# Normalize column names (lowercase)
watched.columns = [c.lower() for c in watched.columns]
recs.columns = [c.lower() for c in recs.columns]

# Ensure we have 'name' and 'year' in both
assert "name" in watched.columns, "watched.csv must have a 'name' column"
assert "year" in watched.columns, "watched.csv must have a 'year' column"
assert "candidate_title" in recs.columns, "candidates CSV must have a 'candidate_title' column"
assert "year" in recs.columns, "candidates CSV must have a 'year' column"

# Build set of watched (name, year)
watched_pairs = set(zip(watched["name"].astype(str).str.strip().str.lower(),
                        watched["year"].astype(str)))
print("Unique watched (name, year) pairs:", len(watched_pairs))


watched.csv columns: ['Date', 'Name', 'Year', 'Letterboxd URI']
candidates.csv columns: ['candidate_id', 'candidate_title', 'year', 'cos', 'meta', 'final', 'seed', 'comp_genres', 'comp_keywords', 'comp_cast', 'comp_director', 'comp_runtime', 'comp_language', 'comp_popularity', 'comp_vote']
Unique watched (name, year) pairs: 754


## Apply Logical Rule Filter

In [4]:

# Prepare candidate pairs
recs["name_norm"] = recs["candidate_title"].astype(str).str.strip().str.lower()
recs["year_str"] = recs["year"].astype(str)

candidate_pairs = list(zip(recs["name_norm"], recs["year_str"]))

# Apply rule: keep only candidates not in watched_pairs
mask_not_watched = [pair not in watched_pairs for pair in candidate_pairs]
filtered_recs = recs.loc[mask_not_watched].copy()

print("Filtered recommendations:", len(filtered_recs), " / original:", len(recs))
filtered_recs.head(10)


Filtered recommendations: 15  / original: 15


Unnamed: 0,candidate_id,candidate_title,year,cos,meta,final,seed,comp_genres,comp_keywords,comp_cast,comp_director,comp_runtime,comp_language,comp_popularity,comp_vote,name_norm,year_str
0,1924,Superman,1978.0,0.4339,0.4698,0.4483,Black Panther,1.0,0.1304,0.0,0.0,0.956,1.0,0.9719,0.9457,superman,1978.0
1,1498,Teenage Mutant Ninja Turtles,1990.0,0.3336,0.4414,0.3767,Teenage Mutant Ninja Turtles,0.8,0.3182,0.0,0.0,0.9651,1.0,0.7857,0.8055,teenage mutant ninja turtles,1990.0
2,11868,Dracula,1958.0,0.3113,0.4735,0.3762,Dracula,1.0,0.1364,0.0,0.0,0.9651,1.0,0.99,0.9695,dracula,1958.0
3,11797,Fright Night,1985.0,0.288,0.46,0.3568,Fright Night,1.0,0.0769,0.0278,0.0,0.9994,1.0,0.9916,0.7908,fright night,1985.0
4,262097,Trio,1997.0,0.3501,0.363,0.3553,Seven Psychopaths,1.0,0.0,0.0,0.0,0.9731,0.0,0.9183,0.3692,trio,1997.0
5,11122,India,1993.0,0.2222,0.4412,0.3098,Summer Storm,1.0,0.037,0.0,0.0,0.9651,1.0,0.9614,0.75,india,1993.0
6,10889,Gloria,1980.0,0.1594,0.4452,0.2737,Good Time,1.0,0.0526,0.0,0.0,0.7827,1.0,0.9956,0.9152,gloria,1980.0
7,2661,Batman,1966.0,0.1428,0.4507,0.266,21 Jump Street,1.0,0.0606,0.0,0.0,0.9912,1.0,0.8997,0.8805,batman,1966.0
8,1227770,Taylor Tomlinson: Have It All,2024.0,0.0,0.6468,0.2587,Hannah Gadsby: Douglas,1.0,1.0,0.0,0.0,0.9802,1.0,0.9857,0.9695,taylor tomlinson: have it all,2024.0
9,671652,Taylor Tomlinson: Quarter-Life Crisis,2020.0,0.0,0.6461,0.2584,Hannah Gadsby: Douglas,1.0,1.0,0.0,0.0,0.935,1.0,0.9966,0.9895,taylor tomlinson: quarter-life crisis,2020.0


## Save Filtered Recommendations

In [5]:

output_csv.parent.mkdir(parents=True, exist_ok=True)
filtered_recs.to_csv(output_csv, index=False)
print("Saved:", output_csv.resolve())


Saved: /Users/tschaffel/PycharmProjects/letterboxd-KG/data/kg/rerank_filtered_by_LO2.csv


## (Optional) Materialize `:recommendedFor` Triples (RDF)

In [8]:

try:
    from rdflib import Graph, Namespace, URIRef, Literal, RDF

    EX = Namespace("http://example.org/")
    g = Graph()
    g.bind("ex", EX)

    user = EX.user_tobias

    for _, row in filtered_recs.iterrows():
        movie_uri = URIRef(f"http://example.org/movie/{row['name_norm']}_{row['year_str']}")
        g.add((movie_uri, RDF.type, EX.Movie))
        g.add((movie_uri, EX.recommendedFor, user))
        g.add((movie_uri, EX.label, Literal(f"{row['candidate_title']} ({row['year_str']})")))

    ttl_out = project_root / "data" / "kg" / "recommended_materialized.ttl"
    g.serialize(destination=str(ttl_out), format="turtle")
    print("Wrote TTL:", ttl_out.resolve())
except Exception as e:
    print("Skipping RDF materialization:", e)


http://example.org/movie/teenage mutant ninja turtles_1990.0 does not look like a valid URI, trying to serialize this will break.
http://example.org/movie/fright night_1985.0 does not look like a valid URI, trying to serialize this will break.
http://example.org/movie/taylor tomlinson: have it all_2024.0 does not look like a valid URI, trying to serialize this will break.
http://example.org/movie/taylor tomlinson: quarter-life crisis_2020.0 does not look like a valid URI, trying to serialize this will break.
http://example.org/movie/george carlin: it's bad for ya!_2008.0 does not look like a valid URI, trying to serialize this will break.
http://example.org/movie/dave chappelle: deep in the heart of texas_2017.0 does not look like a valid URI, trying to serialize this will break.
http://example.org/movie/jim gaffigan: quality time_2019.0 does not look like a valid URI, trying to serialize this will break.
http://example.org/movie/james acaster: make a new tomorrow_2021.0 does not look 

Skipping RDF materialization: "http://example.org/movie/dave chappelle: deep in the heart of texas_2017.0" does not look like a valid URI, I cannot serialize this as N3/Turtle. Perhaps you wanted to urlencode it?


## Before/After Summary

In [9]:

summary = {
    "candidates_total": len(recs),
    "watched_total": len(watched),
    "filtered_total": len(filtered_recs),
    "removed_by_rule": int(len(recs) - len(filtered_recs)),
}
summary


{'candidates_total': 15,
 'watched_total': 754,
 'filtered_total': 15,
 'removed_by_rule': 0}