
# LO2: Logic Rule — Exclude Already-Watched Movies from Recommendations

This notebook applies a simple **logical rule** to your recommendation candidates:

> **Rule (formal):** `recommendedFor(u, m) :- candidateFor(m, u) ∧ ¬ watched(u, m)`

It filters out any movie that is already in your `watched.csv`.  
Sources used from your project:
- `data/letterboxd_export/watched.csv`
- `data/kg/tmdb_rerank_with_embedding_results_movies_only.csv`

Outputs:
- `data/kg/rerank_filtered_by_LO2.csv` — filtered recommendations
- (Optional) `data/kg/recommended_materialized.ttl` — RDF materialization of `:recommendedFor` triples


In [1]:

import os
import pandas as pd
from pathlib import Path

# Resolve project root automatically if notebook is moved inside the repo
# Assumes this notebook lives somewhere under the repo root.
here = Path.cwd()
# Try to detect the repo root by looking for the 'data' directory:
candidate = here
while candidate != candidate.parent and not (candidate / "data").exists():
    candidate = candidate.parent
project_root = candidate if (candidate / "data").exists() else Path("../logical")

print("Detected project_root:", project_root.resolve())

# File paths (relative to project root)
watched_path = project_root / "data" / "letterboxd_export" / "watched.csv"
candidates_path = project_root / "data" / "kg" / "tmdb_rerank_with_embedding_results_movies_only.csv"
output_csv = project_root / "data" / "kg" / "rerank_filtered_by_LO2.csv"


Detected project_root: /Users/tschaffel/PycharmProjects/letterboxd-KG


## Load Data

In [2]:

watched = pd.read_csv(watched_path)
recs = pd.read_csv(candidates_path)

print("watched.csv columns:", list(watched.columns))
print("candidates.csv columns:", list(recs.columns))

# Try to infer the TMDB/movie id column names
watched_id_col = None
for c in watched.columns:
    if c.lower() in ["tmdbid", "tmdb_id", "movieid", "movie_id", "id"]:
        watched_id_col = c
        break

recs_id_col = None
for c in recs.columns:
    if c.lower() in ["tmdbid", "tmdb_id", "movieid", "movie_id", "id"]:
        recs_id_col = c
        break

assert watched_id_col is not None, "Could not find a movie id column in watched.csv"
assert recs_id_col is not None, "Could not find a movie id column in candidates CSV"

print("Using watched id column:", watched_id_col)
print("Using recommendations id column:", recs_id_col)


watched.csv columns: ['Date', 'Name', 'Year', 'Letterboxd URI']
candidates.csv columns: ['candidate_id', 'candidate_title', 'year', 'cos', 'meta', 'final', 'seed', 'comp_genres', 'comp_keywords', 'comp_cast', 'comp_director', 'comp_runtime', 'comp_language', 'comp_popularity', 'comp_vote']


AssertionError: Could not find a movie id column in watched.csv

## Normalize IDs

In [None]:

def _to_int_series(s):
    # robust parsing: drop NAs, handle floats-as-ids (e.g., '123.0')
    return pd.to_numeric(s, errors="coerce").dropna().astype(int)

watched_ids = set(_to_int_series(watched[watched_id_col]))
candidate_ids = _to_int_series(recs[recs_id_col])

print(f"Unique watched ids: {len(watched_ids)}")
print(f"Candidate rows: {len(candidate_ids)} (unique: {len(set(candidate_ids))})")


## Apply Logical Rule Filter

In [None]:

# Logical rule:
# recommendedFor(u, m) :- candidateFor(m, u) ∧ ¬ watched(u, m)
mask_not_watched = ~candidate_ids.isin(watched_ids)
filtered_recs = recs.loc[mask_not_watched].copy()

print("Filtered recommendations:", len(filtered_recs), " / original:", len(recs))
filtered_recs.head(10)


## Save Filtered Recommendations

In [None]:

output_csv.parent.mkdir(parents=True, exist_ok=True)
filtered_recs.to_csv(output_csv, index=False)
print("Saved:", output_csv.resolve())



## (Optional) Materialize `:recommendedFor` Triples (RDF)

If you want to **demonstrate LO2 inside the KG**, you can materialize a triple for each filtered recommendation:

```
:movie_<ID> :recommendedFor :user_tobias .
```

Run the cell below to create `data/kg/recommended_materialized.ttl`.


In [None]:

try:
    from rdflib import Graph, Namespace, URIRef, Literal, RDF
    
    EX = Namespace("http://example.org/")
    g = Graph()
    g.bind("ex", EX)
    
    user = EX.user_tobias
    
    # materialize recommendedFor only for filtered_recs
    id_series = _to_int_series(filtered_recs[recs_id_col]).astype(str)
    for mid in id_series:
        m = URIRef(f"http://example.org/movie_{mid}")
        g.add((m, RDF.type, EX.Movie))
        g.add((m, EX.recommendedFor, user))
    
    ttl_out = project_root / "data" / "kg" / "recommended_materialized.ttl"
    g.serialize(destination=str(ttl_out), format="turtle")
    print("Wrote TTL:", ttl_out.resolve())
except Exception as e:
    print("Skipping RDF materialization (rdflib not available or error):", e)


## Before/After Summary

In [None]:

summary = {
    "candidates_total": len(recs),
    "candidates_unique": int(recs[recs_id_col].nunique()),
    "watched_total": len(watched),
    "watched_unique_ids": len(watched_ids),
    "filtered_total": len(filtered_recs),
    "filtered_unique": int(filtered_recs[recs_id_col].nunique()),
    "removed_by_rule": int(len(recs) - len(filtered_recs)),
}
summary
