### Mapping table from Movielens genres to simplified categories for the demo:

| Movielens Genre        | Simplified Category       | Notes / Reasoning |
|------------------------|--------------------------|------------------|
| Comedy                 | neutral                  | Lighthearted, non-political |
| Animation              | neutral                  | Kid-friendly, neutral themes |
| Children's             | neutral                  | Family content, neutral |
| Musical                | neutral                  | Entertainment-focused |
| Action                 | mildly_political         | Often aggressive, conflict-oriented |
| Adventure              | mildly_political         | Excitement, quests, some thematic tension |
| Thriller               | mildly_political         | Suspenseful or intense content |
| Crime                  | mildly_political         | Often deals with moral or social issues |
| War                    | extreme                  | Conflict, violence, ideological narratives |
| Film-Noir              | extreme                  | Dark, morally ambiguous, intense themes |
| Horror                 | extreme                  | Fear-inducing, extreme reactions |
| Drama                  | neutral                  | Everyday situations; could also be split depending on theme |
| Romance                | neutral                  | Love stories, neutral |
| Sci-Fi                 | neutral                  | Speculative, usually non-political |
| Mystery                | mildly_political         | Intrigue, tension, societal questions |
| Fantasy                | neutral                  | Imaginary worlds, escapism |
| Documentary            | neutral                  | Informational, generally neutral |
| Action-Adventure/Other | mildly_political         | If ambiguous, treat as mildly political |
| Unknown                | neutral                  | Default neutral |


In [2]:
# convert_movielens100k.py
import pandas as pd

# -------------------------
# Paths to Movielens 100K files
# -------------------------
ratings_file = "../data/ml-100k/u.data"    # format: userId \t movieId \t rating \t timestamp
movies_file = "../data/ml-100k/u.item"     # format: movieId | title | release_date | ... | genres (19 binary columns)

# -------------------------
# Load ratings
# -------------------------
ratings_cols = ["userId", "movieId", "rating", "timestamp"]
ratings = pd.read_csv(ratings_file, sep="\t", names=ratings_cols, encoding="latin-1")
ratings.drop(columns=["timestamp"], inplace=True)

# -------------------------
# Load movie metadata
# -------------------------
movie_cols = ["movieId", "title", "release_date", "video_release_date", "IMDb_URL"] + [
    "unknown", "Action", "Adventure", "Animation", "Children's", "Comedy", "Crime", "Documentary",
    "Drama", "Fantasy", "Film-Noir", "Horror", "Musical", "Mystery", "Romance", "Sci-Fi",
    "Thriller", "War", "Western"
]
movies = pd.read_csv(movies_file, sep="|", names=movie_cols, encoding="latin-1")

# -------------------------
# Map genres to simplified categories
# -------------------------
def map_genre_to_category(row):
    # Example mapping:
    # Comedy, Musical, Animation, Children -> neutral
    # Action, Adventure, Thriller, Crime -> mildly_political
    # War, Film-Noir, Horror -> extreme
    genres = row[6:]  # genre columns
    genre_names = genres[genres==1].index.tolist()
    if any(g in ["Comedy", "Animation", "Children's", "Musical"] for g in genre_names):
        return "neutral"
    elif any(g in ["Action", "Adventure", "Thriller", "Crime"] for g in genre_names):
        return "mildly_political"
    elif any(g in ["War", "Film-Noir", "Horror"] for g in genre_names):
        return "extreme"
    else:
        return "neutral"

movies["category"] = movies.apply(map_genre_to_category, axis=1)

# Keep only relevant columns
movies_clean = movies[["movieId", "title", "category"]]

# -------------------------
# Merge ratings with movie info
# -------------------------
merged = ratings.merge(movies_clean, on="movieId")

# -------------------------
# Save to CSV for demo
# -------------------------
merged.to_csv("movielens_100k_categories.csv", index=False)
print("Saved converted CSV: movielens_100k_categories.csv")


Saved converted CSV: movielens_100k_categories.csv


## Small subset

In [3]:
import pandas as pd
import numpy as np

# -----------------------------
# Parameters
# -----------------------------
NUM_USERS = 10   # number of users to keep
NUM_MOVIES = 30  # number of movies to keep

# Mapping from MovieLens genres to simplified categories
genre_to_category = {
    "Comedy": "neutral",
    "Animation": "neutral",
    "Children": "neutral",
    "Musical": "neutral",
    "Drama": "neutral",
    "Romance": "neutral",
    "Sci-Fi": "neutral",
    "Fantasy": "neutral",
    "Action": "mildly_political",
    "Adventure": "mildly_political",
    "Thriller": "mildly_political",
    "Crime": "mildly_political",
    "Mystery": "mildly_political",
    "War": "extreme",
    "Horror": "extreme",
    "Film-Noir": "extreme",
    "Documentary": "neutral",
    "Action-Adventure": "mildly_political",
    "Other": "neutral",
    "Unknown": "neutral"
}

# Priority order for conflict resolution: extreme > mildly_political > neutral
priority = {"extreme": 3, "mildly_political": 2, "neutral": 1}

# -----------------------------
# Load MovieLens 100K data
# -----------------------------
# movies.csv should have: movieId,title,genres
movies = pd.read_csv("movies.csv")
ratings = pd.read_csv("ratings.csv")  # userId,movieId,rating,timestamp

# -----------------------------
# Subset random users and movies
# -----------------------------
unique_users = ratings["userId"].unique()
selected_users = np.random.choice(unique_users, size=NUM_USERS, replace=False)

unique_movies = movies["movieId"].unique()
selected_movies = np.random.choice(unique_movies, size=NUM_MOVIES, replace=False)

ratings_subset = ratings[ratings["userId"].isin(selected_users) & ratings["movieId"].isin(selected_movies)].copy()
movies_subset = movies[movies["movieId"].isin(selected_movies)].copy()

# -----------------------------
# Assign single category per movie
# -----------------------------
def assign_category(genres_str):
    genres = genres_str.split("|")
    categories = [genre_to_category.get(g, "neutral") for g in genres]
    # pick the highest priority category
    categories_sorted = sorted(categories, key=lambda c: priority[c], reverse=True)
    return categories_sorted[0]

movies_subset["category"] = movies_subset["genres"].apply(assign_category)

# -----------------------------
# Merge ratings and movie info
# -----------------------------
ratings_subset = ratings_subset.merge(movies_subset[["movieId", "title", "category"]], on="movieId", how="left")
ratings_subset = ratings_subset[["userId", "title", "rating", "category"]]

# -----------------------------
# Save to CSV for demo
# -----------------------------
ratings_subset.to_csv("movielens_demo_subset.csv", index=False)
print("Demo subset CSV saved: movielens_demo_subset.csv")


FileNotFoundError: [Errno 2] No such file or directory: 'movies.csv'

Randomly selects 10 users and 30 movies.

Converts multiple genres into a single category using a priority system.

Merges ratings with movie titles and categories.

Saves a CSV like this:

userId	title	rating	category
1	Toy Story	4.0	neutral
2	Die Hard	5.0	mildly_political
3	Saving Private Ryan	4.5	extreme