# Movie Recommendation Demo (Mini "Netflix"-Style Recommender)

In this notebook we will build a tiny, **end-to-end recommendation pipeline** for movies.

We will walk through the same conceptual steps that a large system (like Netflix) uses, but on a *very* small and simple dataset so we can understand what's going on:

1. **Create a tiny dataset**
   - Movies with basic metadata (title, genres)
   - Users with ratings

2. **Baseline: Popularity-based recommendations**
   - Recommend movies that are globally popular.

3. **Content-based filtering**
   - Represent movies by their genres.
   - Find movies similar to the ones a user already likes.

4. **Collaborative filtering (matrix factorization)**
   - Learn user and movie "embeddings" from the ratings matrix.
   - Predict how much a user would like an unseen movie.

5. **Simple hybrid recommender**
   - Combine content-based and collaborative signals.
   - Recommend top movies for a target user.

All along the way, we'll connect the math to the concepts used in real-world recommender systems.


In [None]:
# Basic imports we will use throughout the notebook

import numpy as np
import pandas as pd

from sklearn.metrics.pairwise import cosine_similarity


## 1. Create a Tiny Demo Dataset

In this cell, we define a **small, hard-coded dataset** so we don't depend on any external files.

We will create:
- A `movies` table with:
  - `movie_id`
  - `title`
  - `genres` (comma-separated)
- A `ratings` table with:
  - `user_id`
  - `movie_id`
  - `rating` (on a 1–5 scale)

This will act as our mini "Netflix catalog".


In [None]:
# Define a small movie catalog
movies = pd.DataFrame([
    {"movie_id": 1, "title": "The Matrix",          "genres": "Action,Sci-Fi"},
    {"movie_id": 2, "title": "The Matrix Reloaded", "genres": "Action,Sci-Fi"},
    {"movie_id": 3, "title": "Inception",           "genres": "Action,Sci-Fi,Thriller"},
    {"movie_id": 4, "title": "Interstellar",        "genres": "Sci-Fi,Drama"},
    {"movie_id": 5, "title": "The Notebook",        "genres": "Romance,Drama"},
    {"movie_id": 6, "title": "Crazy, Stupid, Love", "genres": "Romance,Comedy"},
    {"movie_id": 7, "title": "Toy Story",           "genres": "Animation,Family,Comedy"},
    {"movie_id": 8, "title": "Finding Nemo",        "genres": "Animation,Family,Adventure"},
])

# Define a tiny set of user ratings
ratings = pd.DataFrame([
    # user 1 likes sci-fi and action
    {"user_id": 1, "movie_id": 1, "rating": 5},
    {"user_id": 1, "movie_id": 3, "rating": 4},
    {"user_id": 1, "movie_id": 4, "rating": 4},

    # user 2 likes romance & drama
    {"user_id": 2, "movie_id": 5, "rating": 5},
    {"user_id": 2, "movie_id": 6, "rating": 4},
    {"user_id": 2, "movie_id": 4, "rating": 3},

    # user 3 likes family/animation
    {"user_id": 3, "movie_id": 7, "rating": 5},
    {"user_id": 3, "movie_id": 8, "rating": 4},
    {"user_id": 3, "movie_id": 6, "rating": 3},

    # user 4 is a mix: likes sci-fi and animation
    {"user_id": 4, "movie_id": 1, "rating": 4},
    {"user_id": 4, "movie_id": 3, "rating": 5},
    {"user_id": 4, "movie_id": 7, "rating": 4},

    # user 5 likes romance and some sci-fi
    {"user_id": 5, "movie_id": 5, "rating": 4},
    {"user_id": 5, "movie_id": 6, "rating": 5},
    {"user_id": 5, "movie_id": 3, "rating": 3},
])

movies, ratings.head()


## 2. Build the User–Item Ratings Matrix

Recommender systems often start from a **user–item matrix**, where each row is a user, each column is a movie, and each cell is a rating.

In the next cell we:
- Pivot the `ratings` table into a matrix.
- Use `NaN` for "no rating yet".
This matrix is the raw data that collaborative filtering methods work with.


In [None]:
# Create the user–item matrix
user_item_matrix = ratings.pivot_table(
    index="user_id",
    columns="movie_id",
    values="rating"
)

user_item_matrix


## 3. Baseline: Popularity-Based Recommendations

Before using any "smart" model, a simple baseline is to:
- Compute the **average rating** for each movie.
- Recommend movies with the highest average rating (or most ratings).

This mimics a very naive recommender: "show what's popular".

In the code below we:
1. Compute the average rating and count of ratings per movie.
2. Join this with the `movies` table.
3. Show the most popular movies.


In [None]:
movie_stats = ratings.groupby("movie_id")["rating"].agg(["mean", "count"]).reset_index()
movie_stats = movie_stats.merge(movies, on="movie_id")

movie_stats.sort_values(by="mean", ascending=False)


## 4. Content-Based Filtering (Using Genres)

Now we move to **content-based filtering**.

Idea:
- Represent each movie by its **features** (here: genres).
- If a user likes a movie, recommend other movies with **similar features**.

Steps in the next cells:
1. Turn genres (text) into a **one-hot encoded** feature matrix.
2. Compute **cosine similarity** between movies based on these features.
3. For a given movie, find the most similar movies.


In [None]:
# One-hot encode the genres
all_genres = sorted(
    {g for genre_str in movies["genres"] for g in genre_str.split(",")}
)

for g in all_genres:
    movies[g] = movies["genres"].apply(lambda s: 1 if g in s.split(",") else 0)

genre_features = movies[all_genres].values  # shape: (num_movies, num_genres)
genre_features


### Compute Movie–Movie Similarity

Using the one-hot genre vectors, we compute a **cosine similarity matrix** between movies.

Cosine similarity measures the angle between two vectors:
- 1.0 = pointing in the same direction (very similar)
- 0.0 = orthogonal (no similarity)
- For genres, this basically counts how many genres overlap (normalized by length).

In the code below:
- `movie_sim_matrix[i, j]` is "how similar movie *i* is to movie *j* in terms of genres".


In [None]:
# Compute cosine similarity between movies based on genres
movie_sim_matrix = cosine_similarity(genre_features, genre_features)

# Put into a DataFrame for readability
movie_sim_df = pd.DataFrame(
    movie_sim_matrix,
    index=movies["movie_id"],
    columns=movies["movie_id"]
)

movie_sim_df.round(2)


### Helper Function: Content-Based Similar Movies

Now we create a helper function that:
- Takes a `movie_id`,
- Looks up its row in the similarity matrix,
- Returns the top-N most similar movies (excluding itself).

We'll use this later in the hybrid recommender.


def get_similar_movies_content_based(movie_id, top_n=5):
    if movie_id not in movie_sim_df.index:
        raise ValueError(f"Unknown movie_id: {movie_id}")
        
    sims = movie_sim_df.loc[movie_id].drop(movie_id)  # drop self
    top_ids = sims.sort_values(ascending=False).head(top_n).index
    return movies[movies["movie_id"].isin(top_ids)].assign(similarity=sims[top_ids].values)

# Example: movies similar to "The Matrix"
get_similar_movies_content_based(movie_id=1, top_n=3)


## Visualizing Movie Similarity (Heatmap)

Before moving on, let's visualize the **movie–movie similarity matrix** we computed using genre vectors.

A heatmap helps us see patterns:
- Clusters of similar movies
- Clear groupings like:
  - Sci-Fi movies
  - Romance/Drama movies
  - Animation/Family movies  

In the next cell, we create a heatmap using seaborn.


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Prepare a readable matrix with titles instead of IDs
sim_matrix_titles = movie_sim_df.copy()
sim_matrix_titles.index = movies.set_index("movie_id").loc[sim_matrix_titles.index, "title"]
sim_matrix_titles.columns = movies.set_index("movie_id").loc[sim_matrix_titles.columns, "title"]

plt.figure(figsize=(10, 8))
sns.heatmap(sim_matrix_titles, annot=False, cmap="viridis")
plt.title("Movie Similarity Heatmap (Content-Based, Genre Features)")
plt.tight_layout()
plt.show()


## 5. Collaborative Filtering with Matrix Factorization

Now we move closer to what large-scale systems like Netflix use: **collaborative filtering**.

Idea:
- The **pattern of ratings** contains hidden structure.
- We assume each user and each movie can be represented in a **latent space**:
  - Each user has a latent vector `U[user]`.
  - Each movie has a latent vector `V[movie]`.
- The predicted rating is the **dot product**:
  \[
  \hat{r}_{u,i} = U_u \cdot V_i
  \]

This is basically a tiny **neural network with embeddings**:
- One embedding table for users, one for movies.
- Multiply and sum: that's a dot-product layer.

In the code below we:
1. Initialize random user and movie latent factors.
2. Perform simple **gradient descent** on the observed ratings.
3. Learn `U` and `V` so that their dot products approximate the actual ratings.


In [None]:
# Prepare data for matrix factorization
user_ids = sorted(ratings["user_id"].unique())
movie_ids = sorted(ratings["movie_id"].unique())

user_id_to_index = {u: i for i, u in enumerate(user_ids)}
movie_id_to_index = {m: i for i, m in enumerate(movie_ids)}

num_users = len(user_ids)
num_movies = len(movie_ids)

# Convert ratings to index-based triples for training
train_data = [
    (user_id_to_index[row.user_id], movie_id_to_index[row.movie_id], row.rating)
    for row in ratings.itertuples()
]

# Hyperparameters
k = 3              # number of latent factors (very small for demo)
lr = 0.01          # learning rate
reg = 0.1          # L2 regularization
num_epochs = 500   # training iterations

# Initialize user and movie latent factors
np.random.seed(42)
U = 0.1 * np.random.randn(num_users, k)
V = 0.1 * np.random.randn(num_movies, k)

# Simple gradient descent loop
for epoch in range(num_epochs):
    np.random.shuffle(train_data)
    total_loss = 0.0

    for u_idx, m_idx, r in train_data:
        # Predict rating
        pred = np.dot(U[u_idx], V[m_idx])
        error = r - pred

        # Compute gradients (with L2 regularization)
        dU = -2 * error * V[m_idx] + 2 * reg * U[u_idx]
        dV = -2 * error * U[u_idx] + 2 * reg * V[m_idx]

        # Update
        U[u_idx] -= lr * dU
        V[m_idx] -= lr * dV

        # Accumulate squared error
        total_loss += error**2

    if (epoch + 1) % 100 == 0:
        rmse = np.sqrt(total_loss / len(train_data))
        print(f"Epoch {epoch+1}/{num_epochs} - RMSE: {rmse:.3f}")


### Predict Ratings with the Learned Latent Factors

Now that we have trained latent factors:
- `U[user]` = user embedding
- `V[movie]` = movie embedding

We can:
- Predict a rating for **any** user–movie pair
- Even if that user never rated that movie before

In the next cell, we define a helper function `predict_rating_cf` that:
- Takes a `user_id` and `movie_id`
- Looks up their indices
- Returns the predicted rating (dot product of embeddings)


In [None]:
def predict_rating_cf(user_id, movie_id):
    if user_id not in user_id_to_index or movie_id not in movie_id_to_index:
        return np.nan
    u_idx = user_id_to_index[user_id]
    m_idx = movie_id_to_index[movie_id]
    return float(np.dot(U[u_idx], V[m_idx]))

# Example: predict how user 1 would rate each movie
preds_for_user1 = []
for mid in movie_ids:
    preds_for_user1.append({
        "movie_id": mid,
        "title": movies.loc[movies["movie_id"] == mid, "title"].values[0],
        "pred_rating": predict_rating_cf(1, mid)
    })

pd.DataFrame(preds_for_user1).sort_values(by="pred_rating", ascending=False)


## 6. A Simple Hybrid Recommender

In a real system, Netflix combines many signals:
- Collaborative filtering (what similar users watched)
- Content-based similarity (what similar items you liked)
- Popularity, freshness, etc.

We will build a **very simple hybrid** model:
1. Choose a target `user_id`.
2. Find movies they have **not** rated yet.
3. For each candidate movie:
   - Collaborative score: predicted rating from matrix factorization.
   - Content score: similarity to the movies this user rated highly (from genres).
4. Combine these scores with a weighted average.

The result is a ranked list of **personalized movie recommendations** for that user.


In [None]:
def get_user_profile_high_rated_movies(user_id, min_rating=4.0):
    """Return the set of movie_ids that the user rated >= min_rating."""
    user_ratings = ratings[ratings["user_id"] == user_id]
    return set(user_ratings[user_ratings["rating"] >= min_rating]["movie_id"].values)

def content_score_for_movie(user_id, candidate_movie_id):
    """Compute content-based score for a candidate movie for a given user.
    
    We define it as:
    - Average genre similarity between the candidate movie and
      all movies that the user rated highly.
    """
    liked_movies = get_user_profile_high_rated_movies(user_id)
    if not liked_movies:
        return 0.0
    
    sims = []
    for m in liked_movies:
        sims.append(movie_sim_df.loc[candidate_movie_id, m])
    return float(np.mean(sims))

def recommend_movies_hybrid(user_id, top_n=5, alpha=0.7):
    """
    Hybrid recommender:
    final_score = alpha * CF_score + (1 - alpha) * content_score
    
    alpha close to 1.0 = mostly collaborative filtering
    alpha close to 0.0 = mostly content-based
    """
    # Movies user has already rated
    user_rated = set(ratings[ratings["user_id"] == user_id]["movie_id"].values)
    
    candidates = [m for m in movie_ids if m not in user_rated]
    
    results = []
    for m in candidates:
        cf = predict_rating_cf(user_id, m)
        cb = content_score_for_movie(user_id, m)
        
        # Normalize to similar scale (roughly 0-5)
        # Here we assume:
        # - CF is already on rating scale (0-5-ish)
        # - Content score is in [0, 1], so scale it up
        content_scaled = cb * 5.0
        
        final_score = alpha * cf + (1 - alpha) * content_scaled
        
        results.append({
            "movie_id": m,
            "title": movies.loc[movies["movie_id"] == m, "title"].values[0],
            "cf_score": cf,
            "content_score": cb,
            "final_score": final_score,
        })
    
    results_df = pd.DataFrame(results)
    return results_df.sort_values(by="final_score", ascending=False).reset_index(drop=True)

# Example: recommend for user 1
recommend_movies_hybrid(user_id=1, top_n=5, alpha=0.7)


## 7. Connecting This to "Real" Netflix-Like Systems

In this notebook we have a toy pipeline that mirrors the **big ideas** behind a system like Netflix:

1. **Data layer**
   - We built a ratings matrix (`user_item_matrix`).
   - We defined movie metadata (genres).

2. **Candidate generation**
   - Collaborative filtering via matrix factorization (user and movie latent factors).
   - Content-based similarity using genres and cosine similarity.

3. **Ranking**
   - We combined multiple signals (CF prediction + content similarity) into a single score.

4. **Personalization**
   - Recommendations are different for each user based on their rating history.

In real production systems:
- The dataset is **massive** (millions of users, thousands of movies or more).
- Models are **bigger** (deep neural networks, sequence models, etc.).
- There are many other signals: time of day, device, thumbnail artwork, etc.
- Training and serving happen on **distributed infrastructure**.

But conceptually, what you saw here is the core idea:
> **Represent users and items as vectors, then use similarity and learned scores to rank items for each user.**

You can now integrate this notebook into your AI workshop to:
- Show participants how recommendation works step by step.
- Connect the math (vectors, dot products, optimization) to real-world AI use cases.
