# 🧠 Collaborative Filtering - User-Based Version

This notebook implements a user-based collaborative filtering system using the MovieLens 1M dataset. The goal is to recommend movies to a target user based on the preferences of similar users, computed via cosine similarity.

## ⚙️ 1. Setup

We set up the environment by importing necessary libraries and adjusting the Python path so that we can import our custom module located in the `src` directory.

We choose to modify the path dynamically inside the notebook to ensure portability and avoid depending on an installed package or preconfigured environment variables (like `PYTHONPATH`).

In [1]:
# Add parent directory to path so we can import local modules from the 'src' folder.
# This is necessary for notebook reproducibility when not using an installed package.
# Alternatively, setting PYTHONPATH or using a package structure would eliminate the need,
# but this approach keeps things simple and portable for this project.
import sys
sys.path.append("..")

import pandas as pd

from src.collaborative_filtering import CollaborativeFilteringRecommender

## 📥 2. Load Ratings Data

We load the MovieLens 1M ratings dataset, which contains over 1 million user ratings for different movies. This data will be used to build the user-item interaction matrix.

- Each row in the dataset represents a rating a user gave to a movie.
- This is the foundation for inferring user similarity and recommendations.

In [2]:
recommender = CollaborativeFilteringRecommender("../data/ml-1m/ratings.csv", "../data/processed/enriched_movies.csv")
recommender.load_data()

✅ Loaded 988380 ratings (filtered from 1000209) based on enriched metadata.


as a final revission we assert that we'll use movies from the enriched dataset, so we can show additional metadata and reduce the original dataset a bit.

## 🧱 3. Build User-Item Matrix

We pivot the ratings data to construct a matrix where:

- Rows represent users
- Columns represent movies
- Cells contain the rating given by a user to a movie (0 if not rated)

In [3]:
recommender.build_user_item_matrix()

✅ Built user-item matrix with shape (6040, 3590)


We create a disperssion matrix of size (${n}_{users}$, ${m}_{movies}$) where each celll represents the rating that a user gave to a movie. Empty cells are treated as 0 (not views).

## 🧮 4. Compute User-User Similarity

Using the sparse user-item matrix, we calculate cosine similarity between all pairs of users.

- Cosine similarity measures the angle between two rating vectors.
- A value close to 1 means the users have rated movies in a similar pattern.
- This results in a similarity matrix of shape `(n_users, n_users)`.

We will use this similarity matrix to find the most similar users for a given user.

In [4]:
recommender.compute_user_similarity()

✅ Computed user-user similarity matrix


## 🔍 5. Inspect Similar Users

We select a target user (in this case, user ID = 1) and retrieve their top N (5) most similar users.

- This is done by sorting the similarity scores in descending order.
- We exclude the user themselves from the ranking.
- These similar users will serve as the basis for generating recommendations.

In [5]:
recommender.get_user_recommendations(user_id=1, top_n=5)

🔍 Top 5 similar users to user 1: [5343, 5190, 1481, 1283, 5705]


[5343, 5190, 1481, 1283, 5705]

## 🎯 6. Recommend Movies

Once we have computed the user-user similarity matrix and identified similar users, we want to recommend movies that those users have rated highly **but the target user hasn’t seen yet**.

This is the final step in the collaborative filtering process:

- We call `recommend_movies_for_user()`, which:
  - Uses the cosine similarity matrix to find similar users.
  - Aggregates their ratings (weighted by similarity scores).
  - Filters out the movies the target user has already rated.
  - Returns the top-N highest scoring movies.
- We then **load the enriched movie dataset** to retrieve metadata (title, genres, etc.) for the recommended movie IDs.
- Finally, we **display** the recommended movie titles and genres.


In [6]:
# Load enriched dataframe
enriched_df = pd.read_csv(recommender.enriched_movies_path)

# recommended IDs
recommended_ids = recommender.recommend_movies_for_user(user_id=1, top_n=5)

# Filtering with those IDs
recommended_movies = enriched_df[enriched_df["movieId"].isin(recommended_ids)]

# Show result
print("🎬 Recommended Movies:")
print(recommended_movies[["movieId", "title", "genres"]])

🎯 Recommended movies for user 1: [2858, 1196, 1198, 593, 1210]
🎬 Recommended Movies:
      movieId                                              title  \
579       593                   Silence of the Lambs, The (1991)   
1148     1196  Star Wars: Episode V - The Empire Strikes Back...   
1150     1198                     Raiders of the Lost Ark (1981)   
1162     1210  Star Wars: Episode VI - Return of the Jedi (1983)   
2708     2858                             American Beauty (1999)   

                                                 genres  
579                               ['Drama', 'Thriller']  
1148  ['Action', 'Adventure', 'Drama', 'Sci-Fi', 'War']  
1150                            ['Action', 'Adventure']  
1162  ['Action', 'Adventure', 'Romance', 'Sci-Fi', '...  
2708                                ['Comedy', 'Drama']  


📦 This function is defined in `src/collaborative_filtering.py` as `recommend_movies_for_user`.

🔎 We only display recommended movies that also exist in the enriched dataset, ensuring we can show genres and titles.

### ✅ Output

We display the final recommendations for `user_id = 1`, showing which movies the user might enjoy next based on similar users' preferences.