# <u> Memory-based Collaborative Filtering</u>

This notebook explores **memory-based collaborative filtering** as a first baseline for building a movie recommendation system. The approach relies solely on the raw user–item rating matrix, without the need for training or any machine learning framework.

The general idea is to build recommendations based on similarities, which can be computed in two main ways:

- **User–User Collaborative Filtering**: recommends movies liked by users who have similar preferences and rating behavior to us.
- **Item–Item Collaborative Filtering**: recommends new movies that are similar to the ones we have already seen and liked.

<br>

To define similarities between users or items, we commonly use two metrics:

- **Cosine Similarity**: a fast and popular measure that computes the angle between two vectors (single users or items). The formula is:

<br>

$$
\text{sim}_{\text{cosine}}(x, y) = \frac{x \cdot y}{\|x\| \|y\|}
$$

<br>

- **Pearson Correlation**: more computationally demanding, but it measures the linear relationship between co-rated items, correcting for each user’s individual rating bias. The formula is:

<br>

$$
\text{sim}_{\text{pearson}}(x, y) = \frac{\sum_{i}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2} \sqrt{\sum(y_i - \bar{y})^2}}
$$

<br>

Due to its ability to correct for individual rating biases, Pearson correlation is especially useful in user–user collaborative filtering, where users may have very different rating scales. In item–item collaborative filtering, this adjustment is less critical, and both Pearson and cosine similarity often produce similar results, particularly in sparse datasets.

## <u>0. Setting:</u>

In [14]:
# Import necessary libraries
import pandas as pd, numpy as np, os, sys, seaborn as sns, matplotlib.pyplot as plt
import matplotlib.dates as mdates
from scipy.sparse import csr_matrix
from sklearn.preprocessing import LabelEncoder

# Set the working directory
current_dir = os.getcwd()

project_root = os.path.abspath(os.path.join(current_dir, ".."))
if project_root not in sys.path:
    sys.path.append(project_root)

# Import module for data processing
from modules.data_analysis import *


In [15]:
# Import cleand dataframe
file_path = '../data/processed/ratings_enriched.parquet'
rating_enriched = pd.read_parquet(file_path, engine="pyarrow")
rating_enriched.head(3)

Unnamed: 0,userId,movieId,rating,timestamp,movie_bayes_avg,log_count_review,release_year,user_avg_rating,user_avg_bayes,title,...,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,2,3.5,2005-04-02 23:53:47,3.211998,10.009828,1995,3.742857,3.630483,Jumanji (1995),...,0,0,0,0,0,0,0,0,0,0
1,1,29,3.5,2005-04-02 23:31:16,3.950552,9.050289,1995,3.742857,3.630483,"City of Lost Children, The (Cité des enfants p...",...,0,0,0,0,1,0,1,0,0,0
2,1,32,3.5,2005-04-02 23:33:39,3.89776,10.713995,1995,3.742857,3.630483,Twelve Monkeys (a.k.a. 12 Monkeys) (1995),...,0,0,0,0,1,0,1,1,0,0


In [6]:
print(f"Number of unique userId: {rating_enriched['userId'].nunique()}")
print(f"Number of unique movieId: {rating_enriched['movieId'].nunique()}")

Number of unique userId: 138493
Number of unique movieId: 13130


## <u> 1. User-User Collaborative Filtering </u>

As previously explained, the main idea behind the **user–user collaborative filtering** approach is to recommend items to a target user by finding other users with similar tastes, then suggesting items they liked but the target user hasn't seen yet.

To compute similarities across users, we use Pearson **correlation** on the **user–item rating matrix**, which is structured as follows:
- Rows represent users (`userId`)
- Columns represent movies (`movieId`)
- Values are the ratings users assigned to each movie

This results in a very sparse matrix, as the dataset contains approximately 139,000 users and 12,000 movies, and most users rate only a small fraction of all available movies.

The prediction for how much a user $u$ will like an unseen item $i$ is computed using a **similarity-weighted average** of the ratings from the most similar users who have rated item $i$:

$$
\hat{r}_{u,i} = \bar{r}_u + \frac{\sum_{v \in N(u)} \text{sim}(u,v) \cdot (r_{v,i} - \bar{r}_v)}{\sum_{v \in N(u)} |\text{sim}(u,v)|}
$$

Where:
- $\hat{r}_{u,i}$ is the predicted rating for user $u$ on item $i$
- $\bar{r}_u$ is the average rating of user $u$
- $N(u)$ is the set of top-$K$ most similar users to $u$ who rated item $i$
- $\text{sim}(u,v)$ is the Pearson correlation between users $u$ and $v$
- $r_{v,i}$ is the rating that user $v$ gave to item $i$
- $\bar{r}_v$ is the average rating of user $v$



### <u> 1.1 Build user-item rating matrix:</u> 

Given the large size of the user–item matrix, we use `csr_matrix` to store it efficiently in a sparse format.
This allows us to keep only the non-zero ratings and their positions, instead of allocating memory for the full (mostly empty) matrix. Since the sparse matrix stores only index-based positions, we keep the original userId and movieId mappings using LabelEncoder to translate between matrix indices and real IDs.

In [13]:
# Filter out unnecessary features
df_u_u = rating_enriched[['userId','movieId','rating', 'title']].copy()

# Encode userId and movieId to ensure compatible indexing with csr_matrix
user_encoder = LabelEncoder()
movie_encoder = LabelEncoder()
df_u_u['user_idx'] = user_encoder.fit_transform(df_u_u['userId'])
df_u_u['item_idx'] = movie_encoder.fit_transform(df_u_u['movieId'])

# Build spare matrix with efficient allocation of memory
user_item_sparse = csr_matrix(
    (df_u_u['rating'], (df_u_u['user_idx'], df_u_u['item_idx']))
)