# Collaborative Filtering Recommendation System

In this project we trained recommendation system on MovieLens 20 Million dataset using the user-user and item-item collaborative filtering. Both approaches aim to create a weighted average score based on the top k similar users/items.

While user-user collaborative filtering can give more personalized recommendations, item-item collaborative filtering is preffered as it is able to handle cold start isssues better and generally give better RMSE result as users can be sporadic in their reviews.

We obtained Overall RMSE of 0.886 in user-user and 0.85 for item-item while test RMSE for both had deviations with item-item collaborative filtering having 0.95 test RMSE which was better than user-user collaborative filtering. We will improve performance on test RMSE in a later notebook.

In [1]:
#importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors
from sklearn.model_selection import train_test_split

In [3]:
df_ratings = pd.read_csv('rating.csv')

In [9]:
df_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,2005-04-02 23:53:47
1,1,29,3.5,2005-04-02 23:31:16
2,1,32,3.5,2005-04-02 23:33:39
3,1,47,3.5,2005-04-02 23:32:07
4,1,50,3.5,2005-04-02 23:29:40


In [10]:
df_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000263 entries, 0 to 20000262
Data columns (total 4 columns):
 #   Column     Dtype  
---  ------     -----  
 0   userId     int64  
 1   movieId    int64  
 2   rating     float64
 3   timestamp  object 
dtypes: float64(1), int64(2), object(1)
memory usage: 610.4+ MB


In [11]:
df_ratings.describe()

Unnamed: 0,userId,movieId,rating
count,20000260.0,20000260.0,20000260.0
mean,69045.87,9041.567,3.525529
std,40038.63,19789.48,1.051989
min,1.0,1.0,0.5
25%,34395.0,902.0,3.0
50%,69141.0,2167.0,3.5
75%,103637.0,4770.0,4.0
max,138493.0,131262.0,5.0


In [12]:
df_ratings.isnull().sum()

userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

In [13]:
df_ratings['rating'].sort_values()

3056357     0.5
3056358     0.5
18375889    0.5
4354958     0.5
3056291     0.5
           ... 
20000230    5.0
20000217    5.0
20000209    5.0
20000202    5.0
131         5.0
Name: rating, Length: 20000263, dtype: float64

In [14]:
df_ratings.loc[df_ratings['rating'] == 0,]

Unnamed: 0,userId,movieId,rating,timestamp


In [15]:
len(set(df_ratings.userId))

138493

In [16]:
len(set(df_ratings.movieId))

26744

In [17]:
df_ratings[['movieId','rating']].groupby('movieId').size().value_counts().head(5)

1    3972
2    2043
3    1355
4    1029
5     826
Name: count, dtype: int64

In [18]:
df_ratings[['movieId','rating']].groupby('movieId').mean().sort_values(by = 'rating', ascending = False).head(10)

Unnamed: 0_level_0,rating
movieId,Unnamed: 1_level_1
54326,5.0
26718,5.0
105846,5.0
81117,5.0
130996,5.0
105841,5.0
129478,5.0
129530,5.0
129526,5.0
103871,5.0


In [4]:
df_movie = pd.read_csv('movie.csv')
df_movie.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [5]:
df_ratings = df_ratings.merge(df_movie, on = 'movieId', how = 'left')

In [21]:
df_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,2,3.5,2005-04-02 23:53:47,Jumanji (1995),Adventure|Children|Fantasy
1,1,29,3.5,2005-04-02 23:31:16,"City of Lost Children, The (Cité des enfants p...",Adventure|Drama|Fantasy|Mystery|Sci-Fi
2,1,32,3.5,2005-04-02 23:33:39,Twelve Monkeys (a.k.a. 12 Monkeys) (1995),Mystery|Sci-Fi|Thriller
3,1,47,3.5,2005-04-02 23:32:07,Seven (a.k.a. Se7en) (1995),Mystery|Thriller
4,1,50,3.5,2005-04-02 23:29:40,"Usual Suspects, The (1995)",Crime|Mystery|Thriller


In [22]:
df_ratings.isnull().sum()

userId       0
movieId      0
rating       0
timestamp    0
title        0
genres       0
dtype: int64

In [23]:
np.unique(df_ratings.userId)

array([     1,      2,      3, ..., 138491, 138492, 138493])

In [24]:
ind_movie_dict = dict(zip(np.unique(df_ratings.movieId), range(len(np.unique(df_ratings.movieId)))))
ind_users_dict = dict(zip(np.unique(df_ratings.userId), range(len(np.unique(df_ratings.userId)))))
movie_new_ind = df_ratings['movieId'].map(ind_movie_dict)
users_new_ind = df_ratings['userId'].map(ind_users_dict)

In [227]:
(users_new_ind, movie_new_ind)

(0                0
 1                0
 2                0
 3                0
 4                0
              ...  
 20000258    138492
 20000259    138492
 20000260    138492
 20000261    138492
 20000262    138492
 Name: userId, Length: 20000263, dtype: int64,
 0               1
 1              28
 2              31
 3              46
 4              49
             ...  
 20000258    13754
 20000259    13862
 20000260    13875
 20000261    13993
 20000262    14277
 Name: movieId, Length: 20000263, dtype: int64)

In [25]:
sparse_20m = csr_matrix((df_ratings['rating'], (users_new_ind, movie_new_ind)), shape = (len(np.unique(df_ratings.userId)), len(np.unique(df_ratings.movieId))))

In [26]:
sparse_20m.shape

(138493, 26744)

In [28]:
#centered

def row_mean_center_csr(X):
    X = X.copy()  # Don't overwrite original
    mean_list = []
    for i in range(X.shape[0]):
        start, end = X.indptr[i], X.indptr[i+1]
        if end > start:
            row_data = X.data[start:end]
            row_mean = row_data.mean()
            mean_list.append(row_mean)
            X.data[start:end] -= row_mean
    return X, mean_list

# Usage:
sparse_20m_centered, mean_list = row_mean_center_csr(sparse_20m)

In [33]:
K = 25  # Number of neighbors

# Use cosine metric (equivalent to Pearson for centered data)
nn = NearestNeighbors(n_neighbors=K+1,  # +1 because the closest is the user itself
                      metric='cosine', 
                      algorithm='brute',  # brute is best for high-dimensional sparse data
                      n_jobs=-1)          # use all CPUs

nn.fit(sparse_20m_centered)
distances, indices = nn.kneighbors(sparse_20m_centered, return_distance=True)

In [34]:
top_k_indices = indices[:, 1:]      # shape: (n_users, K)
top_k_similarities = 1 - distances[:, 1:]

In [35]:
top_k_indices.shape

(138493, 25)

In [36]:
def predict_rating(u, j, user_means, user_item_matrix, top_k_indices, top_k_sims):
    # u: user index
    # j: movie index
    # user_means: array of user mean ratings
    # user_item_matrix: csr_matrix (users x movies)
    # top_k_indices: (n_users, K) array of neighbor indices
    # top_k_sims: (n_users, K) array of similarities

    neighbors = top_k_indices[u]
    sims = top_k_sims[u]
    numerator = 0.0
    denominator = 0.0
    for v, sim in zip(neighbors, sims):
        rating = user_item_matrix[v, j]
        if rating != 0:  # neighbor has rated movie j
            numerator += sim * (rating)
            denominator += abs(sim)
    if denominator == 0:
        return user_means[u]

    return user_means[u] + (numerator/denominator)


In [37]:
top_k_similarities

array([[0.59387156, 0.50665132, 0.50483195, ..., 0.37414174, 0.37338117,
        0.37028256],
       [0.78235919, 0.76590333, 0.71747735, ..., 0.6500008 , 0.64956131,
        0.64784104],
       [0.29632148, 0.28249597, 0.27613281, ..., 0.21615092, 0.21460945,
        0.21299418],
       ...,
       [0.27884603, 0.27568338, 0.27138094, ..., 0.21635697, 0.21498448,
        0.21491789],
       [0.17519377, 0.17078533, 0.1640129 , ..., 0.13052048, 0.13024127,
        0.12942247],
       [0.25176914, 0.24103995, 0.23593337, ..., 0.2126441 , 0.21156199,
        0.21155698]])

In [38]:
predict_rating(0, 0, mean_list, sparse_20m_centered, top_k_indices,top_k_similarities)

np.float64(3.3550264847681586)

In [43]:
sparse_20m_centered.indices

array([    1,    28,    31, ..., 13875, 13993, 14277], dtype=int32)

In [44]:
rows, cols = sparse_20m_centered.nonzero()

In [58]:
predicted = []
actual = []
for u, j in zip(rows, cols):
    pred = predict_rating(u, j, mean_list, sparse_20m_centered, top_k_indices, top_k_similarities)
    predicted.append(pred)
    actual.append(true_rating)

mse = np.mean((np.array(predicted) - np.array(actual)) ** 2)
print("MSE:", mse)

MSE: 0.7851064821227035


In [59]:
print(f'RMSE is {np.sqrt(mse)}')

RMSE is 0.8860623466340862


For our user-user collaborative filtering model we got a very competitive recommendation system with Root Mean Square Error at 0.886.

Now we will apply item-item collaborative filtering which will follow a similar approach but focuing on movies or items.

In [7]:
user_ids = np.unique(df_ratings['userId'])
movie_ids = np.unique(df_ratings['movieId'])
user2idx = {uid: idx for idx, uid in enumerate(user_ids)}
movie2idx = {mid: idx for idx, mid in enumerate(movie_ids)}

df_ratings['user_idx'] = df_ratings['userId'].map(user2idx)
df_ratings['movie_idx'] = df_ratings['movieId'].map(movie2idx)


In [8]:
train_df, test_df = train_test_split(df_ratings, test_size=0.2, random_state=42)

n_users = len(user_ids)
n_movies = len(movie_ids)

train_matrix = csr_matrix(
    (train_df['rating'], (train_df['user_idx'], train_df['movie_idx'])),
    shape=(n_users, n_movies)
)
test_matrix = csr_matrix(
    (test_df['rating'], (test_df['user_idx'], test_df['movie_idx'])),
    shape=(n_users, n_movies)
)

In [64]:
test_df

Unnamed: 0,userId,movieId,rating,timestamp,title,genres,user_idx,movie_idx
17679788,122270,8360,3.5,2012-04-22 01:07:04,Shrek 2 (2004),Adventure|Animation|Children|Comedy|Musical|Ro...,122269,7761
7106385,49018,32,2.0,2001-09-11 07:50:36,Twelve Monkeys (a.k.a. 12 Monkeys) (1995),Mystery|Sci-Fi|Thriller,49017,31
12970708,89527,109374,3.5,2015-01-06 09:26:40,"Grand Budapest Hotel, The (2014)",Comedy|Drama,89526,22764
15426752,106704,1060,3.0,2000-01-22 21:27:57,Swingers (1996),Comedy|Drama,106703,1040
6934678,47791,1732,2.0,2006-01-19 15:48:23,"Big Lebowski, The (1998)",Comedy|Crime,47790,1672
...,...,...,...,...,...,...,...,...
13643587,94260,36,4.0,2001-12-16 06:33:22,Dead Man Walking (1995),Crime|Drama,94259,35
13464658,93021,289,3.0,1997-01-23 10:27:43,Only You (1994),Comedy|Romance,93020,286
2091376,14151,41,5.0,1996-06-04 13:14:29,Richard III (1995),Drama|War,14150,40
11800879,81453,2671,4.5,2010-11-28 02:38:19,Notting Hill (1999),Comedy|Romance,81452,2585


In [None]:
from scipy.sparse import csc_matrix


In [89]:
from scipy.sparse import csc_matrix

def col_mean_center_csc(X):
    X = X.copy().tocsc()  # Convert to CSC for efficient column ops
    n_rows, n_cols = X.shape
    col_means = np.zeros(n_cols)
    for j in range(n_cols):
        start, end = X.indptr[j], X.indptr[j+1]
        col_data = X.data[start:end]
        if len(col_data) > 0:
            col_mean = col_data.mean()
            col_means[j] = col_mean
            X.data[start:end] -= col_mean
        else:
            col_means[j] = 0
    return X, col_means

In [90]:
train_centered, train_mean_ratings = col_mean_center_csc(train_matrix)

In [93]:
test_centered, test_mean_ratings = col_mean_center_csc(test_matrix)

In [14]:
from sklearn.metrics.pairwise import cosine_similarity

# Each column is an item vector (users as features)
item_similarity = cosine_similarity(train_centered.T, dense_output=False)

In [21]:
K = 20  # Number of neighbors

# Use cosine metric (equivalent to Pearson for centered data)
nn = NearestNeighbors(n_neighbors=K+1,  # +1 because the closest is the user itself
                      metric='cosine', 
                      algorithm='brute',  # brute is best for high-dimensional sparse data
                      n_jobs=-1)          # use all CPUs

nn.fit(train_centered.T)
distances, indices = nn.kneighbors(train_centered.T, return_distance=True)

In [22]:
top_k_indices = indices[:, 1:]      # shape: (n_users, K)
top_k_similarities = 1 - distances[:, 1:]

In [24]:
top_k_similarities.shape

(26744, 20)

In [26]:
def predict_rating(u, j, item_means, user_item_matrix, top_k_indices, top_k_sims):
    # u: user index
    # j: movie index
    # user_means: array of user mean ratings
    # user_item_matrix: csr_matrix (users x movies)
    # top_k_indices: (n_users, K) array of neighbor indices
    # top_k_sims: (n_users, K) array of similarities

    neighbors = top_k_indices[j]
    sims = top_k_sims[j]
    numerator = 0.0
    denominator = 0.0
    for v, sim in zip(neighbors, sims):
        rating = user_item_matrix[u, v]
        if rating != 0:  
            numerator += sim * (rating)
            denominator += abs(sim)
    if denominator == 0:
        return item_means[j]

    return item_means[j] + (numerator/denominator)


In [81]:
train_matrix[0].nonzero()

(array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32),
 array([   1,   46,   49,  110,  149,  220,  250,  257,  290,  293,  315,
         333,  363,  537,  583,  645,  902,  907,  990, 1057, 1058, 1067,
        1068, 1075, 1113, 1171, 1173, 1175, 1176, 1182, 1188, 1189, 1191,
        1193, 1196, 1215, 1218, 1221, 1230, 1231, 1233, 1234, 1238, 1250,
        1263, 1275, 1292, 1304, 1318, 1320, 1328, 1339, 1343, 1356, 1478,
        1532, 1686, 1765, 1836, 1913, 1937, 2034, 2054, 2056, 2059, 2089,
        2090, 2109, 2110, 2168, 2203, 2206, 245

In [94]:
predict_rating(1, 0, train_mean_ratings, train_centered, top_k_indices, top_k_similarities)

np.float64(4.834565199943208)

In [108]:
# For all observed ratings in train and test
def get_observed_indices(sparse_matrix):
    rows, cols = sparse_matrix.nonzero()
    return list(zip(rows, cols))

# Train
train_indices = get_observed_indices(train_matrix)
train_actual = [train_matrix[u, i] for u, i in train_indices]
train_pred = [predict_rating(u, i, train_mean_ratings, train_centered, top_k_indices, top_k_similarities) for u, i in train_indices]

# Test
test_indices = get_observed_indices(test_matrix)
test_actual = [test_matrix[u, i] for u, i in test_indices]
test_pred = [predict_rating(u, i, test_mean_ratings, test_centered, top_k_indices, top_k_similarities) for u, i in test_indices]

from sklearn.metrics import mean_squared_error
print("Train MSE:", mean_squared_error(train_actual, train_pred))
print("Test MSE:", mean_squared_error(test_actual, test_pred))

Train MSE: 0.7269377915019403
Test MSE: 0.9137911871488545


In [111]:
print(f'RMSE Train: {np.sqrt(mean_squared_error(train_actual, train_pred))}')
print(f'RMSE Test: {np.sqrt(mean_squared_error(test_actual, test_pred))}')

RMSE Train: 0.8526064693057052
RMSE Test: 0.9559242580606764


## Conclusion

We used tried and tested ways to create the Recommendation system using only ratings, user id, and movie id. Its not a statistical model or a machine learning model or any advanced AI model, but it works quite good and is logically sound. But we can improve on it, in next notebook we will go through Matrix Factorization, AutoRec, and MF-Bayesian Popularity Ranking for a  