<a href="https://www.kaggle.com/code/gpreda/collaborative-filtering-model-using-surprise?scriptVersionId=128758448" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Introduction

This Notebook is experimenting with Surprise to build a simple collaborative filtering model using Surprise library.

# Analysis preparation

In [1]:
import numpy as np
import pandas as pd
import re

from surprise import accuracy
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import cross_validate
from surprise import NormalPredictor, SVD


## Read users data

In [2]:
user_columns = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users_df = pd.read_csv('../input/movielens-100k-dataset/ml-100k/u.user', sep='|', names=user_columns) 
users_df.head(2)

Unnamed: 0,user_id,age,sex,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043


## Read ratings data

In [3]:
ratings_columns = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings_df = pd.read_csv('../input/movielens-100k-dataset/ml-100k/u.data', sep='\t', names=ratings_columns)
ratings_df.drop( "unix_timestamp", inplace = True, axis = 1 ) 
ratings_df.head(2)

Unnamed: 0,user_id,movie_id,rating
0,196,242,3
1,186,302,3


## Read and clean movie data

In [4]:
def clean_title(title):
    return re.sub("[\(\[].*?[\)\]]", "",title)

def process_genre(series):
    genres = series.index[6:-2]
    
    text = []
    for i in genres:
        if series[i] == 1:
            text.append(i)
            break
    return ", ".join(text)

### Read genres

We get the list of names for movie genres from **u.genre** file.

In [5]:
genre_df = pd.read_csv('../input/movielens-100k-dataset/ml-100k/u.genre', sep='|', encoding='latin-1')
genre_columns = ["unknown"] + list(genre_df[genre_df.columns[0]].values)
print(genre_columns)

['unknown', 'Action', 'Adventure', 'Animation', "Children's", 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']


In [6]:
movie_columns = ['movie_id', 'title', 'release_date', 'video_release_date', 'imdb_url']
movies_df = pd.read_csv('../input/movielens-100k-dataset/ml-100k/u.item', sep='|', names=movie_columns+genre_columns,
                     encoding='latin-1')

In [7]:
movies_df.head(5)

Unnamed: 0,movie_id,title,release_date,video_release_date,imdb_url,unknown,Action,Adventure,Animation,Children's,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


Let's remove the issuing date from the movie title. Also, we create a copy of the movie dataset where we concatenate the genres, so that we have a list with the genres, and we eliminate the separate genre columns.

In [8]:
movies_df['title'] = movies_df['title'].apply(clean_title)
movies_df['title'] = movies_df['title'].str.strip()

movies_df['genre'] = movies_df.apply(process_genre,axis=1)

In [9]:
movies_df.head(2)

Unnamed: 0,movie_id,title,release_date,video_release_date,imdb_url,unknown,Action,Adventure,Animation,Children's,...,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western,genre
0,1,Toy Story,01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,Animation
1,2,GoldenEye,01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,1,0,0,Action


In [10]:
movies_red_df = movies_df.copy()
movies_red_df.drop(movies_red_df.columns[[2,3,4]], inplace = True, axis = 1)
movies_red_df.drop(genre_columns,axis=1,inplace=True)

movies_red_df.head(5)

Unnamed: 0,movie_id,title,genre
0,1,Toy Story,Animation
1,2,GoldenEye,Action
2,3,Four Rooms,Thriller
3,4,Get Shorty,Action
4,5,Copycat,Crime


# Prepare data in expected format for Surprise

In [11]:
ratings_df.head()

Unnamed: 0,user_id,movie_id,rating
0,196,242,3
1,186,302,3
2,22,377,1
3,244,51,2
4,166,346,1


In [12]:
print(f"Users: {ratings_df.user_id.nunique()}")

Users: 943


In [13]:
reader = Reader(rating_scale=(1, 5))

In [14]:
data = Dataset.load_from_df(ratings_df, reader=reader)

## Baseline using NormalPredictor

We start by creating a baseline with a random predictor. This way we will have a basis to measure the performance of applying SVD model to our data.

In [15]:
algo_np = NormalPredictor()

We do cross-validation with 5 folds and we measure three metrics:
* RMSE - Root Mean Squared Error - smaller is better
* MAE - Mean Absolute Error - smaller is better
* FCP - Fraction of Concordant Pairs - higher is better (see: https://www.ijcai.org/Proceedings/13/Papers/449.pdf for reference)

In [16]:
cross_validate(algo_np, data, measures = ["RMSE", "MAE", "FCP"], cv=5, verbose=True)

Evaluating RMSE, MAE, FCP of algorithm NormalPredictor on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.5124  1.5193  1.5173  1.5241  1.5238  1.5194  0.0044  
MAE (testset)     1.2135  1.2225  1.2224  1.2271  1.2248  1.2221  0.0046  
FCP (testset)     0.4969  0.4925  0.4929  0.4989  0.4858  0.4934  0.0045  
Fit time          0.14    0.16    0.16    0.17    0.19    0.16    0.02    
Test time         0.28    0.15    0.15    0.27    0.15    0.20    0.06    


{'test_rmse': array([1.51236874, 1.51925514, 1.51730089, 1.52413816, 1.52382212]),
 'test_mae': array([1.21346976, 1.22251733, 1.22237508, 1.22714034, 1.22475444]),
 'test_fcp': array([0.49693499, 0.4925232 , 0.49290548, 0.49885622, 0.4858258 ]),
 'fit_time': (0.137587308883667,
  0.1588761806488037,
  0.16019272804260254,
  0.1650855541229248,
  0.18841075897216797),
 'test_time': (0.27613210678100586,
  0.1474018096923828,
  0.1464388370513916,
  0.27301025390625,
  0.14977288246154785)}

## Model with SVD

We train now the SVD (Singular Value Decomposition) model with cross-validation.  
We need to set at least the number of factors.

In [17]:
algo_svd = SVD(n_factors=40)

In [18]:
cross_validate(algo_svd, data, measures = ["RMSE", "MAE", "FCP"], cv=5, verbose=True)

Evaluating RMSE, MAE, FCP of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9347  0.9301  0.9382  0.9309  0.9329  0.9334  0.0029  
MAE (testset)     0.7360  0.7352  0.7382  0.7370  0.7349  0.7363  0.0012  
FCP (testset)     0.6999  0.7042  0.6939  0.7033  0.7055  0.7014  0.0042  
Fit time          3.05    3.10    3.09    3.07    3.13    3.09    0.03    
Test time         0.15    0.28    0.15    0.14    0.15    0.17    0.05    


{'test_rmse': array([0.93473246, 0.93014697, 0.93815803, 0.93090972, 0.93286119]),
 'test_mae': array([0.73595842, 0.7352312 , 0.73819545, 0.73699584, 0.73494796]),
 'test_fcp': array([0.69992859, 0.70421099, 0.69391508, 0.70333346, 0.70553694]),
 'fit_time': (3.050109624862671,
  3.1021015644073486,
  3.0880095958709717,
  3.070892333984375,
  3.1302261352539062),
 'test_time': (0.14779257774353027,
  0.27581119537353516,
  0.14616942405700684,
  0.14300203323364258,
  0.1465301513671875)}

# Final remarks

With SVD we improved RMSE and MAE, but also FCP, from the values obtained with normal predictor (random values).