# Collaborative-Based Filtering

In collaborative-based filtering, we aim to predict what content a user might enjoy by identifying similar users and their preferences for movies. Since not all users rate every movie, we predict how a user would rate unseen movies based on their past preferences and those of similar users. This prediction is made through training a machine learning model on a dataset of user IDs and movie ratings.

Once trained, the model can predict a user's rating for a particular movie based on their user ID and the movie. If the predicted rating is high, we recommend the movie to that user. To implement this, we organize the dataset into a training set and train the model. This allows us to efficiently cater to diverse user preferences by providing personalized recommendations.

## Load data

In [11]:
import pandas as pd

# Load the ratings data from the 'ratings.csv' file into a DataFrame
ratings_df = pd.read_csv('./assets/data/ratings.csv')

# Selecting only the columns 'userId', 'movieId', and 'rating' from the DataFrame
ratings_df = ratings_df.loc[:, ['userId', 'movieId', 'rating']]

# Display the first few rows of the ratings DataFrame to get an overview of the data
ratings_df.head()

Unnamed: 0,userId,movieId,rating
0,1,31,2.5
1,1,1029,3.0
2,1,1061,3.0
3,1,1129,2.0
4,1,1172,4.0


## Create the dataset

In [12]:
import surprise

In [13]:
from surprise import Dataset, Reader

# Creating a Reader object with the specified rating scale
# The rating scale indicates the minimum and maximum possible ratings in the dataset
# Here, the ratings range from 1 to 5
reader = Reader(rating_scale=(1, 5))

# Loading the dataset from a DataFrame using Surprise's load_from_df() function
# The ratings_df DataFrame contains the ratings data with columns 'userId', 'movieId', and 'rating'
# The Reader object is passed to specify the rating scale
dataset = Dataset.load_from_df(ratings_df, reader)

## Build the trainset

In [14]:
# Building the training set from the dataset
# The build_full_trainset() function constructs a training set that contains all the ratings from the original dataset
# This is useful for training models on the entire available data
# It creates a Trainset object, which is a special data structure used by Surprise for training models
# we're not using a test set for simplicity, but in practice, a separate test set is often used for evaluation purposes.
trainset = dataset.build_full_trainset()

In [15]:
# Converting all ratings in the training set to a list and selecting the first 10 ratings
# The all_ratings() method returns a generator that yields tuples of (user_id, item_id, rating) for all ratings in the training set
# By converting the generator to a list, we can easily manipulate and view the ratings
# We use list slicing [:10] to select the first 10 ratings from the list
list(trainset.all_ratings())[:10]

[(0, 0, 2.5),
 (0, 1, 3.0),
 (0, 2, 3.0),
 (0, 3, 2.0),
 (0, 4, 4.0),
 (0, 5, 2.0),
 (0, 6, 2.0),
 (0, 7, 2.0),
 (0, 8, 3.5),
 (0, 9, 2.0)]

## Train the Model

In [16]:
# Importing the SVD (Singular Value Decomposition) model from Surprise
from surprise import SVD

# Creating an instance of the SVD model
svd = SVD()

# Fitting the SVD model to the training set
# The fit() method trains the model using the data in the training set
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x2980f096ba0>

In [17]:
# Make a prediction for user 15 and item 1956
# This predicts the rating that user 15 would give to item 1956
prediction = svd.predict(15, 1956)

prediction

Prediction(uid=15, iid=1956, r_ui=None, est=2.9839661613723827, details={'was_impossible': False})

In [18]:
# Retrieve the estimated rating from the prediction object
# The 'est' attribute of the prediction object contains the estimated rating
estimation = prediction.est

estimation

2.9839661613723827

## Validation

In [19]:
from surprise import model_selection

# Perform cross-validation to evaluate the performance of the SVD model using RMSE and MAE measures
# 'svd' is the trained Singular Value Decomposition (SVD) model
# 'dataset' is the dataset used for training and testing the model
# 'measures' specifies the evaluation metrics to be computed, here we use Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE)
# Cross-validation divides the dataset into multiple folds, trains the model on some folds, 
# and tests it on others to get an overall estimate of performance
accuracy = model_selection.cross_validate(svd, dataset, measures=['RMSE', 'MAE'])

accuracy

{'test_rmse': array([0.89903294, 0.89181357, 0.90462553, 0.89514806, 0.89499599]),
 'test_mae': array([0.69257265, 0.68999856, 0.69428364, 0.68700953, 0.68975902]),
 'fit_time': (1.3340437412261963,
  1.1105327606201172,
  1.0830504894256592,
  1.0682580471038818,
  1.0740516185760498),
 'test_time': (0.23699951171875,
  0.2040109634399414,
  0.12100386619567871,
  0.21500015258789062,
  0.11005663871765137)}