# Building a Recommender System with SURPRISE & Comparing Performance (RMSE) of Various Algorithms

In this project, we build and compare various recommendation system algorithms using the `scikit-surprise` library. Our goal is to evaluate the performance of popular collaborative filtering algorithms using **cross-validation**, and identify which algorithm performs best on our dataset based on RMSE (Root Mean Squared Error).

## Algorithms Covered

We will implement and evaluate the following algorithms:

- `KNNBasic` — K-Nearest Neighbors based collaborative filtering (user/item-based)
- `SVD` — Singular Value Decomposition for matrix factorization
- `NMF` — Non-negative Matrix Factorization
- `SlopeOne` — Simple and efficient baseline recommender
- `CoClustering` — Clustering-based matrix approximation method

## Evaluation Method

- **Metric**: RMSE (Root Mean Squared Error)
- **Method**: 5-fold Cross-Validation using `surprise.model_selection.cross_validate`
- We will also use `GridSearchCV` where applicable to tune hyperparameters and find the best performing configuration for each algorithm.

## Output

- Mean RMSE for each algorithm
- Comparison table of performance
- Recommendation on which thisrithm wo
---

Let’s get started! 🚀


In [None]:
# !pip install "numpy<2.0"
# !pip install scikit-surprise

In [None]:
import pandas as pd

In [None]:
from surprise import Dataset, Reader, SVD, NMF, KNNBasic, SlopeOne, CoClustering
from surprise.model_selection import cross_validate


In [None]:
movies_df = pd.read_csv("ml-latest-small/movies.csv")
ratings_df = pd.read_csv("ml-latest-small/ratings.csv")

In [None]:
movies_df.head()

In [None]:
ratings_df.head()

In [None]:
movies_ratings_df = pd.merge(movies_df, ratings_df, on="movieId")
movies_ratings_df.head()

In [None]:
movies_ratings_df

In [None]:
data = movies_ratings_df[['userId', 'title', 'rating']]
data.head()

#### Populate Surprise Dataset

In [None]:
input = data[['userId', 'title', 'rating']]
reader = Reader(rating_scale=(0, 5.0))
surprise_data = Dataset.load_from_df(input, reader)
train = surprise_data.build_full_trainset()
test = train.build_testset()

 #### Compare KNNBasic, SVD, NMF, SlopeOne, and CoClustering

In [None]:
#### Singular Value Decomposition (SVD)

In [None]:
svd = SVD()
svd.fit(train)
prediction = svd.test(test)


In [None]:
prediction[:5]

In [None]:
from surprise.model_selection import cross_validate

In [None]:
cross_val_results = cross_validate(svd, surprise_data, measures=['RMSE'])

In [None]:
print (cross_val_results)

In [None]:
from surprise.model_selection import GridSearchCV
param_grid = {
    'n_factors': [2, 5, 10],
    'n_epochs': [20, 30, 50],
}
gs = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=5)
gs.fit(surprise_data)

print(gs.best_score['rmse'])
print(gs.best_params['rmse'])

In [None]:
comparison_list = []
comparison_list.append(['SVD', gs.best_score['rmse'], gs.best_params['rmse']])
comparison_list

#### KNNBasic

In [None]:
param_grid = {
    'k': [10, 20, 30, 40]
}
gs_knn = GridSearchCV(KNNBasic, param_grid, measures=['rmse'], cv=5)
gs_knn.fit(surprise_data)

print(gs_knn.best_score['rmse'])
print(gs_knn.best_params['rmse'])

In [None]:
comparison_list.append(['KNNBasic', gs_knn.best_score['rmse'], gs_knn.best_params['rmse']])
comparison_list

#### Collaborative Filtering - Non-negative Matrix Factorization (NMF)

In [None]:
param_grid = {
    'n_factors': [10, 20, 30],
    'n_epochs': [20, 50],
}
gs_nmf = GridSearchCV(NMF, param_grid, measures=['rmse'], cv=5)
gs_nmf.fit(surprise_data)

print(gs_nmf.best_score['rmse'])
print(gs_nmf.best_params['rmse'])

In [None]:
comparison_list.append(['Non-negative Matrix Factorization', gs_nmf.best_score['rmse'], gs_nmf.best_params['rmse']])
comparison_list

#### SlopeOne

In [None]:
#SlopeOne does not take any Hyper parameters
slope_one = SlopeOne()
cv_results = cross_validate(slope_one, surprise_data, measures=['RMSE'], cv=5, verbose=True)

print("Mean RMSE:", cv_results['test_rmse'].mean())

In [None]:
comparison_list.append(['Slope One', cv_results['test_rmse'].mean(), 'No Hyper parameters'])
comparison_list

### CoClustering

In [None]:
param_grid = {
    'n_epochs': [20, 50],
}
gs_co = GridSearchCV(CoClustering, param_grid, measures=['rmse'], cv=5)
gs_co.fit(surprise_data)

print(gs_co.best_score['rmse'])
print(gs_co.best_params['rmse'])

In [None]:
comparison_list.append(['CoClustering', gs_co.best_score['rmse'], gs_co.best_params['rmse']])
comparison_list

In [None]:
performance_comparison_dataframe = pd.DataFrame(comparison_list)
performance_comparison_dataframe.sort_values(by=performance_comparison_dataframe.columns[1], ascending=False)