# Building a Recommender System with SURPRISE & Comparing Performance (RMSE) of Various Algorithms

In this project, we build and compare various recommendation system algorithms using the `scikit-surprise` library. Our goal is to evaluate the performance of popular collaborative filtering algorithms using **cross-validation**, and identify which algorithm performs best on our dataset based on RMSE (Root Mean Squared Error).

## Algorithms Covered

We will implement and evaluate the following algorithms:

- `KNNBasic` — K-Nearest Neighbors based collaborative filtering (user/item-based)
- `SVD` — Singular Value Decomposition for matrix factorization
- `NMF` — Non-negative Matrix Factorization
- `SlopeOne` — Simple and efficient baseline recommender
- `CoClustering` — Clustering-based matrix approximation method

## Evaluation Method

- **Metric**: RMSE (Root Mean Squared Error)
- **Method**: 5-fold Cross-Validation using `surprise.model_selection.cross_validate`
- We will also use `GridSearchCV` where applicable to tune hyperparameters and find the best performing configuration for each algorithm.

## Output

- Mean RMSE for each algorithm
- Comparison table of performance
- Recommendation on which thisrithm wo
---

Let’s get started! 🚀


In [2]:
# !pip install "numpy<2.0"
# !pip install scikit-surprise

In [3]:
import pandas as pd

In [4]:
from surprise import Dataset, Reader, SVD, NMF, KNNBasic, SlopeOne, CoClustering
from surprise.model_selection import cross_validate


In [5]:
movies_df = pd.read_csv("ml-latest-small/movies.csv")
ratings_df = pd.read_csv("ml-latest-small/ratings.csv")

In [6]:
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [7]:
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [8]:
movies_ratings_df = pd.merge(movies_df, ratings_df, on="movieId")
movies_ratings_df.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5,4.0,847434962
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7,4.5,1106635946
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15,2.5,1510577970
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,17,4.5,1305696483


In [9]:
movies_ratings_df

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5,4.0,847434962
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7,4.5,1106635946
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15,2.5,1510577970
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,17,4.5,1305696483
...,...,...,...,...,...,...
100831,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,184,4.0,1537109082
100832,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,184,3.5,1537109545
100833,193585,Flint (2017),Drama,184,3.5,1537109805
100834,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,184,3.5,1537110021


In [10]:
data = movies_ratings_df[['userId', 'title', 'rating']]
data.head()

Unnamed: 0,userId,title,rating
0,1,Toy Story (1995),4.0
1,5,Toy Story (1995),4.0
2,7,Toy Story (1995),4.5
3,15,Toy Story (1995),2.5
4,17,Toy Story (1995),4.5


#### Populate Surprise Dataset

In [12]:
input = data[['userId', 'title', 'rating']]
reader = Reader(rating_scale=(0, 5.0))
surprise_data = Dataset.load_from_df(input, reader)
train = surprise_data.build_full_trainset()
test = train.build_testset()

 #### Compare KNNBasic, SVD, NMF, SlopeOne, and CoClustering

In [14]:
#### Singular Value Decomposition (SVD)

In [15]:
svd = SVD()
svd.fit(train)
prediction = svd.test(test)


In [16]:
prediction[:5]

[Prediction(uid=1, iid='Toy Story (1995)', r_ui=4.0, est=4.530303663933725, details={'was_impossible': False}),
 Prediction(uid=1, iid='Grumpier Old Men (1995)', r_ui=4.0, est=3.9839800853841707, details={'was_impossible': False}),
 Prediction(uid=1, iid='Heat (1995)', r_ui=4.0, est=4.6251070223289235, details={'was_impossible': False}),
 Prediction(uid=1, iid='Seven (a.k.a. Se7en) (1995)', r_ui=5.0, est=4.8254972507514164, details={'was_impossible': False}),
 Prediction(uid=1, iid='Usual Suspects, The (1995)', r_ui=5.0, est=5.0, details={'was_impossible': False})]

In [17]:
from surprise.model_selection import cross_validate

In [18]:
cross_val_results = cross_validate(svd, surprise_data, measures=['RMSE'])

In [19]:
print (cross_val_results)

{'test_rmse': array([0.87527401, 0.86829205, 0.87438906, 0.86870607, 0.87874656]), 'fit_time': (1.713771104812622, 1.9158456325531006, 2.0125248432159424, 2.1009159088134766, 2.029358386993408), 'test_time': (0.17709708213806152, 0.36453819274902344, 0.20955395698547363, 0.24730849266052246, 0.3731870651245117)}


In [20]:
from surprise.model_selection import GridSearchCV
param_grid = {
    'n_factors': [2, 5, 10],
    'n_epochs': [20, 30, 50],
}
gs = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=5)
gs.fit(surprise_data)

print(gs.best_score['rmse'])
print(gs.best_params['rmse'])

0.8677197395978485
{'n_factors': 2, 'n_epochs': 30}


In [21]:
comparison_list = []
comparison_list.append(['SVD', gs.best_score['rmse'], gs.best_params['rmse']])
comparison_list

[['SVD', 0.8677197395978485, {'n_factors': 2, 'n_epochs': 30}]]

#### KNNBasic

In [23]:
param_grid = {
    'k': [10, 20, 30, 40]
}
gs_knn = GridSearchCV(KNNBasic, param_grid, measures=['rmse'], cv=5)
gs_knn.fit(surprise_data)

print(gs_knn.best_score['rmse'])
print(gs_knn.best_params['rmse'])

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computi

In [24]:
comparison_list.append(['KNNBasic', gs_knn.best_score['rmse'], gs_knn.best_params['rmse']])
comparison_list

[['SVD', 0.8677197395978485, {'n_factors': 2, 'n_epochs': 30}],
 ['KNNBasic', 0.9411547816377727, {'k': 10}]]

#### Collaborative Filtering - Non-negative Matrix Factorization (NMF)

In [26]:
param_grid = {
    'n_factors': [10, 20, 30],
    'n_epochs': [20, 50],
}
gs_nmf = GridSearchCV(NMF, param_grid, measures=['rmse'], cv=5)
gs_nmf.fit(surprise_data)

print(gs_nmf.best_score['rmse'])
print(gs_nmf.best_params['rmse'])

0.9234529946245835
{'n_factors': 20, 'n_epochs': 50}


In [27]:
comparison_list.append(['Non-negative Matrix Factorization', gs_nmf.best_score['rmse'], gs_nmf.best_params['rmse']])
comparison_list

[['SVD', 0.8677197395978485, {'n_factors': 2, 'n_epochs': 30}],
 ['KNNBasic', 0.9411547816377727, {'k': 10}],
 ['Non-negative Matrix Factorization',
  0.9234529946245835,
  {'n_factors': 20, 'n_epochs': 50}]]

#### SlopeOne

In [29]:
#SlopeOne does not take any Hyper parameters
slope_one = SlopeOne()
cv_results = cross_validate(slope_one, surprise_data, measures=['RMSE'], cv=5, verbose=True)

print("Mean RMSE:", cv_results['test_rmse'].mean())

Evaluating RMSE of algorithm SlopeOne on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8999  0.8989  0.9162  0.8976  0.8988  0.9023  0.0070  
Fit time          5.16    5.26    5.19    5.50    5.52    5.33    0.15    
Test time         9.43    9.65    10.14   9.36    9.37    9.59    0.29    
Mean RMSE: 0.9022906213771383


In [30]:
comparison_list.append(['Slope One', cv_results['test_rmse'].mean(), 'No Hyper parameters'])
comparison_list

[['SVD', 0.8677197395978485, {'n_factors': 2, 'n_epochs': 30}],
 ['KNNBasic', 0.9411547816377727, {'k': 10}],
 ['Non-negative Matrix Factorization',
  0.9234529946245835,
  {'n_factors': 20, 'n_epochs': 50}],
 ['Slope One', 0.9022906213771383, 'No Hyper parameters']]

### CoClustering

In [32]:
param_grid = {
    'n_epochs': [20, 50],
}
gs_co = GridSearchCV(CoClustering, param_grid, measures=['rmse'], cv=5)
gs_co.fit(surprise_data)

print(gs_co.best_score['rmse'])
print(gs_co.best_params['rmse'])

0.9456640652984112
{'n_epochs': 20}


In [33]:
comparison_list.append(['CoClustering', gs_co.best_score['rmse'], gs_co.best_params['rmse']])
comparison_list

[['SVD', 0.8677197395978485, {'n_factors': 2, 'n_epochs': 30}],
 ['KNNBasic', 0.9411547816377727, {'k': 10}],
 ['Non-negative Matrix Factorization',
  0.9234529946245835,
  {'n_factors': 20, 'n_epochs': 50}],
 ['Slope One', 0.9022906213771383, 'No Hyper parameters'],
 ['CoClustering', 0.9456640652984112, {'n_epochs': 20}]]

In [34]:
performance_comparison_dataframe = pd.DataFrame(comparison_list)
performance_comparison_dataframe.sort_values(by=performance_comparison_dataframe.columns[1], ascending=False)

Unnamed: 0,0,1,2
4,CoClustering,0.945664,{'n_epochs': 20}
1,KNNBasic,0.941155,{'k': 10}
2,Non-negative Matrix Factorization,0.923453,"{'n_factors': 20, 'n_epochs': 50}"
3,Slope One,0.902291,No Hyper parameters
0,SVD,0.86772,"{'n_factors': 2, 'n_epochs': 30}"


## 📊 Model Performance Comparison (RMSE)

We evaluated five different collaborative filtering algorithms using 5-fold cross-validation on the merged `ratings.csv` and `movies.csv` dataset.

| Rank | Algorithm                       | RMSE     | Best Hyperparameters                         |
|------|----------------------------------|----------|----------------------------------------------|
| 1    | **SVD**                          | 0.867720 | `{'n_factors': 2, 'n_epochs': 30}`           |
| 2    | **Slope One**                   | 0.902291 | No hyperparameters                           |
| 3    | **NMF (Non-negative MF)**       | 0.923453 | `{'n_factors': 20, 'n_epochs': 50}`          |
| 4    | **KNNBasic**                    | 0.941155 | `{'k': 10}`                                  |
| 5    | **CoClustering**                | 0.945664 | `{'n_epochs': 20}`                           |

### ✅ Best Performing Model: `SVD`
- Achieved the lowest RMSE (0.867720)
- Tuned with `n_factors=2`, `n_epochs=30`
- Recommended for generating final predictions in this project

---

