### Required Discussion 19:1: Building a Recommender System with SURPRISE

This discussion focuses on exploring additional algorithms with the `Suprise` library to generate recommendations.  Your goal is to identify the optimal algorithm by minimizing the mean squared error using cross validation. You are also going to select a dataset to use from [grouplens](https://grouplens.org/datasets/movielens/) example datasets.  

To begin, head over to [grouplens](https://grouplens.org/datasets/movielens/) and examine the different datasets available.  Choose one so that it is easy to create the data as expected in `Surprise` with user, item, and rating information.  Then, compare the performance of at least the `KNNBasic`, `SVD`, `NMF`, `SlopeOne`, and `CoClustering` algorithms to build your recommendations.  For more information on the algorithms see the documentation for the algorithm package [here](https://surprise.readthedocs.io/en/stable/prediction_algorithms_package.html).

Share the results of your investigation and include the results of your cross validation and a basic description of your dataset with your peers.


In [None]:
# Install SURPRISE (run this only once)
# !pip install scikit-surprise


In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
from surprise import Dataset, Reader, SVD, NMF, KNNBasic, SlopeOne, CoClustering
from surprise.model_selection import cross_validate, train_test_split, GridSearchCV
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
# Load the ratings.csv file
ratings_full_df = pd.read_csv('data/ratings.csv')

# Sample a smaller subset (e.g., 10% of ratings)
ratings_df = ratings_full_df.sample(frac=0.1, random_state=42)

# Display the first few rows to verify the data
print(ratings_df.head())


          userId  movieId  rating   timestamp
10685861   66954      781     5.0   850944577
1552723     9877      574     4.0   945495614
6145184    38348     1088     2.0   999974867
16268584  101952     2706     1.0  1203077565
22418634  140400   275079     3.5  1653782463


In [4]:
# Summary statistics for the ratings
print("\nRatings summary:")
print(ratings_df['rating'].describe())


Ratings summary:
count    3.200020e+06
mean     3.540462e+00
std      1.058937e+00
min      5.000000e-01
25%      3.000000e+00
50%      3.500000e+00
75%      4.000000e+00
max      5.000000e+00
Name: rating, dtype: float64


In [5]:
# Count unique users and movies
n_users = ratings_df['userId'].nunique()
n_movies = ratings_df['movieId'].nunique()
n_ratings = len(ratings_df)
print(f"\nDataset has {n_users} users, {n_movies} movies, and {n_ratings} ratings")


Dataset has 197270 users, 42809 movies, and 3200020 ratings


In [9]:
# Create a Reader object
reader = Reader(rating_scale=(0.5, 5.0))

# Load the data into SURPRISE format
data = Dataset.load_from_df(ratings_df[['userId', 'movieId', 'rating']], reader)

# Define the algorithms to compare ( I had to do one at a time due to my laptop's performance issues...)
algorithms = {
    'SVD': SVD(),
    # 'NMF': NMF(),
    # 'KNNBasic': KNNBasic(),
    # 'SlopeOne': SlopeOne(),
    # 'CoClustering': CoClustering()
}

# Dictionary to store the results
results = {}

# Perform cross-validation for each algorithm
for name, algorithm in algorithms.items():
    print(f"\nEvaluating {name}...")
    cv_results = cross_validate(algorithm, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)
    results[name] = {
        'test_rmse': cv_results['test_rmse'].mean(),
        'test_mae': cv_results['test_mae'].mean(),
        'fit_time': cv_results['fit_time'].mean(),
        'test_time': cv_results['test_time'].mean()
    }


Evaluating KNNBasic...
Computing the msd similarity matrix...


MemoryError: Unable to allocate 281. GiB for an array with shape (194281, 194281) and data type float64

# SVD Algorithm Results

| Metric | Fold 1 | Fold 2 | Fold 3 | Fold 4 | Fold 5 | Mean | Std |
|--------|--------|--------|--------|--------|--------|------|------|
| RMSE (testset) | 0.8978 | 0.8973 | 0.8950 | 0.8964 | 0.8962 | 0.8966 | 0.0010 |
| MAE (testset) | 0.6855 | 0.6857 | 0.6843 | 0.6849 | 0.6855 | 0.6852 | 0.0005 |
| Fit time | 40.59 | 41.71 | 44.94 | 42.58 | 43.14 | 42.59 | 1.46 |
| Test time | 4.71 | 4.54 | 4.55 | 4.45 | 4.77 | 4.60 |

These are solid results for SVD. An RMSE of about 0.897 means your predictions are, on average, within about 0.9 stars of the actual ratings users gave. The consistency across folds (low standard deviation of 0.001) suggests the model is stable. 0.12 |

In [8]:
# Perform cross-validation for NMF algorithm
name = 'NMF'
algorithm = NMF()
print(f"\nEvaluating {name}...")
cv_results = cross_validate(algorithm, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

# Let's check the type and structure of cv_results
print("Type of cv_results:", type(cv_results))
print("Keys in cv_results:", cv_results.keys())

# Try accessing the results differently
for key in cv_results:
    print(f"Key: {key}, Type: {type(cv_results[key])}")
    if hasattr(cv_results[key], 'mean'):
        print(f"Mean {key}:", cv_results[key].mean())
    else:
        print(f"Value {key}:", cv_results[key])


Evaluating NMF...
Evaluating RMSE, MAE of algorithm NMF on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9430  0.9459  0.9428  0.9461  0.9472  0.9450  0.0018  
MAE (testset)     0.7173  0.7187  0.7170  0.7188  0.7199  0.7183  0.0011  
Fit time          58.66   58.54   61.70   75.01   64.71   63.72   6.08    
Test time         3.88    3.57    4.45    4.37    3.72    4.00    0.35    
Type of cv_results: <class 'dict'>
Keys in cv_results: dict_keys(['test_rmse', 'test_mae', 'fit_time', 'test_time'])
Key: test_rmse, Type: <class 'numpy.ndarray'>
Mean test_rmse: 0.9450117363388463
Key: test_mae, Type: <class 'numpy.ndarray'>
Mean test_mae: 0.7183499573462457
Key: fit_time, Type: <class 'tuple'>
Value fit_time: (58.65761923789978, 58.538286447525024, 61.69825291633606, 75.01252555847168, 64.7076563835144)
Key: test_time, Type: <class 'tuple'>
Value test_time: (3.8787639141082764, 3.565955877304077, 4.453284978866577, 4.3660924434

In [11]:
# Perform cross-validation for NMF algorithm
name = ' KNNBasic'
algorithm = KNNBasic()
print(f"\nEvaluating {name}...")
cv_results = cross_validate(algorithm, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

# Let's check the type and structure of cv_results
print("Type of cv_results:", type(cv_results))
print("Keys in cv_results:", cv_results.keys())

# Try accessing the results differently
for key in cv_results:
    print(f"Key: {key}, Type: {type(cv_results[key])}")
    if hasattr(cv_results[key], 'mean'):
        print(f"Mean {key}:", cv_results[key].mean())
    else:
        print(f"Value {key}:", cv_results[key])


Evaluating  KNNBasic...
Computing the msd similarity matrix...


MemoryError: Unable to allocate 281. GiB for an array with shape (194260, 194260) and data type float64

In [None]:
# Display the results
results_df = pd.DataFrame(results).T
print("\nResults:")
print(results_df.sort_values('test_rmse'))

# Plot the RMSE for each algorithm
plt.figure(figsize=(10, 6))
sns.barplot(x=results_df.index, y='test_rmse', data=results_df)
plt.title('RMSE by Algorithm')
plt.ylabel('RMSE (lower is better)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()