### Building a Recommender system with Surprise

This try-it focuses on exploring additional algorithms with the `Suprise` library to generate recommendations.  Your goal is to identify the optimal algorithm by minimizing the mean squared error using cross validation. You are also going to select a dataset to use from [grouplens](https://grouplens.org/datasets/movielens/) example datasets.  

To begin, head over to grouplens and examine the different datasets available.  Choose one so that it is easy to create the data as expected in `Surprise` with user, item, and rating information.  Then, compare the performance of at least the `KNNBasic`, `SVD`, `NMF`, `SlopeOne`, and `CoClustering` algorithms to build your recommendations.  For more information on the algorithms see the documentation for the algorithm package [here](https://surprise.readthedocs.io/en/stable/prediction_algorithms_package.html).

Share the results of your investigation and include the results of your cross validation and a basic description of your dataset with your peers.



In [17]:
from surprise import Dataset, Reader, SVD, NMF, KNNBasic, SlopeOne, CoClustering
from surprise.model_selection import cross_validate

import pandas as pd

In [18]:
#load the data
ratings_df = pd.read_csv('data/ml-latest-small/ratings.csv')
movies_df = pd.read_csv('data/ml-latest-small/movies.csv')

# Join the dataframes
df = pd.merge(ratings_df, movies_df, on='movieId')

# Create a Surprise dataset
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(df[['userId', 'movieId', 'rating']], reader)

In [19]:
# Define the algorithms to compare
algorithms = [
    KNNBasic(),
    SVD(),
    NMF(),
    SlopeOne(),
    CoClustering()
]


In [20]:
from statistics import mean

results = {}
for a in algorithms:
    cv_results = cross_validate(a, data, measures=['MSE', 'MAE'], cv=5, verbose=False)
    results[a.__class__.__name__] = {
        'MSE': cv_results['test_mse'].mean(),
        'MAE': cv_results['test_mae'].mean(),
        'Fit time': mean(cv_results['fit_time']),
        'Test time': mean(cv_results['test_time'])
    }
    
    

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.


In [21]:
results_df = pd.DataFrame(results).T
results_df

Unnamed: 0,MSE,MAE,Fit time,Test time
KNNBasic,0.896997,0.726086,0.056438,1.487114
SVD,0.763921,0.671784,0.677356,0.876277
NMF,0.847899,0.705788,1.492872,0.857606
SlopeOne,0.81147,0.688301,2.920656,4.375912
CoClustering,0.888312,0.729843,1.501372,0.879219


In [24]:
# Determine the best algorithm based on MSE
best_algo = min(results, key=lambda x: results[x]['MSE'])
print(f"The best performing algorithm based on MSE is: {best_algo}")
print(f"with an MSE of {results[best_algo]['MSE']:.4f}")

# Additional analysis: Top 10 most rated movies
top_movies = df.groupby('title').size().sort_values(ascending=False).head(10)
print("\nTop 10 highest rated movies:")
print(top_movies)

# Additional analysis: Average rating distribution
avg_ratings = df.groupby('title')['rating'].mean().sort_values(ascending=False)
print("\nRating distribution:")
print(avg_ratings.describe())


The best performing algorithm based on MSE is: SVD
with an MSE of 0.7639

Top 10 highest rated movies:
title
Forrest Gump (1994)                          329
Shawshank Redemption, The (1994)             317
Pulp Fiction (1994)                          307
Silence of the Lambs, The (1991)             279
Matrix, The (1999)                           278
Star Wars: Episode IV - A New Hope (1977)    251
Jurassic Park (1993)                         238
Braveheart (1995)                            237
Terminator 2: Judgment Day (1991)            224
Schindler's List (1993)                      220
dtype: int64

Rating distribution:
count    9719.000000
mean        3.262388
std         0.870004
min         0.500000
25%         2.800000
50%         3.416667
75%         3.910357
max         5.000000
Name: rating, dtype: float64


In [25]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
avg_ratings.hist(bins=20)
plt.title('Distribution of Average Movie Ratings')
plt.xlabel('Average Rating')
plt.ylabel('Number of Movies')
plt.savefig('rating_distribution.png')
plt.close()

print("\nRating distribution histogram saved as 'rating_distribution.png'")


Rating distribution histogram saved as 'rating_distribution.png'
