### Building a Recommender system with Surprise

This try-it focuses on exploring additional algorithms with the `Suprise` library to generate recommendations.  Your goal is to identify the optimal algorithm by minimizing the mean squared error using cross validation. You are also going to select a dataset to use from [grouplens](https://grouplens.org/datasets/movielens/) example datasets.  

To begin, head over to grouplens and examine the different datasets available.  Choose one so that it is easy to create the data as expected in `Surprise` with user, item, and rating information.  Then, compare the performance of at least the `KNNBasic`, `SVD`, `NMF`, `SlopeOne`, and `CoClustering` algorithms to build your recommendations.  For more information on the algorithms see the documentation for the algorithm package [here](https://surprise.readthedocs.io/en/stable/prediction_algorithms_package.html).

Share the results of your investigation and include the results of your cross validation and a basic description of your dataset with your peers.



In [10]:
from surprise import Dataset, Reader, SVD, NMF, KNNBasic, SlopeOne, CoClustering
from surprise.model_selection import cross_validate

import pandas as pd

In [53]:

ratings_df = pd.read_csv('data/ratings.csv')
movies_df = pd.read_csv('data/movies.csv')

# Join the dataframes
df = pd.merge(ratings_df, movies_df, on='movieId')


df

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,5,1,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,7,1,4.5,1106635946,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
3,15,1,2.5,1510577970,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
4,17,1,4.5,1305696483,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
...,...,...,...,...,...,...
100831,610,160341,2.5,1479545749,Bloodmoon (1997),Action|Thriller
100832,610,160527,4.5,1479544998,Sympathy for the Underdog (1971),Action|Crime|Drama
100833,610,160836,3.0,1493844794,Hazard (2005),Action|Drama|Thriller
100834,610,163937,3.5,1493848789,Blair Witch (2016),Horror|Thriller


In [54]:
# Create a Surprise dataset
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(df[['userId', 'title', 'rating']], reader)
print(data)

<surprise.dataset.DatasetAutoFolds object at 0x7f32f5913dc0>


In [55]:
algorithms = [
    KNNBasic(),
    SVD(),
    NMF(),
    SlopeOne(),
    CoClustering()
]

In [56]:
from statistics import mean

results = {}
for a in algorithms:
    cv_results = cross_validate(a, data, measures=['RMSE', 'MAE'], cv=5, verbose=False)
    results[a.__class__.__name__] = {
        'RMSE': cv_results['test_rmse'].mean(),
        'MAE': cv_results['test_mae'].mean(),
        'Fit time': mean(cv_results['fit_time']),
        'Test time': mean(cv_results['test_time'])
    }
    
    

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.


In [57]:
results_df = pd.DataFrame(results).T
results_df

Unnamed: 0,RMSE,MAE,Fit time,Test time
KNNBasic,0.946383,0.725679,0.047559,0.635909
SVD,0.873204,0.670851,0.727737,0.119598
NMF,0.923155,0.707298,1.061581,0.087091
SlopeOne,0.90084,0.688293,1.975979,2.996339
CoClustering,0.9441,0.731114,1.10197,0.056912


In [58]:
# Determine the best algorithm based on MSE
best_algo = min(results, key=lambda x: results[x]['RMSE'])
print(f"The best performing algorithm based on RMSE is: {best_algo}")
print(f"with an MSE of {results[best_algo]['RMSE']:.4f}")

# Additional analysis: Top 10 most rated movies
top_movies = df.groupby('title').size().sort_values(ascending=False).head(10)
print("\nTop 10 highest rated movies:")
print(top_movies)

# Additional analysis: Average rating distribution
avg_ratings = df.groupby('title')['rating'].mean().sort_values(ascending=False)
print("\nRating distribution:")
print(avg_ratings.describe())

The best performing algorithm based on RMSE is: SVD
with an MSE of 0.8732

Top 10 highest rated movies:
title
Forrest Gump (1994)                          329
Shawshank Redemption, The (1994)             317
Pulp Fiction (1994)                          307
Silence of the Lambs, The (1991)             279
Matrix, The (1999)                           278
Star Wars: Episode IV - A New Hope (1977)    251
Jurassic Park (1993)                         238
Braveheart (1995)                            237
Terminator 2: Judgment Day (1991)            224
Schindler's List (1993)                      220
dtype: int64

Rating distribution:
count    9719.000000
mean        3.262388
std         0.870004
min         0.500000
25%         2.800000
50%         3.416667
75%         3.910357
max         5.000000
Name: rating, dtype: float64


In [59]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
avg_ratings.hist(bins=20)
plt.title('Distribution of Average Movie Ratings')
plt.xlabel('Average Rating')
plt.ylabel('Number of Movies')
plt.savefig('rating_distribution.png')
plt.close()

print("\nRating distribution histogram saved as 'rating_distribution.png'")


Rating distribution histogram saved as 'rating_distribution.png'


In [60]:
# Use SVD for unseen data

In [61]:
train = data.build_full_trainset()
print(type(train))

<class 'surprise.trainset.Trainset'>


In [62]:
model = ''

model = SVD(n_factors = 2)
model.fit(train)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f32fc7210a0>

In [63]:
### GRADED
test = ''
predictions_list = ''

    
### BEGIN SOLUTION
test = train.build_testset()
predictions_list = model.test(test)
### END SOLUTION

### ANSWER CHECK
predictions_list[:10]

[Prediction(uid=1, iid='Toy Story (1995)', r_ui=4.0, est=4.6898858290068866, details={'was_impossible': False}),
 Prediction(uid=1, iid='Grumpier Old Men (1995)', r_ui=4.0, est=4.06917281924559, details={'was_impossible': False}),
 Prediction(uid=1, iid='Heat (1995)', r_ui=4.0, est=4.7318166299007425, details={'was_impossible': False}),
 Prediction(uid=1, iid='Seven (a.k.a. Se7en) (1995)', r_ui=5.0, est=4.787494813060812, details={'was_impossible': False}),
 Prediction(uid=1, iid='Usual Suspects, The (1995)', r_ui=5.0, est=5, details={'was_impossible': False}),
 Prediction(uid=1, iid='From Dusk Till Dawn (1996)', r_ui=3.0, est=4.368908035501934, details={'was_impossible': False}),
 Prediction(uid=1, iid='Bottle Rocket (1996)', r_ui=5.0, est=4.724177534688534, details={'was_impossible': False}),
 Prediction(uid=1, iid='Braveheart (1995)', r_ui=4.0, est=4.8026296665961, details={'was_impossible': False}),
 Prediction(uid=1, iid='Rob Roy (1995)', r_ui=5.0, est=4.370921293087549, details={

In [67]:
data = {'user_id': [i.uid for i in predictions_list],
       'title': [i.iid for i in predictions_list],
       'user_rating': [i.r_ui for i in predictions_list],
       'svd_rating': [i.est for i in predictions_list]}

hybrid_df = pd.DataFrame(data)
hybrid_df

Unnamed: 0,user_id,title,user_rating,svd_rating
0,1,Toy Story (1995),4.0,4.689886
1,1,Grumpier Old Men (1995),4.0,4.069173
2,1,Heat (1995),4.0,4.731817
3,1,Seven (a.k.a. Se7en) (1995),5.0,4.787495
4,1,"Usual Suspects, The (1995)",5.0,5.000000
...,...,...,...,...
100831,578,"Young Victoria, The (2009)",4.5,4.041760
100832,578,Cold Creek Manor (2003),2.5,3.785572
100833,578,Cheaper by the Dozen (1950),4.0,3.790436
100834,578,My Blueberry Nights (2007),4.0,3.928128
