<a href="https://colab.research.google.com/github/Vikram310/100daysofdatascience/blob/main/Suprise_postread.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Suprise Library

**Surprise** is a Python scikit for building and analyzing recommender systems that deal with explicit rating data.

### Installation:

You can install Surprise with pip using the following command:

## ```pip install scikit-surprise```

Surprise is a Python module that allows you to create and test rate prediction systems. It was created to closely resemble the scikit-learn API, which users familiar with the Python machine learning ecosystem should be comfortable with. 

Surprise includes a set of estimators (or prediction algorithms) for evaluating predictions. Classic techniques, such as the main similarity-based algorithms, as well as matrix factorization algorithms like SVD are implemented.

It also includes tools for model evaluation, such as cross-validation iterators and scikit-built-in learned metrics, as well as grid search and randomized search for model selection and automatic hyper-parameter search. 

In [None]:
# Loading the necessary libraries

In [None]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Reading the data

In [None]:
movies = pd.read_csv('movies.csv')
ratings = pd.read_csv('ratings.csv')
users = pd.read_csv('users.csv')

In [None]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [None]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,16,4.0,1217897793
1,1,24,1.5,1217895807
2,1,32,4.0,1217896246
3,1,47,4.0,1217896556
4,1,50,4.0,1217896523


In [None]:
users.head()

Unnamed: 0,userId,age,time_spent_per_day
0,1,16,3.976315
1,2,24,1.891303
2,3,20,4.521478
3,4,23,2.095284
4,5,35,1.75986


In [None]:
## Data Preprocessing

In [None]:
## Merging the dataframes

In [None]:
df_1 = pd.merge(movies, ratings, how='inner', on='movieId')
df_1.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,2,5.0,859046895
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5,4.0,1303501039
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,8,5.0,858610933
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,11,4.0,850815810
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,14,4.0,851766286


In [None]:
df_2 = pd.merge(df_1, users, how='inner', on='userId')
df_2.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp,age,time_spent_per_day
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,2,5.0,859046895,24,1.891303
1,3,Grumpier Old Men (1995),Comedy|Romance,2,2.0,859046959,24,1.891303
2,5,Father of the Bride Part II (1995),Comedy,2,3.0,859046959,24,1.891303
3,14,Nixon (1995),Drama,2,4.0,859047091,24,1.891303
4,17,Sense and Sensibility (1995),Drama|Romance,2,5.0,859046896,24,1.891303


In [None]:
data = df_2[['userId','movieId','rating']] #Considering only the userid, itemid and ratings
data.head()

Unnamed: 0,userId,movieId,rating
0,2,1,5.0
1,2,3,2.0
2,2,5,3.0
3,2,14,4.0
4,2,17,5.0


In [None]:
data.isna().sum() #Checking for null values and no null values are present in the data

userId     0
movieId    0
rating     0
dtype: int64

In [None]:
data.duplicated().sum() #Checking for the duplicates and no duplicate values present in the data

0

In [None]:
#pip install scikit-surprise

In [None]:
## Loading the necessary packages from suprise library

In [None]:
from surprise import KNNWithMeans 
from surprise import Dataset
from surprise import accuracy
from surprise.model_selection import train_test_split
from surprise import Reader

In [None]:
## The Reader class is used to parse a file containing ratings.It orders the data in format of (userid,title,rating) and even by considering the rating scale
reader = Reader(rating_scale=(0.5 , 5))
# The columns must correspond to user id, item id and ratings (in that order).
data = Dataset.load_from_df(data[['userId','movieId','rating']], reader) # loading the data as per the format

In [None]:
anti_set = data.build_full_trainset().build_anti_testset()

An antiset is a set of those user and item pairs for which a rating doesn't exist in original dataset. This is the set for which we are trying to predict ratings. 

For example in following example userId 2 has not rated movieID 1 that is Toy Story. 

Surprise creates a set of such combinations by providing a default average rating. We'll be calculating an estimated rating for this set using our model.

In [None]:
trainset, testset = train_test_split(data, test_size=.15) # Splitting the data

### User - based collaborative filtering

#### Use user_based true/false to switch between user-based or item-based collaborative filtering
#### Using cosine similarity

#### ```KNNWithMeans``` is basic collaborative filtering algorithm, taking into account the mean ratings of each user.

In [None]:
algo = KNNWithMeans(k = 50, sim_options={'name': 'cosine', 'user_based': True}) 

# K value represents the (max) number of neighbors to take into account for aggregation. Example for every item it gives 50 nearest ones.
# There are many similarity options to calculate the similarity between the neighbors. Here, we have used the cosine similarity.
# when user_based = True then it performs user based collaborative filtering

algo.fit(trainset) #fitting the train dataset

Computing the cosine similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x7f1f4987d910>

In [None]:
# run the trained model against the testset
test_pred = algo.test(testset)

In [None]:
test_pred[0]

Prediction(uid=308, iid=7438, r_ui=4.0, est=3.9420970637417803, details={'actual_k': 50, 'was_impossible': False})

uid – The (raw) user id. 

iid – The (raw) item id. 

r_ui (float) – The true rating .

est (float) – The estimated rating. This is calculated by taking mean ratings of each item for item-based collab filtering.

details (dict) – Stores additional details about the prediction.

In this details was_impossible defines status of the true rating 
-  if was_impossible: False - Then there is some true rating.
-  else if was_impossible: True - Then there is no information on true rating for that particular record.

In [None]:
# get RMSE on test set
print("User-based Model : Test Set")
accuracy.rmse(test_pred, verbose=True)

User-based Model : Test Set
RMSE: 0.8896


0.8895620551576681

In [None]:
# we can query for specific predicions
uid = str(196)  # raw user id
iid = str(302)  # raw item id

In [None]:
# get a prediction for specific users and items.
pred = algo.predict(uid, iid, verbose=True)

user: 196        item: 302        r_ui = None   est = 3.52   {'was_impossible': True, 'reason': 'User and/or item is unknown.'}


For this user ```196``` for movie ```302``` the true rating is None where as the estimated rating is ```3.52```

In [None]:
anti_pre = algo.test(anti_set)
pred_df = pd.DataFrame(anti_pre).merge(movies , left_on = ['iid'], right_on = ['movieId'])
pred_df = pd.DataFrame(pred_df).merge(users , left_on = ['uid'], right_on = ['userId'])

In [None]:
pred_df.head()

Unnamed: 0,uid,iid,r_ui,est,details,movieId,title,genres,userId,age,time_spent_per_day
0,2,63,3.51685,3.284361,"{'actual_k': 16, 'was_impossible': False}",63,Don't Be a Menace to South Central While Drink...,Comedy|Crime,2,24,1.891303
1,2,110,3.51685,4.112406,"{'actual_k': 50, 'was_impossible': False}",110,Braveheart (1995),Action|Drama|War,2,24,1.891303
2,2,170,3.51685,3.329485,"{'actual_k': 29, 'was_impossible': False}",170,Hackers (1995),Action|Adventure|Crime|Thriller,2,24,1.891303
3,2,175,3.51685,3.69987,"{'actual_k': 23, 'was_impossible': False}",175,Kids (1995),Drama,2,24,1.891303
4,2,231,3.51685,3.394821,"{'actual_k': 50, 'was_impossible': False}",231,Dumb & Dumber (Dumb and Dumber) (1994),Adventure|Comedy,2,24,1.891303


We can also recommend a movie to the users if the estimated rating is 5.0 using the above pred_df.

For example, we can recommend for user ```200```

In [None]:
pred_df[(pred_df['est']== 5.0)&(pred_df['userId']== 200)]

Unnamed: 0,uid,iid,r_ui,est,details,movieId,title,genres,userId,age,time_spent_per_day
6173681,200,49817,3.51685,5.0,"{'actual_k': 1, 'was_impossible': False}",49817,"Plague Dogs, The (1982)",Adventure|Animation|Drama,200,13,1.961149
6175135,200,418,3.51685,5.0,"{'actual_k': 1, 'was_impossible': False}",418,Being Human (1993),Drama,200,13,1.961149
6175138,200,649,3.51685,5.0,"{'actual_k': 1, 'was_impossible': False}",649,Cold Fever (Á köldum klaka) (1995),Comedy|Drama,200,13,1.961149
6179115,200,52767,3.51685,5.0,"{'actual_k': 1, 'was_impossible': False}",52767,21 Up (1977),Documentary,200,13,1.961149
6179505,200,1546,3.51685,5.0,"{'actual_k': 1, 'was_impossible': False}",1546,Schizopolis (1996),Comedy,200,13,1.961149
6179799,200,5304,3.51685,5.0,"{'actual_k': 1, 'was_impossible': False}",5304,"Rome, Open City (a.k.a. Open City) (Roma, citt...",Drama|War,200,13,1.961149
6180221,200,25961,3.51685,5.0,"{'actual_k': 1, 'was_impossible': False}",25961,"Gunfighter, The (1950)",Action|Western,200,13,1.961149
6180952,200,80969,3.51685,5.0,"{'actual_k': 1, 'was_impossible': False}",80969,Never Let Me Go (2010),Drama|Romance|Sci-Fi,200,13,1.961149
6181261,200,101862,3.51685,5.0,"{'actual_k': 1, 'was_impossible': False}",101862,50 Children: The Rescue Mission of Mr. And Mrs...,Documentary,200,13,1.961149
6181389,200,116136,3.51685,5.0,"{'actual_k': 1, 'was_impossible': False}",116136,Olive Kitteridge (2014),Drama,200,13,1.961149


### Item - based collaborative filtering

In [None]:
# K value represents the (max) number of neighbors to take into account for aggregation. Example for every item it gives 50 nearest ones.
# There are many similarity options to calculate the similarity between the neighbors . Here, we have used the cosine similarity.
# when user_based = False then it performs item based collaborative filtering

algo_i = KNNWithMeans(k=50, sim_options={'name': 'cosine', 'user_based': False})
algo_i.fit(trainset)

Computing the cosine similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x7f1f4987d730>

In [None]:
# run the trained model against the testset
test_pred = algo_i.test(testset)

In [None]:
test_pred[0]

Prediction(uid=308, iid=7438, r_ui=4.0, est=3.8166317964999177, details={'actual_k': 50, 'was_impossible': False})

uid – The (raw) user id. 

iid – The (raw) item id. 

r_ui (float) – The true rating .

est (float) – The estimated rating. This is calculated by taking mean ratings of each user for user-based collab filtering.

details (dict) – Stores additional details about the prediction.

In this details was_impossible defines status of the true rating 
-  if was_impossible: False - Then there is some true rating.
-  else if was_impossible: True - Then there is no information on true rating for that particular record.

In [None]:
# get RMSE on test set
print("Item-based Model : Test Set")
accuracy.rmse(test_pred, verbose=True)

Item-based Model : Test Set
RMSE: 0.8982


0.8982040421726045

In [None]:
# we can query for specific predicions
uid = str(196)  # raw user id
iid = str(303)  # raw item id

In [None]:
# get a prediction for specific users and items.
pred = algo_i.predict(uid, iid, verbose=True)

user: 196        item: 303        r_ui = None   est = 3.52   {'was_impossible': True, 'reason': 'User and/or item is unknown.'}


For this user ```196``` for movie ```303``` the true rating is None where as the estimated rating is ```3.52```

Finding the movies that are closest to the movieId 1 (Toy Story) based on our training set for algo_i model.

In [None]:
tsr_inner_id = algo_i.trainset.to_inner_iid(1) #Considering the movieId 1

tsr_neighbors = algo_i.get_neighbors(tsr_inner_id, k=5) #Getting the 5 nearest neighbors for movieId 1

movies[movies.movieId.isin([algo.trainset.to_raw_iid(inner_id)
                       for inner_id in tsr_neighbors])] #Displaying the 5 nearest neighbors to the Toy Story.

Unnamed: 0,movieId,title,genres
3089,3919,Hellraiser III: Hell on Earth (1992),Horror
4311,5668,White Oleander (2002),Drama
4635,6270,Akira Kurosawa's Dreams (Dreams) (1990),Drama|Fantasy
5865,8933,"Decline of the American Empire, The (Déclin de...",Comedy|Drama
9436,96606,Samsara (2011),Documentary


### Matrix Factorization

In [None]:
from surprise import SVD
from surprise.model_selection import cross_validate

In [None]:
svd = SVD() #Suprise library uses the SVD algorithm to perform the matrix factorisation where as other libraries uses ALS
cross_validate(svd, data, measures=['rmse','mae'], cv = 5 , return_train_measures=True,verbose=True)
##The dataset is divided into train and test and with 5 folds the rmse has been calculated

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8633  0.8646  0.8688  0.8748  0.8710  0.8685  0.0042  
MAE (testset)     0.6631  0.6695  0.6705  0.6732  0.6717  0.6696  0.0035  
RMSE (trainset)   0.6385  0.6399  0.6364  0.6380  0.6354  0.6376  0.0016  
MAE (trainset)    0.4974  0.4980  0.4950  0.4973  0.4953  0.4966  0.0012  
Fit time          2.11    1.52    1.50    1.53    1.56    1.65    0.23    
Test time         2.09    0.18    0.23    0.16    0.16    0.56    0.76    


{'test_rmse': array([0.86334692, 0.86461463, 0.86879976, 0.87479712, 0.87096511]),
 'train_rmse': array([0.63854306, 0.63985388, 0.63643077, 0.63795128, 0.63539412]),
 'test_mae': array([0.66308555, 0.66950198, 0.67050782, 0.67321333, 0.67170409]),
 'train_mae': array([0.49735743, 0.49804085, 0.49499609, 0.49728045, 0.49533605]),
 'fit_time': (2.1121771335601807,
  1.5230817794799805,
  1.502671241760254,
  1.5310337543487549,
  1.5575673580169678),
 'test_time': (2.0912363529205322,
  0.18477225303649902,
  0.23265624046325684,
  0.1557314395904541,
  0.16026067733764648)}

The above data gives the RMSE and MAE values for each fold as well as average value and standard deviation value.

- ```test_rmse``` represents the rmse values of testsets.

- ```train_rmse``` represents the rmse values of trainsets.

- similarly, ```test_mae``` and ```train_mae``` represents MAE values of train and testsets.

- ```fit_time``` represents time taken to fit the trainsets.

- ```test_time``` represents time taken to fit the testsets.

In [None]:
trainset = data.build_full_trainset()
svd.fit(trainset) ##Fitting the trainset with the help of svd

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f1f4a789e20>

In [None]:
svd.pu.shape , svd.qi.shape #pu gives the embeddings of Users and qi gives the embeddings of Items.

((668, 100), (10325, 100))

In [None]:
#Storing all the movie titles in items
items = movies['title'].unique()
##Considering the user '662' 
test = [[662, iid, 4] for iid in items]
##Finding the user predictions(ratings) for all the movies
predictions = svd.test(test)
pred = pd.DataFrame(predictions)

In [None]:
a = pred.sort_values(by='est', ascending=False) ##Sorting the values based on the estimated predictions

In [None]:
a[0:10] ##TOP 10

Unnamed: 0,uid,iid,r_ui,est,details
0,662,Toy Story (1995),4,4.122779,{'was_impossible': False}
6859,662,"Organization, The (1971)",4,4.122779,{'was_impossible': False}
6881,662,Lies My Father Told Me (1975),4,4.122779,{'was_impossible': False}
6882,662,We All Loved Each Other So Much (C'eravamo tan...,4,4.122779,{'was_impossible': False}
6883,662,Lady Vengeance (Sympathy for Lady Vengeance) (...,4,4.122779,{'was_impossible': False}
6884,662,49th Parallel (1941),4,4.122779,{'was_impossible': False}
6885,662,Ted Bundy (2002),4,4.122779,{'was_impossible': False}
6886,662,District 13 (Banlieue 13) (2004),4,4.122779,{'was_impossible': False}
6887,662,BloodRayne (2005),4,4.122779,{'was_impossible': False}
6888,662,Hostel (2005),4,4.122779,{'was_impossible': False}


In [None]:
testset = trainset.build_anti_testset()
predictions_svd = svd.test(testset) #Predicting for the test set

In [None]:
print('SVD - RMSE:', accuracy.rmse(predictions_svd, verbose=False))
print('SVD - MAE:', accuracy.mae(predictions_svd, verbose=False))

SVD - RMSE: 0.4692648491936571
SVD - MAE: 0.36674177423813115


**For complete documentation on Suprise library refer to the below link:**

<a href="https://surprise.readthedocs.io/en/stable/">Suprise Documentation</a>