# Recommendation System

Building recommendation system to scale using scikit-surprise (surprise library)

[Recommender systems](https://en.wikipedia.org/wiki/Recommender_system) are one of the most common used and easily understandable applications of data science. Lots of work has been done on this topic, the interest and demand in this area remains very high because of the rapid growth of the internet and the information overload problem. It has become necessary for online businesses to help users to deal with information overload and provide personalized recommendations, content and services to them.

Two of the most popular ways to approach recommender systems are [collaborative filtering](https://en.wikipedia.org/wiki/Collaborative_filtering) and [content-based recommendations](https://www.analyticsvidhya.com/blog/2015/08/beginners-guide-learn-content-based-recommender-systems/). In this post, we will focus on the **collaborative filtering** approach, that is: the user is recommended items that people with similar tastes and preferences liked in the past. In another word, this method predicts unknown ratings by using the similarities between users.

## Importing required libraries

In [1]:
import numpy as np
import pandas as pd
from surprise import Reader
from surprise import Dataset
from surprise.model_selection import cross_validate
from surprise import NormalPredictor
from surprise import KNNBasic
from surprise import KNNWithMeans
from surprise import KNNWithZScore
from surprise import KNNBaseline
from surprise import SVD
from surprise import BaselineOnly
from surprise import SVDpp
from surprise import NMF
from surprise import SlopeOne
from surprise import CoClustering
from surprise.accuracy import rmse
from surprise import accuracy
from surprise.model_selection import train_test_split
from surprise.model_selection import GridSearchCV

In [2]:
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
from tqdm import tqdm
import logging

## Importing data

In [4]:
data = Dataset.load_builtin("ml-1m")

## Surprise

To load a dataset from a pandas dataframe, we will use the load_from_df() method, we will also need a Reader object, and the rating_scale parameter must be specified. The dataframe must have three columns, corresponding to the user ids, the item ids, and the ratings in this order. Each row thus corresponds to a given rating.


#### NormalPredictor

* NormalPredictor algorithm predicts a random rating based on the distribution of the training set, which is assumed to be normal. This is one of the most basic algorithms that do not do much work.

#### BaselineOnly

* BasiclineOnly algorithm predicts the baseline estimate for given user and item.

### k-NN algorithms

#### KNNBasic

* KNNBasic is a basic collaborative filtering algorithm.

#### KNNWithMeans

* KNNWithMeans is basic collaborative filtering algorithm, taking into account the mean ratings of each user.

#### KNNWithZScore

* KNNWithZScore is a basic collaborative filtering algorithm, taking into account the z-score normalization of each user.

#### KNNBaseline

* KNNBaseline is a basic collaborative filtering algorithm taking into account a baseline rating.

### Matrix Factorization-based algorithms

#### SVD

* SVD algorithm is equivalent to Probabilistic Matrix Factorization (http://papers.nips.cc/paper/3208-probabilistic-matrix-factorization.pdf)

#### SVDpp

* The SVDpp algorithm is an extension of SVD that takes into account implicit ratings.

#### NMF

* NMF is a collaborative filtering algorithm based on Non-negative Matrix Factorization. It is very similar with SVD.

### Slope One

* Slope One is a straightforward implementation of the SlopeOne algorithm. (https://arxiv.org/abs/cs/0702144)

### Co-clustering

* Co-clustering is a collaborative filtering algorithm based on co-clustering (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.113.6458&rep=rep1&type=pdf)


We use rmse as our accuracy metric for the predictions.

In [5]:
data_df = pd.DataFrame(data.__dict__['raw_ratings'], columns=['user_id','item_id','rating','timestamp'])
data_df = data_df.iloc[:,:-1].copy()

In [6]:
data_df

Unnamed: 0,user_id,item_id,rating
0,1,1193,5.0
1,1,661,3.0
2,1,914,3.0
3,1,3408,4.0
4,1,2355,5.0
...,...,...,...
1000204,6040,1091,1.0
1000205,6040,1094,5.0
1000206,6040,562,5.0
1000207,6040,1096,4.0


In [7]:
# ### Check whether the user already seen the movie or not:
# 
# n_user = [i for i in data_df.user_id.unique()]
# n_movie = [i for i in data_df.item_id.unique()]
# num = []
# for i in n_user:
#   num.append(data_df[data_df['user_id']==str(i)].item_id.nunique())
# num

In [None]:

# Iterate over all algorithms
algorithms = [SVD(), SVDpp(), SlopeOne(), NMF(), NormalPredictor(), KNNBaseline(), KNNBasic(), KNNWithMeans(), KNNWithZScore(), BaselineOnly(), CoClustering()]

print("Attempting: ", str(algorithms), '\n\n\n')

# Initialize an empty dictionary to accumulate results
results_dict = {}

for i in tqdm(range(1, 11, 1)):
    inused_data = data_df.iloc[:i * (10**5), :].copy()

    # A reader is still needed but only the rating_scale param is required.
    reader = Reader(rating_scale=(1, 5))

    # The columns must correspond to user id, item id, and ratings (in that order).
    data = Dataset.load_from_df(inused_data[["user_id", "item_id", "rating"]], reader)

    test_rmse = {}
    test_time = {}
    fit_time = {}  # Dictionary to store results for this iteration

    for algorithm in algorithms:
        print("Starting: ", str(algorithm))
        # Perform cross validation
        results = cross_validate(algorithm, data, measures=['RMSE'], cv=10, verbose=False)

        # Store results for this algorithm in the dictionary
        algorithm_name = str(algorithm).split(' ')[0].split('.')[-1]
        test_rmse[algorithm_name] = np.mean(results['test_rmse']),
        test_time[algorithm_name] = np.mean(results['test_time']),
        fit_time[algorithm_name] = np.mean(results['fit_time'])

    # Store the iteration results in the main dictionary
    results_dict[f'Iteration_{i}'] = {
        'test_rmse':test_rmse,
        'test_time':test_time,
        'fit_time':fit_time
    }

print('\n\tDONE\n')

# Convert the accumulated data into a DataFrame

surprise_results = pd.DataFrame.from_dict({(i, j): results_dict[i][j] for i in results_dict.keys() for j in results_dict[i].keys()}, orient='index')


Attempting:  [<surprise.prediction_algorithms.matrix_factorization.SVD object at 0x7fbff5e26620>, <surprise.prediction_algorithms.matrix_factorization.SVDpp object at 0x7fbff5e265c0>, <surprise.prediction_algorithms.slope_one.SlopeOne object at 0x7fbff5e26590>, <surprise.prediction_algorithms.matrix_factorization.NMF object at 0x7fbff5e249a0>, <surprise.prediction_algorithms.random_pred.NormalPredictor object at 0x7fbff5e24e80>, <surprise.prediction_algorithms.knns.KNNBaseline object at 0x7fbff5e24ee0>, <surprise.prediction_algorithms.knns.KNNBasic object at 0x7fbff5e24f10>, <surprise.prediction_algorithms.knns.KNNWithMeans object at 0x7fbff5e24fa0>, <surprise.prediction_algorithms.knns.KNNWithZScore object at 0x7fbff5e25000>, <surprise.prediction_algorithms.baseline_only.BaselineOnly object at 0x7fbff5e25060>, <surprise.prediction_algorithms.co_clustering.CoClustering object at 0x7fbff5e250c0>] 


  0%|          | 0/10 [00:00<?, ?it/s]

Starting:  <surprise.prediction_algorithms.matrix_factorization.SVD object at 0x7fbff5e26620>
Starting:  <surprise.prediction_algorithms.matrix_factorization.SVDpp object at 0x7fbff5e265c0>
Starting:  <surprise.prediction_algorithms.slope_one.SlopeOne object at 0x7fbff5e26590>
Starting:  <surprise.prediction_algorithms.matrix_factorization.NMF object at 0x7fbff5e249a0>
Starting:  <surprise.prediction_algorithms.random_pred.NormalPredictor object at 0x7fbff5e24e80>
Starting:  <surprise.prediction_algorithms.knns.KNNBaseline object at 0x7fbff5e24ee0>
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als..

 10%|█         | 1/10 [09:28<1:25:16, 568.53s/it]

Starting:  <surprise.prediction_algorithms.matrix_factorization.SVD object at 0x7fbff5e26620>
Starting:  <surprise.prediction_algorithms.matrix_factorization.SVDpp object at 0x7fbff5e265c0>
Starting:  <surprise.prediction_algorithms.slope_one.SlopeOne object at 0x7fbff5e26590>
Starting:  <surprise.prediction_algorithms.matrix_factorization.NMF object at 0x7fbff5e249a0>
Starting:  <surprise.prediction_algorithms.random_pred.NormalPredictor object at 0x7fbff5e24e80>
Starting:  <surprise.prediction_algorithms.knns.KNNBaseline object at 0x7fbff5e24ee0>
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als..

 20%|██        | 2/10 [31:58<2:17:04, 1028.03s/it]

Starting:  <surprise.prediction_algorithms.matrix_factorization.SVD object at 0x7fbff5e26620>
Starting:  <surprise.prediction_algorithms.matrix_factorization.SVDpp object at 0x7fbff5e265c0>
Starting:  <surprise.prediction_algorithms.slope_one.SlopeOne object at 0x7fbff5e26590>
Starting:  <surprise.prediction_algorithms.matrix_factorization.NMF object at 0x7fbff5e249a0>
Starting:  <surprise.prediction_algorithms.random_pred.NormalPredictor object at 0x7fbff5e24e80>
Starting:  <surprise.prediction_algorithms.knns.KNNBaseline object at 0x7fbff5e24ee0>
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als..

 30%|███       | 3/10 [1:09:59<3:06:40, 1600.12s/it]

Starting:  <surprise.prediction_algorithms.matrix_factorization.SVD object at 0x7fbff5e26620>
Starting:  <surprise.prediction_algorithms.matrix_factorization.SVDpp object at 0x7fbff5e265c0>
Starting:  <surprise.prediction_algorithms.slope_one.SlopeOne object at 0x7fbff5e26590>
Starting:  <surprise.prediction_algorithms.matrix_factorization.NMF object at 0x7fbff5e249a0>
Starting:  <surprise.prediction_algorithms.random_pred.NormalPredictor object at 0x7fbff5e24e80>
Starting:  <surprise.prediction_algorithms.knns.KNNBaseline object at 0x7fbff5e24ee0>
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als..

 40%|████      | 4/10 [2:07:34<3:53:15, 2332.60s/it]

Starting:  <surprise.prediction_algorithms.matrix_factorization.SVD object at 0x7fbff5e26620>
Starting:  <surprise.prediction_algorithms.matrix_factorization.SVDpp object at 0x7fbff5e265c0>
Starting:  <surprise.prediction_algorithms.slope_one.SlopeOne object at 0x7fbff5e26590>
Starting:  <surprise.prediction_algorithms.matrix_factorization.NMF object at 0x7fbff5e249a0>
Starting:  <surprise.prediction_algorithms.random_pred.NormalPredictor object at 0x7fbff5e24e80>
Starting:  <surprise.prediction_algorithms.knns.KNNBaseline object at 0x7fbff5e24ee0>
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als..

In [None]:
surprise_results.reset_index()

In [None]:
surprise = surprise_results.reset_index().copy()
surprise100k-data

In [None]:
cols = [c for c in surprise.columns if c != "level_0" and c!= "level_1"]
cols

In [None]:
p,l = surprise.shape

for c in cols:
    df = surprise[c]
   
    for i in range(p):
        d = df[i]
        if isinstance(d, tuple): 
            surprise.loc[i,c] = d[0]
        

In [None]:
test_rmse = surprise[surprise['level_1']=='test_rmse']
test_time = surprise[surprise['level_1']=='test_time']
fit_time = surprise[surprise['level_1']=='fit_time']

In [None]:
type(surprise.iloc[1,4]) 

In [None]:
data.shape

In [None]:
fit_time

In [None]:
sns.lineplot(data=test_rmse)

In [None]:
sns.lineplot(data=fit_time)

In [None]:
sns.lineplot(data=test_time)