# Part 3: Recommendation system for retail using collaborative filtering memory based

# In this notebook we will focus on surprise library which allows us to facilitate recommandation
I use some articles during the creation of this notebook  :  
* https://gist.github.com/pankajti/e631e8f6ce067fc76dfacedd9e4923ca#file-surprise_knn_recommendation-ipynb  
* https://towardsdatascience.com/how-to-build-a-memory-based-recommendation-system-using-python-surprise-55f3257b2cf4  
* https://pankaj-tiwari2.medium.com/neighborhood-based-collaborative-filtering-in-python-using-surprise-fe9d5700cb58

# The summary of the notebook is written below

I. Import useful library and python file containing our functions

II. Retrieving our data from a previous notebook  

* A. Ratings_table, the basic table

III. Preparing our differents set of data

* A. Load our full ratings_table as data
* B. Create the anti_set
* C. Create the nice_set
* D. Information about the full data  

IV. Train a model  

* A. Choosing the best algorithm
* B. Trainning the best algorithm

V. Prediction  

* A. Prediction on a pair (userid and an itemid)
* B. Prediction on an user_id
* C. Checking metrics
* D. Some serendipity ?? 


# I. Import useful library and python file containing our functions

In [3]:
import pandas as pd
import time
from surprise import SVD
from surprise import Dataset
from surprise import accuracy
from surprise.model_selection import train_test_split
from surprise import KNNBasic,  KNNWithMeans, KNNBaseline, KNNWithZScore
from surprise.model_selection import KFold
from surprise import Reader
from surprise import NormalPredictor
from surprise.model_selection import cross_validate
import matplotlib.pyplot as plt
import seaborn as sns
from surprise.model_selection import GridSearchCV



# II. Retrieving our data from a previous notebook

## A. Ratings_table, the basic table

We use ratings_table as our basic table meaning that this table contain all needed information for this notebook.  
* userId
* rating
* ItemID
* Item_Name

Moreover, this data frame contain only cust_id with more than 3 transactions

In [203]:
ratings_table = pd.read_csv('rating_table.csv')
ratings_table = ratings_table.rename(columns={"cust_id": "userId", "object_id": "ItemId", "rank": "rating", "object_name": "Item_Name"})
ratings_table = ratings_table.groupby(['userId', 'ItemId', 'Item_Name'])[['rating']].mean().reset_index()
ratings_table.head()

Unnamed: 0,userId,ItemId,Item_Name,rating
0,266783,1_4,Clothing_Mens,2.0
1,266783,2_1,Footwear_Mens,4.0
2,266783,5_10,Books_Non-Fiction,2.0
3,266784,3_4,Electronics_Mobiles,2.0
4,266784,5_10,Books_Non-Fiction,3.0


So, we have 18 437 transactions and 4031 uniques customers

In [204]:
print('shape of the ratings_table:', ratings_table.shape)
print('numbers of uniques customer:', len(ratings_table.userId.unique()))

shape of the ratings_table: (16880, 4)
numbers of uniques customer: 4031


# III. Preparing our differents set of data

## A. Load our full ratings_table as data

In [205]:
reader = Reader(rating_scale=(1, 5))
# The columns must correspond to user_id, item_id and ratings (in that order).
data = Dataset.load_from_df(ratings_table[['userId', 'ItemId', 'rating']], reader)

## B. Create the anti_set

An antiset is a set of those user and item pairs for which a rating doesn't exist in original dataset.  
This is the set for which we are trying to predict ratings.  
For example in following example userId 270384 has not rated ItemID 5_3, 5_12...  
Surprise creates a set of such combinations by providing a default average rating.  
We will be calculating an estimated rating for this set using our model.

In [206]:
anti_set = data.build_full_trainset().build_anti_testset()
anti_set[:4]

[(266783, '3_4', 2.999397709320695),
 (266783, '5_7', 2.999397709320695),
 (266783, '2_4', 2.999397709320695),
 (266783, '4_1', 2.999397709320695)]

## C. Create the nice_set

We create also a nice_set.  
In the nice_set, we will only have our combinaison of userdId andIitemId which are already rated.  
We can see here the real value of the rating.

In [207]:
#we create here an anti_set
nice_set = data.build_full_trainset().build_testset()
nice_set[:4]

[(266783, '1_4', 2.0),
 (266783, '2_1', 4.0),
 (266783, '5_10', 2.0),
 (266784, '3_4', 2.0)]

## D. Information about the full data

In [208]:
items = ratings_table[['ItemId' , 'Item_Name']].drop_duplicates(['ItemId' , 'Item_Name'])
users = ratings_table[['userId']].drop_duplicates(['userId'])
display(items.head(2))
display(users.head(2))

Unnamed: 0,ItemId,Item_Name
0,1_4,Clothing_Mens
1,2_1,Footwear_Mens


Unnamed: 0,userId
0,266783
3,266784


In [209]:
trainsetfull = data.build_full_trainset()
print('Number of users: ', trainsetfull.n_users, '\n')
print('Number of items: ', trainsetfull.n_items, '\n')

Number of users:  4031 

Number of items:  23 



# IV. Train a model

## A. Choosing the best algorithm

We use cross validatin technique to estimate the best algo

* cross-validated a number of model types with different parameters,
* selected the configuration with the lowest average test RMSE score,
* trained that model on the whole Dataset,
* used it for predictions.

On this next cell, we can change a lot of parameters:
* k : it is the upper limit of similar items we want the algorithm to consider
* min_k : if a user does not have enough ratings, the global average will be used for estimations
* For sim_option :  
    * name : type of formula for the similarity functions (pearson, cosine, MSD)
    * user_based : Basically, there are two different routes when you want to estimate similarities. You can either compute how similar each item is to each other item, or do the same with the users.

We can also change the model (KNNWithMeans, KNNBasic, KNNWithZScore, KNNBaseline)

We create a data frame named DF_result to expose all our model test in order to choose with best result

In [210]:

#Creation of the result df
DF_result = pd.DataFrame(columns = ['algor', 'my_k', 'my_min_k', 'my_sim_option_name', 'my_sim_option_user_based', 'results'])


my_k = 6
my_min_k = 2
my_sim_option = { 'name':'pearson', 'user_based':False }



def compute_result(algor, my_k, my_min_k, my_sim_option, data = data):

    algo = algor(
        k = my_k, min_k = my_min_k, 
        sim_options = my_sim_option, verbose = True
        )

    results = cross_validate(
        algo = algo, data = data, measures=['RMSE'], 
        cv=5, return_train_measures=True
        )
    
 
    return algo, algor, my_k, my_min_k, results['test_rmse'].mean()


t1 = time.time()


for algor in [KNNWithMeans, KNNBasic, KNNWithZScore, KNNBaseline]:        # [KNNWithMeans, KNNBasic, KNNWithZScore, KNNBaseline]:
    for metric in ['pearson', 'cosine', 'MSD']: #['pearson', 'cosine', 'MSD']:
        for Bool in [False]: #[True, False]
            for my_k in [5,7]: #[4,7,14]
                L= []

                my_sim_option['name'] = metric
                my_sim_option['user_based'] = Bool
                algo , algor, my_k, my_min_k, results = compute_result(algor, my_k, my_min_k, my_sim_option, data = data)

                L.append([algor.__name__, my_k, my_min_k, metric, Bool, results])

                df_temp = pd.DataFrame(L,columns = ['algor', 'my_k', 'my_min_k', 'my_sim_option_name', 'my_sim_option_user_based', 'results'])


                DF_result = DF_result.append(df_temp, ignore_index=True)


DF_result.sort_values(by=['results'], ascending=True, inplace=True)
DF_result.head()
t2 = time.time()

#it takes 23 min to complete with conditions written in comment
print('took in sec :', (t2 - t1))
DF_result

Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Co

KeyboardInterrupt: 

We select the best result thanks to the data frame result below.  
The previous cell was a bit long to exectue, around 23 min so i willl save the result in a csv file.

In [211]:
#DF_result.to_csv('df_result_metrics.csv', index = False)
DF_result.head(20)

Unnamed: 0,algor,my_k,my_min_k,my_sim_option_name,my_sim_option_user_based,results
0,KNNWithMeans,5,2,pearson,False,1.552837
1,KNNWithMeans,7,2,pearson,False,1.55888
2,KNNWithMeans,5,2,cosine,False,1.597842
3,KNNWithMeans,7,2,cosine,False,1.595564
4,KNNWithMeans,5,2,MSD,False,1.589469


So, we can see that the best algorithm is KNNBaseline with k at 14,  min_k at 2, the similarity metrics as cosine, we use user_based on True.  
This allow us to have a mean RMSE at 1.46

In [212]:
algo, _, _, _, _ =  compute_result(algor = KNNBaseline, my_k = 7, my_min_k = 2, my_sim_option = {'name': 'cosine', 'user_based': True}, data = data)

Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.


## B. Trainning the best algorithm

Now that we find a good algorithm, we need to train it on the full dataset in order to do prediction 

In [213]:
algo.fit(trainsetfull)

Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBaseline at 0x7f96346ff130>

# V. Prediction

## A. Prediction on a pair (userid and an itemid)

In [214]:
algo.predict(uid = 270384, iid = '5_12')

Prediction(uid=270384, iid='5_12', r_ui=None, est=2.389500534574194, details={'actual_k': 7, 'was_impossible': False})

In this case, the prediction for the cust_id 270384 and the item 5_12 will be around 3.6

## B. Prediction on an user_id

With anti_pred_df, we do the prediction for all unknown rate with the anti_set

In [215]:
anti_pre = algo.test(anti_set)
anti_pred_df = pd.DataFrame(anti_pre).merge(items , left_on = ['iid'], right_on = ['ItemId'])
anti_pred_df = pd.DataFrame(anti_pred_df).merge(users , left_on = ['uid'], right_on = ['userId'])

In [216]:
user_id_to_predict = 266784
treshold = 3.0
anti_pred_df[(anti_pred_df['est']>treshold)&(anti_pred_df['userId']==user_id_to_predict)].sort_values(by=['est'], ascending=False)

Unnamed: 0,uid,iid,r_ui,est,details,ItemId,Item_Name,userId
73400,266784,1_3,2.999398,4.228812,"{'actual_k': 7, 'was_impossible': False}",1_3,Clothing_Kids,266784
73410,266784,6_11,2.999398,3.918306,"{'actual_k': 7, 'was_impossible': False}",6_11,Home and kitchen_Bath,266784
73404,266784,4_4,2.999398,3.833014,"{'actual_k': 7, 'was_impossible': False}",4_4,Bags_Women,266784
73397,266784,4_1,2.999398,3.660677,"{'actual_k': 7, 'was_impossible': False}",4_1,Bags_Mens,266784
73396,266784,2_4,2.999398,3.597158,"{'actual_k': 7, 'was_impossible': False}",2_4,Footwear_Kids,266784
73406,266784,3_5,2.999398,3.512738,"{'actual_k': 7, 'was_impossible': False}",3_5,Electronics_Computers,266784
73411,266784,1_1,2.999398,3.459833,"{'actual_k': 7, 'was_impossible': False}",1_1,Clothing_Women,266784
73412,266784,5_3,2.999398,3.386538,"{'actual_k': 7, 'was_impossible': False}",5_3,Books_Comics,266784
73409,266784,3_9,2.999398,3.211833,"{'actual_k': 7, 'was_impossible': False}",3_9,Electronics_Cameras,266784
73415,266784,2_1,2.999398,3.18194,"{'actual_k': 7, 'was_impossible': False}",2_1,Footwear_Mens,266784


With nice_pred_df, we do the prediction for all known value to see how the algorithm peforms

In [217]:
nice_pre = algo.test(nice_set)
nice_pred_df = pd.DataFrame(nice_pre).merge(items , left_on = ['iid'], right_on = ['ItemId'])
nice_pred_df = pd.DataFrame(nice_pred_df).merge(users , left_on = ['uid'], right_on = ['userId'])

In [218]:
user_id_to_predict = 266785
treshold = 3.0
nice_pred_df[(nice_pred_df['est']>treshold)&(nice_pred_df['userId']==user_id_to_predict)].sort_values(by=['est'], ascending=False)

Unnamed: 0,uid,iid,r_ui,est,details,ItemId,Item_Name,userId
3194,266785,2_1,4.0,4.063014,"{'actual_k': 7, 'was_impossible': False}",2_1,Footwear_Mens,266785
3197,266785,5_11,5.0,3.775466,"{'actual_k': 7, 'was_impossible': False}",5_11,Books_Children,266785
3195,266785,2_4,3.5,3.670563,"{'actual_k': 7, 'was_impossible': False}",2_4,Footwear_Kids,266785
3196,266785,4_1,3.0,3.293609,"{'actual_k': 7, 'was_impossible': False}",4_1,Bags_Mens,266785
3198,266785,6_10,5.0,3.220526,"{'actual_k': 7, 'was_impossible': False}",6_10,Home and kitchen_Kitchen,266785


## C. Checking metrics

We can check here how does it perform with the rmse metrics

We can't really have a look to our anti_set metrics because it's irrelevant.  
We will compare all our estimation to a mean value set by default.

But we can effectively compare result predict by the algorithm and the real value rated by customers with the "nice_set"

In [219]:
#on nice_set
predictions = algo.test(nice_set)
print('On the nice_set, the RMSE score is :', accuracy.rmse(predictions))

RMSE: 1.3812
On the nice_set, the RMSE score is : 1.3811797192355084


So, the RMSE score is 1.39 meanning that would mean the estimated ratings on average are about 1.39 higher or lower than the actual ratings.  
Here, our scale is from 0 to 5, so it's is not a very good result at all.

We can display below a full data frame with all nice_set value to compare and see how poorly it performs... 

In [220]:
pd.DataFrame(predictions)

Unnamed: 0,uid,iid,r_ui,est,details
0,266783,1_4,2.0,2.969048,"{'actual_k': 7, 'was_impossible': False}"
1,266783,2_1,4.0,2.596845,"{'actual_k': 7, 'was_impossible': False}"
2,266783,5_10,2.0,2.810934,"{'actual_k': 7, 'was_impossible': False}"
3,266784,3_4,2.0,3.657985,"{'actual_k': 7, 'was_impossible': False}"
4,266784,5_10,3.0,3.182158,"{'actual_k': 7, 'was_impossible': False}"
...,...,...,...,...,...
16875,275261,5_11,5.0,3.193583,"{'actual_k': 7, 'was_impossible': False}"
16876,275261,5_3,3.0,3.358162,"{'actual_k': 7, 'was_impossible': False}"
16877,275265,4_1,1.0,2.000597,"{'actual_k': 7, 'was_impossible': False}"
16878,275265,5_12,3.0,3.010975,"{'actual_k': 7, 'was_impossible': False}"


## D. Some serendipity ?? 

Purelly by curiosity I try some another random algorithm to check if poor RMSE value is always the norm.  
But i find an algo with a RMSE at 0.62 on the nice_set, and it's is quite promising

In [221]:
algo, _, _, _, _ =  compute_result(algor = KNNBasic, my_k = 5, my_min_k = 2, my_sim_option = {'name': 'pearson', 'user_based': False})

Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.


In [222]:
algo.fit(trainsetfull)

Computing the pearson similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBasic at 0x7f96007540a0>

In [223]:
#on nice_set
predictions = algo.test(nice_set)
print('On the nice_set, the RMSE score is :', accuracy.rmse(predictions))

RMSE: 0.5442
On the nice_set, the RMSE score is : 0.5441917914334575


In [224]:
pd.DataFrame(predictions)

Unnamed: 0,uid,iid,r_ui,est,details
0,266783,1_4,2.0,2.000000,"{'actual_k': 2, 'was_impossible': False}"
1,266783,2_1,4.0,2.999398,"{'was_impossible': True, 'reason': 'Not enough..."
2,266783,5_10,2.0,2.000000,"{'actual_k': 2, 'was_impossible': False}"
3,266784,3_4,2.0,2.999398,"{'was_impossible': True, 'reason': 'Not enough..."
4,266784,5_10,3.0,2.999398,"{'was_impossible': True, 'reason': 'Not enough..."
...,...,...,...,...,...
16875,275261,5_11,5.0,4.488831,"{'actual_k': 2, 'was_impossible': False}"
16876,275261,5_3,3.0,2.999398,"{'was_impossible': True, 'reason': 'Not enough..."
16877,275265,4_1,1.0,1.057290,"{'actual_k': 2, 'was_impossible': False}"
16878,275265,5_12,3.0,2.942710,"{'actual_k': 2, 'was_impossible': False}"


# Thank you for finishing part 3, let's move on the fouth part now