# Part 3: Recommendation system for retail using collaborative filtering memory based

# In this notebook we will focus on surprise library which allows us to facilitate recommandation
I use some articles during the creation of this notebook  :  
* https://gist.github.com/pankajti/e631e8f6ce067fc76dfacedd9e4923ca#file-surprise_knn_recommendation-ipynb  
* https://towardsdatascience.com/how-to-build-a-memory-based-recommendation-system-using-python-surprise-55f3257b2cf4  
* https://pankaj-tiwari2.medium.com/neighborhood-based-collaborative-filtering-in-python-using-surprise-fe9d5700cb58
* https://towardsdatascience.com/building-and-testing-recommender-systems-with-surprise-step-by-step-d4ba702ef80b

This notebook will mainly focus on the collaborative filtering approach.  
A user will be recommended items that people with similar tastes and preferences liked in the past.  
So, this method predicts unknown ratings by using the similarities between users.

# The summary of the notebook is written below

I. Import useful library and python file containing our functions

II. Retrieving our data from a previous notebook  

* A. Ratings_table, the basic table

III. Preparing our differents set of data

* A. Load our full ratings_table as data
* B. Create the anti_set
* C. Create the nice_set
* D. Information about the full data  

IV. Train a model  

* A. Choosing the best algorithm
* B. Trainning the best algorithm

V. Prediction  

* A. Prediction on a pair (userid and an itemid)
* B. Prediction on an user_id
* C. Checking metrics
* D. Dislpay Result


# I. Import useful library and python file containing our functions

In [4]:
import pandas as pd
import time
from surprise import SVD
from surprise import Dataset
from surprise import accuracy
from surprise.model_selection import train_test_split
from surprise import KNNBasic,  KNNWithMeans, KNNBaseline, KNNWithZScore
from surprise.model_selection import KFold
from surprise import Reader
from surprise import NormalPredictor
from surprise.model_selection import cross_validate
import matplotlib.pyplot as plt
import seaborn as sns
from surprise.model_selection import GridSearchCV

You can see here a link to the surprise library to have more information : http://surpriselib.com

# II. Retrieving our data from a previous notebook

## A. Ratings_table, the basic table

We use ratings_table as our basic table meaning that this table contain all needed information for this notebook.  
* userId
* rating
* ItemID
* Item_Name

Moreover, this data frame contain only cust_id with more than 3 transactions

In [5]:
ratings_table = pd.read_csv('output/rating_table.csv')
ratings_table = ratings_table.rename(columns={"cust_id": "userId", "object_id": "ItemId", "rank": "rating", "object_name": "Item_Name"})
ratings_table = ratings_table.groupby(['userId', 'ItemId', 'Item_Name'])[['rating']].mean().reset_index()
ratings_table.head()

Unnamed: 0,userId,ItemId,Item_Name,rating
0,266783,1_4,Clothing_Mens,2.0
1,266783,2_1,Footwear_Mens,4.0
2,266783,5_10,Books_Non-Fiction,2.0
3,266784,3_4,Electronics_Mobiles,2.0
4,266784,5_10,Books_Non-Fiction,3.0


So, we have 18 437 transactions and 4031 uniques customers

In [6]:
print('shape of the ratings_table:', ratings_table.shape)
print('numbers of uniques customer:', len(ratings_table.userId.unique()))

shape of the ratings_table: (16880, 4)
numbers of uniques customer: 4031


# III. Preparing our differents set of data

## A. Load our full ratings_table as data

In [7]:
reader = Reader(rating_scale=(1, 5))
# The columns must correspond to user_id, item_id and ratings (in that order).
data = Dataset.load_from_df(ratings_table[['userId', 'ItemId', 'rating']], reader)

## B. Create the anti_set

An antiset is a set of those user and item pairs for which a rating doesn't exist in original dataset.  
This is the set for which we are trying to predict ratings.  
For example in following example userId 270384 has not rated ItemID 5_3, 5_12...  
Surprise creates a set of such combinations by providing a default average rating.  
We will be calculating an estimated rating for this set using our model.

In [8]:
# we create here an anti_set
anti_set = data.build_full_trainset().build_anti_testset()
anti_set[:4]

[(266783, '3_4', 2.999397709320695),
 (266783, '5_7', 2.999397709320695),
 (266783, '2_4', 2.999397709320695),
 (266783, '4_1', 2.999397709320695)]

For anti_set, we assign a mean value as a default value before prediction, so here it's 2.999. 

## C. Create the nice_set

We create also a nice_set.  
In the nice_set, we will only have our combinaison of userdId andIitemId which are already rated.  
We can see here the real value of the rating.

In [9]:
# we create here a nice_set
nice_set = data.build_full_trainset().build_testset()
nice_set[:4]

[(266783, '1_4', 2.0),
 (266783, '2_1', 4.0),
 (266783, '5_10', 2.0),
 (266784, '3_4', 2.0)]

Here, we retrive the real rate for the couple (user, item)

## D. Information about the full data

In [10]:
items = ratings_table[['ItemId' , 'Item_Name']].drop_duplicates(['ItemId' , 'Item_Name'])
users = ratings_table[['userId']].drop_duplicates(['userId'])
display(items.head(2))
display(users.head(2))

Unnamed: 0,ItemId,Item_Name
0,1_4,Clothing_Mens
1,2_1,Footwear_Mens


Unnamed: 0,userId
0,266783
3,266784


In [11]:
trainsetfull = data.build_full_trainset()
print('Number of users: ', trainsetfull.n_users, '\n')
print('Number of items: ', trainsetfull.n_items, '\n')

Number of users:  4031 

Number of items:  23 



# IV. Train a model

## A. Choosing the best algorithm

We use cross validatin technique to estimate the best algo

* cross-validated a number of model types with different parameters,
* selected the configuration with the lowest average test RMSE score,
* trained that model on the whole Dataset,
* used it for predictions.

On this next cell, we can change the model and try many cross_validation.  

We try SVD(), KNNWithMeans, KNNBasic, KNNWithZScore, KNNBaseline

We create a data frame named DF_result to expose all our model test in order to choose with best result

In [64]:
benchmark = []

# Iterate over all algorithms
for algorithm in [SVD(), KNNBaseline(), KNNBasic(), KNNWithMeans(), KNNWithZScore()]:
    
    # Perform cross validation
    results = cross_validate(algorithm, data, measures=['RMSE'], cv=3, verbose=False)
    
    # Get results & append algorithm name
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    benchmark.append(tmp)
    
result = pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse')   
result

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.


Unnamed: 0_level_0,test_rmse,fit_time,test_time
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
KNNBasic,1.410192,0.494234,1.405025
KNNBaseline,1.412596,0.605503,1.520622
SVD,1.442522,0.57195,0.033864
KNNWithMeans,1.590514,0.557834,1.395574
KNNWithZScore,1.60306,0.631478,1.538227


We select the best result thanks to the data frame result below.

So, we can see that the best algorithm is KNNBasic.
This allow us to have a mean RMSE at 1.41 for the cross validation

## B. Trainning the best algorithm

Now that we find a good algorithm, we need to train it on the full dataset in order to do prediction 

In [65]:
algo = KNNBasic()
algo.fit(trainsetfull)

Computing the msd similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBasic at 0x7fb77be23820>

We can see information about the chosen algorithm : 

In [66]:
print('algo: {0}, k = {1}, min_k = {2}, sim = {3}'.format(algo.__class__.__name__, algo.k, algo.min_k, algo.sim_options))

algo: KNNBasic, k = 40, min_k = 1, sim = {'user_based': True}


# V. Prediction

## A. Prediction on a pair (userid and an itemid)

In [67]:
algo.predict(uid = 270384, iid = '5_12')

Prediction(uid=270384, iid='5_12', r_ui=None, est=2.725, details={'actual_k': 40, 'was_impossible': False})

In this case, the prediction for the cust_id 270384 and the item 5_12 will be around 2.79

## B. Prediction on an user_id

With anti_pred_df, we do the prediction for all unknown rate with the anti_set

In [68]:
anti_pre = algo.test(anti_set)
anti_pred_df = pd.DataFrame(anti_pre).merge(items , left_on = ['iid'], right_on = ['ItemId'])
anti_pred_df = pd.DataFrame(anti_pred_df).merge(users , left_on = ['uid'], right_on = ['userId'])

In the data frame below, r_ui correspond to the rating of the customer for the item but if the result is 2.999398,  
it's ean that the customer never rate this item yet.   
r_ui stand for rating between user and item.  
est is the estimation for the rating for the couple (user,item)

In [69]:
user_id_to_predict = 266784
treshold = 3.0
anti_pred_df[(anti_pred_df['est']>treshold)&(anti_pred_df['userId']==user_id_to_predict)].sort_values(by=['est'], ascending=False)

Unnamed: 0,uid,iid,r_ui,est,details,ItemId,Item_Name,userId
73396,266784,2_4,2.999398,3.3625,"{'actual_k': 40, 'was_impossible': False}",2_4,Footwear_Kids,266784
73404,266784,4_4,2.999398,3.3625,"{'actual_k': 40, 'was_impossible': False}",4_4,Bags_Women,266784
73397,266784,4_1,2.999398,3.2875,"{'actual_k': 40, 'was_impossible': False}",4_1,Bags_Mens,266784
73403,266784,3_8,2.999398,3.2125,"{'actual_k': 40, 'was_impossible': False}",3_8,Electronics_Personal Appliances,266784
73409,266784,3_9,2.999398,3.2125,"{'actual_k': 40, 'was_impossible': False}",3_9,Electronics_Cameras,266784
73402,266784,3_10,2.999398,3.1875,"{'actual_k': 40, 'was_impossible': False}",3_10,Electronics_Audio and video,266784
73406,266784,3_5,2.999398,3.1875,"{'actual_k': 40, 'was_impossible': False}",3_5,Electronics_Computers,266784
73412,266784,5_3,2.999398,3.1125,"{'actual_k': 40, 'was_impossible': False}",5_3,Books_Comics,266784
73415,266784,2_1,2.999398,3.1,"{'actual_k': 40, 'was_impossible': False}",2_1,Footwear_Mens,266784
73398,266784,5_11,2.999398,3.075,"{'actual_k': 40, 'was_impossible': False}",5_11,Books_Children,266784


With nice_pred_df, we do the prediction for all known value to see how the algorithm peforms

In [70]:
nice_pre = algo.test(nice_set)
nice_pred_df = pd.DataFrame(nice_pre).merge(items , left_on = ['iid'], right_on = ['ItemId'])
nice_pred_df = pd.DataFrame(nice_pred_df).merge(users , left_on = ['uid'], right_on = ['userId'])

In the Data frame below, r_ui also mean rating between user and item but for the nice_set, user already interact with the item so r_ui give us the real rate. 

In [71]:
user_id_to_predict = 266785
treshold = 3.0
nice_pred_df[(nice_pred_df['est']>treshold)&(nice_pred_df['userId']==user_id_to_predict)].sort_values(by=['est'], ascending=False)

Unnamed: 0,uid,iid,r_ui,est,details,ItemId,Item_Name,userId
3197,266785,5_11,5.0,5.0,"{'actual_k': 40, 'was_impossible': False}",5_11,Books_Children,266785
3198,266785,6_10,5.0,5.0,"{'actual_k': 40, 'was_impossible': False}",6_10,Home and kitchen_Kitchen,266785
3194,266785,2_1,4.0,4.0,"{'actual_k': 40, 'was_impossible': False}",2_1,Footwear_Mens,266785
3195,266785,2_4,3.5,3.416968,"{'actual_k': 40, 'was_impossible': False}",2_4,Footwear_Kids,266785


## C. Checking metrics

We can check here how does it perform with the rmse metrics

We can't really have a look to our anti_set metrics because it's irrelevant.  
We will compare all our estimation to a mean value set by default.

But we can effectively compare result predict by the algorithm and the real value rated by customers with the "nice_set"

In [72]:
#on nice_set
predictions = algo.test(nice_set)
print('On the nice_set, the RMSE score is :', accuracy.rmse(predictions))

RMSE: 0.0206
On the nice_set, the RMSE score is : 0.020617845744627092


So, the RMSE score is 0.11 meanning that would mean the estimated ratings on average are about 0.02 higher or lower than the actual ratings.  
Here, our scale is from 0 to 5, so it's a very good result.

We can display below a full data frame with all nice_set value to compare and see how poorly it performs... 

In [74]:
pd.DataFrame(predictions)

Unnamed: 0,uid,iid,r_ui,est,details
0,266783,1_4,2.0,2.0,"{'actual_k': 40, 'was_impossible': False}"
1,266783,2_1,4.0,4.0,"{'actual_k': 40, 'was_impossible': False}"
2,266783,5_10,2.0,2.0,"{'actual_k': 40, 'was_impossible': False}"
3,266784,3_4,2.0,2.0,"{'actual_k': 40, 'was_impossible': False}"
4,266784,5_10,3.0,3.0,"{'actual_k': 40, 'was_impossible': False}"
...,...,...,...,...,...
16875,275261,5_11,5.0,5.0,"{'actual_k': 40, 'was_impossible': False}"
16876,275261,5_3,3.0,3.0,"{'actual_k': 40, 'was_impossible': False}"
16877,275265,4_1,1.0,1.0,"{'actual_k': 40, 'was_impossible': False}"
16878,275265,5_12,3.0,3.0,"{'actual_k': 40, 'was_impossible': False}"


## D. Display Result

In [75]:
#from https://nbviewer.jupyter.org/github/NicolasHug/Surprise/blob/master/examples/notebooks/KNNBasic_analysis.ipynb

def get_Iu(uid):
    """ return the number of items rated by given user
    args: 
      uid: the id of the user
    returns: 
      the number of items rated by the user
    """
    try:
        return len(trainsetfull.ur[trainsetfull.to_inner_uid(uid)])
    except ValueError: # user was not part of the trainset
        return 0
    
def get_Ui(iid):
    """ return number of users that have rated given item
    args:
      iid: the raw id of the item
    returns:
      the number of users that have rated the item.
    """
    try: 
        return len(trainsetfull.ir[trainsetfull.to_inner_iid(iid)])
    except ValueError:
        return 0
    
df = pd.DataFrame(predictions, columns=['uid', 'iid', 'rui', 'est', 'details'])
df['Iu'] = df.uid.apply(get_Iu)
df['Ui'] = df.iid.apply(get_Ui)
df['err'] = abs(df.est - df.rui)
best_predictions = df.sort_values(by='err')[:10]
worst_predictions = df.sort_values(by='err')[-10:]

In [76]:
#display best prediction
best_predictions

Unnamed: 0,uid,iid,rui,est,details,Iu,Ui,err
0,266783,1_4,2.0,2.0,"{'actual_k': 40, 'was_impossible': False}",3,703,0.0
11103,272357,3_4,3.0,3.0,"{'actual_k': 40, 'was_impossible': False}",6,767,0.0
11104,272357,3_5,2.0,2.0,"{'actual_k': 40, 'was_impossible': False}",6,709,0.0
11105,272357,5_10,2.0,2.0,"{'actual_k': 40, 'was_impossible': False}",6,729,0.0
11106,272357,5_7,1.0,1.0,"{'actual_k': 40, 'was_impossible': False}",6,771,0.0
11107,272358,2_4,3.0,3.0,"{'actual_k': 40, 'was_impossible': False}",4,728,0.0
11109,272358,3_9,4.0,4.0,"{'actual_k': 40, 'was_impossible': False}",4,721,0.0
11111,272359,3_9,2.0,2.0,"{'actual_k': 40, 'was_impossible': False}",3,721,0.0
11112,272359,5_10,5.0,5.0,"{'actual_k': 40, 'was_impossible': False}",3,729,0.0
11113,272359,6_10,2.0,2.0,"{'actual_k': 40, 'was_impossible': False}",3,734,0.0


Those result are very good, over more than 699 rate for each item (UI), we manage to find 40 neighbours  
and estimate a lot of perfect score where estimation (est) = real rate (rui)

In [77]:
worst_predictions

Unnamed: 0,uid,iid,rui,est,details,Iu,Ui,err
5011,269320,3_8,3.333333,3.038459,"{'actual_k': 40, 'was_impossible': False}",6,708,0.294874
7824,270706,3_10,3.333333,3.035698,"{'actual_k': 40, 'was_impossible': False}",6,709,0.297636
2066,267888,2_1,1.666667,1.964395,"{'actual_k': 40, 'was_impossible': False}",4,702,0.297729
4779,269225,3_4,2.666667,2.964516,"{'actual_k': 40, 'was_impossible': False}",7,767,0.29785
14924,274306,5_11,3.666667,3.964531,"{'actual_k': 40, 'was_impossible': False}",6,759,0.297864
6510,270049,6_2,1.666667,1.964547,"{'actual_k': 40, 'was_impossible': False}",5,756,0.29788
13728,273682,5_7,3.666667,3.964653,"{'actual_k': 40, 'was_impossible': False}",4,771,0.297986
16176,274923,5_7,4.333333,4.035286,"{'actual_k': 40, 'was_impossible': False}",7,771,0.298048
3890,268819,5_10,2.333333,2.022662,"{'actual_k': 40, 'was_impossible': False}",8,729,0.310672
13738,273684,6_11,2.333333,2.022142,"{'actual_k': 40, 'was_impossible': False}",6,763,0.311191


Even the worst prediction seems very accurate to me.

Those very good result mainly said that we found neigbhors that are very similar to the customer we made recommendation.

## Thank you for finishing part 3, this model with the surprise library give us really accurate result. 
## The fourth part, is also a collabortive filtering method but result are not as good as this one.