# Recommendtion System Individual Project
Prepared by: Wajih Arfaoui

In [33]:
import pandas as pd 
import numpy as np 
from IESEGRecSys import eval
from IESEGRecSys.model import ContentBased
from surprise import KNNBasic, Reader, Dataset, SVD, CoClustering, BaselineOnly, accuracy
from surprise.model_selection import GridSearchCV, cross_validate, KFold
from sklearn.decomposition import PCA
# NLP packages
import nltk # pip install nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.snowball import SnowballStemmer

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\warfaoui\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\warfaoui\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Data Preparation

After importing all the libraries that we are going to need through the whole process, we start by importing our three datasets which are:  
- **Meta** dataset: it contains additional item's data (title, description, image_url)  
- **Train** dataset: it contains user-item ratings, including review text and additional user data  
- **Test** dataset: it contains user-item ratings to predict  

Once this done, I checked the cleanliness of our train data by:  
- Checking the range values of the ratings that needs to be integers in the range [1,5].  
- Checking the mission values especially for the columns: `userID`, `asin` and `overall`
- Checking the data types per column.  

And finally, we create a **Reader** object, with the attribute `rating_scale` which is a tuple with the lowest and highest possible range. It’s important to get this parameter right, otherwise parts of your data will be ignored. In our case, we have a minimum rating of 1.0 and a maximum rating of 5.0.  

Next, I transform our train dataset into a **Surprise** format where it will become a sparse matrix, with the **users / items** are the **rows / columns**, and the **ratings** are elements in this matrix. 
Since I am going to use cross validation, I don't need to split my data, but instead, I will be using my whole dataset for training and cross-validate each time for testing.



In [34]:
# Read datasets 
meta = pd.read_csv("metadata.csv")
train = pd.read_csv("train.csv")
test = pd.read_csv("test_students.csv")

In [35]:
meta["main_cat"].value_counts()

Pet Supplies                   2510
Amazon Home                      29
Tools & Home Improvement          7
Health & Personal Care            5
Grocery                           5
Sports & Outdoors                 5
Industrial &amp; Scientific       4
Automotive                        3
Cell Phones & Accessories         2
Industrial & Scientific           2
Toys & Games                      2
Sports &amp; Outdoors             2
Baby                              1
Name: main_cat, dtype: int64

In [36]:
# check values for the ratings 
train["overall"].value_counts()

5.0    107620
4.0     23909
3.0     14029
1.0      8474
2.0      7721
Name: overall, dtype: int64

In [37]:
# check missing values
train.isna().sum()

userID             0
overall            0
asin               0
vote          145992
reviewText         2
summary            1
style          29641
image         157207
dtype: int64

In [38]:
# check columns types 
train.dtypes

userID          int64
overall       float64
asin           object
vote           object
reviewText     object
summary        object
style          object
image          object
dtype: object

In [39]:
test["overall"]=0.0
test.head()

Unnamed: 0,ID,userID,asin,overall
0,21069B00BFK2B24,21069,B00BFK2B24,0.0
1,3506B00ZK0Y7R2,3506,B00ZK0Y7R2,0.0
2,21907B0002AQPA2,21907,B0002AQPA2,0.0
3,14092B0002DHXX2,14092,B0002DHXX2,0.0
4,3085B0006VB3SQ,3085,B0006VB3SQ,0.0


In [40]:
reader = Reader(rating_scale=(1,5))
data = Dataset.load_from_df(train[["userID","asin","overall"]], reader)
df_train = data.build_full_trainset()
df_test = list(test[["userID","asin","overall"]].itertuples(index = False , name = None))

## Collaborative filtering approach

In the first part of this project, I am going to apply **collaborative filtering** models to our dataset with the aim of finding similarities between **items / users** through commonly **rated** items.  
Under this method, I will opt for these two main approaches:  
- **Memory-based** models that calculate the similarities between **users / items** based on **user-item rating pairs** (I will use the `KNNBAsic`).  
- **Model-based** models that use machine learning algorithms to estimate the ratings (I will use `SVD`, `ALS` and `CoClustering`).  

In order to do the benchmarking of these models, and come out with the model with the best performance, I will opt for **Grid Search**.  

`GridSearchCV` is an algorithm that we can import from `sklearn.model_selection` library, that automatically finds the best parameters for a particular model, what we call **hyperparameter tuning**.  

To implement this algorithm, I started by creating a **dictionary** of all the parameters and their corresponding set of values that you want to test for best performance. 
Once the parameter dictionary is created, the next step is to create `GridSearchCV` for our model. I included **the model function** name as a estimator parameter. The **param_grid** as dictionary parameter, **the performance metrics** which I chose to be **RMSE** as a scoring parameter, and finally I precised the number of folds for cross validation for the cv parameter, which is 5 in this case.

After fitting the model, I checked which were the parameters that return the highest accuracy. For the cases of where one of the parameters highest value was chosen in the best combination of parameters, I tried more values for that paremter, to see if performance further increases. 
After checking this, I printed the RMSE corresponding to the best combination of parameters that I am going to use later on to compare models to each others. 

#### Memory based models

In [None]:
# User model hyperparameter tuning 
param_grid = {'k': [10,15,20,25,30],
              'sim_options': {'name': ["pearson", 'cosine'],
                              'user_based': [True]}
              }
knnbasic_gs = GridSearchCV(KNNBasic, param_grid, measures=['rmse'], cv=5)
knnbasic_gs.fit(data)

In [None]:
# display the best parameter 
print(knnbasic_gs.best_params)

{'rmse': {'k': 30, 'sim_options': {'name': 'pearson', 'user_based': True}}}


In [None]:
# display the best parameters rmse 
print(knnbasic_gs.best_score)

In [None]:
# item model hyperparameter tuning 
param_grid = {'k': [5,10,15,20,25],
              'sim_options': {'name': ["pearson", 'cosine'],
                              'user_based': [False]}
              }
knnbasic_gs_i = GridSearchCV(KNNBasic, param_grid, measures=['rmse'], cv=5)
knnbasic_gs_i.fit(data)

In [52]:
# display the best parameter 
print(knnbasic_gs_i.best_params)

{'rmse': {'k': 20, 'sim_options': {'name': 'pearson', 'user_based': False}}}


In [53]:
# display the best parameters rmse 
print(knnbasic_gs_i.best_score)

{'rmse': 1.1824484369990143}


#### Model based models

In this part of the project, I am going to use another subgroup of collaborative filtering models which is **model-based**. Unlike **memory-based** models, These ones use machine learning algorithms.  
In the upcoming steps, I am going to concentrate on the `SVD`, `ALS` and `CoClustering` methods.

**Singular value decomposition (SVD)** is a matrix factorisation technique, which reduces the number of features of a dataset by reducing the space dimension from N-dimension to K-dimension (where K< N). In our context, it aims to provide users with Amazon products’ recommendation from the latent features of item-user matrices. The code would show you how to use the SVD latent factor model for matrix factorization. 
The hyperparameters I considered for this method are:  

1. **n_factors**: this parameter determines how many latent factors the model will try to find.  
2. **n_epochs**: this parameter determines how many times the gradient descent calculations are repeated.  
3. **lr_all**: it is the learning rate factor for all of the parameters. These are the step sizes the model will use to minimise the cost function.  
4. **reg_all**: it is regularisation factor for all of the parameters.  
5. **biased**: this parameter determines whether to choose biased or unbiased version of the algorithm.


In [54]:
train[["userID","asin","overall"]].shape

(161753, 3)

In [10]:
# SVD model hyperparameter tuning
param_grid = {'n_factors':[300],'n_epochs': [200], 'lr_all':[0.02],'biased':[True],
              'reg_all': [0.02]}
svd_gs = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=300,refit=True)
svd_gs.fit(data)

In [55]:
# display the best parameter 
print(svd_gs.best_params)

{'rmse': {'n_factors': 300, 'n_epochs': 200, 'lr_all': 0.02, 'biased': True, 'reg_all': 0.02}}


In [56]:
# display the best parameters {'rmse': {'n_factors': 300, 'n_epochs': 200, 'lr_all': 0.02, 'biased': True, 'reg_all': 0.02}}
print(svd_gs.best_score)

{'rmse': 1.0371928726286255}


 **The alternating least squares (ALS)** is a matrix factorization algorithm that uses Alternating Least Squares with Weighted-Lamda-Regularization (ALS-WR). It factors the user to item **matrix A** into the user-to-feature **matrix U** and the item-to-feature **matrix M**. It runs the ALS algorithm in a parallel fashion, and tries to find optimal factor weights to minimize the least squares between predicted and actual ratings.  
 Since the ALS uses baslines in the minimization objective function, I am going to use the `BaselineOnly` method from Surpise library, and I am going to configure it using these parameters:  
 1. **reg_i**: The regularization parameter for products.  
 2. **reg_u**: The regularization parameter for users.
 3. **n_epochs**: The number of iteration of the ALS procedure. 

In [None]:
# ALS model hyperparameter tuning
param_grid = {'bsl_options': 
                {'reg_i':[5,10,15], 'reg_u':[5,10], 'n_epochs': [30,40]}}
als_gs = GridSearchCV(BaselineOnly, param_grid, measures=['rmse'], cv=5)
als_gs.fit(data)

In [58]:
# display the best parameter 
print(als_gs.best_params)

{'rmse': {'bsl_options': {'reg_i': 10, 'reg_u': 5, 'n_epochs': 40}}}


In [59]:
# display the best parameters rmse 
print(als_gs.best_score)

{'rmse': 1.0728335831007654}


**Co-clustering** is a special case of clustering, where it is done simultaneously for the rows and columns of the matrix. It is basically a method of co-grouping users and items simultaneously based on similarity of their **pairwise interactions**.  In order to configure this model, I chose to tune the following parameters:  
1. **n_cltr_u**: it represents the number of user clusters  
2. **n_cltr_i**: it represents the number of products clusters  
3. **n_epochs**: it determines the number of iteration of the optimization loop.

In [60]:
# CoClustering model hyperparameter tuning
param_grid = {'n_cltr_u':[10,15], 'n_cltr_i':[10,15],'n_epochs': [10,20]}
clust_gs = GridSearchCV(CoClustering, param_grid, measures=['rmse'], cv=5)
clust_gs.fit(data)

In [61]:
# display the best parameter 
print(clust_gs.best_params)

{'rmse': {'n_cltr_u': 10, 'n_cltr_i': 10, 'n_epochs': 10}}


In [62]:
# display the best parameters rmse 
print(clust_gs.best_score)

{'rmse': 1.2639709573502595}


## Content Based approach

This recommendation process is based on the similarity between those products. **Similarity** is measured based on the similarity in the content of those products and in our case, it will be based on the description column. 
First thing, I am going to do is to tokenize the `description` column to come up with a list of seperate words while just preserving alphabetic ones, then I am going to remove stopwords that don't have any importance when doing NLP, and we finish our preprocessing steps by applying the ``TfidfVectorizer`` in sci-kit learn to convert the collection of raw documents to a matrix of **TF-IDF** features. 
Before applying this function, I am going to put the hyperparameter `min_df` to 5 which means that ``TfidfVectorizer`` will ignore terms that have adocument frequency strictly lower than this threshold.

Once our document-term-matrix is ready and tranformed into a Dataframe, we convert the ratings matrix into the surpirse dataframe format, and we move to fit our content based model.  
The model I will be using is **ContentBased()** which is imported from `IESEGRecSys` library and has a single hyperparameter which is:  
1. **NN**: by initiating this parameter, the model will filter the matrix for *k* nearest neighbors with non-negative similarity.  

To fit the model, I will be using the **K-folds** cross validator, that will split my train dataset into *k* consecutive folds, each fold is then used once as a validation while the *k-1* remaining folds form the training set to fit the model on. 



In [65]:
train=train[["userID","asin","overall"]]

In [66]:
meta= meta.drop_duplicates(subset=['asin'])
meta= meta[["asin",'description']]

In [67]:
# Tokenize, case conversion & only alphabetic
tokens = meta['description'].apply(lambda txt: [word.lower() for word in word_tokenize(str(txt)) if word.isalpha()])

In [68]:
# setup stop words list
stop_words = stopwords.words('english')
stop_words.append('nan')

stemmer = SnowballStemmer("english")

# remove stopwords
# stem
token_stem = tokens.apply(lambda lst_token: [stemmer.stem(tok) for tok in lst_token if tok not in stop_words and len(tok) > 2])

In [69]:
# TFIDF vectorizer
tfidf = TfidfVectorizer(min_df=5)

# apply tf-idf vectorizer -> document-term-matrix in sparse format
dtm = tfidf.fit_transform([" ".join(x) for x in token_stem])


df_dtm = pd.DataFrame(dtm.toarray(), columns=tfidf.get_feature_names_out(), index=meta["asin"])
df_dtm.shape

(2307, 2874)

In [70]:
train = train[train["asin"].isin (df_dtm.index.values)]

In [71]:
# convert into a surprise dataframe
reader = Reader(rating_scale = (1,5))
dataset = Dataset.load_from_df(train, reader)

In [72]:
# initiate the model
cb = ContentBased(NN=50)

In [None]:
# apply cross validation on the cb model

kf = KFold(n_splits=100)
rmse=[]
for trainset, testset in kf.split(dataset):

    # train and test algorithm.
    cb.fit(df_dtm)
    cb.fit_ratings(trainset)
    predictions = cb.test(testset)

    # Compute and print Root Mean Squared Error
    rmse.append(accuracy.rmse(predictions, verbose=True))

In [75]:
cb_rmse = pd.DataFrame(rmse,columns=["CB"]).mean()

## Benchmarking

After doing hyperparameter tuning and fitting all our models, I calculated each model RMSE based on cross validation and as you can see above, we ended up by having the **SVD** model as the most appropriate model to use with a RMSE of 1.037 with these hyperparameter: **'n_factors': 300, 'n_epochs': 200, 'lr_all': 0.02, 'biased': True, 'reg_all': 0.02 and 300 cross validation folds** 

In [97]:
#Benchmark different models
models = {"IB_10":knnbasic_gs_i, "SVD_300":svd_gs,"Clust_10":clust_gs, "ALS_40":als_gs}
bench = pd.concat([pd.DataFrame.from_dict(mod.best_score,orient='index') for mod in models.values()], axis=1,)
bench.columns = list(models.keys())
bench["CB_50"] = cb_rmse[0]
bench

Unnamed: 0,IB_10,SVD_300,Clust_10,ALS_40,CB_50
rmse,1.182448,1.037193,1.263971,1.072834,1.129827


In [13]:
# get rating estimations
test["overall"] = pd.DataFrame(svd_gs.test(df_test))["est"]

In [14]:
# save the test ratings estimation
results = test[["ID","overall"]]
results.to_csv('sample_sumbission_7.csv', index=False)