We want to do a book recommendation system based on the data we have cleaned before.
They are three types of recommender system:
* Collaborative filtering:This system matches persons with similar interests and provides recommendations based on this matching. It needs the historical activity of the users which we have.It can be user-based or item-based. In the first, we recommend items to a user that similar users have also liked. In the second, we recommend items based on the past ratings of similiar items by the user.

* Content base systems: They suggest similar items based on a particular item. This system uses item metadata. The general idea behind these recommender systems is that if a person liked a particular item, he or she will also like an item that is similar to it. The only metadata we have about the books in our dataset is the title, author and country wich is very little. A good metadata is the genres or the summary that sadly we haven't. This method is also computationally expensive which is another reason we are not doing it.

This system recommendation were mainly done with the helf of the book "Hands on recommendation systems with Python" by Rounak Banik 
https://learning.oreilly.com/library/view/hands-on-recommendation-systems/9781788993753/5f1269b2-2007-42a8-a900-bf255b86e64c.xhtml


We want to predict the rating given by a user to a book based on the book that he has already read and the books of user that have like the same books i.e similar users

# Loading the data previously cleaned

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
path="D:/Workspace_Python/MLProject/MachineLearningDSTIProject/dataset/"
file="final_dataset.csv"
users_csv="cleaned_users.csv"
books_csv="cleaned_books.csv"
ratings_csv="cleaned_ratings.csv"

In [3]:
df = pd.read_csv(path+file, sep=";",on_bad_lines='warn', encoding="latin-1")

# Recommender system

## Item-based ou user-based collaborative filtering ?

In [4]:
users = pd.read_csv(path+users_csv, sep=";",on_bad_lines='warn', encoding="latin-1")
users.shape

(278858, 5)

In [5]:
books = pd.read_csv(path+books_csv, sep=";",on_bad_lines='warn', encoding="latin-1")
books.bookId.max()

248251

In [6]:
books.loc[books.bookId==-1]

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher,bookId


As we have more users than books and we have huge disparity in the number of ratings between users as seen in the EDA, we choose to use the item-based collaborative filtering which should also be more efficient

## Item-based collaborative filtering

We load the data into the data structure of the surprise library

In [7]:
# Loads Pandas dataframe
from surprise import Dataset
from surprise import Reader
#we reduce the size of the dataset by removing the user that rated less than 10 books and books 
#that have been rated less than 100 times
df=df[df.userID.isin(df.userID.value_counts()[df.userID.value_counts()>10].index)][['userID','bookId','bookRating']]
df=df[df.bookId.isin(df.bookId.value_counts()[df.bookId.value_counts()>100].index)][['userID','bookId','bookRating']]
data = Dataset.load_from_df(df[['userID','bookId','bookRating']], Reader(line_format='user item rating', sep=";",rating_scale=(1, 10)))

In [8]:
##to see what is in the built in data of surprise
d=pd.DataFrame(data.__dict__['raw_ratings'], columns=['user_id','item_id','rating','timestamp'])
d.head()

Unnamed: 0,user_id,item_id,rating,timestamp
0,242,180,8.0,
1,243,305,7.0,
2,243,822,9.0,
3,243,512,10.0,
4,243,4675,9.0,


We benchmark the different models by using a cross validation after having set the the seed of the RNG to have reproducible experiments. We use RMSE to measure the performance.

In [9]:
import random
import numpy as np

my_seed = 42
random.seed(my_seed)
np.random.seed(my_seed)

In [10]:
from surprise import SVD
from surprise import SVDpp
from surprise import SlopeOne
from surprise import NMF
from surprise import NormalPredictor
from surprise import KNNBaseline
from surprise import KNNBasic
from surprise import KNNWithMeans
from surprise import KNNWithZScore
from surprise import BaselineOnly
from surprise import CoClustering
from surprise.model_selection import cross_validate
benchmark = []
# Iterate over all algorithms
for algorithm in [SVD(verbose =False), SVDpp(verbose =False), SlopeOne(), NMF(), NormalPredictor(), KNNBaseline(verbose =False), KNNBasic(verbose =False), KNNWithMeans(verbose =False), KNNWithZScore(verbose =False), BaselineOnly(verbose =False), CoClustering(verbose =False)]:
    # Perform cross validation
    results = cross_validate(algorithm, data, measures=['RMSE'], cv=5, verbose=False)
    
    # Get results & append algorithm name
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = pd.concat([tmp,pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm'])])
    benchmark.append(tmp)
    
pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse')    

Unnamed: 0_level_0,test_rmse,fit_time,test_time
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
BaselineOnly,1.626885,0.014229,0.009009
SVD,1.634637,0.104982,0.013174
KNNBaseline,1.654903,0.235163,0.260968
SVDpp,1.680577,0.127168,0.03688
KNNBasic,1.737566,0.210171,0.196236
KNNWithMeans,1.830618,0.266791,0.209236
SlopeOne,1.843097,0.051604,0.018613
CoClustering,1.851605,0.489725,0.009844
KNNWithZScore,1.875192,0.404463,0.221646
NormalPredictor,2.447643,0.009133,0.009405


BaselineOnly and SVD give us the best although BaselineOnly also give us the best time, so it' is one of this two we should use. As KNNBseline can be configured by using similitude we also futher test it with both user_based and item_based configuration.  

## Hyper_tuning of BaseLineOnly

In [23]:
from surprise.model_selection import GridSearchCV
bsl_options = {"method": ["als"], "n_epochs": [10,15,20,25 ], "reg_u": [4,5,10,15,20], "reg_i": [5,10,15,20]}
param_grid={"bsl_options":bsl_options}
gs = GridSearchCV(BaselineOnly, param_grid, measures=["rmse", "mae"], cv=5,n_jobs =2, joblib_verbose=0)
gs.fit(data)

print(gs.best_score["rmse"])
print(gs.best_params["rmse"])

1.5972535959837304
{'bsl_options': {'method': 'als', 'n_epochs': 25, 'reg_u': 4, 'reg_i': 5}}


## Hyper tuning of SVC

In [16]:
param_grid = {"n_epochs": [15,20,25,30],"lr_all": [0.004,0.008,0.01,0.02],"reg_all": [0,0.02,0.4,0.5, 0.6]}
gs = GridSearchCV(SVD, param_grid, measures=["rmse", "mae"], cv=5,n_jobs =2, joblib_verbose=0)
gs.fit(data)

print(gs.best_score["rmse"])
print(gs.best_params["rmse"])

1.6153179196047005
{'n_epochs': 20, 'lr_all': 0.02, 'reg_all': 0.02}


## Hyper tuning of KNNBaseLine 

In [19]:
bsl_options = {"method": ["als"], "n_epochs": [15,20,25 ], "reg_u": [2,5,8,10], "reg_i": [2,4,6]}
sim_options = {"name": ["cosine"],"user_based": [True,False]}
param_grid = {"bsl_options":bsl_options, "sim_options":sim_options, "k":[20,40,60]}
gs = GridSearchCV(KNNBaseline, param_grid, measures=["rmse", "mae"], cv=5,n_jobs =2, joblib_verbose=0)
gs.fit(data)

print(gs.best_score["rmse"])
print(gs.best_params["rmse"])

1.6303375645029043
{'bsl_options': {'method': 'als', 'n_epochs': 25, 'reg_u': 2, 'reg_i': 4}, 'sim_options': {'name': 'cosine', 'user_based': True}, 'k': 60}


# Training of the model

The benchmark and the tunning of the model show that the best model is the BaseLineOnly with the Alternating Least Squares procedure configured with 25 iterations,reg_u=4 and reg_i=5 

In [27]:
from surprise.model_selection import train_test_split
from surprise import accuracy
bsl_options= {'method': 'als', 'n_epochs': 25, 'reg_u': 4, 'reg_i': 5}
trainset, testset = train_test_split(data, test_size=0.25)
algo = BaselineOnly(bsl_options=bsl_options,verbose =False)
predictions = algo.fit(trainset).test(testset)

Finally, we measure the accuracy of the ratings we have get

In [28]:
accuracy.rmse(predictions)
accuracy.mse(predictions)
accuracy.mae(predictions)

RMSE: 1.5899
MSE: 2.5278
MAE:  1.2344


1.234437785004282

As a good recommender system should have RMSE under 1, we could say that our recommender system is average as the metrics aren't too far of the target value.

In [32]:
from collections import defaultdict
from surprise.model_selection import KFold

def precision_recall_at_k(predictions, k=10, threshold=3.5):
    """Return precision and recall at k metrics for each user"""
    # First map the predictions to each user.
    user_est_true = defaultdict(list)
    for uid, _, true_r, est, _ in predictions:
        user_est_true[uid].append((est, true_r))

    precisions = dict()
    recalls = dict()
    for uid, user_ratings in user_est_true.items():
        # Sort user ratings by estimated value
        user_ratings.sort(key=lambda x: x[0], reverse=True)
        # Number of relevant items
        n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)
        # Number of recommended items in top k
        n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[:k])
        # Number of relevant and recommended items in top k
        n_rel_and_rec_k = sum(
            ((true_r >= threshold) and (est >= threshold))
            for (est, true_r) in user_ratings[:k]
        )
        # Precision@K: Proportion of recommended items that are relevant
        # When n_rec_k is 0, Precision is undefined. We here set it to 0.
        precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 0
        # Recall@K: Proportion of relevant items that are recommended
        # When n_rel is 0, Recall is undefined. We here set it to 0.
        recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 0
    return precisions, recalls

In [33]:
kf = KFold(n_splits=5)
bsl_options= {'method': 'als', 'n_epochs': 25, 'reg_u': 4, 'reg_i': 5}
algo = BaselineOnly(bsl_options=bsl_options,verbose =False)

for trainset, testset in kf.split(data):
    algo.fit(trainset)
    predictions = algo.test(testset)
    precisions, recalls = precision_recall_at_k(predictions, k=5, threshold=4)
    print("--------------------------------------")
    # Precision and recall can then be averaged over all users
    print("precision = ",sum(prec for prec in precisions.values()) / len(precisions))
    print("recall = ",sum(rec for rec in recalls.values()) / len(recalls))

--------------------------------------
precision =  0.9673444976076555
recall =  0.9714247740563531
--------------------------------------
precision =  0.968816348195329
recall =  0.9703802699423718
--------------------------------------
precision =  0.9724203133441384
recall =  0.9747451360093174
--------------------------------------
precision =  0.9729950900163666
recall =  0.9744271685761048
--------------------------------------
precision =  0.9676317162232654
recall =  0.970016394664282


We can now do some prediction:

In [1]:
#from neo4j import GraphDatabase
#driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "DSTI2023!!"))