## Content-Based recommendation system based on users prefrences

The premise of this project is to recommend a list of books to a user is interested in by identifying his preferences based on his ratings.

The recommendations will be generated by computing the weighted average for each book based on the weight of each category for this user.
The category's weight is the sum of ratings of books rated by the user in that category.

## 01 - Data Preprocessing

In [1]:
import pandas as pd
import numpy as np

### About the dataset

This dataset provides information about users interactions (ratings) with books alongside books metadata (categories)

Books Dataset:
- isbn: Universal books identifier
- category: Book's category

Ratings Dataset:
- user_id: User's indentifier
- isbn: Universal books identifier
- rate: Rating given by user

In [17]:
# Read Books data
books_df = pd.read_csv(
    "dataset/books.csv",
    sep=";",
    usecols=["isbn", "category"],
    dtype={"isbn":np.str, "category":np.str}
).drop_duplicates()

books_df.info(memory_usage="deep")
books_df.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9972 entries, 0 to 11619
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   isbn      9972 non-null   object
 1   category  9972 non-null   object
dtypes: object(2)
memory usage: 1.3 MB


Unnamed: 0,isbn,category
0,782128726,Computers
2,789711427,Computers
3,691097186,Ancient
4,691097186,Philosophy
5,789719037,Computers


In [18]:
#Read Ratings data
ratings_df = pd.read_csv(
    "dataset/ratings.csv",
    dtype={"user_id":np.int32, "isbn":np.str, "rate":np.int8}
)

print(ratings_df.info(memory_usage="deep"))
ratings_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 256626 entries, 0 to 256625
Data columns (total 3 columns):
 #   Column   Non-Null Count   Dtype 
---  ------   --------------   ----- 
 0   user_id  256626 non-null  int32 
 1   isbn     256626 non-null  object
 2   rate     256626 non-null  int8  
dtypes: int32(1), int8(1), object(1)
memory usage: 17.6 MB
None


Unnamed: 0,user_id,isbn,rate
0,1,1565924649,4
1,1,596000480,4
2,1,596000480,4
3,1,596000480,4
4,1,596000480,4


## 02 - Building the recommendation system

Keeping the dataframe as it is not the ideal format for the content-based recommendation system.

We will build two new dataframes:

The first will convert book categories to a vector of the binary value of that feature by using One Hot Encoding technique, if the book had that genre column value will be 1 and if not the value will be 0.

The second will contains users preferences weight by calculating the dot product of users ratings and books metadata (category).

In [19]:
# Building Books/Categories One Hot Encoding dataframe

# get_dummies function will convert 'category' feature to vector of binary value
# since we want to identify each vector with its book isbn, we set the index to 'isbn' feature

books_categories_pivot = pd.get_dummies(books_df.set_index("isbn")["category"]).groupby("isbn").sum()
books_categories_pivot.info(memory_usage="deep")
books_categories_pivot.head()

<class 'pandas.core.frame.DataFrame'>
Index: 9399 entries, 0002251760 to 950491036X
Columns: 419 entries, Abduction to Young Adult Fiction
dtypes: uint8(419)
memory usage: 4.4 MB


Unnamed: 0_level_0,Abduction,Aboriginal Australians,Abusive Men,Accidents,Actors,Adolescence,Adulteresses,Adultery,Adventure And Adventurers,Adventure Stories,...,Travel,Trojan War,True Crime,Unix (Computer File),Vampires,Voyages Around The World,Web Sites,West,Xml (Document Markup Language),Young Adult Fiction
isbn,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0002251760,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
000648302X,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
0006543545,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
0007106572,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
0007154615,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [20]:
# Building Users/Categories weight dataframe (User profile)

# Join the ratings and books dataframe to get the users ratings for each category
users_categories = pd.merge(ratings_df, books_df[["isbn", "category"]], on='isbn', how="left")

# To avoid the blind spot, we fetch all categories that are not exist in ratings dataframe.
#      categories, where their books don't have any ratings
categories_not_rated = books_df[~books_df["category"].isin(users_categories["category"])]['category']
categories_not_rated = dict.fromkeys(categories_not_rated, 0)

# pivot_table function will generate a dataframe contains weight of each category
# by  calculating the sum of book's ratings
# assign will append new columns to the dataframe
users_categories_pivot = users_categories.pivot_table(
    index="user_id",
    columns="category",
    values="rate",
    fill_value=0,
    aggfunc=np.sum
).assign(**categories_not_rated)

print(users_categories_pivot.info(memory_usage="deep"))
users_categories_pivot

<class 'pandas.core.frame.DataFrame'>
Int64Index: 41399 entries, 1 to 278854
Columns: 419 entries, Abduction to Ms-Dos (Computer File)
dtypes: int64(419)
memory usage: 132.7 MB
None


category,Abduction,Aboriginal Australians,Abusive Men,Accidents,Actors,Adolescence,Adulteresses,Adultery,Adventure And Adventurers,Adventure Stories,...,Trojan War,True Crime,Unix (Computer File),Vampires,Voyages Around The World,Web Sites,West,Xml (Document Markup Language),Young Adult Fiction,Ms-Dos (Computer File)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,12,35,0,9,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
14,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
16,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
278838,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
278843,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
278849,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
278851,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## 03 - Books Recommendation

By building the users profiles (weight of every book category),
now we can develop a function that takes a user id and a number of books to recommend as arguments
and returns a list of recommeded books based on the weighted average of each book

In [21]:
# Building the function
def get_user_recommendations(user_id, nbr_books=10):
    
    # get the user profile from Users/Categories weight dataframe
    user_profile = users_categories_pivot.loc[user_id]

    # Compute weighted average for each book based on user's profile
    user_books = round((books_categories_pivot * user_profile).sum(axis=1) / user_profile.sum(), 3)

    # Recommend top N books
    top_n_books = user_books.sort_values(ascending=False).head(nbr_books)
    
    return top_n_books

In [24]:
#  Simulate recommendation proccess

# Random user id
user_id = 1

# Get list of recommendation books
recommendation_list = get_user_recommendations(1)

recommendation_list

isbn
0735710902    0.564
0970747926    0.490
0596000855    0.469
0596000480    0.463
0672319942    0.461
1565924649    0.438
0130211192    0.346
078972376X    0.195
0789724499    0.195
0789724243    0.195
dtype: float64

## 04 - Into production

Keeping the process as it is will not serve our needs in a production scale efficiently.
Executing all those steps that take an average time of 20s for each user request can turn down the server immediately, and among
the solutions we could use is saving the results of the heaviest task and go back when we need it instead of re-executing the task again.

To keep the saved up to date with the database we can setup a cron job that will generate those file at a specifique time.

In our recommendation system, the most expensive tasks are generating the One Hot Encoding books/categories matrix and User Profile matrix.
we can save those two matrices either in memory like Redis or in our file system as a binary file.
for the sake of simplicity, we will save them as parquet file (you can read this article about Pandas file benchmarking)

In [25]:
# Saving Books/Categories datatframe
books_categories_pivot.to_parquet("books_categories.parquet")

# Saving Users/Categories datatframe
users_categories_pivot.to_parquet("users_categories.parquet")

Update get_user_recommendations to read the data from the saved files

In [26]:
def get_user_recommendations(user_id, nbr_books=10):
    books_categories_pivot = pd.read_parquet("books_categories.parquet")
    users_categories_pivot = pd.read_parquet("users_categories.parquet")
    # get the user profile from Users/Categories weight dataframe
    user_profile = users_categories_pivot.loc[user_id]

    # Computer weighted average for each book based on user's profile
    user_books = round((books_categories_pivot * user_profile).sum(axis=1) / user_profile.sum(), 3)

    # Recommend top N books
    top_n_books = user_books.sort_values(ascending=False).head(nbr_books)
    
    return top_n_books

In [27]:
#  Simulate recommendation proccess

# Random user id
user_id = 1

# Get list of recommendation books
recommendation_list = get_user_recommendations(1)

recommendation_list

isbn
0735710902    0.564
0970747926    0.490
0596000855    0.469
0596000480    0.463
0672319942    0.461
1565924649    0.438
0130211192    0.346
078972376X    0.195
0789724499    0.195
0789724243    0.195
dtype: float64