## Content-Based recommendation system based on Cosine Similarity method

This system will generate a list of books that the user might be interested in by giving a book ISBN.

The recommendations will be generated by vectorizing (Word Frequency or TF-IDF) the book's features (Genres, Authors ...), and computing the cosine similarity between vectors.

## 01 - Data Preprocessing

In [2]:
# Import libraries
import numpy as np
import pandas as pd

# CountVectorizer vectorize a document by generating his words frequency
from sklearn.feature_extraction.text import CountVectorizer

# cosine_similarity to compute the difference between two vectors
from sklearn.metrics.pairwise import cosine_similarity

# ntlk stopwords function will help ignore non-contextuel words like (the, or, she)
# during document vectorizing
import nltk
from nltk.corpus import stopwords
nltk.download("stopwords", quiet=True)

True

### About the dataset

This dataset provides some books metadata (pages, category ..)

Columns:
- isbn: Universal books identifier
- title: Book's tilte
- pages: Number of pages
- category: Book's category
- author: Book's author
- publisher: Book's publisher

In [3]:
# Step 01: Read Books Data
df = pd.read_csv(
    "dataset/books.csv",
    sep=";",
    usecols=["isbn", "title", "publisher", "category", "author"],
    dtype={"isbn":np.str}
)

df.info(memory_usage="deep")
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11622 entries, 0 to 11621
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   isbn       11622 non-null  object
 1   title      11622 non-null  object
 2   category   11622 non-null  object
 3   author     11622 non-null  object
 4   publisher  11622 non-null  object
dtypes: object(5)
memory usage: 3.9 MB


Unnamed: 0,isbn,title,category,author,publisher
0,782128726,Mastering Windows 2000 Server,Computers,brian m. smith,Sybex Inc
1,782128726,Mastering Windows 2000 Server,Computers,doug toombs,Sybex Inc
2,789711427,Using Microsoft Backoffice,Computers,don benage,Macmillan Computer Pub
3,691097186,"The Collected Dialogues Of Plato, Including Th...",Ancient,plato,Princeton University Press
4,691097186,"The Collected Dialogues Of Plato, Including Th...",Philosophy,plato,Princeton University Press


In [4]:
# Features considered in content filtering
features = ["isbn", "publisher", "category", "author"]

In [5]:
# Join all the features in one column to build a bag of words
books_df = df[features].groupby("isbn").agg(lambda cell_val: ' '.join(set(cell_val)))
books_df["features"] = books_df[features[1:]].apply(lambda row: " ".join(row.fillna("")), axis=1)

# Delete unecessary features
books_df.drop(features[1:], axis=1, inplace=True)

books_df.info(memory_usage="deep")
books_df.head()

<class 'pandas.core.frame.DataFrame'>
Index: 9399 entries, 0002251760 to 950491036X
Data columns (total 1 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   features  9399 non-null   object
dtypes: object(1)
memory usage: 1.5 MB


Unnamed: 0_level_0,features
isbn,Unnamed: 1_level_1
0002251760,Harpercollins Fiction nick bantock
000648302X,Harpercollins (Uk) End Of The World matthew th...
0006543545,Flamingo Booksellers And Bookselling penelope ...
0007106572,Harpercollins Domestic Fiction sue welfare
0007154615,Perennial Fiction carol shields


## 02 - Building the recommendation system

In order to build our recommender, we have to vectorize the book features (bag of words), 
after that we can calculate the similarity score of those vectors with the Cosine Similarity method.

there are differenct techniques to vectorize your documents, like term frequency or TF-IDF technique.

for this Lab, we are going to use term frequency technique with the CountVectorizer function in ScLearn library.

In [6]:
# Define words to ignore during document vectorizing, stop words such as 'the', 'a', 'et', "she ..."
stopwords_list = stopwords.words('english') + stopwords.words('french')

In [8]:
# Instanciate a count Vectorizer Object
count = CountVectorizer(stop_words=stopwords_list)


# Construct the sparse Count matrix by fitting and transforming the bag of words
count_matrix = count.fit_transform(books_df['features'])

In [9]:
# Compute the cosine similarity matrix
cosine_matrix = cosine_similarity(count_matrix, count_matrix)

## 03 - Books Recommendation

After generating the cosine similarity matrix, we can retrieve book's recommendation list by getting his position index in the main dataframe "books_df".

But we can also build a dataframe upon the cosine similarity matrix to help us retreive a book recommendation list by its ISBN.

In [11]:
# Convert the cosine similarity matrix to a dataframe
# In order to facilitate books retrieving
cosine_sim_df = pd.DataFrame(cosine_matrix, columns=books_df.index, index=books_df.index)
cosine_sim_df.head()

isbn,0002251760,000648302X,0006543545,0007106572,0007154615,000716226X,0020198906,0020360754,0020418809,0020768702,...,8807812576,8807813025,8817125539,8838918600,8845247414,8845407039,8878188212,9004121390,902470068X,950491036X
isbn,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0002251760,1.0,0.204124,0.0,0.447214,0.25,0.223607,0.0,0.25,0.0,0.188982,...,0.176777,0.223607,0.204124,0.25,0.204124,0.204124,0.188982,0.0,0.0,0.223607
000648302X,0.204124,1.0,0.0,0.182574,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0006543545,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0007106572,0.447214,0.182574,0.0,1.0,0.223607,0.2,0.0,0.223607,0.0,0.169031,...,0.158114,0.2,0.182574,0.223607,0.182574,0.182574,0.169031,0.0,0.0,0.2
0007154615,0.25,0.0,0.0,0.223607,1.0,0.223607,0.0,0.25,0.204124,0.188982,...,0.176777,0.223607,0.204124,0.25,0.204124,0.204124,0.188982,0.0,0.0,0.223607


In [17]:
#  Simulate recommendation proccess

# Random book identifier
book_isbn = "0789711427"

# Get the book similarity score with other books
books_score = cosine_sim_df.loc[book_isbn]

# Sort score and retrieve the top 10
# We ignore the first book, its always the same book
recommended_books = books_score.sort_values(ascending=False)[1:11]

recommended_books

isbn
1575213168    0.676123
1562056417    0.617213
0789706814    0.617213
0672306204    0.617213
1562057154    0.617213
1562056484    0.617213
0672306670    0.617213
1562055089    0.617213
0672308002    0.617213
0789705672    0.617213
Name: 0789711427, dtype: float64

04 - Into production
Repeating those steps every time a user requests production, it will not help to scale the system.

But we can consider the "02 - Building the recommendation system" section as a model training step, and instead of repeating the training for each request, we can save the cosine similarity data and interrogate this dataframe in every request.

For the saving techniques, we can save the dataframe in memory (Redis database) or as a file in the file system as a binary file.
for the sake of simplicity, we will save it as parquet file (you can read this article about Pandas file benchmarking)

In [20]:
# Saving cosine similarity datatframe
cosine_sim_df.to_parquet("books_cosine_similarity.parquet")

In [21]:
#  Simulate recommendation proccess

books_cos_df = pd.read_parquet("books_cosine_similarity.parquet")
# Read cosine similarity dataframe file
# Random book identifier
book_isbn = "0789711427"

# Get the book similarity score with other books
books_score = books_cos_df.loc[book_isbn]

# Sort score and retrieve the top 10
# We ignore the first book, its always the same book
recommended_books = books_score.sort_values(ascending=False)[1:11]

recommended_books

isbn
1575213168    0.676123
1562056417    0.617213
0789706814    0.617213
0672306204    0.617213
1562057154    0.617213
1562056484    0.617213
0672306670    0.617213
1562055089    0.617213
0672308002    0.617213
0789705672    0.617213
Name: 0789711427, dtype: float64