Recommendation algorithms commonly have the following architecture:

Generation of candidates. This first step is responsible for generating the subsets of candidates to recommend to the user. For example, among all the books we take a subset of more relevant books from which we will recommend.

Scoring. Another model is responsible for scoring and ranking the candidates in order to choose the set of products to present to the user.

Reclassification. Additionally, the system takes into account additional constraints for the final classification. For example, if the user did not like a certain product we delete it or if there is a more recent product we increase their score.

# The different approaches
We can distinguish several approaches, in this module we will mainly focus on:
1- content based methods, 2 - collaborative filtering methods (memory approach and model approach), 3 - Hybrid methods <br>
collaborative filtering methods: <br>
In content-based filtering we use known information about users' interests as a link for potential recommendations.

Suppose Alice is asked what type of books she likes to see if she would be interested in The Hobbit. With this information, we start by labeling the users as well as the items, in this case books with known variables, for example adventure and romance.

If Alice doesn't like adventure but she likes romance, we can represent her preferences as the vector  (0,4),
  assuming a notation between  0−5.
 
We can do the same for books: The Hobbit does not contain romance but contains a lot of adventures, this gives let's say the vector  (5,0).

In [None]:
# dataset link
# https://www.kaggle.com/datasets/thedevastator/comprehensive-overview-of-52478-goodreads-best-b


# other dataset for exploration:
# https://www.kaggle.com/datasets/jealousleopard/goodreadsbooks
# https://cseweb.ucsd.edu/~jmcauley/datasets/goodreads.html#datasets




Initially we decide to keep only the 'book_id', 'title', 'language_code' and 'description' columns. We keep the language code so that we can filter the books and keep those that are in English only (predominant category in the database). As for the description, we keep it so that we can extract the characteristics of each book and thus make a recommendation based on their content.

We also notice that there are books which have missing values ​​in the language_code column. To ensure more precise filtering, we also use the description column to filter books in English. There are functions on Python like detect from the langdetect library which allow you, given a text, to identify the language in which it is written.

Once the pre-processing is done, we only keep the 'book_id', 'title' and 'description' columns for our recommendation system. The file 'goodreads_descr' contains the cleaned database with information on just over 9000 books.

In [None]:
import pandas as pd

# Importing the dataset
df = pd.read_csv('goodreads_descr.csv')

# Displaying the first 10 lines
df.head(10)

When we read the summary of a book, we find important elements which allow us to define whether it could interest us or not. If we like the fantasy genre, we might be more attracted to descriptions containing words such as 'magic', 'dragons' or 'elf'.

For our recommendation system we will use the ''description'' column to find these words and thus identify books that cover similar subjects.

To do this, we will go through tokenization and vectorization (a subject which will be covered in depth on the Text Mining module). For now, all you need to know is that these are fundamental processes in natural language processing (NLP) which consist of converting textual data into a format understandable and usable by Machine Learning models.

Tokenization is the process of dividing a text into individual units (words or subwords) called "tokens".

Vectorization consists of representing each "token" in digital form, generally in the form of a vector of numbers. This is necessary because most machine learning models require numerical input.

TfidfVectorizer is a scikit-learn package that performs both tokenization and TF-IDF (Term-Frequency-Inverse Document) vectorization to create digital representations of documents.

The stop_words parameter removes the empty words that are very frequent and therefore not significant in a text. Removing them does not represent a significant loss of information.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create a TfidfVectorizer and remove stop words
tfidf = TfidfVectorizer(stop_words='english')

# Adapt and transform the data into a tfidf matrix
tfidf_matrix = tfidf.fit_transform(df['description'])

# Show tfidf matrix shape
print(tfidf_matrix.shape)
display(tfidf_matrix)

The number of lines (9676) corresponds to the number of books in our DataFrame. As for the number of columns (53433), this corresponds to the number of important words after removing stop words.

Then we will calculate the similarity between each book.

The cosine_similarity and euclidean_distances functions from the sklearn.metrics.pairwise library calculate the cosine similarity and Euclidean distance, respectively.
Attention: Remember that in the Euclidean measurement, the shorter the distance, the higher the similarity. To ensure that higher values ​​correspond to greater similarity (common convention in recommendation systems) we do the transformation euclidean_sim = 1 / (1 + euclidean_distances(tfidf_matrix)). Also note that the Euclidean measure is never recommended for very large dimensions.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances

# We calculate the cosine similarity
sim_cosine = cosine_similarity(tfidf_matrix, tfidf_matrix)

# We calculate the Euclidean similarity
sim_euclidean = 1 / (1 + euclidean_distances(tfidf_matrix))

Create a pandas series named indexes using the 'title' column of the DataFrame as index. This series will match the index to the title of the associated book when recommending.

# Create a series of indices using the 'title' column as an index
indices = pd.Series(range(0, len(df)), index=df['title'])

 Create a function recommendations which will take as input:
title: from which we will recommend it.
mat_sim: the similarity matrix calculated previously.
num_recommendations: the number of recommendations to return, we can set the default to 10.
The function must retrieve in a variable named idx the index associated with the title from the series indices.

Then it will have to keep in a list the similarity scores corresponding to the index of the target film and associate each score with its index using the enumerate function and the similarity matrix mat_sim.

It will have to sort the similarity scores, find the most similar ones and recover her clues.

Finally, based on the clues found, return the titles of the most similar books.

In [None]:
from tabulate import tabulate
def recommendations(title, mat_sim, num_recommendations = 10):
    # We retrieve the index associated with the title which will be used to identify the book in the similarity matrix
    idx = indices[title]

    # We obtain the similarity scores of all the books with the given book and we keep the tuples of book index and score in a list
    similarity_scores = list(enumerate(mat_sim[idx]))

    # We sort books based on similarity scores
    similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)

    # Get scores of the 10 most similar books
    top_similair = similarity_scores[1:num_recommendations+1]

    # Get book clues
    res = [(indices.index[idx], score) for idx, score in top_similair]

    # Return the titles of the most similar books
    return tabulate(res, headers=["Titre", "Score de similarité"], tablefmt="pretty")

 Using the recommendations function, find recommendations for the books 'Batman Detective Comics #39' and 'The Queen's Gambit' using Euclidean and cosine similarities.

In [None]:
print("Recommendations for 'Batman Detective Comics #39' Cosine similarity: \n", recommendations('Batman Detective Comics #39', sim_cosine))
print("\nRecommendations for 'The Queen\'s Gambit' Euclidean similarity: \n", recommendations('Batman Detective Comics #39', sim_euclidean))

print("\nRecommendations for 'The Queen\'s Gambit' cosine similarity: \n", recommendations('The Queen\'s Gambit', sim_cosine))
print("\nRecommendations for 'The Queen\'s Gambit' Euclidean similarity: \n", recommendations('The Queen\'s Gambit', sim_euclidean))

We find that we have good recommendations but we also have some recommendations that don't have much to do with the target books: 'Pugs in Public' doesn't have much to do with 'Batman Detective Comics #39' and 'The Mammoth Book of Tasteless Jokes' has little to do with 'The Queen's Gambit'.

   Limitation: Content-based filtering has limitations related to excessive personalization, limited discovery, and reliance on product features. It is limited to the characteristics and information present in the products themselves. If a product does not have relevant information for analysis (for example, an empty description), it may not be recommended correctly.

 Collaborative filtering¶
Sometimes we can't explain why we like certain things, we just like them. Taking this into consideration, the main idea of ​​collaborative filtering is that a person is likely to like what people with similar interests have liked.

We have two different approaches when we talk about collaborative filtering:

Memory approach. We focus on the rating matrix, where users interact with the products. We will explore two methods:

a. User-user. : If the books that Alice likes are similar to those of Bob, we will recommend to Alice the books that Bob liked and vice versa.

b. Item-item. : If Alice likes Pride and Prejudice and it is similar to Persuasion and Emma, ​​then the latter two will be recommended to her.

Model approach We use Machine Learning models to try to predict the ratings that a user will give to a product. The main method is Matrix Factorization.

Here, we use the rating matrix that users were able to give to the products, even if there are missing values. From it we try to reconstruct the matrices of the variables, so that from them we find not only the values ​​that we already knew, but an estimate of the missing values ​​as well.

It is important to note that, unlike the variable matrices of content-based filtering, we will not know exactly to which characteristics the entries of the variable matrices found correspond (for example we will not know exactly if there are acts of adventure or romance). We will simply call them latent variables because they come from the pattern of the data. This notion is the main difference between content-based filtering and collaborative filtering.

   Limitation: cold start problem where it is difficult to make recommendations to new users or on new products.
What to remember 
There are different approaches when we talk about recommendation systems: in this notebook we focused on content-based filtering. In future notebooks we will explore collaborative filtering in more detail.

A similarity measure is a function  𝑠
  that takes a pair of embeddings  𝑥,𝑦
  and returns a scalar value measuring their similarity.

Content-based filtering makes it possible to make recommendations that are based on the intrinsic information of products or users, therefore limiting itself to having this information beforehand.