**Group Project : building a recommendation model for the books industry**

---


In [None]:
#Library
import pandas as pd
import numpy as np
import nltk
import string
import re
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse import csr_matrix

In [None]:
#Open datasets
df_Books = pd.read_csv("Books.csv",engine="python", sep=',')
df_Ratings = pd.read_csv("Ratings.csv",engine="python", sep=',')
df_Users = pd.read_csv("Users.csv",engine="python", sep=',')

# Preprocessing our data





First we clean our datasets by removing missing values and outliers

In [None]:
# Missing values : there is only one missing value in "Book_Authors" column, and two missing values in "Publisher" column
# We chose to remove them :
df_Books = df_Books.dropna(subset=['Book-Author'])
df_Books = df_Books.dropna(subset=['Publisher'])

In [None]:
# Year of publication: We have filtered years in the column "Year of publication" of the Ratings' sheets to eliminate outliers (as Year 0 or Year 2050)
df_Books['Year-Of-Publication'] = pd.to_numeric(df_Books['Year-Of-Publication'], errors='coerce').astype('Int64')
filtered_Books = df_Books[(df_Books['Year-Of-Publication'] >= 1900) & (df_Books['Year-Of-Publication'] <= 2024)]
filtered_Books

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,0195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,0002005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,0060973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,0374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,0393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...
...,...,...,...,...,...,...,...,...
271355,0440400988,There's a Bat in Bunk Five,Paula Danziger,1988,Random House Childrens Pub (Mm),http://images.amazon.com/images/P/0440400988.0...,http://images.amazon.com/images/P/0440400988.0...,http://images.amazon.com/images/P/0440400988.0...
271356,0525447644,From One to One Hundred,Teri Sloat,1991,Dutton Books,http://images.amazon.com/images/P/0525447644.0...,http://images.amazon.com/images/P/0525447644.0...,http://images.amazon.com/images/P/0525447644.0...
271357,006008667X,Lily Dale : The True Story of the Town that Ta...,Christine Wicker,2004,HarperSanFrancisco,http://images.amazon.com/images/P/006008667X.0...,http://images.amazon.com/images/P/006008667X.0...,http://images.amazon.com/images/P/006008667X.0...
271358,0192126040,Republic (World's Classics),Plato,1996,Oxford University Press,http://images.amazon.com/images/P/0192126040.0...,http://images.amazon.com/images/P/0192126040.0...,http://images.amazon.com/images/P/0192126040.0...


In [None]:
# Transform Ratings: We have decided to filter the ratings with a minimum of 1 since a vote of 0 represents an implicit evaluation of the reader
filtered_Ratings = df_Ratings[(df_Ratings['Book-Rating'] >= 1)]
filtered_Ratings

Unnamed: 0,User-ID,ISBN,Book-Rating
1,276726,0155061224,5
3,276729,052165615X,3
4,276729,0521795028,6
6,276736,3257224281,8
7,276737,0600570967,6
...,...,...,...
1149773,276704,0806917695,5
1149775,276704,1563526298,9
1149777,276709,0515107662,10
1149778,276721,0590442449,10


Sencondly we enrich our datasets to obtain new features

On the one hand, to build a recommendation model, we need the average rating of each book and the number of ratings it has received.

On the other hand, we need to transform the Year of Publication column to use it as a feature in our model, as numeric variables are not suitable for recommendation systems. We chose to split the books into 3 parts :


*   Classics of litterature : the majority of the lectors of our dataset were not born when these books has been published.
*   Old books : well-known books of the end of XXe century
*   Recent books : they are not classics yet, but they have been read and rated.


In [None]:
# We compute the average rating and number of ratings of each book
# We regroup the dataset by ISBN (instead of user ID). Then we compute the mean of ratings and we count how many ratings each book has received :
filtered_grouped_Ratings = filtered_Ratings.groupby('ISBN')['Book-Rating'].agg(['mean', 'count']).reset_index()

filtered_grouped_Ratings

Unnamed: 0,ISBN,mean,count
0,0330299891,6.0,1
1,0375404120,3.0,1
2,9022906116,7.0,1
3,#6612432,5.0,1
4,'9607092910',10.0,1
...,...,...,...
185968,"\8888809228\""""",5.0,1
185969,"\9170010242\""""",10.0,1
185970,ooo7156103,7.0,1
185971,´3499128624,8.0,1


In [None]:
# Analysis of the age of the readers
df_Users = df_Users.dropna(subset=['Age']) # remove missing values
filtered_ages = df_Users[df_Users['Age'] < 100]['Age'].dropna() # remove outliers

# Compute and show the third quartile of "Age" variable
third_quartile = df_Users['Age'].quantile(0.75)

# Compute breacking point of the "Classics of litterature" (75% of readers were not born before this date)
breaking_point = 2024 - third_quartile
breaking_point

1980.0

In [None]:
# We add a new column "Period_Of_Publication" :
# Books are labelled as 'Classic' if they were published before 1980,
# 'Old' if they were published between 1980 and 2000, and 'Recent' if they were published since 2000.
df_Books['Period_Of_Publication'] = df_Books['Year-Of-Publication'].apply(
    lambda year: 'Classic' if year < 1980 else 'Old' if year <= 2000 else 'Recent'
)

In [None]:
# We merge Rating and Books datasets with an inner join, so we only keep books
# - for which we know the information ;
# - that have been rated.
df_all = pd.merge(filtered_grouped_Ratings, df_Books, on='ISBN', how='inner')

print(len(df_all)) #check the length of our merged dataframe
df_all #check our merged dataframe

149832


Unnamed: 0,ISBN,mean,count,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L,Period_Of_Publication
0,0000913154,8.0,1,The Way Things Work: An Illustrated Encycloped...,C. van Amerongen (translator),1967,Simon &amp; Schuster,http://images.amazon.com/images/P/0000913154.0...,http://images.amazon.com/images/P/0000913154.0...,http://images.amazon.com/images/P/0000913154.0...,Classic
1,0001046438,9.0,1,Liar,Stephen Fry,0,Harpercollins Uk,http://images.amazon.com/images/P/0001046438.0...,http://images.amazon.com/images/P/0001046438.0...,http://images.amazon.com/images/P/0001046438.0...,Classic
2,000104687X,6.0,1,"T.S. Eliot Reading \The Wasteland\"" and Other ...",T.S. Eliot,1993,HarperCollins Publishers,http://images.amazon.com/images/P/000104687X.0...,http://images.amazon.com/images/P/000104687X.0...,http://images.amazon.com/images/P/000104687X.0...,Old
3,0001047213,9.0,1,The Fighting Man,Gerald Seymour,1993,HarperCollins Publishers,http://images.amazon.com/images/P/0001047213.0...,http://images.amazon.com/images/P/0001047213.0...,http://images.amazon.com/images/P/0001047213.0...,Old
4,0001047973,9.0,2,Brave New World,Aldous Huxley,1999,Trafalgar Square Publishing,http://images.amazon.com/images/P/0001047973.0...,http://images.amazon.com/images/P/0001047973.0...,http://images.amazon.com/images/P/0001047973.0...,Old
...,...,...,...,...,...,...,...,...,...,...,...
149827,B0001FZGPI,7.0,1,The Bonesetter's Daughter,Amy Tan,2001,Putnam Pub Group,http://images.amazon.com/images/P/B0001FZGPI.0...,http://images.amazon.com/images/P/B0001FZGPI.0...,http://images.amazon.com/images/P/B0001FZGPI.0...,Recent
149828,B0001FZGRQ,9.0,1,The Clan of the Cave Bear,Jean M. Auel,2001,Crown Publishing Group,http://images.amazon.com/images/P/B0001FZGRQ.0...,http://images.amazon.com/images/P/B0001FZGRQ.0...,http://images.amazon.com/images/P/B0001FZGRQ.0...,Recent
149829,B0001GMSV2,8.0,2,Find Me,Rosie O'Donnell,2002,Warner Books,http://images.amazon.com/images/P/B0001GMSV2.0...,http://images.amazon.com/images/P/B0001GMSV2.0...,http://images.amazon.com/images/P/B0001GMSV2.0...,Recent
149830,B0001I1KOG,10.0,1,New York Public Library Literature Companion,New York Public Library,2001,Free Press,http://images.amazon.com/images/P/B0001I1KOG.0...,http://images.amazon.com/images/P/B0001I1KOG.0...,http://images.amazon.com/images/P/B0001I1KOG.0...,Recent


# Recommendation Model : IMDB

We don't want to recommend books that have only been voted one time, and we don't want to recommend books that have a poor rating.
Therfore, we will use IMDB to score books appreciation and we will filter books with the highest score.



In [None]:
# Define a function to calculate the weighted rating
def weighted_rating(x, m, C):
    """
    Calculate the weighted rating of a book based on IMDB's formula:
    WR = (v / (v + m)) * R + (m / (v + m)) * C

    where:
    R = average rating for the book
    v = number of votes for the book
    m = minimum votes required to be listed
    C = mean vote across all books
    """
    v = x['count'] # take the number of votes for book x
    R = x['mean'] # take the average of votes for book x
    return (v / (v + m) * R) + (m / (v + m) * C) # return weighted formula

In [None]:
# Calculate the mean vote average (C)
C = df_all['mean'].mean()

# We only keep books that have been voted on more than 5 times (m)
m = 5
# By choosing 5, we keep only 6% of our dataset (choosing a higher m would have removed too much of our dataset).

# Filter the books that qualify for the chart
qualified_books = df_all[df_all['count'] >= m]

In [None]:
len(qualified_books)

13787

In [None]:
# Calculate the weighted rating for each qualified movie
qualified_books['score'] = qualified_books.apply(weighted_rating, axis=1, m=m, C=C)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  qualified_books['score'] = qualified_books.apply(weighted_rating, axis=1, m=m, C=C)


In [None]:
# Sort books based on the score
recommended_books = qualified_books.sort_values('score', ascending=True)

# Display the top 10 recommended books
print(recommended_books[['ISBN', 'mean', 'count', 'score']].head(10))

              ISBN      mean  count     score
117744  0971880107  4.390706    581  4.417470
132310  1880985055  3.000000      9  4.616966
142071  349222539X  3.666667      6  5.421593
45340   0425182908  5.338028     71  5.482073
19219   0312313616  3.600000      5  5.563752
86838   0743242203  4.000000      6  5.603411
95858   080213825X  5.444444     54  5.620975
105277  0843952180  3.800000      5  5.663752
140882  3446153950  3.800000      5  5.663752
29237   037325055X  4.500000      8  5.664424


In [None]:
# We filter our dataset to keep books that have a high score
df = recommended_books[recommended_books['score'] > 5]

# Recommendation Model : Cosine Similarity

We will build a recommendation model based on books features : the name of the book, name of the author, the publisher and the period of publication.

In [None]:
# We need to reset the indexes of our dataframe since we filtered it. (The index will be necessary in our model)
df = df.reset_index(drop=True)

In [None]:
#We select four columns to build the recommendation
features = ['Book-Title', 'Book-Author', 'Publisher', 'Period_Of_Publication']

In [None]:
#define a function to extract values for each row for the 4 features and put them together in one column
def combined_features(row):
    return row['Book-Title']+" "+row['Book-Author']+" "+row['Publisher']+" "+row['Period_Of_Publication'] # We add spaces to ensure that words will not be aggregated

df["combined_features"] = df.apply(combined_features, axis =1) # We create a new column using the previous function

df["combined_features"]

Unnamed: 0,combined_features
0,Die LÃ?Â¼ge im Bett. Gaby Hauptmann Piper Old
1,Isle of Dogs Patricia Cornwell Berkley Publish...
2,Confessions of a Sociopathic Social Climber: T...
3,The Boy on the Bus : A Novel Deborah Schupack ...
4,Four Blondes Candace Bushnell Grove Press Recent
...,...
13780,Dilbert: A Book of Postcards Scott Adams Andre...
13781,"The Return of the King (The Lord of the Rings,..."
13782,The Giving Tree Shel Silverstein HarperCollins...
13783,"The Two Towers (The Lord of the Rings, Part 2)..."


In [None]:
# We clean the combined features we have just created

nltk.download('stopwords') # Download stopwords
stop_words_english = set(stopwords.words('english')) # Take english stopwords

# We create a function to clean each line
def clean_text(text):
    # Remove stopwords
    text = " ".join([word for word in text.split() if word.lower() not in stop_words_english])

    # Remove numbers
    text = re.sub(r'\d+', '', text)

    return text

# Apply that function on "combined_features"
df["cleaned_combined_features"] = df["combined_features"].apply(clean_text)

# Show result
print(df["cleaned_combined_features"].head())

[nltk_data] Downloading package stopwords to /root/nltk_data...


0        Die LÃ?Â¼ge im Bett. Gaby Hauptmann Piper Old
1    Isle Dogs Patricia Cornwell Berkley Publishing...
2    Confessions Sociopathic Social Climber: Katya ...
3    Boy Bus : Novel Deborah Schupack Free Press Re...
4     Four Blondes Candace Bushnell Grove Press Recent
Name: cleaned_combined_features, dtype: object


[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
# We compute the similarity between books' information using cosine similarity method
cv = CountVectorizer()
count_matrix = cv.fit_transform(df["cleaned_combined_features"]) # create a dictionnary with words and "count" the number of time these words appear in each book

In [None]:
# Apply cosine_similarity function to the count matrix. This will compute the similarity between each book
cosine_sim = cosine_similarity(count_matrix)

In [None]:
print(df['Book-Title'].head(5)) # Show the 5 books with the highest score as an example

0                                 Die LÃ?Â¼ge im Bett.
1                                         Isle of Dogs
2    Confessions of a Sociopathic Social Climber: T...
3                         The Boy on the Bus : A Novel
4                                         Four Blondes
Name: Book-Title, dtype: object


In [None]:
# Choose the name of a book
book_user_likes = "Dune"

# This function will return the index of the chosen book in our dataset
def get_index_from_title(title):
    return df[df["Book-Title"] == title].index.values[0]

# Apply the function to the chosen book
book_index = int(get_index_from_title(book_user_likes))

In [None]:
similar_books = list(enumerate(cosine_sim[book_index])) # Enumerate the index of other books in the dataset (same index as in the matrix)

sorted_similar_books = sorted(similar_books, key=lambda x:x[1], reverse=True) # Sort books by their similarity with the chosen book

In [None]:
# Let's print the 15 most recommended books

def get_title_from_index(index):
    return df[df.index == index]["Book-Title"].values[0] # return the names of the recommended books after their index
i=0 # initialize variable
for book in sorted_similar_books: # loop to print the recommended books
    print(get_title_from_index(book[0])) # print the name of each recommended book
    i=i+1
    if i>15: # stop after the 15 first books have been print
        break

Dune
Dune (Dune Chronicles (Berkley Paperback))
N or M?
Firebird
Abduction
The Choir
Ssn
The Mask
Invasion
Mutation
The Funhouse
Into the Darkness
Toxin
Shattered
Greygallows
Holiday in Death
