# Generate Personalized Reading Recommendations from a Book Dataset

By Adil Said
- email: adilsaid64@gmail.com
- github: https://github.com/adilsaid64
- linkedin: https://www.linkedin.com/in/adil-s64/

**Task:** You have been given a dataset containing information about various books,
including the title, author, genre, and a brief summary. Your task is to develop a simple
but creative recommendation system using Python to provide personalized reading
recommendations.

## Overview of what this project contains

This project involves building a book recommendation system that incorporates both a search engine and personalized recommendations using collaborative filtering. 

The system allows users to search for books by titles and author names. Users can then rate the books they have read, and create user profiles. 

These user profiles are then used to generate personalized book recommendations using collaborative filtering techniques. The system aims to provide users with relevant book suggestions based on their preferences and the ratings of similar users. 

The system was then deployed onto a web application using Flask. You can run the app.py file and access it by typing http://localhost:5000/ into a web browser.


## Data
Data was from kaggle [goodbooks-10k](https://www.kaggle.com/datasets/zygmunt/goodbooks-10k?select=books.csv).

Contains data on 10k books and 10k users with almost a million rows of user interactions.

# Methods

First a search engine was built. Then we used user-based colaborative filtering to make recommendations. The steps taken is described bellow with code.

## Building a search engine

The aim of this section is to develop a book search engine that enables users to search for books based on titles and author names. 

The search engine will serve as the foundation for creating user profiles, allowing individuals to search for books they like and rate them. 

These user profiles will then be utilized to generate personalized book recommendations using collaborative filtering techniques.

### Steps taken to build search engine

**1. Preprocess Text Data**

Cleaned the text data by removing special characters, large spaces, and converting to all text to lower case.  

**2. Term Frequency Matrix (TF):**

Constructed a term frequency matrix to capture the frequency of each term within each book title and author columns.
    
    
**3. Inverse Document Frequency Matrix (IDF):**

Constructed an inverse document frequency matrix to minimize the impact of common words such as "the" or "and" that appear frequently across books.

**4. TF-IDF Calculation:**

Combined the term frequency matrix and the inverse document frequency matrix to obtain a TF-IDF matrix.

**5. Searching with TF-IDF Vectors:**

Enabled users to enter search queries (book titles or author names) and converted them into TF-IDF vectors using the same term frequency and inverse document frequency calculations.

Determined the similarity between the TF-IDF vector of the search query and the TF-IDF vectors of the books in the collection using cosine simularity.

Ranked the books based on their cosine similarity to the search query and presented the results to the user.

**6. Create a jupyter notebook widget to search for books**

Make a notebook widget to allow users to search for books they want.

**How to use:**
Just enter the name of an author or a book title. You dont need to know the whole auther name or even the entire book name. You can enter partial names and partial titles. Or even a combination of both. Names and titles have to be spelled correctly.

## Memory-based collaborative filtering 


Memory-based collaborative filtering relies on the user-item interaction data to make recommendations. 

For this coding task user-item filtering was used. A user-item filtering will take a particular user, find users that are similar to that user based on similarity of ratings, and recommend items that those similar users liked.

You can say: “Users who are similar to you also liked …”

### Steps taken to implement collaborative filtering

**1. User Filtering based on Books Read:**

Filter out users who have read the same books as us and have given high ratings to the books we rated highly. This step aims to identify users with similar preferences and reduce the dimensions of the user-item matrix.

**2. User-Item Matrix Creation:**

Construct a user-item matrix based on the filtered set of users. This matrix represents the ratings given by users to different books, enabling comparisons and analysis.

**3. Similar User Identification:**

Calculate the similarity between users based on their rating patterns. Used cosine similarity to identify users who have similar preferences.

**4. Book Ratings by Similar Users:**

Extract the books that similar users have rated. Determine the count of each book and calculate the average rating given by the similar users.

**5. Adjusted Count for Personalization:**

Focus on books that are highly rated by similar users but may not be popular among all users. Adjust the count of user ratings by considering the proportion of similar users who have rated each book.

**6. Scoring and Sorting of Recommendations:**

Score the book recommendations by multiplying the adjusted count by the mean rating given by similar users. Sort the recommendations based on this score to present personalized book recommendations to the user.

**7. Evaluation:**

To evaluate the peformance of the system a simple manual check is done to see if recomendations are appropriate.



## Limitations and how they could be improved

**1. Experiment with different filtering methods**

It would have been better to explore different filtering methods, such as content-based filtering and other types of collaborative filtering like item-item collaborative filtering.

It would have also been interesting to explore other techniques like model-based collaborative filtering, which is based on matrix factorization. A common matrix factorization method is singular value decomposition, which is what the famous winning team at the Netflix Prize competition used to make recommendations.

**2. Better System Evaluation**

A better and more systematic evaluation method should have been utilized to measure the performance of the system. To gain a better understanding of how the system performs, the data could have been split into training and testing sets. It is common in research to use metrics like mean squared error (MSE) or root mean squared error (RMSE) to evaluate the performance of recommendation systems.

**3. Cold Start Problems**

The system faces a cold start problem when new users or users without ratings join. The current solution of using a search engine for users to rate books helps, but the problem persists for users who haven't rated anything or for new books. To tackle this, a hybrid approach combining content-based and user-based filtering could be implemented. Content-based filtering considers book attributes, which helps with new users, while user-based filtering relies on existing user data. This hybrid approach improves recommendations even in cold start situations.

**4. Dataset**

In the future, I would like to use a better dataset. This dataset only contains 10k books. Some dataset contain millions of books. That would have been much more intresting to work with. However I used this smaller dataset because of RAM limitations. 


I am sure there are many more limitations and ways to improve on this, these are just some of the ways I could think off right now. I think the biggest limitation is the peformance evaluation. Because no quantitative measure was used to evaluate the peformance of the system.

**What I have Learned**
- The basics of recommender systems and some of their algorithms.
- First time to use Flask and learned some basic HTML.

## Building the search engine

In [1]:
import pandas as pd
import numpy as np
from scipy.sparse import coo_matrix
import re
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
books = pd.read_csv("books.csv")
ratings = pd.read_csv("ratings.csv")

In [3]:
##############################
##### text preprocessing #####
##############################

# combining title and authors into one column
books["tags"] = books["title"] + " " + books["authors"]

# Cleaning tags

# remove unwanted characters
books["tags"] = books["tags"].str.replace("[^a-zA-Z0-9 ]", "",regex = True)

# lower case
books["tags"] = books["tags"].str.lower()

# remove 2+ more spaces and replace with 1 space
books["tags"] = books["tags"].str.replace("\s+", " ", regex =  True)

# removing rows that contain no titles.
books = books[books["tags"].str.len()>0]

#################################
###### cleaning title names #####
#################################

# remove unwanted characters
books["title_clean"] = books["title"].str.replace("[^a-zA-Z0-9 ]", "",regex = True)

# lower case
books["title_clean"] = books["title_clean"].str.lower()

# remove 2+ more spaces and replace with 1 space
books["title_clean"] = books["title_clean"].str.replace("\s+", " ", regex =  True)

# removing rows that contain no titles.
books = books[books["title_clean"].str.len()>0]

In [4]:
#################################
#### steps 2-4 TF-IDF matrix ####
#################################

from sklearn.feature_extraction.text import TfidfVectorizer

vecto = TfidfVectorizer()

tfidf  = vecto.fit_transform(books["tags"]) 

In [5]:
##############################
#### create search engine ####
##############################

# show book front covers
def show_image(val):
    return '<a href="{}"><img src="{}" width=50></img></a>'.format(val, val)

# search function that takes in a user query
def search(q, vecto):
    
    # clean the query
    q_clean = re.sub("[^a-zA-Z0-9 ]", "", q.lower())

    # convert to a vector
    q_vec = vecto.transform([q_clean])

    # consine simularity between the query and our tfidf matrix
    simularity = cosine_similarity(q_vec, tfidf).flatten()

    # need the indicies of the books that are simular to our query. selecting the 10 heights
    indices = np.argpartition(simularity, -10)[-10:]

    # using the indicies to get the book titles
    results = books.iloc[indices]
    
    results["simularity"]= simularity[indices]

    # sort the results. Only show results that have a simularity score of greater than 0.2
    results = results[results["simularity"]>=0.2].sort_values("simularity", ascending = False)
    
    return results[["authors","original_title", "average_rating","image_url", "simularity", "id"]].style.format({'image_url': show_image, 'small_image_url': show_image})

In [6]:
# to hide warnigs for a neater notebook :D
import warnings
warnings.filterwarnings("ignore")

In [7]:
search("mockingbird", vecto)

Unnamed: 0,authors,original_title,average_rating,image_url,simularity,id
4933,Kathryn Erskine,Mockingbird,4.18,,0.602677,4934
3,Harper Lee,To Kill a Mockingbird,4.25,,0.586581,4


In [8]:
#########################################################
#### building an interactive jupyter notebook widget ####
#########################################################

import ipywidgets as widgets
from IPython.display import display

search_input = widgets.Text(description = "Search",
                           disabled = False)


# attaching our search widget to an output widget

output = widgets.Output()

def on_search(data):
    with output:
        output.clear_output()
        title = data["new"]
        
        if len(title) >= 1:
            display(search(title, vecto))
        
search_input.observe(on_search, names = "value")

display(search_input, output)

Text(value='', description='Search')

Output()

## Implementing the coloborative filter

In [53]:
########################################################
#### example of what a user profile would look like ####
########################################################


### the column "id" in books corresponds to the column "book_id" in ratings

# creating a  dataset of some books that you like

# enter the ids the books you like from the search
my_book_ids = [342, 399, 695, 253, 1286]
my_ratings = [4, 5, 5, 5, 5]

# create a dictionary with book IDs and ratings
data = {
    'user_id': [-1] * len(my_book_ids),  # Set my_id as -1 for all ratings
    'book_id': my_book_ids,
    'rating': my_ratings
}

# create the my_books dataframe
my_books = pd.DataFrame(data)

my_books

Unnamed: 0,user_id,book_id,rating
0,-1,342,4
1,-1,399,5
2,-1,695,5
3,-1,253,5
4,-1,1286,5


In [54]:
######################
#### filter users ####
######################

def filter_users(ratings, mybooks):
    # We want to find users, that rate the books we rated highly also highly. "Common Users"
    # This is to reduce the number of rows in our ratings dataset

    # set the threshold for high ratings
    high_rating_threshold = 3

    # get the book IDs of the books you rate highly
    highly_rated_books = my_books.loc[my_books['rating'] >= high_rating_threshold, 'book_id']

    # filter the ratings dataset to select users who rate the highly rated books
    filtered_ratings = ratings[ratings['book_id'].isin(highly_rated_books)]

    # get the unique user IDs from the filtered ratings
    selected_users = filtered_ratings['user_id'].unique()

    # filter the ratings dataset to keep only the selected users
    filtered_ratings = ratings[ratings['user_id'].isin(selected_users)]
    
    # we add our ratings to our the set of filtered_ratings
    filtered_ratings = pd.concat([filtered_ratings, my_books], ignore_index = True)

    return filtered_ratings

In [55]:
#################################
#### create user item matrix ####
#################################

def user_item_matrix(filtered_ratings):
    
    # preprocessing our filtered ratings matrix before converting it to a user_item matrix
    filtered_ratings["user_id"] = filtered_ratings["user_id"].astype(int)
    filtered_ratings["user_index"] = filtered_ratings["user_id"].astype("category").cat.codes
    filtered_ratings["book_index"] = filtered_ratings["book_id"].astype("category").cat.codes
    

    # we will use a sparse matrix rather than a dense matrix to save memory
    ratings_mat_coo = coo_matrix((filtered_ratings["rating"], 
                                  (filtered_ratings["user_index"], filtered_ratings["book_index"])))
    
    # covnerting our matrix to csr format
    ratings_mat = ratings_mat_coo.tocsr()
    
    # finding row position
    my_index = filtered_ratings[filtered_ratings["user_id"]==-1]["user_index"].values[1]
    
    return ratings_mat

In [56]:
#####################################
#### calculate cosine simularity ####
#####################################

def simularity(ratings_mat, my_index):
    user_similarity = cosine_similarity(ratings_mat[my_index,:], ratings_mat).flatten()
    return user_similarity

In [57]:
#####################################
#### book recommendations system ####
#####################################

def book_rec(ratings, mybooks):
    # drop duplicates
    ratings = ratings.drop_duplicates()

    # re ordering
    ratings = ratings[["user_id", "book_id", "rating"]]
    
    # filter the users using the filter users function
    filtered_ratings = filter_users(ratings, mybooks)
    
    # create a user item matrix
    ratings_mat = user_item_matrix(filtered_ratings)
    
    # finding row position
    my_index = filtered_ratings[filtered_ratings["user_id"]==-1]["user_index"].values[1]
    
    # calculate cosine simularity
    user_similarity = simularity(ratings_mat, my_index)
    
    #### find simular users based of cosine simularity ####
    
    # find the indicies of the users that are most simular to us and take the first 15
    indices = np.argpartition(user_similarity, -15)[15:]

    # we need to find the user ids
    simular_users = filtered_ratings[filtered_ratings["user_index"].isin(indices)].copy()

    # lets remove ourself from this list
    simular_users = simular_users[simular_users["user_id"] != -1]

    # this dataframe now contains book potential recomendations of users that are most simular to us
    # simular_users
    
    ####
    
    
    ##### finding out how many times each book appears in this list ####

    book_recs = simular_users.groupby("book_id").rating.agg(["count", "mean"])

    # adding the ttitle
    book_recs = book_recs.merge(books[["id", "title", "authors", "ratings_count", "image_url"]], left_index=True, right_on="id")

    # this contains the amount of times each book appears in our rec list and the mean ratings
    #book_recs
    
    ####
    
    #### ranking recomendations ####
    #
    # we need to rank our reccomendations
    #
    # we want to have books that are popular amonst simular users to us. We do not want books that are popular in general
    #
    # so we want to get the books everyone likes, we want the books that only people simular to us liked,
    #
    # and general people didnt like or pay much attention too
    #
    # for example if everyone rates harry potter, and people simular to us rate harry potter. we dont know if 
    #
    # people simular to us like harry potter because of a simular trait or because everyone likes harry potter.


    # creating an adjusted count
    book_recs["relative_count"] = book_recs["count"]/book_recs["ratings_count"]

    # creating a score
    book_recs["score"] = book_recs["mean"] * book_recs["relative_count"]


    # take out books that we have already read
    book_recs =  book_recs[~book_recs["id"].isin(my_books["book_id"])]
    
    
    ####
    
    # only display books that have a mean rating over grater than 4 by simular users
    top_books = book_recs[book_recs["mean"]>=4]
    top_books = top_books.sort_values(by ="score", ascending = False)[:10]
    
    
    return top_books.style.format({'image_url': show_image})

In [58]:
###############################
#### final recommendations ####
###############################

# this function takes a list of books that a certain user likes and rated 
#
# and returns back personilsed recomendations

book_rec(ratings, my_books)

Unnamed: 0,count,mean,id,title,authors,ratings_count,image_url,relative_count,score
1064,60,4.166667,1065,"Career of Evil (Cormoran Strike, #3)","Robert Galbraith, J.K. Rowling",66979,,0.000896,0.003733
9807,5,5.0,9808,The Center Cannot Hold: My Journey Through Madness,Elyn R. Saks,7533,,0.000664,0.003319
8700,6,5.0,8701,"Sin City: Una Dura Despedida, #1 de 3",Frank Miller,9115,,0.000658,0.003291
5991,8,5.0,5992,Ball Four,Jim Bouton,12805,,0.000625,0.003124
4961,9,4.888889,4962,The Zombie Room,R.D. Ronald,14302,,0.000629,0.003076
5010,11,4.454545,5011,Delicious!,Ruth Reichl,17657,,0.000623,0.002775
9849,5,5.0,9850,The Egg,Andy Weir,10238,,0.000488,0.002442
4653,9,4.0,4654,The Book of Strange New Things,Michel Faber,15189,,0.000593,0.00237
6728,8,4.125,6729,Ordinary People,Judith Guest,15207,,0.000526,0.00217
5655,9,4.111111,5656,"W is for Wasted (Kinsey Millhone, #23)",Sue Grafton,17357,,0.000519,0.002132
