<h1 align=center><font size = 6>Book Recommendation with Collaborative Filtering</font></h1>

<br>

<img src="https://img-cdn.inc.com/image/upload/w_1920,h_1080,c_fill/images/panoramic/GettyImages-577674005_492115_zfpgiw.jpg" height=520 width=1000 alt="GitHub">

<br>

<small>Picture Source: <a href="https://www.inc.com/jessica-stillman/books-reading-intelligence-tyler-cowen.html">Jessica Stillman</a>

<br>

<h2>Context</h2>

During the last few decades, with the rise of Youtube, Amazon, Netflix and many other such web services, recommender systems have taken more and more place in our lives. From e-commerce (suggest to buyers articles that could interest them) to online advertisement (suggest to users the right contents, matching their preferences), recommender systems are today unavoidable in our daily online journeys. In a very general way, recommender systems are algorithms aimed at suggesting relevant items to users (items being movies to watch, text to read, products to buy or anything else depending on industries).

<br>

Recommender systems are really critical in some industries as they can generate a huge amount of income when they are efficient or also be a way to stand out significantly from competitors. As a proof of the importance of recommender systems, we can mention that, a few years ago, Netflix organised a challenges (the “Netflix prize”) where the goal was to produce a recommender system that performs better than its own algorithm with a prize of 1 million dollars to win.

<br>

By applying this simple dataset and related tasks and notebooks , we will evolutionary go through different paradigms of recommender algorithms . For each of them, we will present how they work, describe their theoretical basis and discuss their strengths and weaknesses. For extra information, please check <a href="https://www.kaggle.com/arashnic/book-recommendation-dataset">Kaggle Möbius</a>.

<br>
<br>

<h2>Data Set</h2>

<a href="https://www.inc.com/jessica-stillman/books-reading-intelligence-tyler-cowen.html"></a>


Dataset link from Kaggle: [Book Recommendation Dataset](https://www.kaggle.com/arashnic/book-recommendation-dataset)

<br>

<h2>Objective:</h2>
<ul>
  <li>Understand the dataset.</li>
  <li>Build Pearson correlation.</li>
  <li>Make recommendations.</li>
</ul>

<br>
<h2>Keywords</h2>
<ul>
  <li>Computer Science</li>
  <li>Collaborative Filtering</li>
  <li>Pearson Correlation</li>
  <li>Recommendation Systems</li>
  <li>Book Recommendation</li>
</ul>
<br>

<h2>Content</h2>

<div class="alert alert-block alert-info" style="margin-top: 20px">

<li><a href="https://#">Importing Libraries</a></li>
<li><a href="https://#">Data Pre-processing</a></li>
<li><a href="https://#">Pearson Cerrelation for the Recommendation</a></li>
<br>

<p></p>
Estimated Time Needed: <strong>25 min</strong>
</div>

## 1. Imporing Libraries

We set the stage by importing essential libraries that will empower our exploration into personalized book recommendations. Each library plays a unique role in the analytical and mathematical aspects.

<br>

- **pandas (as pd):** A powerhouse for data manipulation and analysis, pandas will be our go-to tool for handling datasets with finesse.

- **sqrt from math:** The square root function from the math library will prove handy for certain calculations, especially when dealing with similarity metrics in collaborative filtering.

- **numpy (as np):** A fundamental library for numerical operations, numpy will be instrumental in array manipulations and mathematical computations.

- **warnings:** We're using the warnings library to suppress any distracting warning messages that might pop up during our analysis. This ensures a cleaner and more focused exploration.

In [None]:
import pandas as pd
from math import sqrt
import numpy as np
import warnings
warnings.filterwarnings("ignore")

## 2. Data Pre-processing

### 2.1. Seperate Datasets

We initiate the exploration by loading and organizing the datasets that form the backbone of our book recommendation analysis.

<br>

- **books_df:** We start by loading the '**Books.csv**' dataset using pandas. This dataset encapsulates information about various books, including details such as ISBN (International Standard Book Number), book title, author, and year of publication. To streamline our analysis, we selectively choose relevant columns, including ISBN, Book-Title, Book-Author, and Year-Of-Publication.

<br>

- **ratings_df:** Simultaneously, we import the '**Ratings.csv**' dataset, a crucial component for collaborative filtering. This dataset contains user ratings for various books, forming the basis for understanding user preferences and generating personalized recommendations.

In [None]:
books_df = pd.read_csv('Books.csv')

In [None]:
books_df = books_df[['ISBN', 'Book-Title', 'Book-Author', 'Year-Of-Publication']]
ratings_df = pd.read_csv('Ratings.csv')

In [None]:
books_df.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication
0,195153448,Classical Mythology,Mark P. O. Morford,2002
1,2005018,Clara Callan,Richard Bruce Wright,2001
2,60973129,Decision in Normandy,Carlo D'Este,1991
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999


### 2.2. Looking for Duplicated Book Titles

We conduct a critical examination of the '**Book-Title**' column in the '**books_df**' dataset to identify and handle any instances of duplicated book titles.

In [None]:
books_df['Book-Title'].duplicated().sum()

29225

In [None]:
books_df.drop_duplicates(subset='Book-Title', keep="last", inplace=True)

In [None]:
books_df['Book-Title'].duplicated().sum()

0

In [None]:
ratings_df.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,5
1,276726,155061224,2
2,276727,446520802,9
3,276729,052165615X,6
4,276729,521795028,7


### 2.3. User Input for Making Recommendations

Now, we need to define our books for making recommendations. Otherwise, we can't clearly make recommandations without user input.

In [None]:
userInput = [
            {'Book-Title': 'Lightning', 'Book-Rating': 9},
            {'Book-Title': 'Manhattan Hunt Club', 'Book-Rating': 9},
            {'Book-Title': 'Clara Callan', 'Book-Rating': 7},
            {'Book-Title': "Jane Doe", 'Book-Rating': 2},
            {'Book-Title': 'Wild Animus', 'Book-Rating': 5}
         ]
inputBooks= pd.DataFrame(userInput)
inputBooks

Unnamed: 0,Book-Title,Book-Rating
0,Lightning,9
1,Manhattan Hunt Club,9
2,Clara Callan,7
3,Jane Doe,2
4,Wild Animus,5


### 2.4. Creating User Subset for Collaborative Filtering

The focus is on preparing the input data for the collaborative filtering analysis. The process begins by identifying the subset of the '**books_df**' dataset that corresponds to the book titles specified in the user's input. This is achieved through the use of the `.isin()` method, allowing the extraction of relevant rows based on the 'Book-Title' column from the user's input. Subsequently, a merging operation takes place to obtain the necessary details, including '**ISBN**', for the identified books. This merging step is crucial for aligning the user's input with the broader dataset.

Following the merge, unnecessary information is trimmed down by dropping the '**Year-Of-Publication**' column from the '**inputBooks**' dataframe. The final result is a refined input dataframe that encapsulates the essential details of the user-specified books. It's worth noting that if a book specified by the user is not present in this final input dataframe, it may be due to variations in spelling or capitalization, warranting a careful check for data consistency. This meticulous preparation ensures that the input data is well-structured and ready for integration into the collaborative filtering model, laying the foundation for generating personalized book recommendations based on user preferences.

In [None]:
inputId = books_df[books_df['Book-Title'].isin(inputBooks['Book-Title'].tolist())]
inputBooks = pd.merge(inputId, inputBooks)
inputBooks = inputBooks.drop('Year-Of-Publication', 1)
inputBooks

Unnamed: 0,ISBN,Book-Title,Book-Author,Book-Rating
0,2005018,Clara Callan,Richard Bruce Wright,7
1,971880107,Wild Animus,Rich Shapero,5
2,449006522,Manhattan Hunt Club,JOHN SAUL,9
3,1551665107,Jane Doe,R.J. Kaiser,2
4,553290703,Lightning,Patricia Potter,9


We had merged our recommendation dataframe and the ISBN number of the books with book's author.

In [None]:
# Filtering out users that have watched movies that the input has watched and storing it
userSubset = ratings_df[ratings_df['ISBN'].isin(inputBooks['ISBN'].tolist())]
userSubset.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
86676,18401,1551665107,3
128287,29806,1551665107,5
162695,35859,1551665107,1
166997,36609,1551665107,5
201177,45114,1551665107,5


The focus shifts to creating a user subset for collaborative filtering analysis. The user subset is generated by grouping the '**userSubset**' dataframe based on the '**User-ID**' column. A specific user, in this case, '**User-ID**' 18401, is isolated for closer inspection using the `get_group()` method.

In [None]:
userSubsetGroup = userSubset.groupby(['User-ID'])

In [None]:
userSubsetGroup.get_group(18401)

Unnamed: 0,User-ID,ISBN,Book-Rating
86676,18401,1551665107,3


The user subset is then organized by the number of entries per user in descending order, achieved through sorting the '**userSubsetGroup**' using a lambda function. This arrangement allows prioritizing users with a higher number of interactions, contributing to a more robust collaborative filtering model.

In [None]:
userSubsetGroup = sorted(userSubsetGroup,  key=lambda x: len(x[1]), reverse=True)

In [None]:
userSubsetGroup[0:3]

[(18401,
         User-ID        ISBN  Book-Rating
  86676    18401  1551665107            3),
 (29806,
          User-ID        ISBN  Book-Rating
  128287    29806  1551665107            5),
 (35859,
          User-ID        ISBN  Book-Rating
  162695    35859  1551665107            1)]

To manage computational resources effectively, the user subset is further narrowed down to the top 100 users. This selection, stored in 'userSubsetGroup,' represents a subset of users with substantial engagement, forming the basis for collaborative filtering analysis. This strategic curation of the user subset ensures that the collaborative filtering model is not only computationally efficient but also prioritizes users with a significant impact on recommendations.

In [None]:
userSubsetGroup = userSubsetGroup[0:100]

## 3. Pearson Correlation for the Recommendation

The Pearson correlation coefficient (ρ) is a statistical measure that quantifies the linear relationship between two variables, X and Y. The formula for calculating Pearson correlation is as follows:

<br>

$$ \rho = \frac{\sum{(X_i - \bar{X})(Y_i - \bar{Y})}}{\sqrt{\sum{(X_i - \bar{X})^2} \sum{(Y_i - \bar{Y})^2}}} $$

<br>

Here's a breakdown of the terms in the formula:

- $ \rho $: Pearson correlation coefficient.
- $ X_i $ and $ Y_i $: Individual data points in the datasets X and Y.
- $ \bar{X} $ and $ \bar{Y} $: Mean (average) of the respective datasets X and Y.

The numerator represents the sum of the product of the differences between each data point and the mean of its respective dataset. The denominator involves the square root of the product of the sums of squared differences from the mean for both datasets.

<br>

The resulting Pearson correlation coefficient ranges from -1 to 1:

- $ \rho = 1 $: Perfect positive correlation.
- $ \rho = -1 $: Perfect negative correlation.
- $ \rho = 0 $: No linear correlation.

<br>

In collaborative filtering for book recommendations, Pearson correlation is commonly used to measure the similarity between user preferences based on their ratings. A positive correlation suggests similar tastes, while a negative correlation implies dissimilar preferences.


In [None]:
# Initialize an empty dictionary to store Pearson correlation coefficients
pearsonCorrelationDict = {}

# For every user group in our subset
for name, group in userSubsetGroup:
    # Sort the input and current user group by ISBN for consistency
    group = group.sort_values(by='ISBN')
    inputBooks = inputBooks.sort_values(by='ISBN')

    # Get the number of ratings (N) for the formula
    nRatings = len(group)

    # Get the review scores for the books they both have in common
    temp_df = inputBooks[inputBooks['ISBN'].isin(group['ISBN'].tolist())]

    # Store review scores in temporary lists for future calculations
    tempRatingList = temp_df['Book-Rating'].tolist()
    tempGroupList = group['Book-Rating'].tolist()

    # Calculate the components of the Pearson correlation formula
    Sxx = sum([i**2 for i in tempRatingList]) - pow(sum(tempRatingList),2)/float(nRatings)
    Syy = sum([i**2 for i in tempGroupList]) - pow(sum(tempGroupList),2)/float(nRatings)
    Sxy = sum(i*j for i, j in zip(tempRatingList, tempGroupList)) - sum(tempRatingList)*sum(tempGroupList)/float(nRatings)

    # If the denominator is different than zero, then calculate Pearson correlation, else, set correlation to 0
    if Sxx != 0 and Syy != 0:
        pearsonCorrelationDict[name] = Sxy/sqrt(Sxx*Syy)
    else:
        pearsonCorrelationDict[name] = 0

The resulting dictionary '**pearsonCorrelationDict**' contains Pearson correlation coefficients between the input user and other users in the subset. This code calculates how similar the preferences of the input user are to each user in the subset.

In [None]:
pearsonCorrelationDict.items()

dict_items([(18401, 0), (29806, 0), (35859, 0), (36609, 0), (45114, 0), (72238, 0), (73394, 0), (78553, 0), (79903, 0), (93518, 0), (135265, 0), (135321, 0), (143411, 0), (158295, 0), (166596, 0), (167349, 0), (175003, 0), (201451, 0), (221908, 0), (230522, 0), (242824, 0)])

### 3.1. Pearson Correlation Dictonary on DataFrame

 We codded transforms the calculated Pearson correlation coefficients into a structured DataFrame ('**pearsonDF**'), where each row corresponds to a user in the subset, and columns include the similarity index and user ID. This DataFrame is a valuable resource for further analysis and recommendation generation in collaborative filtering.

In [None]:
pearsonDF = pd.DataFrame.from_dict(pearsonCorrelationDict, orient='index')
pearsonDF.columns = ['similarityIndex']
pearsonDF['userId'] = pearsonDF.index
pearsonDF.index = range(len(pearsonDF))
pearsonDF.head()

Unnamed: 0,similarityIndex,userId
0,0,18401
1,0,29806
2,0,35859
3,0,36609
4,0,45114


### 3.2. Extracting Top Similar Users

The provided code segment involves extracting the top users from the '**pearsonDF**' DataFrame, sorting them based on their similarity indices in descending order, and displaying the first few rows.

In [None]:
topUsers=pearsonDF.sort_values(by='similarityIndex', ascending=False)[0:50]
topUsers.head()

Unnamed: 0,similarityIndex,userId
0,0,18401
11,0,135321
19,0,230522
18,0,221908
17,0,201451


### 3.3. Merging Top Similar Users with Ratings Data

The code segment involves merging the '**topUsers**' DataFrame with the '**ratings_df**' DataFrame based on user IDs and displaying the first few rows of the resulting DataFrame.

In [None]:
topUsersRating=topUsers.merge(ratings_df, left_on='userId', right_on='User-ID', how='inner')
topUsersRating.head()

Unnamed: 0,similarityIndex,userId,User-ID,ISBN,Book-Rating
0,0,18401,18401,60009241,8
1,0,18401,18401,60085444,9
2,0,18401,18401,60092149,5
3,0,18401,18401,60502177,6
4,0,18401,18401,61012513,7


### 3.4. Calculating Weighted Ratings for Top Similar Users

It involves calculating weighted ratings for the top similar users by multiplying their similarity indices with their respective book ratings.

In [None]:
topUsersRating['weightedRating'] = topUsersRating['similarityIndex']*topUsersRating['Book-Rating']
topUsersRating.head()

Unnamed: 0,similarityIndex,userId,User-ID,ISBN,Book-Rating,weightedRating
0,0,18401,18401,60009241,8,0
1,0,18401,18401,60085444,9,0
2,0,18401,18401,60092149,5,0
3,0,18401,18401,60502177,6,0
4,0,18401,18401,61012513,7,0


### 3.5. Aggregating Weighted Ratings for Books

Aggregating the weighted ratings and similarity indices for each book in the '**topUsersRating**' DataFrame.

In [None]:
tempTopUsersRating = topUsersRating.groupby('ISBN').sum()[['similarityIndex','weightedRating']]
tempTopUsersRating.columns = ['sum_similarityIndex','sum_weightedRating']
tempTopUsersRating.head()

Unnamed: 0_level_0,sum_similarityIndex,sum_weightedRating
ISBN,Unnamed: 1_level_1,Unnamed: 2_level_1
000104799X,0,0
002026478X,0,0
002045211X,0,0
002089130X,0,0
006000147X,0,0


### 3.6. Generating Weighted Average Recommendation Scores

Calculating the weighted average recommendation scores for books based on the aggregated information from the collaborative filtering process.

In [None]:
recommendation_df = pd.DataFrame()
# Now we take the weighted average
recommendation_df['weighted average recommendation score'] = tempTopUsersRating['sum_weightedRating']/tempTopUsersRating['sum_similarityIndex']
recommendation_df['ISBN'] = tempTopUsersRating.index
recommendation_df.head()

Unnamed: 0_level_0,weighted average recommendation score,ISBN
ISBN,Unnamed: 1_level_1,Unnamed: 2_level_1
000104799X,,000104799X
002026478X,,002026478X
002045211X,,002045211X
002089130X,,002089130X
006000147X,,006000147X


### 3.7. Sorting Books by Weighted Average Recommendation Scores

Sorting the '**recommendation_df**' DataFrame based on the calculated weighted average recommendation scores in descending order.

In [None]:
recommendation_df = recommendation_df.sort_values(by='weighted average recommendation score', ascending=False)
recommendation_df.head()

Unnamed: 0_level_0,weighted average recommendation score,ISBN
ISBN,Unnamed: 1_level_1,Unnamed: 2_level_1
000104799X,,000104799X
002026478X,,002026478X
002045211X,,002045211X
002089130X,,002089130X
006000147X,,006000147X


### 3.8. Retrieving Book Details for Top Recommendations

Retrieving detailed information about the top recommended books by matching their ISBN values with the '**books_df**' DataFrame.

In [None]:
books_df.loc[books_df['ISBN'].isin(recommendation_df.head(10)['ISBN'].tolist())]

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication
3701,002026478X,AGE OF INNOCENCE (MOVIE TIE-IN),Edith Wharton,1993
47683,006000181X,With Her Last Breath,Cait London,2003
49133,006008460X,Cheaper by the Dozen (Perennial Classics),Frank B. Gilbreth,2002
51433,000104799X,Monk's-hood,Ellis Peters,1994
81494,006000553X,Victoria and the Rogue (An Avon True Romance),Meg Cabot,2003
127878,006000147X,Cherokee Warriors: The Loner,Genell Dellin,2003
225224,006008197X,Once Upon a Town : The Miracle of the North Pl...,Bob Greene,2003
245421,002045211X,Salazar Blinks (Collier Fiction),David R. Slavitt,1990


The code efficiently identifies and displays detailed information about the top 10 recommended books by matching their ISBN values with the '**books_df**' DataFrame. This final output provides a comprehensive view of the recommended books, including their titles, authors, and other relevant details.

<h1>Contact Me</h1>
<p>If you have something to say to me please contact me:</p>

<ul>
  <li>Twitter: <a href="https://twitter.com/Doguilmak">Doguilmak</a></li>
  <li>Mail address: doguilmak@gmail.com</li>
</ul>

In [1]:
from datetime import datetime
print(f"Changes have been made to the project on {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

Changes have been made to the project on 2023-12-06 06:34:02
