# KNN Book recommendation

The Objective of this notebook will be to explore the data sets and then create a model to recommend book<br> stores what books to stock up on depending on the bookstore's location and the demographic(age) of that location.

## Setup Environment

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import NearestNeighbors
from sklearn.decomposition import TruncatedSVD

In [None]:
path = os.path.join(os.path.normpath(os.getcwd() + os.sep + os.pardir) + "/data/cleaned")

In [None]:
cwd = os.path.join(os.path.normpath(os.getcwd() + os.sep + os.pardir))

In [None]:
os.listdir(path)

#### Read in clean data

In [None]:
ratings = pd.read_csv(path + '/BX-Ratings.csv')
print(ratings.dtypes)
ratings.head()

In [None]:
users = pd.read_csv(path + '/BX-Users.csv')
print(users.dtypes)
users.head()

In [None]:
plt.figure(figsize=(8, 6))
plt.hist(users['User-Age'], bins=30, color='skyblue', edgecolor='black')
plt.title('Distribution of User Ages')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.xlim(0,80)
plt.grid(True)
plt.show()

In [None]:
books = pd.read_csv(path + '/BX-Books.csv')
print(books.dtypes)
books.head()

We should only use ratings that are of books in our data set of books

In [None]:
ratings_new = ratings[ratings.ISBN.isin(books.ISBN)]
ratings.shape,ratings_new.shape

Both ratings and ratings_new have the same shape so we can assume that all the ratings are of books in our books data set.<br>
Thats great!

Now our ratings need to be from users in our user data set

In [None]:
print("Shape of dataset before dropping",ratings_new.shape)
ratings_new = ratings_new[ratings_new['User-ID'].isin(users['User-ID'])]
print("shape of dataset after dropping",ratings_new.shape)

Okay so all the ratings we got have their corrosponding user data, nice!

#### Analyse rating distribution

In [None]:
ratings_counts = ratings['Book-Rating'].value_counts(sort=False)

ratings_counts_sorted = ratings_counts.sort_index()

plt.rc("font", size=15)
ratings_counts_sorted.plot(kind='bar')
plt.title('Rating Distribution\n')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.show()

It seems that the ratings are very heavily negatively skewed, this means that people will generally review books they like rather than books the don't like more often

Let's take a look at the top 5 books that have been rated the most by users

In [None]:
rating_count = pd.DataFrame(ratings.groupby('ISBN')['Book-Rating'].count())
rating_count.sort_values('Book-Rating', ascending=False).head()

In [None]:
most_rated_books = pd.DataFrame(['0316666343', '0971880107', '0385504209', '0312195516', '0060928336'], index=np.arange(5), columns = ['ISBN'])
most_rated_books_summary = pd.merge(most_rated_books, books, on='ISBN')
most_rated_books_summary

Just for intuition let's see the distribution of ratings for the top book: "The Lovely Bones: A Novel"

In [None]:
isbn_0316666343_df = ratings[ratings['ISBN'] == '0316666343']

isbn_0316666343_df_counts = isbn_0316666343_df['Book-Rating'].value_counts(sort=False)

isbn_0316666343_df_sorted = isbn_0316666343_df_counts.sort_index()

plt.rc("font", size=15)
isbn_0316666343_df_sorted.plot(kind='bar')
plt.title('Rating Distribution\n')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.show()

It seems that the same trends still hold thats great!

Now let's create a new collumn in our ratings df for average rating of that book

In [None]:
# Create column Rating average 
ratings['Avg_Rating']=ratings.groupby('ISBN')['Book-Rating'].transform('mean')
# Create column Rating sum
ratings['Times been Rated']=ratings.groupby('ISBN')['Book-Rating'].transform('count')

In [None]:
ratings.head()

Okay now let's merge the data sets together for further analysis

In [None]:
Final_Dataset=users.copy()
Final_Dataset=pd.merge(Final_Dataset,ratings,on='User-ID')
Final_Dataset=pd.merge(Final_Dataset,books,on='ISBN')

In [None]:
Final_Dataset.head()

In [None]:
missing_values_count = Final_Dataset.isna().sum()
total_values = len(Final_Dataset)

# Calculate the percentage of missing values for each column
missing_values_percentage = (missing_values_count / total_values) * 100

# Creating a new DataFrame to store the missing values count and percentage
missing_values_df = pd.DataFrame({'Column': missing_values_count.index, 
                                  'Missing Values': missing_values_count.values,
                                  'Missing Values (%)': missing_values_percentage.values})

missing_values_df

It seems that there are a lot of missing values in the location collumns. Since we are trying to train our model to give recommendations based on location, I think that we should omit these rows

In [None]:
Final_Dataset.dropna(inplace=True)
missing_values_count = Final_Dataset.isna().sum()
total_values = len(Final_Dataset)

# Calculate the percentage of missing values for each column
missing_values_percentage = (missing_values_count / total_values) * 100

# Creating a new DataFrame to store the missing values count and percentage
missing_values_df = pd.DataFrame({'Column': missing_values_count.index, 
                                  'Missing Values': missing_values_count.values,
                                  'Missing Values (%)': missing_values_percentage.values})

missing_values_df

### Categorising by age and location

If we want to help bookstore owners pick new books to purchase for their store, we need to find out what books the people in that area like. We will take an approach that makes artificial users which will be a collation separated by area and age. Then we will feed this into a model that can tell us what type of books to recommend to this demographic

Location Categorisation

Let's see how the locations are distributed to get a feel on how to categorise location

In [None]:
country_counts = Final_Dataset['User-Country'].value_counts()

# Plotting the results
country_counts.plot(kind='bar')
plt.title('Number of People in Each Country')
plt.xlabel('Country')
plt.ylabel('Count')
plt.show()

While this doesn't tell us much, it's clear there are a few countries with far greater number of users. It wouldn't be apprpriate to group all these users together.
So, we shall then reduce the scale for these countreis into states or cities. Lets first find out which cities are the greatest and least.

In [None]:
country_counts.head(10)

In [None]:
print(len(country_counts))
country_counts.tail(10)

Firstly there are alot of countries with only one user and this won't be enough to get an idea of what kind of books that country/demographic enjoy. We will omit these countries in our data set training 

In [None]:
country_counts_df = Final_Dataset.groupby('User-Country').size()

# Filter countries with 5 or less users
countries_to_drop = country_counts_df[country_counts <= 5].index
rows_to_drop = Final_Dataset[Final_Dataset['User-Country'].isin(countries_to_drop)].index
Final_Dataset = Final_Dataset.drop(rows_to_drop)

#Recount
country_counts = Final_Dataset['User-Country'].value_counts()
print(len(country_counts[(country_counts > 5) & (country_counts < 30)]))
country_counts.tail(10)

We will categorize via this method:<br>
If the country has less than 30 users we group by country.<br>
for countries with between 30 to 200 users we will group by state.<br>
for countries with 200 to 500 users we will group by city.<br>
for countries with >500 we will group by city then by categorical age(young, middle, old.)<br>

The objective is to get roughly the same number of users and reviews in each section such that the range of books is not too wide

In [None]:
countries_lt_30_list = country_counts[country_counts < 30].index
countries_30_200_list =  country_counts[(country_counts>=30) & (country_counts <200)].index
countries_200_500_list = country_counts[(country_counts>=200) & (country_counts<500)].index
countries_gt_500_list = country_counts[(country_counts>500)].index

In [None]:
df_30 = Final_Dataset[Final_Dataset['User-Country'].isin(countries_lt_30_list)]
df_30_200 = Final_Dataset[Final_Dataset['User-Country'].isin(countries_30_200_list)]
df_200_500 = Final_Dataset[Final_Dataset['User-Country'].isin(countries_200_500_list)]
df_500 = Final_Dataset[Final_Dataset['User-Country'].isin(countries_gt_500_list)]

In [None]:
df_30

In [None]:
countries = df_30['User-Country']
new_ids_30 = countries + '-Na-Na-Na'
df_30['User-ID'] = new_ids_30

df_30


In [None]:
countries = df_30_200['User-Country']
states = df_30_200['User-State']
new_ids_30_200 = countries + '-'+ states +'-Na-Na'
df_30_200['User-ID'] = new_ids_30_200

df_30_200

In [None]:
countries = df_200_500['User-Country']
states = df_200_500['User-State']
cities = df_200_500['User-City']
new_ids_200_500 = countries + '-'+ states +'-' + cities + '-Na'
df_200_500['User-ID'] = new_ids_200_500

df_200_500

Now for age grouping we need to find a way to separate the different ages, lets do some quick analysis for the ages

In [None]:
percentile_33 = np.percentile(Final_Dataset['User-Age'], 18)

percentile_66 = np.percentile(Final_Dataset['User-Age'], 82)

print("33rd percentile of age:", percentile_33)
print("66th percentile of age:", percentile_66)

In [None]:
plt.figure(figsize=(8, 6))
plt.hist(Final_Dataset['User-Age'], bins=30, color='skyblue', edgecolor='black')
plt.title('Distribution of User Ages')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

The distribution for age is approximately normally distributed, so let's split the distribution into three parts:<br>

1. young people 0 - 28
2. middle aged 29 - 43
3. Old 44 - &infin;

In [None]:
age_bins = [0, 28, 43, np.inf]
age_labels = ['Young', 'Middle-aged', 'Old']
df_500['Age-Category'] = pd.cut(df_500['User-Age'], bins=age_bins, labels=age_labels, right=False)
df_500['Age-Category'] = df_500['Age-Category'].astype('object')

df_500

In [None]:
countries = df_500['User-Country']
states = df_500['User-State']
cities = df_500['User-City']
age = df_500['Age-Category']
new_ids_500 = countries + '-'+ states +'-' + cities + '-' + age
df_500['User-ID'] = new_ids_500

df_500.head()

In [None]:
demographic_final_data = pd.concat([df_30, df_30_200, df_200_500, df_500], ignore_index=True)
demographic_final_data.rename(columns={'User-ID': 'demographic'}, inplace=True)

demographic_final_data.head()

In [None]:
user_id_counts = demographic_final_data.groupby('demographic').size()
print(len(user_id_counts))

Nice, so now we have the same dataframe, but this has in terms of the demographic which will be more useful for bookstores and booksellers

## Book recommendation system

This section we will try and transform the data to make KNN more plausible


In [None]:
ratings.head()

### Implementing Knn

Let's make a matrix of ratings and books to see what books get ratings from similar users

In [None]:
ratings_matrix = ratings.pivot(index='User-ID', columns='ISBN', values='Book-Rating')
userID = ratings_matrix.index
ISBN = ratings_matrix.columns
print(ratings_matrix.shape)
ratings_matrix.head()

In [None]:
n_users = ratings_matrix.shape[0]
n_books = ratings_matrix.shape[1]
print (n_users, n_books)
     

Let's fix up those nans

In [None]:
ratings_matrix.fillna(0, inplace = True)
ratings_matrix = ratings_matrix.astype(np.int32)

In [None]:
ratings_matrix

In [None]:
sparsity = 1.0-len(ratings)/float(ratings.shape[0]*n_books)
print('Matrix Sparsity:' +  str(sparsity*100))

This is a very high sparsity and something we need to fix to get our knn model to work better,<br> to fix this we should omit books with few ratings

In [None]:
raw_ratings = pd.read_csv(path + '/BX-Ratings.csv')


combine_book_rating = pd.merge(raw_ratings, books, on = 'ISBN')
combine_book_rating

We only really care about the Book ratings and Book titles here so let's remove some distracting collumns

In [None]:
cols_to_drop = ['Book-Author', 'Book-Info', 'Year-Of-Publication', 'Book-Publisher', 'Book-Vector', 'Year-Of-Publication-Group', 'Year-Of-Publication-Group-Encoded']
combine_book_rating = combine_book_rating.drop(cols_to_drop, axis = 1)
combine_book_rating.head()

Let's group it by titles and find out how many times each title was reviewed

In [None]:
grouped_ratings = combine_book_rating.groupby(by=['Book-Title'])['Book-Rating'].count()
reset_index = grouped_ratings.reset_index()
renamed_columns = reset_index.rename(columns={'Book-Rating': 'Total-Rating-Count'})
book_ratingcount = renamed_columns[['Book-Title', 'Total-Rating-Count']]

In [None]:
sorted_count = book_ratingcount.sort_values(by='Total-Rating-Count', ascending=False)
sorted_count.head()

In [None]:
merged_counts = combine_book_rating.merge(book_ratingcount, left_on = 'Book-Title', right_on = 'Book-Title', how = 'inner' )

merged_counts.head()

Okay now let's analyse our distribution and take books that have a significant ammount of ratings

In [None]:
print(book_ratingcount['Total-Rating-Count'].describe())

In [None]:
print(book_ratingcount['Total-Rating-Count'].quantile(np.arange(.9,1,.01)))

It seems that only a few books have significant amount of ratings. Because we have so many books, let's take the top 4% of books, that is the books that receive 47 or more ratings

In [None]:
popularity_threshold = 47
rating_popular_book = merged_counts[merged_counts['Total-Rating-Count'] >= popularity_threshold]

In [None]:
ratings_removed = len(merged_counts) - len(rating_popular_book)
print('ratings_removed =', ratings_removed)

rating_popular_book.tail()

Now let's check and remove any duplicate rows

In [None]:
duplicate_rows = rating_popular_book.duplicated(['User-ID', 'Book-Title'])

# Count the number of duplicate rows
num_duplicates = duplicate_rows.sum()

print("Number of duplicate rows:", num_duplicates)
print(rating_popular_book.shape)

In [None]:
if not rating_popular_book[rating_popular_book.duplicated(['User-ID', 'Book-Title'])].empty:
    rating_popular_book = rating_popular_book.drop_duplicates(['User-ID', 'Book-Title'])
print(rating_popular_book.shape)

#### And Now we have our Matrix!

In [None]:
user_rating_matrix = rating_popular_book.pivot(index = 'Book-Title',columns = 'User-ID', values = 'Book-Rating').fillna(0)
user_rating_matrix

In [None]:
total_elements = user_rating_matrix.size
num_zero_elements = np.count_nonzero(user_rating_matrix == 0)
sparsity = num_zero_elements / total_elements
sparsity

This Sparsity is barely acceptable, we could increase our threshhold but this will come at the cost of ignoring less popular books and therefore decrease out book set

### K nearest neighbours

In [None]:
model_knn = NearestNeighbors(metric = 'cosine', algorithm = 'brute')
model_knn.fit(user_rating_matrix)

Training Model

In [None]:
X = user_rating_matrix.values
X.shape

In [None]:
SVD = TruncatedSVD(n_components=12, random_state=123)
matrix = SVD.fit_transform(X)
print(matrix.shape)
matrix

In [None]:
corr = np.corrcoef(matrix)
corr.shape

Test out Model

In [None]:
book_titles = user_rating_matrix.index
book_list = list(book_titles)
matrix_index = book_list.index("Harry Potter and the Prisoner of Azkaban (Book 3)")
book_row  = corr[matrix_index]
list(book_titles[(book_row<1.0) & (book_row>0.9)])

Nice similar books are being recommended

## Implement Recommendation System

so now we have a model that can reccomend books based on a title, what we need to do is group by demographic, and make a list of books to recommend, these will include all the positively reviewed books or people in that area as well as inputing those reviews into the book recommender to recommend similar books. 

In [None]:
demographic_final_data.head()

Since we only want recommended books, lets only consider ratings that are above the average rating

In [None]:
filtered_demographic = demographic_final_data[demographic_final_data['Book-Rating'] >= demographic_final_data['Avg_Rating']]
filtered_demographic

In [None]:
filtered_demographic.groupby('demographic').size()

In [None]:
def book_recommendations(demographic_data, corr_matrix, book_titles ,country="Na", state="Na", city="Na", agegroup="Na", ) -> list:
    recommended_books = []
    book_list = list(book_titles)
    target = f"{country}-{state}-{city}-{agegroup}"

    target_reviews = demographic_data[demographic_data['demographic']==target]
    for index, row in target_reviews.iterrows():
        recommended_books.append(row['Book-Title'])
        if row['Book-Title'] in book_list:
            matrix_index = book_list.index(row['Book-Title'])
            book_row = corr_matrix[matrix_index]
            related_books = list(book_titles[(book_row<1.0) & (book_row>0.9)])
            for book in related_books:
                recommended_books.extend(related_books)





    return recommended_books


In [None]:
recommended_books = book_recommendations(filtered_demographic,corr,user_rating_matrix.index,country = 'bangladesh')

In [None]:
print(len(recommended_books))
recommended_books

## Accuracy Testing via LOOCV:

to test the precision of our model, we will employ Leave one out cross validation, that is we will iterate through out users, with more than 20 ratings, choose one target and see based off the other books if we are able to predict that given book

In [None]:
def test_recommendation(merged_data, corr_matrix, book_titles ,user_ID):

    book_list = list(book_titles)
    user_reviews = merged_data[merged_data['User-ID']==user_ID]
    user_books = []

    for index, row in user_reviews.iterrows():
        user_books.append(row['Book-Title']) 
    target_book = user_books[0] # we just take the first book without loss of generality
    test_books = user_books[1:]

    for book in test_books:
        if book in book_list:
            matrix_index = book_list.index(book)
            book_row = corr_matrix[matrix_index]
            related_books = list(book_titles[(book_row<1.0) & (book_row>0.7)])
            if target_book in related_books:
                return 1

    
    return 0

In [None]:
Final_Dataset.head()

In [None]:
user_ratings_count = Final_Dataset.groupby('User-ID').size()
users_with_more_than_50_ratings = user_ratings_count[user_ratings_count > 50].index
test_targets = Final_Dataset[Final_Dataset['User-ID'].isin(users_with_more_than_50_ratings)]
test_targets.sort_values(by='User-ID')


In [None]:
test_ids = test_targets['User-ID'].tolist()

user_id_set = set(test_ids)
user_id_set

In [None]:
test_recommendation(Final_Dataset, corr, user_rating_matrix.index, 274301)

In [None]:
results = []
for id in user_id_set:
    match = test_recommendation(Final_Dataset, corr, user_rating_matrix.index, id)
    if match == 0:  
        results.append(0)  
    else:
        results.append(1)



In [None]:
results_array = np.array(results)
results_array.mean()


In [None]:
review_count_index = []
accuracy = []
for value in range(10, 301, 10):
    review_count_index.append(value)


for i in range(len(review_count_index)):
    user_ratings_count = Final_Dataset.groupby('User-ID').size()
    users_with_more_than_50_ratings = user_ratings_count[user_ratings_count > review_count_index[i]].index
    test_targets = Final_Dataset[Final_Dataset['User-ID'].isin(users_with_more_than_50_ratings)]
    test_ids = test_targets['User-ID'].tolist()
    user_id_set = set(test_ids)
    results = []
    for id in user_id_set:
        match = test_recommendation(Final_Dataset, corr, user_rating_matrix.index, id)
        if match == 0:  
            results.append(0)  
        else:
            results.append(1)
    results_array = np.array(results)
    mean = results_array.mean()
    accuracy.append(mean)
    

In [None]:
plt.plot(review_count_index, accuracy, marker='o', linestyle='-')
plt.title('Accuracy vs. Review Count')
plt.xlabel('Review Count Index')
plt.ylabel('Accuracy')
plt.grid(True)
plt.show()