<h1 style="text-align:center;">CREATING A BOOK RECOMMENDATION SYSTEM</h1>


<h2 style="text-align:center;">BUSINESS UNDERSTANDING  </h2>

<h3 style="text-align:left;">Project Overview  </h3>

Recommendation systems are powerful tools that use machine learning algorithms to provide suggestions that are useful to users based on behaviour or habit patterns or user data. 
A Book Recommendation System is a machine learning-based solution designed to suggest books to users based on their preferences and behavior. Recommendation systems enhance user engagement, drive sales, and improve customer satisfaction by providing personalized suggestions. This project aims to utlize advance machine learning tools to develop a book recommendation system that is tailored to meet personalized customers needs and preferences therefore helping customers with the challenge of locating or choosing which books to read considering the large electronic book presence. The data used is from Kaggle  [Kaggle](https://www.kaggle.com/datasets/somnambwl/bookcrossing-dataset/data)

There are two primary approaches to recommendation systems:

1. Collaborative Filtering – Uses user behavior and preferences to make recommendations.

2. Content-Based Filtering – Suggests books based on the features and attributes of previously liked books

<h3 style="text-align:left;">Business Problem</h3>

As earlier mentioned, with the rise of online and e-commerce services, customers face challenges in finding books that match their preferences. The goal is to develop a Book Recommendation System that offers personalized book suggestions, improving user experience and engagement.

<h3 style="text-align:left;">Project Objectives</h3>

Develop a recommendation system that provides tailored book suggestions.

Increase book sales by recommending books users are most likely to purchase.

Enhance customer retention by offering relevant recommendations.

Improve user engagement by making book discovery easier.

<h3 style="text-align:left;">Key Analysis Questions:</h3>

Which authors consistently receive high ratings?

Does the year of publication influence book ratings?

How accurate are the collaborative filtering recommendations?

How does class imbalance in ratings affect recommendation performance?

<h3 style="text-align:left;">Key Analysis Questions:</h3>

Which authors consistently receive high ratings?

Does the year of publication influence book ratings?

How accurate are the collaborative filtering recommendations?

How does class imbalance in ratings affect recommendation performance?

<h3 style="text-align:left;">Data Source</h3>

The dataset is obtained from Kaggle and consists of three files:

Books.csv – Includes details such as ISBN, title, author, year, and publisher.

Ratings.csv – Contains user ratings (0 to 10) linked to books via ISBN and user ID.

Users.csv – Provides user demographics, including age.

<h3 style="text-align:left;">Data Source</h3>

The dataset is obtained from Kaggle and consists of three files:

Books.csv – Includes details such as ISBN, title, author, year, and publisher.

Ratings.csv – Contains user ratings (0 to 10) linked to books via ISBN and user ID.

Users.csv – Provides user demographics, including age.

<h3 style="text-align:left;">Stakeholders</h3>

Customers – Expect personalized and accurate book recommendations.

Marketing Team – Uses insights for targeted promotions and advertising.

Data Scientists – Focus on optimizing the accuracy and scalability of the model.

Book Authors – Gain insights into audience preferences and book popularity.

Executives (CEO) – Assess the business impact of the recommendation system on revenue and customer retention.

<h3 style="text-align:left;">Methodology</h3>

The project follows the CRISP-DM framework:

Business Understanding – Define project goals and objectives.

Data Understanding – Explore and assess data quality.

Data Preparation – Clean and preprocess data.

Modeling – Implement recommendation models.

Evaluation – Assess model performance using relevant metrics.

<h3 style="text-align:left;">Data Understanding</h3>

A thorough exploration of the dataset ensures high-quality inputs for model training.

<h4 style="text-align:left;">A. Dataset Overview</h4>

Books Dataset – Provides ISBN, title, author, publisher, and publication year.

Users Dataset – Contains user demographics, including age.

Ratings Dataset – Includes user ratings for books (0-10 scale).

<h4 style="text-align:left;">B. Data Merging</h4>
Merged Ratings.csv with Users.csv using User-ID.

Further merged with Books.csv using ISBN to form a comprehensive dataset.

<h4 style="text-align:left;">C. Key Insights</h4>

Outliers – Unreasonable age values need capping or removal.

Missing Values –

Age column has ~27% missing values; will be imputed using the median.

Book-Author and Publisher have negligible missing values; will be dropped.

Image URLs will be dropped as they are irrelevant to analysis.

<h4 style="text-align:left;">D. Data Analysis</h4>

i) Univariate Analysis

Book Ratings – Understand rating distribution and trends.

User Ages – Identify major reading demographics.

ii) Bivariate Analysis

Age vs. Ratings – Understand rating preferences across age groups.

Authors vs. Ratings – Identify consistently high-rated authors.

Publisher vs. Ratings – Analyze publisher influence on book ratings.

Publication Year vs. Ratings – Determine if newer books receive better ratings.


<h3 style="text-align:left;">Modeling</h3>

We will develop and evaluate two types of recommendation models:

1. Collaborative Filtering

User-User Filtering – Recommends books based on user similarity.

Item-Item Filtering – Suggests books similar to previously liked books.

Techniques Used: SVD, Cosine Similarity.

Challenge:

Struggles with new users or books lacking sufficient interaction history (cold start problem).




<h3 style="text-align:left;">Modeling Evaluation</h3>

For evaluation, we use:

Precision – Measures how many recommended books are relevant.

Recall – Ensures a diverse and comprehensive recommendation list.

F1-Score – Balances precision and recall.

RMSE & MAE – Measures prediction accuracy.





<h3 style="text-align:left;">Expected Results</h3>

The primary focus is on precision, ensuring highly relevant recommendations that improve user satisfaction. The goal is to achieve a precision score of at least 75%, ensuring accurate and personalized book suggestions.

<h2 style="text-align:center;"> DATA PREPARATION </h2>

For this section of the project, data is prepared for analysis by loading our data for inspection, visualizing it , cleaning it and performing feature engineering to better improve the dataset for analysis. All important libraries releveant to our project are also imported at this point.

Started by importing relevant libraries

In [14]:
#Import relevant libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from surprise import Dataset, Reader, SVD
from surprise.model_selection import train_test_split
from surprise import accuracy
from surprise.model_selection import GridSearchCV
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
from scipy.sparse import csr_matrix
from sklearn.metrics.pairwise import cosine_similarity



We then load the datasets

In [28]:
#Load 'books.csv' dataset
Books = pd.read_csv('Books.csv',delimiter=";",low_memory= False)
Books.head()

Unnamed: 0,ISBN,Title,Author,Year,Publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton & Company


In [29]:
#Load 'ratings.csv' dataset
Ratings = pd.read_csv('Ratings.csv',delimiter=";",low_memory=False)
Ratings.head()



Unnamed: 0,User-ID,ISBN,Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [33]:
Users = pd.read_csv('Users.csv',delimiter=";",low_memory=False)
Users.head()

Unnamed: 0,User-ID,Age
0,1,
1,2,18.0
2,3,
3,4,17.0
4,5,
