# Books recommender system | Collaborative based

<h1 style="text-align:center;background-color:powderblue;"> I. Introduction </h1>

<h2 style="color:blue;">A. Definition of Recommender Systems</h2>

Recommender systems, often referred to as recommendation systems, are a class of software applications and algorithms designed to provide personalized suggestions and recommendations to users. These suggestions are based on the user's historical interactions, preferences, or behavior within a specific platform or system. Recommender systems have become an integral part of our digital lives, shaping the way we discover products, content, and services.

Recommender systems employ a variety of techniques, including collaborative filtering, content-based filtering, and hybrid methods. These algorithms analyze vast amounts of data to predict what items or content a user may be interested in, thereby enhancing the user experience and driving engagement.


<h2 style="color:blue;">B. Importance of Recommender Systems</h2>

The significance of recommender systems in today's digital landscape cannot be overstated. They play a pivotal role in a wide range of industries and applications, offering several key benefits:

1. Enhanced User Experience
Recommender systems enhance user experiences by providing tailored recommendations that match individual preferences. Whether it's suggesting movies on a streaming platform, products on an e-commerce site, or connections on social media, these systems make users' interactions more enjoyable and relevant.
2. Increased Engagement
By surfacing content or products that align with users' interests, recommender systems keep users engaged and coming back for more. Higher engagement often leads to increased usage and loyalty.>
3. Boosted Sales and Conversions
E-commerce platforms rely on recommender systems to boost sales and conversions. By recommending products similar to what users have viewed or purchased, online retailers can increase cross-selling and upselling opportunities.
4. Content Discovery
In the realm of media and content consumption, recommender systems help users discover new music, movies, articles, or books. These systems introduce users to content they might not have found otherwise, broadening their horizons.
5. Personalization
Personalization is at the core of recommender systems. Users receive recommendations that reflect their tastes, preferences, and behaviors. This level of personalization not only benefits users but also drives business results.<br>


<h2 style="color:blue;">C. Common Applications</h2>

Recommender systems are versatile and have found applications in various domains. Some of the most common applications include:

1. E-commerce
Online retailers like Amazon use recommender systems to suggest products related to users' browsing and purchase history, encouraging additional sales.

2. Streaming Services
Platforms such as Netflix and Spotify use recommender systems to recommend movies, TV shows, or music tracks based on users' viewing or listening history.

3. Social Media
Social networks employ recommendation algorithms to connect users with potential friends, groups, and content, enhancing the overall social experience.

4. News and Content Aggregation
News websites and content aggregators recommend articles, news stories, or blog posts that match a user's reading history or interests.

5. Dating Apps
Dating apps like Tinder use recommender systems to suggest potential matches based on user profiles and interactions.

6. Job Search and Recruitment
Job search websites recommend job listings that match a user's skills, experience, and job preferences.

Recommender systems have become an indispensable part of modern technology, powering everything from online shopping to content discovery. They continue to evolve and adapt to meet the changing needs and expectations of users in the digital age.

<h1 style="text-align:center;background-color:powderblue;"> II. Types of Recommender Systems </h1>

Recommender systems are a diverse field, and different techniques are used to provide personalized recommendations to users. In this section, we will explore the primary types of recommender systems and their applications.


<h2 style="color:blue;">A. Content-Based Recommender Systems</h2>

<h3 style="color:brown;">1. Explanation of Content-Based Filtering</h3>

Content-based recommender systems make recommendations based on the content attributes of items and a user profile. These systems analyze item characteristics and match them with a user's profile to generate recommendations. For example, in a movie recommendation system, content-based filtering might consider movie genres, actors, directors, and user preferences.


<h3 style="color:brown;">2. Pros and Cons</h3>

**Pros:**

* Recommendations are personalized and reflect a user's interests.
* Effective for new or less popular items since they don't rely on user behavior.
* Less susceptible to the "cold start" problem for new users.


**Cons:**

* Limited in discovering novel or unexpected items.
* Recommendations are based on existing preferences and may not encourage exploration

<h2 style="color:blue;">B. Collaborative Filtering</h2>

<h3 style="color:brown;">1. Explanation of Collaborative Filtering</h3>
Collaborative filtering is based on the idea that users who have agreed on certain issues in the past will agree on other issues in the future. It's one of the most popular recommendation techniques.

<h3 style="color:brown;">2. User-Based Collaborative Filtering</h3>

In user-based collaborative filtering, recommendations are made by identifying users with similar preferences. If User A and User B have rated or interacted with similar items in the past, recommendations for User A are based on what User B has liked.

<h3 style="color:brown;">Item-Based Collaborative Filtering</h3>

Item-based collaborative filtering focuses on the relationships between items. If a user has shown a preference for one item, similar items can be recommended based on the preferences of other users who liked the first item.

<h3 style="color:brown;">4. Pros and Cons</h3>

**Pros:**

* Effective in identifying unexpected recommendations.
* Works well with sparse data.

**Cons:**

* Vulnerable to the "cold start" problem for new users.
* May produce biased results if the user's historical interactions are limited.

<h2 style="color:blue;">C. Hybrid Recommender Systems</h2>

<h3 style="color:brown;">1. Explanation of Hybrid Recommender Systems</h3>

Hybrid recommender systems combine multiple recommendation techniques to improve recommendation accuracy and mitigate the limitations of individual methods.

<h3 style="color:brown;">2. Combining Content-Based and Collaborative Filtering</h3>

In hybrid systems, content-based and collaborative filtering methods are often combined. For example, the content-based filtering can provide initial recommendations, and collaborative filtering can refine them based on user interactions.

<h3 style="color:brown;">3. Pros and Cons</h3>

**Pros:**

* Improved recommendation accuracy and robustness.
* Can address cold start problems more effectively.

**Cons:**

* Complex to implement and maintain.

<h2 style="color:blue;">D. Other Recommender System Techniques</h2>

<h3 style="color:brown;">1. Matrix Factorizations</h3>

Matrix factorization techniques decompose the user-item interaction matrix into lower-dimensional matrices, making it possible to uncover latent factors that influence user preferences.

<h3 style="color:brown;">2. Deep Learning-Based Recommender Systems</h3>

Deep learning models, such as neural collaborative filtering, leverage neural networks to capture complex patterns in user behavior and item characteristics.

<h3 style="color:brown;">3. Context-Aware Recommender Systemss</h3>

Context-aware recommender systems take into account contextual information, such as user location, time, and device, to make recommendations more relevant.

<h3 style="color:brown;">4. Knowledge-Based Recommender Systems</h3>

Knowledge-based systems use explicit knowledge about items, users, and preferences to make recommendations. These systems are often used in domains where domain knowledge is critical, such as healthcare and education.




<h1 style="text-align:center;background-color:powderblue;"> III. How Recommender Systems Work</h1>

Recommender systems are complex algorithms that involve several stages to provide personalized recommendations. In this section, we'll explore the initial steps of how recommender systems work, from data collection to user-item interaction representation.

<h2 style="color:blue;">A. Data Collection and Preprocessing</h2>

<h3 style="color:brown;">1. Data Collection</h3>

The first crucial step in building a recommender system is data collection. Data sources may include:

* **User interactions:** These can be in the form of user ratings, likes, clicks, or purchases. For example, an e-commerce platform collects data on products users have viewed and bought.

* **User profiles:** Information about users, such as demographics, preferences, and historical behaviors, is valuable for understanding user preferences. Social media platforms, for instance, gather user profile data.

* **Item attributes:** Information about items, such as product descriptions, genres, or tags, is necessary for content-based filtering. Movie recommendation systems, for example, require details about the movies, including genre and director.

<h3 style="color:brown;">2. Data Preprocessing</h3>

Once the data is collected, it undergoes preprocessing to ensure it's in a suitable format for recommendation algorithms. This preprocessing includes:

* **Data cleaning:** Handling missing values, outliers, and errors in the dataset to improve data quality.

* **Data transformation:** Converting and encoding data into a usable format. Categorical variables may be converted into numerical representations.

* **Data reduction:** Reducing dimensionality to make computations more efficient. Techniques like principal component analysis (PCA) may be applied.

<h2 style="color:blue;">B. User-Item Interaction Representation</h2>

Recommender systems rely on a matrix that represents the interactions between users and items. This matrix typically has users as rows and items as columns, and its entries represent user interactions with items. Here are key elements of this representation:

* **User-Item Matrix:** This matrix is often sparse because users don't interact with all available items. Each cell of the matrix represents a user's interaction with an item, such as a rating or purchase.

* **Sparse Data Handling:** Since most entries in the matrix are missing (indicating no interaction), recommender systems need techniques to handle sparse data efficiently.

* **User and Item Indices:** The matrix includes user and item indices that facilitate mapping between users, items, and matrix indices.

* **Normalization:** Data normalization may be applied to the matrix to account for user biases and make recommendations more accurate.

This user-item interaction matrix serves as the foundation for various recommendation algorithms, including collaborative filtering and content-based filtering.


<h1 style="text-align:center;background-color:powderblue;"> IV. Challenges and Considerations</h1>

Recommender systems are powerful tools for enhancing user experiences, but they also come with a set of challenges and considerations that need to be addressed. In this section, we'll explore some of the key challenges and considerations in building and deploying recommender systems.

<h2 style="color:blue;">A. Cold Start Problem</h2>

The "cold start" problem refers to the challenge of making recommendations for new users or items. New users have limited or no interaction history, making it difficult to provide personalized recommendations. Similarly, new items lack historical data, making it challenging to understand their attributes and how they relate to user preferences.

**Solutions:**

* For new users, content-based recommendations can be a starting point, relying on item attributes and user demographics.

* For new items, hybrid recommender systems can combine content-based and collaborative filtering to provide initial recommendations.

<h2 style="color:blue;">B. Scalability</h2>

Scalability is a significant concern for recommender systems, especially in platforms with a large user base and extensive item catalog. As the number of users and items grows, the computational demands of recommendation algorithms increase significantly.

**Solutions:**

* Efficient algorithms and data structures can be implemented to speed up recommendation generation.

* Distributed computing and cloud-based solutions can be used to handle large datasets and user loads.

<h2 style="color:blue;">C. Data Privacy and Security</h2>

Recommender systems require access to user data and preferences, raising concerns about data privacy and security. Users may be apprehensive about sharing personal information, and data breaches can have severe consequences.

**Solutions:**

* Anonymization and data aggregation can be used to protect user identities while still providing relevant recommendations.

* Compliance with data protection regulations, such as GDPR, is essential to ensure data privacy.

<h2 style="color:blue;">D. Diversity and Serendipity</h2>

Recommender systems often face the challenge of providing diverse and serendipitous recommendations. Over-recommending popular items or content can lead to user homogenization and lack of exploration.

**Solutions:**

* Algorithms can incorporate diversity-aware components that encourage the recommendation of less popular or related but unexpected items.

* Serendipity can be promoted by incorporating randomization or novelty factors into recommendations.

<h2 style="color:blue;">E. Explainability</h2>

The lack of transparency in some recommendation algorithms can be a concern. Users may want to understand why a particular item was recommended to them, especially in critical domains like healthcare or finance.

**Solutions:**

* Explainable AI (XAI) techniques can be integrated into recommender systems to provide explanations for recommendations.

* Offering users control over recommendation parameters and filters can enhance transparency.

<h1 style="text-align:center;background-color:powderblue;"> V. Future Trends and Developments</h1>

The field of recommender systems is dynamic and continuously evolving to meet the changing needs of users and businesses. In this section, we'll explore some of the emerging trends and developments that are shaping the future of recommender systems.

<h2 style="color:blue;">A. Personalization</h2>

Personalization remains a key focus in the future of recommender systems. As user expectations for tailored recommendations continue to rise, recommender systems are expected to become even more sophisticated in understanding and predicting individual preferences. Key trends in personalization include:

* **Hyper-Personalization:** Recommender systems will move beyond general preferences to consider users' current contexts and needs, such as location, time of day, and device.

* **Privacy-Preserving Personalization:** With growing concerns about data privacy, recommender systems will adopt privacy-preserving techniques that enable personalization while respecting user privacy.

<h2 style="color:blue;">B. AI and Deep Learning</h2>

Artificial intelligence (AI) and deep learning are expected to play an increasingly significant role in the future of recommender systems. These technologies offer the potential to model complex user behaviors and item characteristics more accurately. Future trends in this area include:

* **Neural Recommender Systems:** The use of neural networks and deep learning will become more prevalent in capturing intricate user-item interactions.

* **Explainable AI (XAI):** As AI-driven recommender systems become more complex, the need for transparent and explainable AI will grow. XAI techniques will be integrated to provide users with insight into recommendation rationale.

<h2 style="color:blue;">C. Ethics and Fairness</h2>

Ethical considerations in recommender systems are gaining prominence. Future developments will prioritize fairness, transparency, and accountability. Key trends include:

* **Algorithmic Fairness:** To mitigate biases in recommendations, recommender systems will incorporate fairness-aware algorithms and perform regular audits for bias detection.

* **User Control:** Users will have more control over their recommendations, enabling them to customize and fine-tune their preferences.

<h2 style="color:blue;">D. Cross-Domain Recommendations</h2>

Cross-domain recommendations involve extending recommendations beyond a single platform or domain. This trend will enable users to receive personalized recommendations that span different areas of interest. Future developments include:

* **Cross-Platform Recommendations:** Recommender systems will provide recommendations that bridge various digital platforms, allowing users to seamlessly discover content or products across different services.

* **Multimodal Recommendations:** Systems will leverage different types of data, such as text, images, and audio, to provide richer and more diverse recommendations.

<h2 style="color:blue;">E. Reinforcement Learning in Recommender Systems</h2>

Reinforcement learning, a subset of machine learning, will play a more significant role in improving the recommendations made by these systems. Future trends include:

* **Exploration-Exploitation Balance:** Recommender systems will employ reinforcement learning to strike a balance between exploring new items and exploiting known preferences.

* **Dynamic Personalization:** These systems will adapt in real-time to users' evolving preferences and changing contexts.


In [226]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors
import pickle 
import os

## Loading and examining `Books` dataset

In [227]:
# Load the Books dataset
'''
This dataset has some issues, as it's not separated with commas but semicolons.
We use "sep=" parameter to handle this issue. We also skip lines with errors using "on_bad_lines".
'''
books = pd.read_csv('data/BX-Books.csv', sep=";", on_bad_lines='skip', encoding='latin-1')


  books = pd.read_csv('data/BX-Books.csv', sep=";", on_bad_lines='skip', encoding='latin-1')


In [228]:
books.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


In [229]:
# Display the shape of the Books dataset
books.shape

(271360, 8)

In [230]:
# Display the columns of the Books dataset
books.columns

Index(['ISBN', 'Book-Title', 'Book-Author', 'Year-Of-Publication', 'Publisher',
       'Image-URL-S', 'Image-URL-M', 'Image-URL-L'],
      dtype='object')

In [231]:
# Remove unnecessary columns 'Image-URL-S' and 'Image-URL-M'
columns_to_drop = ['Image-URL-S', 'Image-URL-M']
books = books.drop(columns=columns_to_drop)

In [232]:
# Rename columns for clarity
books.rename(columns={
    "Book-Title": "title",
    "Book-Author": "author",
    "Year-Of-Publication": "year",
    "Publisher": "publisher",
    "Image-URL-L": "img_url"
}, inplace=True)

In [233]:
books.head(2)

Unnamed: 0,ISBN,title,author,year,publisher,img_url
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...


## Loading examining `Users` dataset

In [234]:
# Load the Users dataset
users = pd.read_csv('data/BX-Users.csv', sep=";", on_bad_lines='skip', encoding='latin-1')

In [235]:
users.head(2)

Unnamed: 0,User-ID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0


In [236]:
# Display the shape of the Users dataset
users.shape # Notice that data shapes are not same with the `books` dataset

(278858, 3)

In [237]:
# Rename columns for clarity
users.rename(columns={
    "User-ID": "user_id",
    "Location": "location",
    "Age": "age"
}, inplace=True)

In [238]:
# Display the first two rows of the Users dataset after renaming
users.head(2)

Unnamed: 0,user_id,location,age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0


## Loading and examining `Book Ratings` dataset

In [239]:
# Load the Book Ratings dataset
ratings = pd.read_csv('data/BX-Book-Ratings.csv', sep=";", on_bad_lines='skip', encoding='latin-1')

In [240]:
# Display the first two rows of the Book Ratings dataset
ratings.head(2)

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5


In [241]:
ratings.shape # All the dataframes has different shapes (as expected)

(1149780, 3)

In [242]:
# Rename columns for clarity
ratings.rename(columns={
    "User-ID": "user_id",
    "Book-Rating": "rating"
}, inplace=True)

In [243]:
# Display the first two rows of the Book Ratings dataset after renaming
ratings.head(2)

Unnamed: 0,user_id,ISBN,rating
0,276725,034545104X,0
1,276726,0155061224,5


In [244]:
# Count the number of ratings for each user
ratings['user_id'].value_counts()

user_id
11676     13602
198711     7550
153662     6109
98391      5891
35859      5850
          ...  
116180        1
116166        1
116154        1
116137        1
276723        1
Name: count, Length: 105283, dtype: int64

In [245]:
# Count the number of unique users
ratings['user_id'].unique().shape

(105283,)

## Creating the `Final` dataframe

In [246]:
# Filter out users with more than 200 ratings
x = ratings['user_id'].value_counts() > 200

In [247]:
# Count the number of users with more than 200 ratings
x.sum()

899

In [248]:
# Get the user IDs with more than 200 ratings
y = x[x].index

In [249]:
# Filter the ratings dataset to include only users with more than 200 ratings
ratings = ratings[ratings['user_id'].isin(y)]

In [250]:
# Display the first two rows of the filtered ratings dataset
ratings.head(2)

Unnamed: 0,user_id,ISBN,rating
1456,277427,002542730X,10
1457,277427,0026217457,0


In [251]:
# Display the shape of the filtered ratings dataset
ratings.shape

(526356, 3)

In [252]:
# Merge the filtered ratings dataset with the Books dataset using the ISBN as the key
ratings_with_books = ratings.merge(books, on="ISBN")

In [253]:
# Display the first few rows of the merged dataset
ratings_with_books.head()

Unnamed: 0,user_id,ISBN,rating,title,author,year,publisher,img_url
0,277427,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,http://images.amazon.com/images/P/002542730X.0...
1,3363,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,http://images.amazon.com/images/P/002542730X.0...
2,11676,002542730X,6,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,http://images.amazon.com/images/P/002542730X.0...
3,12538,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,http://images.amazon.com/images/P/002542730X.0...
4,13552,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,http://images.amazon.com/images/P/002542730X.0...


In [254]:
ratings_with_books.shape

(487671, 8)

In [255]:
# Count the number of ratings for each book
num_ratings = ratings_with_books.groupby('title')['rating'].count().reset_index()

In [256]:
# Rename the 'rating' column to 'num_of_rating'
num_ratings.rename(columns={"rating": "num_of_rating"}, inplace=True)

In [257]:
num_ratings

Unnamed: 0,title,num_of_rating
0,A Light in the Storm: The Civil War Diary of ...,2
1,Always Have Popsicles,1
2,Apple Magic (The Collector's series),1
3,Beyond IBM: Leadership Marketing and Finance ...,1
4,Clifford Visita El Hospital (Clifford El Gran...,1
...,...,...
160264,Ã?Â?ber die Pflicht zum Ungehorsam gegen den S...,3
160265,Ã?Â?lpiraten.,1
160266,Ã?Â?rger mit Produkt X. Roman.,1
160267,Ã?Â?stlich der Berge.,1


In [258]:
# Merge the merged dataset with the number of ratings dataset using the book title as the key
final_df = ratings_with_books.merge(num_ratings, on="title")

In [259]:
# Display the first two rows of the final dataset
final_df.head(2)

Unnamed: 0,user_id,ISBN,rating,title,author,year,publisher,img_url,num_of_rating
0,277427,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,http://images.amazon.com/images/P/002542730X.0...,82
1,3363,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,http://images.amazon.com/images/P/002542730X.0...,82


In [260]:
# Display the shape and information of the final dataset
final_df.shape

(487671, 9)

In [261]:
final_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 487671 entries, 0 to 487670
Data columns (total 9 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   user_id        487671 non-null  int64 
 1   ISBN           487671 non-null  object
 2   rating         487671 non-null  int64 
 3   title          487671 non-null  object
 4   author         487670 non-null  object
 5   year           487671 non-null  object
 6   publisher      487669 non-null  object
 7   img_url        487668 non-null  object
 8   num_of_rating  487671 non-null  int64 
dtypes: int64(3), object(6)
memory usage: 33.5+ MB


In [262]:
# Filter books with more than 50 ratings
final_df = final_df[final_df['num_of_rating'] >= 50]

In [263]:
# Display a random sample of 5 rows from the filtered dataset
final_df.sample(n=5)

Unnamed: 0,user_id,ISBN,rating,title,author,year,publisher,img_url,num_of_rating
55595,142524,451167317,0,The Dark Half,Stephen King,1994,Signet Book,http://images.amazon.com/images/P/0451167317.0...,91
63917,106225,440213290,0,The Copper Beech,Maeve Binchy,1993,Dell,http://images.amazon.com/images/P/0440213290.0...,68
67858,200226,60096195,0,The Boy Next Door,Meggin Cabot,2002,Avon Trade,http://images.amazon.com/images/P/0060096195.0...,55
75949,52584,743410181,0,Temptation,Jude Deveraux,2001,Pocket Books,http://images.amazon.com/images/P/0743410181.0...,61
166260,31315,515130966,0,Riptide,Catherine Coulter,2001,Jove Books,http://images.amazon.com/images/P/0515130966.0...,87


In [264]:
# Count the number of duplicates between user_id and title
final_df[final_df.duplicated(subset=['user_id', 'title'])].count()

user_id          2003
ISBN             2003
rating           2003
title            2003
author           2003
year             2003
publisher        2003
img_url          2003
num_of_rating    2003
dtype: int64

In [265]:
# Remove duplicates based on user_id and title
final_df.drop_duplicates(subset=['user_id', 'title'], inplace=True)

In [266]:
# Display the shape of the dataset after removing duplicates
final_df.shape

(59850, 9)

In [303]:
# Create a pivot table to represent user-book interactions
book_pivot = final_df.pivot_table(columns='user_id', index='title', values='rating')

In [304]:
# Non rated books look as NaN. They should be 0
book_pivot.isna().sum().sum() 

599046

In [305]:
book_pivot.head(2)

user_id,254,2276,2766,2977,3363,3757,4017,4385,6242,6251,...,274004,274061,274301,274308,274808,275970,277427,277478,277639,278418
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1984,9.0,,,,,,,,,,...,,,,,,0.0,,,,
1st to Die: A Novel,,,,,,,,,,,...,,,,,,,,,,


In [306]:
# Fill missing values with 0 in the pivot table
book_pivot.fillna(0, inplace=True)

In [307]:
book_pivot.head(2)

user_id,254,2276,2766,2977,3363,3757,4017,4385,6242,6251,...,274004,274061,274301,274308,274808,275970,277427,277478,277639,278418
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1984,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1st to Die: A Novel,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## MODELING

In [269]:
'''
The utilization of a Compressed Sparse Row (CSR) matrix is crucial in this pivot table due to the prevalence of zero values. 
This matrix efficiently stores only the non-zero values, resulting in memory savings and faster computation.
'''

# Create a Compressed Sparse Row (CSR) matrix to store non-zero values efficiently
book_sparse = csr_matrix(book_pivot)

In [308]:
'''
Output Explanation: The presented matrix is of size 742x888 and follows the Compressed Sparse Row format. 
It contains a total of 14942 stored elements, with each element having the data type 'numpy.float64'.
'''

book_sparse

<742x888 sparse matrix of type '<class 'numpy.float64'>'
	with 14942 stored elements in Compressed Sparse Row format>

In [270]:
'''
Algorithm Choice - "brute": When the "brute" algorithm is employed, the Nearest Neighbors algorithm performs 
a direct calculation of distances between each data point in the dataset to identify the nearest neighbors. 
This approach, while conceptually straightforward, can become computationally demanding, especially with larger datasets. 
It involves a comprehensive examination of all data points and distance calculations, 
lacking specific data structures or optimizations to expedite the process.

'''

# Create a nearest neighbors model using the 'brute' algorithm
model = NearestNeighbors(algorithm='brute')

In [316]:
'''
---------------------------------------------------------------------
## distance = array([[0.0, 68.78953409, 69.5413546, 72.64296249, 76.83098333, 77.28518616]]):

* distance is an array showing the distances between the selected book (at index 237) and its 6 closest neighbors.
* The initial value, 0.0, represents the distance to itself.
* The subsequent values represent distances to other nearest books.
----------------------------------------------------------------------

## suggestion = array([[237, 240, 238, 241, 184, 536]], dtype=int64):

* suggestion is an array providing the indices of the selected book's 6 nearest neighbors in the original dataset.
* The indices, such as 237, 240, etc., indicate which books are the nearest neighbors.
----------------------------------------------------------------------

## About (1,-1):

* The code book_pivot.iloc[237, :].values.reshape(1,-1) prepares the data of the book at index 237 for input into the model.kneighbors() method.
* book_pivot.iloc[237, :] selects the relevant row from the book_pivot DataFrame.
* .values transforms the row into a NumPy array.
* .reshape(1, -1) reshapes the array into a single row with a dynamically calculated number of columns based on the original shape.
'''


# Fit the model using the sparse matrix
model.fit(book_sparse)

# Find the 6 nearest neighbors for a specific book
distance, suggestion = model.kneighbors(book_pivot.iloc[237, :].values.reshape(1, -1), n_neighbors=6)

print(f'Distances:   {distance}')
print(f'Suggestions: {suggestion}')

Distances:   [[ 0.         68.78953409 69.5413546  72.64296249 76.83098333 77.28518616]]
Suggestions: [[237 240 238 241 184 536]]


In [317]:
# Print the book titles of the suggested books
for i in range(len(suggestion)):
    print(book_pivot.index[suggestion[i]])

Index(['Harry Potter and the Chamber of Secrets (Book 2)',
       'Harry Potter and the Prisoner of Azkaban (Book 3)',
       'Harry Potter and the Goblet of Fire (Book 4)',
       'Harry Potter and the Sorcerer's Stone (Book 1)', 'Exclusive',
       'The Cradle Will Fall'],
      dtype='object', name='title')


In [276]:
# Store the book titles in a variable
book_names = book_pivot.index

In [277]:
final_df.head(2)

Unnamed: 0,user_id,ISBN,rating,title,author,year,publisher,img_url,num_of_rating
0,277427,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,http://images.amazon.com/images/P/002542730X.0...,82
1,3363,002542730X,0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc,http://images.amazon.com/images/P/002542730X.0...,82


In [278]:
# Display the first two rows of the pivot table
book_pivot.head(2)

user_id,254,2276,2766,2977,3363,3757,4017,4385,6242,6251,...,274004,274061,274301,274308,274808,275970,277427,277478,277639,278418
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1984,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1st to Die: A Novel,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [279]:
# Create a directory 'artifacts' if it doesn't exist
if not os.path.exists('artifacts'):
    os.makedirs('artifacts')

In [280]:
# Save the model, book names, final dataset, and pivot table to pickle files
pickle.dump(model, open('artifacts/model.pkl', 'wb'))
pickle.dump(book_names, open('artifacts/book_names.pkl', 'wb'))
pickle.dump(final_df, open('artifacts/final_df.pkl', 'wb'))
pickle.dump(book_pivot, open('artifacts/book_pivot.pkl', 'wb'))

In [281]:
# Define a function to recommend books based on a book's title
def recommend_book(book_name):
    book_id = np.where(book_pivot.index == book_name)[0][0]
    distance, suggestion = model.kneighbors(book_pivot.iloc[book_id, :].values.reshape(1, -1), n_neighbors=6)
    
    for i in range(len(suggestion)):
        books = book_pivot.index[suggestion[i]]
        for j in books:
            print(j)
            
"""
book_id = np.where(book_pivot.index == 'The Cradle Will Fall')
book_id
output: (array([536], dtype=int64),)
#####################################

book_id[0]
output: array([536], dtype=int64)
###################################

book_id[0][0]
output: 536

"""

In [327]:
#### This cell, made for to understand better what we did in the previous cell.  ######

def book_tryout(book_name):
    book_id = np.where(book_pivot.index == book_name)[0][0]
    distance, suggestion = model.kneighbors(book_pivot.iloc[book_id, :].values.reshape(1,-1), n_neighbors=6)
    
    return suggestion


array  = book_tryout("2nd Chance")
print(array)
for zort in array:
    print(zort)
    for j in zort:
        print(j)
        
#########################################################################################

[[  2 609 184 562 697 311]]
[  2 609 184 562 697 311]
2
609
184
562
697
311


In [284]:
df1 = pd.DataFrame(book_pivot.iloc[100, :].values.reshape(1,-1))
df1 

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,878,879,880,881,882,883,884,885,886,887
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [328]:
'''
The presence of the input book's name in the recommendations is a result of the model's behavior. 
The model, driven by distances, designates the book at index 0 as the closest item. 
As a result, the input book is invariably included in the recommendations.
'''

book_name = "A Bend in the Road"
recommend_book(book_name=book_name)

A Bend in the Road
Exclusive
The Cradle Will Fall
No Safe Place
Family Album
Last Man Standing
