# Audible Insights: Book Recommendation System

## Recommendation System Development

In [19]:
# Import necessary libraries
import pandas as pd
import numpy as np
import pickle
import os

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

import warnings
warnings.filterwarnings("ignore")


Imports all required libraries for
* Data handling(pandas, numpy)
* NLP and modeling(TF-IDF, KMeans, LinearRegression)
* Evaluation (mean_squared_error, cosine_similarity)
* File handling (os, pickle)
* Suppressing warnings for cleaner output

In [20]:
# Load the data set
df = pd.read_csv("books_clusters.csv")

# Check the first few rows
df.head()


Unnamed: 0,Book Name,Author,Rating,Number of Reviews,Price,Listening Time (Minutes),Rank,Genre,Processed_Description,cluster
0,Think Like a Monk: The Secret of How to Harnes...,Jay Shetty,4.9,342.0,10080.0,654.0,1,Society & Culture (Books),"Over past three year , Jay Shetty become one w...",1
1,Ikigai: The Japanese Secret to a Long and Happ...,Héctor García,4.6,3670.0,615.0,203.0,2,Personal Success,Brought Penguin .,2
2,The Subtle Art of Not Giving a F*ck: A Counter...,Mark Manson,4.4,20240.0,10378.0,317.0,3,Personal Development & Self-Help,"In generation-defining self-help guide , super...",1
3,Atomic Habits: An Easy and Proven Way to Build...,James Clear,4.6,4646.0,888.0,335.0,5,Personal Success,Brought Penguin .,2
4,Life's Amazing Secrets: How to Find Balance an...,Gaur Gopal Das,4.6,4305.0,1005.0,385.0,6,Spiritualism,"Stop going life , Start growing life !",1


Loads the **preprocessed dataset** (books_clusters.csv) and displays the first 5 rows to inspect the structure.

In [21]:
print(df.columns)


Index(['Book Name', 'Author', 'Rating', 'Number of Reviews', 'Price',
       'Listening Time (Minutes)', 'Rank', 'Genre', 'Processed_Description',
       'cluster'],
      dtype='object')


Prints the list of column names in the Dataframe `df`, useful for confirming feature names and understanding dataset structure.

### Building recommendation models

The actual **recommendation models** will be built, including both **content-based** and **clustering-based** approaches

#### Content-Based Recommendations

Introduces the **Content-Based Filtering** method, where books are recommended based on the similarity of content features like genre, author, and description.

In [37]:
def content_similarity(df):
    """
    Computes the cosine similarity matrix using genres, authors, and processed descriptions.
    Saves the similarity matrix as a pickle file.
    """

    # Fill NaN values with empty strings in the relevant columns
    df['Genre'] = df['Genre'].fillna('')
    df['Author'] = df['Author'].fillna('')
    df['Processed_Description'] = df['Processed_Description'].fillna('')
    
    # Combine features for similarity calculation
    df['combined_features'] = df['Genre'].astype(str) + ' ' + df['Author'] + ' ' + df['Processed_Description']
    
    # Convert text into numerical representation using TF-IDF
    tfidf = TfidfVectorizer(stop_words='english')
    tfidf_matrix = tfidf.fit_transform(df['combined_features'])
    
    # Compute cosine similarity
    content_sim_matrix = cosine_similarity(tfidf_matrix)
    
    return content_sim_matrix
content_sim_matrix = content_similarity(df)  # Call the function

with open("C:/Users/user/Desktop/DS_Capstone_Projects/DS_Audible_Insights/models/content_similarity_matrix.pkl", "wb") as file:
    pickle.dump(content_sim_matrix, file)

Defines a function to compute `cosine similarity`between books using combined features `(Genre + Author + Description)` with **TF-IDF vectorization**.
Then saves the similarity matrix as a `.pkl` file for use in the Streamlit app.

In [23]:
# Recommends books based on a given book index using the cosine similarity matrix.
def recommend_books_by_index(book_index, content_sim_matrix, df, top_n=5):
  similar_books = list(enumerate(content_sim_matrix[book_index]))
  sorted_books = sorted(similar_books, key=lambda x: x[1], reverse=True)[1:top_n+1]
  recommendations = [df.iloc[i[0]]['Book Name'] for i in sorted_books]
  return recommendations  


This function takes a book index and returns the top `n` most similar books using the precomputed **cosine similarity matrix**. It's the core of the **content-based recommendation logic**

#### Clustering-Based Recommendations

Introduces the **clustering-based recommendations** , where books are recommended based on the cluster they belong to.

In [38]:
def kmeans_clustering(df, n_clusters=5):
        
    # Select numerical features for clustering
    X = df[['Rating', 'Number of Reviews', 'Price', 'Listening Time (Minutes)', 'Rank']]
    
    # Apply K-Means clustering
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    df['cluster'] = kmeans.fit_predict(X)

    # Ensure the directory exists before saving
    save_folder = "C:/Users/user/Desktop/DS_Capstone_Projects/DS_Audible_Insights/models"
    if not os.path.exists(save_folder):
        os.makedirs(save_folder)

    # Save the clustering model using pickle
    model_path = os.path.join(save_folder, "kmeans_clustering_model.pkl")
    with open(model_path, 'wb') as file:
        pickle.dump(kmeans, file)

    return kmeans
kmeans = kmeans_clustering(df)


Defines and applies **KMeans clustering** on numeric book features. It also saves the trained model as a `.pkl` file for later use in the Streamlit app.

In [25]:
# Recommends books from the same cluster based on book index

def recommend_books_from_cluster(book_index, df, top_n=5):
   
    cluster_label = df.iloc[book_index]['cluster']

    cluster_books = df[df['cluster'] == cluster_label].head(top_n)['Book Name'].tolist()
    return cluster_books


Recommends books from the same cluster as a given book using its **cluster label**, helping implement **clustering-based recommendations**. 

#### Hybrid Approaches

Introduces a approach where a **hybrid recommendation system** is built, combining both content-based and clustering-based results.

In [26]:
# Combines content-based and clustering recommendations to suggest books.
def get_hybrid_recommendations(book_index, similarity_matrix, df, top_n=5):
    content_based_recs = recommend_books_by_index(book_index, content_sim_matrix, df, top_n=top_n)
    cluster_based_recs = recommend_books_from_cluster(book_index, df, top_n=top_n)
    return list(set(content_based_recs + cluster_based_recs)) 

    
   


Implements a **hybrid recommendation system** by merging the results of content-based and clustering-based recommendations

In [27]:
# Precision and Recall calculation
def calculate_precision_recall(recommended, actual):
    recommended_set = set(recommended)
    actual_set = set(actual)

    true_positives = len(recommended_set.intersection(actual_set))
    precision = true_positives / len(recommended_set) if recommended_set else 0
    recall = true_positives / len(actual_set) if actual_set else 0

    return precision, recall
   

Defines a function to compute **precision and recall** - to evaluate the effectiveness of the recommendation engine.

In [39]:
# RMSE Calculation using Linear Regression for rating prediction
def calculate_rmse(df):
    # Train a simple Linear Regression model for rating prediction
    X = df[['Rank', 'Number of Reviews', 'Price', 'Listening Time (Minutes)']]
    y = df['Rating']

    model = LinearRegression()
    model.fit(X, y)
    
    # Make predictions
    predicted_ratings = model.predict(X)

    # Calculate Mean Squared Error (MSE)
    mse = mean_squared_error(y, predicted_ratings)
    
    # Take square root of MSE to get RMSE
    rmse = np.sqrt(mse)
    
    
   # Take square root of MSE to get RMSE
    rmse = np.sqrt(mse)
     
    # Ensure the directory exists before saving
    save_folder = "C:/Users/user/Desktop/DS_Capstone_Projects/DS_Audible_Insights/models"
    if not os.path.exists(save_folder):
        os.makedirs(save_folder) 
        
    # Save the linear regression model
    model_path = os.path.join(save_folder, "linear_regression_model.pkl")
    with open(model_path, 'wb') as file:
        pickle.dump(model, file)
    
    
    return rmse
rmse = calculate_rmse(df)

Trains a **Linear Regression model** to predict book ratings and calculates **Root Mean Squared Error (RMSE)**. It also saves the trained model for deployment.

In [29]:
# Calculate similarity for content-based filtering
cosine_sim = content_similarity(df)

Recalculates the **cosine similarity matrix** using the earlier defined `content_similarity` function.

In [30]:
cosine_sim

array([[1.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 1.        , 0.0270161 , ..., 0.16178998, 0.04422082,
        0.        ],
       [0.        , 0.0270161 , 1.        , ..., 0.        , 0.14499006,
        0.        ],
       ...,
       [0.        , 0.16178998, 0.        , ..., 1.        , 0.        ,
        0.0260833 ],
       [0.        , 0.04422082, 0.14499006, ..., 0.        , 1.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.0260833 , 0.        ,
        1.        ]])

Displays the cosine similarity matrix in output (likely for verification or preview).

In [31]:
# Perform clustering
kmeans_clustering(df)

Re-applies KMeans clustering on the current Dataframe `df`

In [32]:
#Get hybrid recommendations for a given book index (example index 0)
book_index = 0
hybrid_recommendations = get_hybrid_recommendations(book_index, content_sim_matrix, df, top_n=5)
print(f'Hybrid Recommendations:\n{hybrid_recommendations}')


Hybrid Recommendations:
["A Room of One's Own: Penguin Classics", 'The Book of Why: The New Science of Cause and Effect', 'The 5AM Club: Own Your Morning. Elevate Your Life.', 'The Intelligent Investor Rev Ed.', 'Das Think Like a Monk-Prinzip: Finde innere Ruhe und Kraft für ein erfülltes und sinnvolles Leben', 'Everything Is F*cked: A Book About Hope', 'The Facebook Effect: The Inside Story of the Company That Is Connecting the World', 'Think Like a Monk: The Secret of How to Harness the Power of Positivity and Be Happy Now', 'Influence: The Psychology of Persuasion', 'The Order of Time: Narrated by Benedict Cumberbatch']


Generates and prints **hybrid recommendations** (combining content + cluster) for the book at index 0

In [33]:
df['Book Name'].unique()

array(['Think Like a Monk: The Secret of How to Harness the Power of Positivity and Be Happy Now',
       'Ikigai: The Japanese Secret to a Long and Happy Life',
       'The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life',
       ..., 'Terra Incognita: 100 Maps to Survive the Next 100 Years',
       'Universal Mind Power: New Silva Method Techniques for Developing Your Ideal Self',
       "Dr. Bernstein's Diabetes Solution: The Complete Guide to Achieving Normal Blood Sugars"],
      dtype=object)

Displays a list of all **unique book titles** in the dataset - useful for identifying available options or validationg recommendations.

In [34]:
# Books liked by the user (replace with actual data)
user_favorite_books = [
   "Everything Is F*cked: A Book About Hope",
    "Ikigai: The Japanese Secret to a Long and Happy Life",
    "Universal Mind Power: New Silva Method Techniques for Developing Your Ideal Self",
    "The 5AM Club: Own Your Morning. Elevate Your Life.",
    "The Order of Time: Narrated by Benedict Cumberbatch"
]

# Compute Precision & Recall
precision, recall = calculate_precision_recall(hybrid_recommendations, user_favorite_books)

# Print Results
print(f'Precision: {precision:.2f}, Recall: {recall:.2f}')


Precision: 0.30, Recall: 0.60


Simulates a **user profile** by listing favorite books, then evaluates the hybrid recommendation output using precision and recall metrics to assess relevance.

In [35]:
# Calculate RMSE
rmse  = calculate_rmse(df)
print(f'RMSE: {rmse}')

RMSE: 1.625641464505686


Computes and displays the **Root Mean Squared Error (RMSE)** from the linear regression model trained to predict ratings - helps assess prediction accuracy 

#### Summary

This notebook successfully implemented and evaluated a book recommendation system using:

- **Content-Based Filtering** using TF-IDF and cosine similarity
- **Clustering-Based Recommendations** using KMeans
- **Hybrid Recommendation Strategy** that merges both approaches
- **Evaluation Metrics** such as Precision, Recall, and RMSE

The models and data were saved as `.pkl` and `.csv` files, ready to be deployed in a Streamlit app for an interactive user experience.
