# 📘 Final Project - Recommender Systems

### 📌 Submitted by:
#### 👤 1. Ebenezer Daniel  
#### 👤 2. Raja Prabhakaran  
#### 👤 3. Nitheesh Samiappan  

## 📚 **Dataset Details: Goodreads Book Datasets (10M)**
### 📌 Source:
🔗 **[Goodreads Book Datasets (10M) on Kaggle](https://www.kaggle.com/datasets/bahramjannesarr/goodreads-book-datasets-10m/data)**

### 📊 **Overview**
This dataset contains **book information and user ratings** from **Goodreads**, a popular online book review platform. It includes metadata on **millions of books**, user-generated reviews, and rating distributions.

### 📂 **Files Included**
- **Books Data**: Contains metadata such as title, author, publisher, publication year, and ratings.
- **User Ratings Data**: Includes detailed breakdowns of how users rated books (1-star to 5-star distributions).

### 🔍 **Dataset Attributes**
| Column Name         | Description |
|---------------------|------------|
| **Id**             | Unique book identifier |
| **Name**           | Title of the book |
| **Authors**        | Author(s) of the book |
| **ISBN**           | International Standard Book Number (if available) |
| **Publisher**      | Name of the publishing house |
| **PagesNumber**    | Number of pages in the book |
| **PublishYear**    | Year of publication |
| **PublishMonth**   | Month of publication |
| **PublishDay**     | Day of publication |
| **Language**       | Language of the book |
| **CountsOfReview** | Total number of user reviews |
| **Rating**         | Average user rating (1-5) |
| **RatingDist1**    | Number of 1-star ratings |
| **RatingDist2**    | Number of 2-star ratings |
| **RatingDist3**    | Number of 3-star ratings |
| **RatingDist4**    | Number of 4-star ratings |
| **RatingDist5**    | Number of 5-star ratings |
| **RatingDistTotal** | Total number of ratings |

### 📈 **Size of the Dataset**
- **Number of books:** ~10 million  
- **Number of user ratings:** ~25 million  

### 🛠 **Preprocessing Done**
- **Handling Missing Values:** Some books are missing ISBNs, publishers, or languages.
- **Standardizing Column Names:** Ensured consistency across multiple files.
- **Data Cleaning:** Removed duplicates, merged similar columns (e.g., `PagesNumber` and `pagesNumber`).

---

### 📌 **Why This Dataset?**
- Large-scale book rating data enables **recommendation system development**.
- Rich metadata for **book analysis, trend discovery, and user preferences**.
- Real-world **collaborative filtering & machine learning applications**.

---

📢 *This dataset serves as the foundation for our book recommender system and data analysis!* 🚀


In [10]:
import importlib
import subprocess
import sys

# List of required libraries
required_libraries = [
    "pandas", "numpy", "matplotlib", "seaborn", "scipy", 
    "nltk", "scikit-learn", "scikit-surprise"  # Use scikit-surprise instead of surprise
]

# Function to check and install missing libraries
def check_and_install(libraries):
    for lib in libraries:
        try:
            importlib.import_module(lib)
        except ImportError:
            print(f"Installing {lib}...")
            try:
                subprocess.check_call([sys.executable, "-m", "pip", "install", lib])
            except subprocess.CalledProcessError:
                print(f"Failed to install {lib}. Trying Conda (if applicable)...")
                try:
                    subprocess.check_call(["conda", "install", "-c", "conda-forge", lib, "-y"])
                except subprocess.CalledProcessError:
                    print(f"Could not install {lib}. Please install it manually.")

# Check and install missing libraries
check_and_install(required_libraries)

print("All required libraries are installed.")


Installing scikit-learn...
Installing scikit-surprise...
All required libraries are installed.


In [155]:
# Jupyter magic command for inline plots
%matplotlib inline

# Import libraries
import pandas as pd
import glob
import os
import re
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tabulate import tabulate
from scipy import stats
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.neighbors import NearestNeighbors
from difflib import get_close_matches
from sklearn.decomposition import TruncatedSVD
from rich.console import Console
from rich.table import Table
from gensim.models import Word2Vec
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet
from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate

import warnings
warnings.simplefilter('ignore')

print("All required libraries are successfully imported.")

All required libraries are successfully imported.


In [12]:
# Define the folder where your CSV files are stored
csv_folder = "Datasets/" 

# Use glob to get only CSV files that start with 'book'
csv_book_files = glob.glob(os.path.join(csv_folder, "book*.csv"))
csv_user_files = glob.glob(os.path.join(csv_folder, "user*.csv"))

# Load and concatenate all "book" CSV files into a single DataFrame
# Function to read CSV files and standardize column names before combining
def load_and_standardize_csv(file_list):
    df_list = []
    for file in file_list:
        df = pd.read_csv(file)
        df.columns = df.columns.str.lower()
        df_list.append(df)
    return pd.concat(df_list, ignore_index=True)

# Load and combine "book" and "user" CSV files
df_books = load_and_standardize_csv(csv_book_files)
df_users = load_and_standardize_csv(csv_user_files)

# Rename specific mismatched columns for consistency
df_books.rename(columns={'pagesnumber': 'pagesNumber'}, inplace=True)

print(f"Combined 'book' DataFrame shape: {df_books.shape}")
print(f"Combined 'user' DataFrame shape: {df_users.shape}")

Combined 'book' DataFrame shape: (1850310, 20)
Combined 'user' DataFrame shape: (362596, 3)


In [13]:
#store the merged data into a csv files for both books and user
df_books.to_csv('combined_books_data.csv', index=False) # Causes memory error
df_users.to_csv('combined_users_data.csv', index=False)

In [14]:
#copy the data frame and keep the original data for further use
df_books_original = df_books.copy()
df_users_original = df_users.copy()
print('Original data copied for future use. DONE!!!')

Original data copied for future use. DONE!!!


### Preprocessing & Data Cleaning

In [16]:
# find the unique ID and ISBN
unique_ids = df_books['id'].nunique()
unique_isbns = df_books['isbn'].nunique()
print(f"Unique IDs: {unique_ids}")
print(f"Unique ISBNs: {unique_isbns}")

Unique IDs: 1850115
Unique ISBNs: 1844192


In [17]:
# find duplicates across all rows
duplicate_rows = df_books.duplicated().sum()
print(f"Duplicate Rows: {duplicate_rows}")

Duplicate Rows: 112


In [18]:
# find missing ISBNs & ID
missing_isbns = df_books['isbn'].isna().sum()
print(f"Missing ISBNs: {missing_isbns}")

missing_id = df_books['id'].isna().sum()
print(f"Missing Id: {missing_id}")

Missing ISBNs: 5923
Missing Id: 0


In [19]:
# Drop duplicate rows
df_books_cleaned = df_books.drop_duplicates()
print(f"After removing fully duplicated rows: {df_books_cleaned.shape[0]} records remaining.")

# Keep only the first occurrence of each ID
df_books_cleaned = df_books_cleaned.drop_duplicates(subset=['id'], keep='first')
print(f"After ensuring unique IDs: {df_books_cleaned.shape[0]} records remaining.")

After removing fully duplicated rows: 1850198 records remaining.
After ensuring unique IDs: 1850115 records remaining.


In [20]:
# Check for missing (NaN) values across all columns
missing_values = df_books_cleaned.isna().sum()
missing_values

id                             0
name                           0
ratingdist1                    0
pagesNumber                    0
ratingdist4                    0
ratingdisttotal                0
publishmonth                   0
publishday                     0
publisher                  17821
countsofreview                 0
publishyear                    0
language                 1598369
authors                        0
rating                         0
ratingdist2                    0
ratingdist5                    0
isbn                        5922
ratingdist3                    0
description               678927
count of text reviews    1440418
dtype: int64

In [21]:
# Dropping all the unnesscary columns
df_books_cleaned.drop(columns=['count of text reviews', 'isbn'], inplace=True)
df_books_cleaned.head(5)

Unnamed: 0,id,name,ratingdist1,pagesNumber,ratingdist4,ratingdisttotal,publishmonth,publishday,publisher,countsofreview,publishyear,language,authors,rating,ratingdist2,ratingdist5,ratingdist3,description
0,1,Harry Potter and the Half-Blood Prince (Harry ...,1:9896,652,4:556485,total:2298124,16,9,Scholastic Inc.,28062,2006,eng,J.K. Rowling,4.57,2:25317,5:1546466,3:159960,
1,2,Harry Potter and the Order of the Phoenix (Har...,1:12455,870,4:604283,total:2358637,1,9,Scholastic Inc.,29770,2004,eng,J.K. Rowling,4.5,2:37005,5:1493113,3:211781,
2,3,Harry Potter and the Sorcerer's Stone (Harry P...,1:108202,309,4:1513191,total:6587388,1,11,Scholastic Inc,75911,2003,eng,J.K. Rowling,4.47,2:130310,5:4268227,3:567458,
3,4,Harry Potter and the Chamber of Secrets (Harry...,1:11896,352,4:706082,total:2560657,1,11,Scholastic,244,2003,eng,J.K. Rowling,4.42,2:49353,5:1504505,3:288821,
4,5,Harry Potter and the Prisoner of Azkaban (Harr...,1:10128,435,4:630534,total:2610317,1,5,Scholastic Inc.,37093,2004,eng,J.K. Rowling,4.57,2:24849,5:1749958,3:194848,


In [22]:
df_books_cleaned['publisher'].fillna("Unknown", inplace=True)
df_books_cleaned['language'].fillna(df_books_cleaned['language'].mode()[0], inplace=True)

In [23]:
df_books_cleaned.head(2)

Unnamed: 0,id,name,ratingdist1,pagesNumber,ratingdist4,ratingdisttotal,publishmonth,publishday,publisher,countsofreview,publishyear,language,authors,rating,ratingdist2,ratingdist5,ratingdist3,description
0,1,Harry Potter and the Half-Blood Prince (Harry ...,1:9896,652,4:556485,total:2298124,16,9,Scholastic Inc.,28062,2006,eng,J.K. Rowling,4.57,2:25317,5:1546466,3:159960,
1,2,Harry Potter and the Order of the Phoenix (Har...,1:12455,870,4:604283,total:2358637,1,9,Scholastic Inc.,29770,2004,eng,J.K. Rowling,4.5,2:37005,5:1493113,3:211781,


In [25]:
# takes 10 to 30 mins to Run !!!!
from summa import summarizer

# Step 1: Create a mapping of authors to non-null descriptions
author_descriptions = df_books_cleaned.groupby('authors')['description'].apply(lambda x: " ".join(x.dropna().values[:5]))

# Step 2: Fill missing descriptions in bulk
def generate_summary(author, desc):
    if pd.isna(desc) and author in author_descriptions:
        return summarizer.summarize(author_descriptions[author], words=50)
    return desc

# Apply faster in bulk (vectorized)
df_books_cleaned['description'] = df_books_cleaned.apply(lambda row: generate_summary(row['authors'], row['description']), axis=1)

In [27]:
# find different types of rating in the rating colum
ratings = df_users['rating'].unique()
print(ratings)

['it was amazing' 'really liked it' 'liked it' 'did not like it'
 'it was ok' "This user doesn't have any rating"]


In [28]:
rating_mapping = {
    "it was amazing": 5,
    "really liked it": 4,
    "liked it": 3,
    "it was ok": 2,
    "did not like it": 1,
    "This user doesn't have any rating": 0  # Convert to NaN or remove
}

# create a numeric rating colum
df_users['rating in numbers'] = df_users['rating'].map(rating_mapping)
df_users.head(5)

Unnamed: 0,id,name,rating,rating in numbers
0,1,Agile Web Development with Rails: A Pragmatic ...,it was amazing,5
1,1,The Restaurant at the End of the Universe (Hit...,it was amazing,5
2,1,Siddhartha,it was amazing,5
3,1,The Clock of the Long Now: Time and Responsibi...,really liked it,4
4,1,"Ready Player One (Ready Player One, #1)",really liked it,4


In [29]:
# copying the data frame
df_users_copy = df_users.copy()
df_books_copy = df_books_cleaned.copy() 

In [30]:
# renaming the columns in both the data frame
df_books_copy.rename(columns={'id': 'book_id'}, inplace=True)
df_users_copy.rename(columns={'id': 'user_id'}, inplace=True)

In [31]:
df_users_copy.head()

Unnamed: 0,user_id,name,rating,rating in numbers
0,1,Agile Web Development with Rails: A Pragmatic ...,it was amazing,5
1,1,The Restaurant at the End of the Universe (Hit...,it was amazing,5
2,1,Siddhartha,it was amazing,5
3,1,The Clock of the Long Now: Time and Responsibi...,really liked it,4
4,1,"Ready Player One (Ready Player One, #1)",really liked it,4


In [32]:
df_books_copy.head(5)

Unnamed: 0,book_id,name,ratingdist1,pagesNumber,ratingdist4,ratingdisttotal,publishmonth,publishday,publisher,countsofreview,publishyear,language,authors,rating,ratingdist2,ratingdist5,ratingdist3,description
0,1,Harry Potter and the Half-Blood Prince (Harry ...,1:9896,652,4:556485,total:2298124,16,9,Scholastic Inc.,28062,2006,eng,J.K. Rowling,4.57,2:25317,5:1546466,3:159960,Harry kan dan ook niet wachten tot hij terug m...
1,2,Harry Potter and the Order of the Phoenix (Har...,1:12455,870,4:604283,total:2358637,1,9,Scholastic Inc.,29770,2004,eng,J.K. Rowling,4.5,2:37005,5:1493113,3:211781,Harry kan dan ook niet wachten tot hij terug m...
2,3,Harry Potter and the Sorcerer's Stone (Harry P...,1:108202,309,4:1513191,total:6587388,1,11,Scholastic Inc,75911,2003,eng,J.K. Rowling,4.47,2:130310,5:4268227,3:567458,Harry kan dan ook niet wachten tot hij terug m...
3,4,Harry Potter and the Chamber of Secrets (Harry...,1:11896,352,4:706082,total:2560657,1,11,Scholastic,244,2003,eng,J.K. Rowling,4.42,2:49353,5:1504505,3:288821,Harry kan dan ook niet wachten tot hij terug m...
4,5,Harry Potter and the Prisoner of Azkaban (Harr...,1:10128,435,4:630534,total:2610317,1,5,Scholastic Inc.,37093,2004,eng,J.K. Rowling,4.57,2:24849,5:1749958,3:194848,Harry kan dan ook niet wachten tot hij terug m...


In [33]:
def clean_text(text):
    text = text.lower().strip()
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    return text

df_users_copy['clean_name'] = df_users_copy['name'].apply(clean_text)
df_books_copy['clean_name'] = df_books_copy['name'].apply(clean_text)

In [34]:
# mapping name and bookid
name_to_book_id_map = df_books_copy.set_index('clean_name')['book_id'].to_dict()
df_users_copy['book_id'] = df_users_copy['clean_name'].map(name_to_book_id_map)

In [35]:
# finding the matched count in both the books and users dataframe
matched_count = df_users_copy['book_id'].notna().sum()
print(f"Directly matched count: {matched_count}")

Directly matched count: 234734


In [36]:
# finding the unmatched count in both the books and users dataframe
unmatched_count = df_users_copy['book_id'].isna().sum()
print(f"Directly not matched count: {unmatched_count}")

Directly not matched count: 127862


In [37]:
# First, map directly where names match exactly (FAST)
df_users_copy['book_id'] = df_users_copy['clean_name'].map(name_to_book_id_map)

# Count directly matched entries
matched_count = df_users_copy['book_id'].notna().sum()
print(f"Directly matched count: {matched_count}")


Directly matched count: 234734


In [38]:
df_users_copy.head(5)

Unnamed: 0,user_id,name,rating,rating in numbers,clean_name,book_id
0,1,Agile Web Development with Rails: A Pragmatic ...,it was amazing,5,agile web development with rails a pragmatic g...,45.0
1,1,The Restaurant at the End of the Universe (Hit...,it was amazing,5,the restaurant at the end of the universe hitc...,862825.0
2,1,Siddhartha,it was amazing,5,siddhartha,828548.0
3,1,The Clock of the Long Now: Time and Responsibi...,really liked it,4,the clock of the long now time and responsibility,1788479.0
4,1,"Ready Player One (Ready Player One, #1)",really liked it,4,ready player one ready player one 1,


In [39]:
# storing it in a csv file
unmatched_rows = df_users_copy[df_users_copy['book_id'].isna()]
unmatched_rows.to_csv("unmatched_books.csv", index=False)

In [40]:
# storing it in a csv file
matched_rows = df_users_copy[df_users_copy['book_id'].notna()]
matched_rows.to_csv("matched_books.csv", index=False)

In [41]:
unique_book_ids = df_users_copy['book_id'].nunique()
print(f"Total unique book IDs: {unique_book_ids}")

Total unique book IDs: 51336


In [42]:
missing_values = matched_rows.isna().sum()
missing_values

user_id              0
name                 0
rating               0
rating in numbers    0
clean_name           0
book_id              0
dtype: int64

In [43]:
# Check for missing (NaN) values across all columns
missing_values = df_books_copy.isna().sum()
missing_values

book_id            0
name               0
ratingdist1        0
pagesNumber        0
ratingdist4        0
ratingdisttotal    0
publishmonth       0
publishday         0
publisher          0
countsofreview     0
publishyear        0
language           0
authors            0
rating             0
ratingdist2        0
ratingdist5        0
ratingdist3        0
description        0
clean_name         0
dtype: int64

### Simple Recommender Systems

#### For unique book ids - which has 51k rows

In [369]:
# Step 1: Extract unique book IDs from users_df
unique_book_ids = matched_rows['book_id'].unique()

# Step 2: Filter book_df to keep only books that exist in users_df
filtered_books_df = df_books_copy[df_books_copy['book_id'].isin(unique_book_ids)].copy()

# Step 3: Ensure only unique book entries remain
filtered_books_df = filtered_books_df.drop_duplicates(subset=['book_id'])

In [371]:
def compute_weighted_recommendations(books_df, quantile_threshold=0.90):
    # Compute the mean rating across all books
    mean_rating = books_df['rating'].mean()

    # Determine the minimum number of reviews required for a book to be considered
    min_no_of_reviews = books_df['countsofreview'].quantile(quantile_threshold)

    # Filter books with at least `min_no_of_reviews` reviews
    popular_books = books_df[books_df['countsofreview'] >= min_no_of_reviews].copy()

    # Define the IMDB Weighted Rating function
    def weighted_rating(x, m=min_no_of_reviews, C=mean_rating):
        v = x['countsofreview']  # Number of reviews
        R = x['rating']          # Average rating
        return (v / (v + m) * R) + (m / (v + m) * C)

    # Apply the weighted rating formula
    popular_books['score'] = popular_books.apply(weighted_rating, axis=1)

    # Sort books based on score
    recommended_books = popular_books.sort_values('score', ascending=False)

    return recommended_books

In [373]:
def display_recommended_books(recommended_books, top_n=10, use_rich=True):
    # Select relevant columns & handle missing values
    top_books = recommended_books[['name', 'authors', 'rating', 'countsofreview', 'score']].copy()
    top_books = top_books.fillna("N/A")  # Replace NaN values

    # Sort by score (highest first)
    top_books = top_books.sort_values(by="score", ascending=False).head(top_n)

    # Convert numeric columns to formatted strings (2 decimal places)
    top_books["rating"] = top_books["rating"].apply(lambda x: f"{x:.2f}" if isinstance(x, (int, float)) else x)
    top_books["score"] = top_books["score"].apply(lambda x: f"{x:.5f}" if isinstance(x, (int, float)) else x)

    # Convert DataFrame to list of lists
    book_list = top_books.values.tolist()

    # Define headers
    headers = ["📚 Book Title", "✍️ Author(s)", "⭐ Rating", "🗳️ Review Count", "🏆 Score"]

    # Use Rich for Beautiful Console Output
    if use_rich:
        console = Console()
        table = Table(title="🎉📚 **Top Recommended Books** 📚🎉", show_lines=True)

        # Add headers with styling
        table.add_column("📚 Book Title", justify="left", style="black")
        table.add_column("✍️ Author(s)", justify="left", style="black")
        table.add_column("⭐ Rating", justify="center", style="black")
        table.add_column("🗳️ Review Count", justify="right", style="black")
        table.add_column("🏆 Score", justify="right", style="black")

        # Add rows dynamically
        for book in book_list:
            table.add_row(book[0], book[1], book[2], str(book[3]), book[4])

        console.print(table)

    # Use Tabulate for Grid-Style Tables (No Color)
    else:
        table_str = tabulate(
            book_list,
            headers=headers,
            tablefmt="fancy_grid",
            showindex=False
        )
        print("\n🎉📚 **Top Recommended Books** 📚🎉\n")
        print(table_str)

In [375]:
recommended_books_wr = compute_weighted_recommendations(filtered_books_df)

In [377]:
display_recommended_books(recommended_books_wr, use_rich=True, top_n=10)  # Show Top 10 Books

#### For the entire books data frame - which has more than 1 million rows

In [380]:
recommended_books_whole_wr = compute_weighted_recommendations(df_books_copy)

In [381]:
display_recommended_books(recommended_books_whole_wr, use_rich=True, top_n=10)  # Show Top 10 Books

#### Content Based Recommender

In [450]:
def content_recommender(df_book, n_neighbors=10):
    
    # Convert descriptions into TF-IDF Vectors
    tfidf = TfidfVectorizer(stop_words='english', max_features=5000)  # Limiting to 5000 features for efficiency
    tfidf_matrix = tfidf.fit_transform(df_book['description'])

    # Nearest Neighbors Model (Efficient for Large Datasets)
    nn_model = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=n_neighbors + 1, n_jobs=-1)
    nn_model.fit(tfidf_matrix)

    # Create a Reverse Mapping (Book Title to Index)
    df_book = df_book.reset_index(drop=True)  # Ensure indices match the dataset
    indices = pd.Series(df_book.index, index=df_book['name']).drop_duplicates()

    #  Recommendation Function
    def get_recommendations(title, df=df_book, model=nn_model, tfidf_matrix=tfidf_matrix, n=n_neighbors):
        
        # Check if the title exists, else find the closest match
        if title not in indices:
            book_titles = df['name'].tolist()
            matched_title = get_close_matches(title, book_titles, n=1, cutoff=0.6)
            if matched_title:
                title = matched_title[0]
            else:
                print("No close matches found. Please check the title and try again.")
                return None

        #  Get the index of the book that matches the title
        idx = indices[title]

        # Find the most similar books
        distances, indices_nn = model.kneighbors(tfidf_matrix[idx], n_neighbors=n + 1)

        # Extract book indices (excluding the first one, which is the input book itself)
        book_indices = indices_nn[0][1:]

        # Return the top recommended books
        return df[['book_id', 'name', 'authors', 'publisher']].iloc[book_indices]

    return get_recommendations

In [452]:
def display_recommendations_in_table(recommended_books, use_rich=True):

    headers = ["📖 Book ID", "📚 Book Title", "✍️ Author(s)", "🏢 Publisher"]

    # Convert DataFrame to list of lists
    book_list = recommended_books.values.tolist()

    # Use Rich for Beautiful Console Output
    if use_rich:
        console = Console()
        table = Table(title="🎉📚 **Top Recommended Books** 📚🎉", show_lines=True)

        # Add headers
        table.add_column("📖 Book ID", justify="center", style="black", no_wrap=True)
        table.add_column("📚 Book Title", justify="left", style="black")
        table.add_column("✍️ Author(s)", justify="left", style="black")
        table.add_column("🏢 Publisher", justify="left", style="black")

        # Add rows
        for book in book_list:
            table.add_row(str(book[0]), book[1], book[2], book[3])

        console.print(table)

    # Use Tabulate for Grid-Style Tables (No Color)
    else:
        table_str = tabulate(
            book_list,
            headers=headers,
            tablefmt="fancy_grid",
            showindex=False
        )
        print("\n🎉📚 **Top Recommended Books** 📚🎉\n")
        print(table_str)

In [454]:
content_recommender = content_recommender(filtered_books_df)
recommended_books_cr = content_recommender('Harry Potter Series Box Set (Harry Potter, #1-7)')
display_recommendations_in_table(recommended_books_cr, use_rich=True)

In [446]:
def metadata_recommender(df_book, feature_columns, n_neighbors=10):
    # Combine all metadata features into a single column
    df_book['metadata'] = df_book[feature_columns].astype(str).apply(lambda x: ' '.join(x), axis=1)

    #Convert metadata into TF-IDF vectors
    tfidf = TfidfVectorizer(stop_words='english', max_features=5000)  # Limits memory usage
    tfidf_matrix = tfidf.fit_transform(df_book['metadata'])

    # Use Nearest Neighbors for Efficient Similarity Search
    nn_model = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=n_neighbors + 1, n_jobs=-1)
    nn_model.fit(tfidf_matrix)

    # Create a Reverse Mapping (Book Title to Index)
    df_book = df_book.reset_index(drop=True)  # Ensure indices are aligned
    indices = pd.Series(df_book.index, index=df_book['name']).drop_duplicates()

    # Metadata-Based Recommendation Function
    def get_recommendations(title, df=df_book, model=nn_model, tfidf_matrix=tfidf_matrix, n=n_neighbors):
        # Ensure title exists, else find closest match
        if title not in indices:
            matched_title = get_close_matches(title, df['name'].tolist(), n=1, cutoff=0.6)
            if matched_title:
                title = matched_title[0]
            else:
                print("No close matches found. Please check the title and try again.")
                return None

        # Get index of the book
        idx = indices[title]

        # Find similar books using Nearest Neighbors
        distances, indices_nn = model.kneighbors(tfidf_matrix[idx], n_neighbors=n + 1)

        # Extract book indices (excluding the input book itself)
        book_indices = indices_nn[0][1:]

        # Return the top recommended books
        return df[['book_id', 'name', 'authors', 'publisher']].iloc[book_indices]

    return get_recommendations

In [448]:
feature_columns = ['authors', 'clean_name', 'publisher']
metadata_recommender = metadata_recommender(filtered_books_df, feature_columns, n_neighbors = 10)
recommended_books_md_r = metadata_recommender('Harry Potter Series Box Set (Harry Potter, #1-7)')
display_recommendations_in_table(recommended_books_md_r)

In [392]:
def hybrid_metadata_recommender(df_book, feature_columns, vector_size=100, n_neighbors=10):
    df_book = df_book.reset_index(drop=True)
    df_book[feature_columns] = df_book[feature_columns].fillna('')
    
    # Combine all metadata features into a single column
    df_book['metadata'] = df_book[feature_columns].astype(str).apply(lambda x: ' '.join(x), axis=1)

    # Applying TF-IDF Vectorization
    tfidf = TfidfVectorizer(stop_words='english', max_features=5000)
    tfidf_matrix = tfidf.fit_transform(df_book['metadata'])

    # Applying LSA (Truncated SVD)
    svd = TruncatedSVD(n_components=200)  # Reduce dimensions to 200 topics
    lsa_matrix = svd.fit_transform(tfidf_matrix)

    # Train Word2Vec on metadata
    sentences = [row.split() for row in df_book['metadata']]
    w2v_model = Word2Vec(sentences, vector_size=vector_size, window=5, min_count=1, workers=-1)

    # Compute Word2Vec embeddings for each book
    def get_embedding(text):
        words = text.split()
        word_vectors = [w2v_model.wv[word] for word in words if word in w2v_model.wv]
        return np.mean(word_vectors, axis=0) if word_vectors else np.zeros(vector_size)

    df_book['w2v_embedding'] = df_book['metadata'].apply(get_embedding)
    word2vec_matrix = np.vstack(df_book['w2v_embedding'].values)

    # Combing LSA and Word2Vec embeddings
    combined_matrix = np.hstack((lsa_matrix, word2vec_matrix))

    # Using Nearest Neighbors for efficient similarity search
    nn_model = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=n_neighbors + 1, n_jobs=-1)
    nn_model.fit(combined_matrix)

    # Create a Reverse Mapping (Book Title to Index)
    indices = pd.Series(df_book.index, index=df_book['name']).drop_duplicates()

    # Hybrid Recommendation Function
    def get_recommendations(title, df=df_book, model=nn_model, combined_matrix=combined_matrix, n=n_neighbors):
        # Ensure title exists, else find closest match
        if title not in indices:
            matched_title = get_close_matches(title, df['name'].tolist(), n=1, cutoff=0.6)
            if matched_title:
                title = matched_title[0]
            else:
                print(" No close matches found. Please check the title and try again.")
                return None

        # Get index of the book
        idx = indices[title]

        # Find similar books using Nearest Neighbors
        distances, indices_nn = model.kneighbors(combined_matrix[idx].reshape(1, -1), n_neighbors=n + 1)

        # Extract book indices (excluding the input book itself)
        book_indices = indices_nn[0][1:]

        # Return the top recommended books
        return df[['book_id', 'name', 'authors', 'publisher']].iloc[book_indices]

    return get_recommendations

In [440]:
feature_columns = ['authors', 'publisher', 'clean_name']
hybrid_recommender = hybrid_metadata_recommender(filtered_books_df, feature_columns)
recommended_books_hy_r = hybrid_recommender('Harry Potter Series Box Set (Harry Potter, #1-7)')
display_recommendations_in_table(recommended_books_hy_r)

In [401]:
def feature_based_recommender(df_book, feature_columns, n_neighbors=10):
    # Create a copy of the DataFrame to prevent modifying the original
    df_copy = df_book.copy()

    # Preserve the original publisher column before encoding
    df_copy['publisher_original'] = df_copy['publisher']

    # Encode categorical features (on the copied DataFrame)
    label_encoders = {}  # Store encoders in case they are needed later
    for col in feature_columns:
        if df_copy[col].dtype == "object":
            le = LabelEncoder()
            df_copy[col] = le.fit_transform(df_copy[col])
            label_encoders[col] = le  # encoder for debugging if needed

    # Scale features
    scaled_features = MinMaxScaler().fit_transform(df_copy[feature_columns])

    # Train Nearest Neighbors model
    nn_model = NearestNeighbors(metric='cosine', n_neighbors=n_neighbors + 1, n_jobs=-1)
    nn_model.fit(scaled_features)

    # Create reverse mapping for book titles
    df_copy = df_copy.reset_index(drop=True)
    indices = pd.Series(df_copy.index, index=df_copy['name']).drop_duplicates()

    # Returns top similar books based on features
    def get_recommendations(title, n_neighbors=n_neighbors):
        if title not in indices:
            match = get_close_matches(title, indices.index, n=1, cutoff=0.6)
            if match:
                title = match[0]
            else:
                print("No close matches found.")
                return None

        idx = indices[title]
        distances, neighbors = nn_model.kneighbors([scaled_features[idx]], n_neighbors=n_neighbors + 1)

        # Return recommended books (excluding the input book itself)
        return df_copy[['book_id', 'name', 'authors', 'publisher_original']].iloc[neighbors[0][1:]].rename(columns={'publisher_original': 'publisher'})

    return get_recommendations

In [438]:
feature_columns = ['pagesNumber', 'publisher', 'rating', 'countsofreview', 'publishyear']
feature_recommender_system = feature_based_recommender(filtered_books_df, feature_columns)
recommended_books_fr_r = feature_recommender_system("Harry Potter Series Box Set (Harry Potter, #1-7)", n_neighbors=10)
display_recommendations_in_table(recommended_books_fr_r)

In [422]:
# filtered_books_df.loc[filtered_books_df['name'] == "J.R.R. Tolkien 4-Book Boxed Set: The Hobbit and The Lord of the Rings"]

#### Read the description of the books to confirm if the recommended books makes sense

In [436]:
# Identify the correct column name for book titles
actual_col_name = [col for col in filtered_books_df.columns if "name" in col.lower()]
book_title_column = actual_col_name[0] if actual_col_name else "name"  # Default to 'name'

# Get the description for the specific book
book_title = "Harry Potter Series Box Set (Harry Potter, #1-7)"
book_description = filtered_books_df.loc[filtered_books_df[book_title_column] == book_title, 'description']

# Display the result
print(book_description.iloc[0] if not book_description.empty else "Book not found in dataset.")

Over 4000 pages of Harry Potter and his world, including all 7 books.<br /><br />All seven eBooks in the multi-award winning, internationally bestselling Harry Potter series, available as one download with stunning cover art by Olly Moss. Enjoy the stories that have captured the imagination of millions worldwide.


In [468]:
# Identify the correct column name for book titles
book_title_column = "name"  # Based on the dataset structure

# Find all books related to "Harry Potter" in the dataset
harry_potter_books = filtered_books_df[filtered_books_df[book_title_column].str.contains("Harry Potter", case=False, na=False)]
harry_potter_books

Unnamed: 0,book_id,name,ratingdist1,pagesNumber,ratingdist4,ratingdisttotal,publishmonth,publishday,publisher,countsofreview,publishyear,language,authors,rating,ratingdist2,ratingdist5,ratingdist3,description,clean_name,metadata
6,8,"Harry Potter Boxed Set, Books 1-5 (Harry Potte...",1:402,2690,4:4650,total:43968,13,9,Scholastic,166,2004,eng,J.K. Rowling,4.78,2:283,5:37432,3:1201,Harry kan dan ook niet wachten tot hij terug m...,harry potter boxed set books 15 harry potter 15,J.K. Rowling harry potter boxed set books 15 h...
8,10,"Harry Potter Collection (Harry Potter, #1-6)",1:257,3342,4:4358,total:30313,12,9,Scholastic,809,2005,eng,J.K. Rowling,4.73,2:218,5:24406,3:1074,Harry kan dan ook niet wachten tot hij terug m...,harry potter collection harry potter 16,J.K. Rowling harry potter collection harry pot...
1293,2002,Harry Potter Schoolbooks Box Set: Two Classic ...,1:124,240,4:2847,total:12706,1,11,Arthur A. Levine,140,2001,eng,J.K. Rowling,4.4,2:332,5:7751,3:1652,Harry kan dan ook niet wachten tot hij terug m...,harry potter schoolbooks box set two classic b...,J.K. Rowling harry potter schoolbooks box set ...
9579,15867,Mugglenet.Com's What Will Happen in Harry Pott...,1:244,216,4:1759,total:9125,19,10,Ulysses Press,112,2006,en-GB,Ben Schoen,4.24,2:432,5:5223,3:1467,,mugglenetcoms what will happen in harry potter...,Ben Schoen mugglenetcoms what will happen in h...
9581,15872,Harry Potter y el misterio del príncipe (Harry...,1:9903,602,4:556583,total:2298697,28,2,Salamandra,398,2006,spa,J.K. Rowling,4.57,2:25321,5:1546901,3:159989,Harry kan dan ook niet wachten tot hij terug m...,harry potter y el misterio del príncipe harry ...,J.K. Rowling harry potter y el misterio del pr...
29739,49871,Harry Potter aur Azkaban ka Qaidi (Harry Potte...,1:10195,372,4:636788,total:2635020,8,7,"Oxford University Press, USA",12,2004,eng,J.K. Rowling,4.56,2:25043,5:1766209,3:196785,Harry kan dan ook niet wachten tot hij terug m...,harry potter aur azkaban ka qaidi harry potter 3,J.K. Rowling harry potter aur azkaban ka qaidi...
40727,70365,Harry Potter und der Orden des Phoenix,1:12474,1024,4:604751,total:2361179,6,11,Carlsen,0,2003,ger,J.K. Rowling,4.5,2:37026,5:1494969,3:211959,Harry kan dan ook niet wachten tot hij terug m...,harry potter und der orden des phoenix,J.K. Rowling harry potter und der orden des ph...
57890,99298,"The Harry Potter Collection 1-4 (Harry Potter,...",1:363,1500,4:8808,total:50741,1,11,"Scholastic, Inc.",284,1999,eng,J.K. Rowling,4.67,2:450,5:38627,3:2493,Harry kan dan ook niet wachten tot hij terug m...,the harry potter collection 14 harry potter 14,J.K. Rowling the harry potter collection 14 ha...
105999,113271,Ultimate Unofficial Guide to the Mysteries of ...,1:3,260,4:29,total:102,1,7,Wizarding World Press,3,2005,eng,Galadriel Waters,4.07,2:4,5:45,3:21,,ultimate unofficial guide to the mysteries of ...,Galadriel Waters ultimate unofficial guide to ...
122595,142294,Harry Potter Y El Prisionero De Azkaban (Harry...,1:10274,359,4:639222,total:2648183,1,6,Turtleback Books,20,2015,spa,J.K. Rowling,4.57,2:25115,5:1776247,3:197325,Harry kan dan ook niet wachten tot hij terug m...,harry potter y el prisionero de azkaban harry ...,J.K. Rowling harry potter y el prisionero de a...
