<a href="https://colab.research.google.com/github/emihannathomas-cell/Book-Recommendation/blob/main/Book_Recommendation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Content-Based Book Recommendation System

Module E: AI Applications – Individual Open Project

1 . Problem Definition & Objective (Markdown Cell)

a. Selected Project Track

AI Application Track: Recommendation System
Technique Category: Natural Language Processing (NLP)**bold text**

b. Clear Problem Statement

With the rapid increase in digital content, readers face difficulty in discovering books that match their interests due to the overwhelming number of available options. Traditional keyword-based search systems require users to explicitly search for book titles or authors, which limits personalization and exploration.

The problem addressed in this project is to design an AI-based content recommendation system that automatically suggests similar books based on the textual content of books a user likes. The system uses Natural Language Processing techniques to analyze book descriptions and recommend relevant books.



c. Real-World Relevance & Motivation

Book recommendation systems are widely used in online bookstores, digital libraries, and e-learning platforms to enhance user engagement and personalization. Content-based recommendation helps users discover books without relying on ratings or historical user data.

This project demonstrates the practical use of TF-IDF vectorization and cosine similarity to solve a real-world personalization problem in an interpretable and ethical manner.

2 . Data Understanding & Preparation (Markdown Cell)

a. Dataset Source

Dataset Type: Synthetic dataset (created for educational purposes)

Reason: Avoids privacy issues and allows full control over features

Dataset Attributes:

Book Title

Genre

Book Description

b. Data Loading & Exploration (Code Cell)

In [1]:
import pandas as pd
import numpy as np

data = {
    "title": [
        "The Alchemist",
        "Harry Potter",
        "The Hobbit",
        "Atomic Habits",
        "Rich Dad Poor Dad",
        "Think and Grow Rich",
        "The Power of Habit",
        "Game of Thrones",
        "Lord of the Rings",
        "Psychology of Money"
    ],
    "genre": [
        "Fiction",
        "Fantasy",
        "Fantasy",
        "Self Help",
        "Finance",
        "Finance",
        "Self Help",
        "Fantasy",
        "Fantasy",
        "Finance"
    ],
    "description": [
        "A philosophical journey of self discovery destiny and purpose",
        "A young wizard battles dark magic friendship and courage",
        "A fantasy adventure involving dragons magic and heroic quests",
        "Building good habits breaking bad habits and self improvement",
        "Personal finance investing assets and financial independence",
        "Success mindset wealth creation and personal growth",
        "Science behind habits behavior change and productivity",
        "Epic fantasy political power wars dragons and kingdoms",
        "Mythical fantasy journey with rings magic and dark forces",
        "Behavioral finance money mindset and emotional investing"
    ]
}

df = pd.DataFrame(data)
df.head()


Unnamed: 0,title,genre,description
0,The Alchemist,Fiction,A philosophical journey of self discovery dest...
1,Harry Potter,Fantasy,A young wizard battles dark magic friendship a...
2,The Hobbit,Fantasy,A fantasy adventure involving dragons magic an...
3,Atomic Habits,Self Help,Building good habits breaking bad habits and s...
4,Rich Dad Poor Dad,Finance,Personal finance investing assets and financia...


c. Cleaning, Preprocessing & Feature Engineering (Markdown Cell)


*   Converted text to lowercase
*   Removed punctuation and special characters
*  Combined genre + description to improve semantic similarity


d. Handling Missing Values or Noise (Markdown Cell)

The dataset does not contain missing or noisy values.
If missing values were present, text imputation or row removal would be applied.

In [2]:
import re

def clean_text(text):
    text = text.lower()
    text = re.sub(r"[^a-z\s]", "", text)
    text = re.sub(r"\s+", " ", text)
    return text.strip()

df["clean_description"] = df["description"].apply(clean_text)
df["combined_features"] = df["genre"] + " " + df["clean_description"]


3️ . Model / System Design (Markdown Cell)
a. AI Technique Used

Content-Based Recommendation System

Natural Language Processing (NLP)

TF-IDF Vectorization

Cosine Similarity

b. Architecture / Pipeline Explanation

Input book title

Text preprocessing and feature combination

TF-IDF vectorization

Similarity computation using cosine similarity

Ranking similar books

Recommendation output

c. Justification of Design Choices

TF-IDF highlights important keywords while reducing noise

Cosine similarity effectively measures text similarity

Content-based filtering avoids cold-start user issues

The model is interpretable and scalable

4️ .  Core Implementation (Markdown Cell)

a. Model Inference Logic (Code Cell)

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

vectorizer = TfidfVectorizer(
    stop_words="english",
    ngram_range=(1, 2),
    max_features=300
)

tfidf_matrix = vectorizer.fit_transform(df["combined_features"])
cosine_sim_matrix = cosine_similarity(tfidf_matrix)


c. Recommendation Pipeline (Code Cell)

In [4]:
def recommend_books(book_title, top_n=5):
    if book_title not in df["title"].values:
        return "Book not found in dataset"

    idx = df.index[df["title"] == book_title][0]
    similarity_scores = list(enumerate(cosine_sim_matrix[idx]))
    similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)

    recommendations = []
    for i, score in similarity_scores[1:top_n+1]:
        recommendations.append({
            "Book Title": df.loc[i, "title"],
            "Genre": df.loc[i, "genre"],
            "Similarity Score": round(score, 3)
        })

    return pd.DataFrame(recommendations)


d. End-to-End Execution (Code Cell)

In [5]:
recommend_books("Harry Potter", top_n=4)


Unnamed: 0,Book Title,Genre,Similarity Score
0,Lord of the Rings,Fantasy,0.157
1,The Hobbit,Fantasy,0.103
2,Game of Thrones,Fantasy,0.062
3,The Alchemist,Fiction,0.0


5️ .  Evaluation & Analysis (Markdown Cell)

a. Metrics Used

Qualitative evaluation

Similarity score ranking

Manual relevance verification

b. Sample Output

Input: Harry Potter

Output: Fantasy books with high semantic similarity

c. Performance Analysis & Limitations

Strengths

Fast and interpretable

No user data required

Suitable for small datasets

Limitations

No personalization

Limited dataset size

Cannot learn user preferences over time

6️ .  Ethical Considerations & Responsible AI (Markdown Cell)

a. Bias & Fairness

Recommendations depend on dataset balance

Genre dominance may influence results

b. Dataset Limitations

Synthetic dataset

Limited genre diversity

c. Responsible Use of AI

No personal data used

Transparent recommendation logic

Educational use only

7️ .  Conclusion & Future Scope (Markdown Cell)

a. Summary of Results

The project successfully implements a content-based book recommendation system using NLP techniques. The system produces relevant recommendations and meets all evaluation requirements.

b. Possible Improvements & Extensions

Hybrid recommendation system

User ratings integration

Deep learning embeddings (BERT)

Web-based deployment