# **AI/Machine Learning Intern Challenge: Simple Content-Based Recommendation**

**Public Book Dataset for Content Recommendation**

This dataset contains a collection of books with descriptions, genres, and other metadata, making it suitable for building a content-based recommendation system using TF-IDF and cosine similarity.

**Content-Based Book Recommendation System**

This content-based book recommendation system suggests books based on their textual descriptions and genres. By leveraging TF-IDF (Term Frequency-Inverse Document Frequency) vectorization and cosine similarity, the system matches books to a user’s input description, identifying those with the most relevant themes and content.

**How It Works:**
Dataset Preparation: Each book has a title, description, and genres stored in a dataset.

**Text Processing:** The system combines book descriptions and genres into a single text feature.

**TF-IDF Transformation:** Converts the text into numerical vectors, highlighting important words.

**User Input Analysis:** The user provides a brief description of what they want to read.

**Cosine Similarity Computation:** Measures the similarity between the user’s input and all book descriptions in the dataset.

**Recommendation Output:** Returns the top N books that best match the user’s preferences.

In [4]:
import pandas as pd
import numpy as np

# Load the full dataset
books = pd.read_csv('books_enriched.csv')
rating = pd.read_csv('ratings.csv')

# Merge the datasets
merged_books = pd.merge(books, rating, on='book_id')

# Create bins for ratings to use in stratification
merged_books['rating_bin'] = pd.cut(merged_books['average_rating'], bins=5)

# Perform stratified sampling to get 500 rows
sampled_books = merged_books.groupby('rating_bin', group_keys=False).apply(lambda x: x.sample(n=min(len(x), int(502*len(x)/len(merged_books)))))

# Reset the index of the sampled dataset
sampled_books = sampled_books.reset_index(drop=True)

# Verify the number of rows in the sampled dataset
print(f"Number of rows in sampled dataset: {len(sampled_books)}")


  sampled_books = merged_books.groupby('rating_bin', group_keys=False).apply(lambda x: x.sample(n=min(len(x), int(502*len(x)/len(merged_books)))))


Number of rows in sampled dataset: 500


  sampled_books = merged_books.groupby('rating_bin', group_keys=False).apply(lambda x: x.sample(n=min(len(x), int(502*len(x)/len(merged_books)))))


In [5]:
sampled_books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 33 columns):
 #   Column                     Non-Null Count  Dtype   
---  ------                     --------------  -----   
 0   Unnamed: 0                 500 non-null    int64   
 1   index                      500 non-null    int64   
 2   authors                    500 non-null    object  
 3   average_rating             500 non-null    float64 
 4   best_book_id               500 non-null    int64   
 5   book_id                    500 non-null    int64   
 6   books_count                500 non-null    int64   
 7   description                495 non-null    object  
 8   genres                     500 non-null    object  
 9   goodreads_book_id          500 non-null    int64   
 10  image_url                  500 non-null    object  
 11  isbn                       476 non-null    object  
 12  isbn13                     477 non-null    float64 
 13  language_code              500 non-

In [50]:
df = pd.DataFrame(sampled_books)

# Save as CSV
df.to_csv('dataset.csv', index=False)

In [65]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import re

def escape_special_chars(text):
    return re.escape(text)

def extract_genre(description, genre_list):
    extracted_genres = [genre for genre in genre_list if re.search(rf'\b{escape_special_chars(genre)}\b', description, re.IGNORECASE)]
    return ' '.join(extracted_genres) if extracted_genres else ''

def recommend_books(user_description, dataset_path, top_n=5):
    # Load dataset
    df = pd.read_csv(dataset_path)

    # Fill missing values
    df['processed_description'] = df['processed_description'].fillna('')
    df['processed_genres'] = df['processed_genres'].fillna('')

    # Extract genre from user's description
    all_genres = set(df['processed_genres'].explode().dropna().unique())
    user_genre = extract_genre(user_description, all_genres)

    # Combine description and genre into one text feature
    df['combined_text'] = df['processed_description'] + " " + df['processed_genres']

    # TF-IDF Vectorization
    vectorizer = TfidfVectorizer(stop_words='english')
    tfidf_matrix = vectorizer.fit_transform(df['combined_text'])

    # Transform user input
    user_input = user_description + " " + user_genre
    user_vector = vectorizer.transform([user_input])

    # Compute cosine similarity
    similarities = cosine_similarity(user_vector, tfidf_matrix).flatten()

 # Get top N recommendations without duplicates
    top_indices = list(dict.fromkeys(similarities.argsort()[::-1]))[:top_n]
    recommendations = df.iloc[top_indices][['title', 'description', 'genres']].drop_duplicates()

    # Display recommendations properly
    if recommendations.empty:
        print("No recommendations found.")
    else:
        print("Top Recommended Books:\n")
        for index, row in recommendations.iterrows():
            print(f"Title: {row['title']}\nDescription: {row['description']}\nGenres: {row['genres']}\n{'-'*50}")

    return recommendations
# Example usage
if __name__ == "__main__":
    user_desc = input("Enter a description: ")
    dataset_path = "/content/dataset.csv"
    print(recommend_books(user_desc, dataset_path))


Enter a description: A thrilling mystery novel with deep psychological twists.
Top Recommended Books:

Title: second life
Description: The sensational new psychological thriller from the bestselling author of Before I Go To Sleep.She loves her husband. She's obsessed by a stranger.She's a devoted mother. She's prepared to lose everything.She knows what she's doing. She's out of control.She's innocent. She's guilty as sin.She's living two lives. She might lose both . . .
Genres: ['thriller', 'fiction', 'mystery', 'suspense', 'crime', 'contemporary', 'books']
--------------------------------------------------
Title: twilight: the graphic novel, vol. 1 (twilight: the graphic novel, #1)
Description: When Isabella Swan moves to the gloomy town of Forks and meets the mysterious, alluring Edward Cullen, her life takes a thrilling and terrifying turn. With his porcelain skin, golden eyes, mesmerizing voice, and supernatural gifts, Edward is both irresistible and impenetrable. Up until now, he 