# Book Recommendation System - ML Analysis

This notebook demonstrates the machine learning approach for building a content-based book recommendation system.

## Approach
1. **Data Collection**: Load and explore book dataset
2. **Feature Engineering**: Create TF-IDF vectors from book descriptions
3. **Similarity Calculation**: Use cosine similarity to find similar books
4. **Model Evaluation**: Assess recommendation quality
5. **Visualization**: Show insights and results

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import PCA
import pickle
import os

# Set plotting style
plt.style.use('default')
sns.set_palette('husl')

print("Libraries imported successfully!")

## 1. Data Loading and Exploration

In [None]:
# Load the dataset
books_df = pd.read_csv('../data/books.csv')

print(f"Dataset shape: {books_df.shape}")
print(f"Columns: {list(books_df.columns)}")

# Display first few rows
books_df.head()

In [None]:
# Basic statistics
print("Dataset Info:")
print(books_df.info())
print("\nMissing values:")
print(books_df.isnull().sum())
print("\nBasic statistics:")
print(books_df.describe())

In [None]:
# Visualize distribution of ratings and publication years
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Rating distribution
axes[0].hist(books_df['rating'], bins=20, alpha=0.7, color='skyblue')
axes[0].set_title('Distribution of Book Ratings')
axes[0].set_xlabel('Rating')
axes[0].set_ylabel('Frequency')

# Year distribution
axes[1].hist(books_df['year'], bins=20, alpha=0.7, color='lightgreen')
axes[1].set_title('Distribution of Publication Years')
axes[1].set_xlabel('Year')
axes[1].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

In [None]:
# Genre distribution
genre_counts = books_df['genre'].value_counts()

plt.figure(figsize=(10, 6))
genre_counts.plot(kind='bar', color='coral')
plt.title('Distribution of Book Genres')
plt.xlabel('Genre')
plt.ylabel('Number of Books')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## 2. Feature Engineering - TF-IDF Vectorization

In [None]:
# Combine relevant text features for content-based filtering
books_df['combined_features'] = (
    books_df['title'].fillna('') + ' ' +
    books_df['author'].fillna('') + ' ' +
    books_df['genre'].fillna('') + ' ' +
    books_df['description'].fillna('')
)

print("Sample combined features:")
for i in range(3):
    print(f"Book {i+1}: {books_df['combined_features'].iloc[i][:100]}...")
    print()

In [None]:
# Create TF-IDF vectorizer
tfidf = TfidfVectorizer(
    max_features=1000,  # Limit features for demonstration
    stop_words='english',
    ngram_range=(1, 2),  # Use unigrams and bigrams
    min_df=1,
    max_df=0.8
)

# Fit and transform the combined features
tfidf_matrix = tfidf.fit_transform(books_df['combined_features'])

print(f"TF-IDF matrix shape: {tfidf_matrix.shape}")
print(f"Number of features: {len(tfidf.get_feature_names_out())}")
print(f"Matrix sparsity: {(1 - (tfidf_matrix.nnz / (tfidf_matrix.shape[0] * tfidf_matrix.shape[1]))) * 100:.2f}%")

In [None]:
# Show top TF-IDF features
feature_names = tfidf.get_feature_names_out()
mean_scores = np.array(tfidf_matrix.mean(axis=0)).flatten()
top_features = [(feature_names[i], mean_scores[i]) for i in mean_scores.argsort()[::-1][:20]]

print("Top 20 TF-IDF features:")
for feature, score in top_features:
    print(f"{feature}: {score:.4f}")

## 3. Similarity Calculation

In [None]:
# Calculate cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

print(f"Similarity matrix shape: {cosine_sim.shape}")
print(f"Average similarity score: {cosine_sim.mean():.4f}")
print(f"Similarity score std: {cosine_sim.std():.4f}")

In [None]:
# Visualize similarity matrix
plt.figure(figsize=(10, 8))
sns.heatmap(cosine_sim, cmap='viridis', square=True, cbar_kws={'label': 'Cosine Similarity'})
plt.title('Book Similarity Matrix')
plt.xlabel('Book Index')
plt.ylabel('Book Index')
plt.show()

In [None]:
# Distribution of similarity scores
plt.figure(figsize=(10, 6))

# Get upper triangle of similarity matrix (excluding diagonal)
upper_triangle = cosine_sim[np.triu_indices_from(cosine_sim, k=1)]

plt.hist(upper_triangle, bins=50, alpha=0.7, color='lightblue', edgecolor='black')
plt.title('Distribution of Cosine Similarity Scores')
plt.xlabel('Cosine Similarity')
plt.ylabel('Frequency')
plt.axvline(upper_triangle.mean(), color='red', linestyle='--', label=f'Mean: {upper_triangle.mean():.3f}')
plt.legend()
plt.show()

print(f"Min similarity: {upper_triangle.min():.4f}")
print(f"Max similarity: {upper_triangle.max():.4f}")
print(f"Mean similarity: {upper_triangle.mean():.4f}")

## 4. Recommendation Function

In [None]:
def get_recommendations(book_title, cosine_sim=cosine_sim, df=books_df, n_recommendations=5):
    """Get book recommendations based on cosine similarity"""
    
    # Get the index of the book that matches the title
    try:
        idx = df[df['title'] == book_title].index[0]
    except IndexError:
        print(f"Book '{book_title}' not found in dataset")
        return []
    
    # Get similarity scores for all books
    sim_scores = list(enumerate(cosine_sim[idx]))
    
    # Sort books based on similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Get the scores of the most similar books (excluding the book itself)
    sim_scores = sim_scores[1:n_recommendations+1]
    
    # Get the book indices
    book_indices = [i[0] for i in sim_scores]
    
    # Get similarity scores
    similarity_scores = [i[1] for i in sim_scores]
    
    # Return the most similar books with scores
    recommendations = df.iloc[book_indices].copy()
    recommendations['similarity_score'] = similarity_scores
    
    return recommendations[['title', 'author', 'genre', 'rating', 'similarity_score']]

In [None]:
# Test recommendations for a sample book
sample_book = books_df['title'].iloc[0]
print(f"Getting recommendations for: '{sample_book}'")
print()

recommendations = get_recommendations(sample_book)
print("Recommended books:")
print(recommendations.to_string(index=False))

## 5. Model Evaluation

In [None]:
# Analyze recommendation quality by genre similarity
def evaluate_genre_consistency():
    """Evaluate how often recommendations share the same genre as the target book"""
    
    genre_matches = []
    
    for i, book_title in enumerate(books_df['title']):
        target_genre = books_df.iloc[i]['genre']
        recs = get_recommendations(book_title, n_recommendations=3)
        
        if not recs.empty:
            genre_match_count = sum(recs['genre'] == target_genre)
            genre_match_ratio = genre_match_count / len(recs)
            genre_matches.append(genre_match_ratio)
    
    return genre_matches

genre_consistency = evaluate_genre_consistency()
avg_genre_consistency = np.mean(genre_consistency)

print(f"Average genre consistency: {avg_genre_consistency:.2f}")
print(f"This means {avg_genre_consistency*100:.1f}% of recommendations share the same genre as the target book")

In [None]:
# Visualize genre consistency
plt.figure(figsize=(10, 6))
plt.hist(genre_consistency, bins=10, alpha=0.7, color='lightcoral', edgecolor='black')
plt.title('Distribution of Genre Consistency in Recommendations')
plt.xlabel('Genre Match Ratio')
plt.ylabel('Number of Books')
plt.axvline(avg_genre_consistency, color='red', linestyle='--', 
            label=f'Average: {avg_genre_consistency:.2f}')
plt.legend()
plt.show()

## 6. Dimensionality Reduction Visualization

In [None]:
# Use PCA to visualize books in 2D space
pca = PCA(n_components=2, random_state=42)
book_embeddings_2d = pca.fit_transform(tfidf_matrix.toarray())

print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
print(f"Total explained variance: {pca.explained_variance_ratio_.sum():.3f}")

In [None]:
# Create a scatter plot colored by genre
plt.figure(figsize=(12, 8))

# Get unique genres for color mapping
unique_genres = books_df['genre'].unique()
colors = plt.cm.Set3(np.linspace(0, 1, len(unique_genres)))

for i, genre in enumerate(unique_genres):
    mask = books_df['genre'] == genre
    plt.scatter(book_embeddings_2d[mask, 0], book_embeddings_2d[mask, 1], 
               c=[colors[i]], label=genre, alpha=0.7, s=100)

plt.title('Book Embeddings in 2D Space (PCA)', fontsize=16)
plt.xlabel(f'First Principal Component ({pca.explained_variance_ratio_[0]:.1%} variance)')
plt.ylabel(f'Second Principal Component ({pca.explained_variance_ratio_[1]:.1%} variance)')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 7. Save Model Components

In [None]:
# Create models directory
os.makedirs('../models', exist_ok=True)

# Save TF-IDF matrix
with open('../models/tfidf_matrix.pkl', 'wb') as f:
    pickle.dump(tfidf_matrix, f)

# Save TF-IDF vectorizer
with open('../models/tfidf_vectorizer.pkl', 'wb') as f:
    pickle.dump(tfidf, f)

# Save similarity matrix
with open('../models/cosine_similarity.pkl', 'wb') as f:
    pickle.dump(cosine_sim, f)

print("Model components saved successfully!")

## 8. Summary and Next Steps

### What we accomplished:
1. ✅ **Data Analysis**: Explored book dataset and visualized distributions
2. ✅ **Feature Engineering**: Created TF-IDF vectors from book text content
3. ✅ **Similarity Calculation**: Used cosine similarity for content-based filtering
4. ✅ **Model Evaluation**: Assessed recommendation quality using genre consistency
5. ✅ **Visualization**: Used PCA to visualize book embeddings in 2D space

### Key Findings:
- The model achieves good genre consistency in recommendations
- TF-IDF effectively captures book content similarity
- PCA visualization shows clustering of similar genres

### Potential Improvements:
1. **Collaborative Filtering**: Add user rating data for collaborative recommendations
2. **Deep Learning**: Use embeddings from pre-trained language models
3. **Hybrid Approach**: Combine content-based and collaborative filtering
4. **Evaluation Metrics**: Implement precision@K, recall@K, and NDCG
5. **A/B Testing**: Test different algorithms in production

This notebook demonstrates a solid foundation for a content-based recommendation system using traditional ML techniques!