# Recommendation System - Complete Tutorial

**Author:** Anik Tahabilder  
**Project:** 12 of 22 - Kaggle ML Portfolio  
**Dataset:** MovieLens  
**Difficulty:** 6/10 | **Learning Value:** 8/10

---

## What Will You Learn?

This tutorial teaches **Recommendation Systems from fundamentals to implementation**.

| Topic | What You'll Understand |
|-------|------------------------|
| **Types of Recommenders** | Content-based vs Collaborative Filtering |
| **Similarity Metrics** | Cosine, Pearson, Jaccard - when to use each |
| **User-based CF** | "Users like you also liked..." |
| **Item-based CF** | "Because you liked X, try Y..." |
| **Matrix Factorization** | SVD for latent features |
| **Cold Start Problem** | Handling new users/items |
| **Evaluation Metrics** | RMSE, MAE, Precision@K, Recall@K |

---

## The Recommendation Problem

```
USER-ITEM INTERACTION MATRIX

              Movie1  Movie2  Movie3  Movie4  Movie5
           ┌────────────────────────────────────────┐
   User1   │   5       3       ?       1       ?   │
   User2   │   4       ?       ?       1       ?   │
   User3   │   1       1       ?       5       4   │
   User4   │   ?       ?       5       4       ?   │
           └────────────────────────────────────────┘
                        ↓
           GOAL: Predict the "?" values!
```

---

## Table of Contents

1. [Part 1: Types of Recommendation Systems](#part1)
2. [Part 2: Similarity Metrics](#part2)
3. [Part 3: Collaborative Filtering Theory](#part3)
4. [Part 4: Matrix Factorization (SVD)](#part4)
5. [Part 5: Dataset Loading & EDA](#part5)
6. [Part 6: User-Based Collaborative Filtering](#part6)
7. [Part 7: Item-Based Collaborative Filtering](#part7)
8. [Part 8: SVD-Based Recommendations](#part8)
9. [Part 9: Evaluation & Comparison](#part9)
10. [Part 10: Summary & Key Takeaways](#part10)

---

<a id='part1'></a>
# Part 1: Types of Recommendation Systems

---

## 1.1 The Three Main Approaches

| Approach | How It Works | Example | When to Use |
|----------|--------------|--------|-------------|
| **Content-Based** | Recommend items similar to what user liked | "You liked Action movies, here's more Action" | Have good item features, new items |
| **Collaborative Filtering** | Recommend based on similar users/items | "Users like you also watched..." | Have user-item interactions |
| **Hybrid** | Combine both approaches | Netflix, Spotify, Amazon | Production systems |

---

## 1.2 Content-Based Filtering

```
USER PROFILE                    ITEM FEATURES                RECOMMENDATION
┌─────────────┐                ┌─────────────┐               ┌─────────────┐
│ Likes:      │                │ Movie X:    │               │             │
│ - Action    │   MATCH        │ - Action    │   HIGH        │ Recommend   │
│ - Sci-Fi    │ ──────────>    │ - Sci-Fi    │ ──────────>   │  Movie X!   │
│ - 2hr+      │                │ - 2hr 15min │   SCORE       │             │
└─────────────┘                └─────────────┘               └─────────────┘
```

### When to Choose Content-Based?

| Scenario | Why Content-Based Works |
|----------|------------------------|
| **New items (no ratings yet)** | Can recommend based on features alone |
| **Rich item metadata** | News articles, job postings have good descriptions |
| **User privacy important** | Doesn't need other users' data |
| **Transparent recommendations** | Can explain "because it has genre X" |

| Pros | Cons |
|------|------|
| No cold start for items | Limited diversity (filter bubble) |
| Transparent recommendations | Needs good item features |
| User independence | Can't discover unexpected preferences |

---

## 1.3 Collaborative Filtering

```
USER-BASED CF                              ITEM-BASED CF
"Users similar to you liked..."            "Items similar to what you liked..."

┌─────┐                                    ┌─────┐
│You  │──likes──> Movie A                  │You  │──likes──> Movie A
└─────┘                                    └─────┘              │
   │ similar                                                    │ similar
   v                                                            v
┌─────┐                                                    ┌─────────┐
│Bob  │──likes──> Movie A, Movie B                         │ Movie B │
└─────┘              │                                     └─────────┘
                     v                                          │
            Recommend Movie B!                      Recommend Movie B!
```

### When to Choose Collaborative Filtering?

| Scenario | Why CF Works |
|----------|-------------|
| **Enough user-item interactions** | Can find meaningful patterns |
| **Items lack good features** | Music, movies hard to describe |
| **Want to discover hidden patterns** | Users might like unexpected things |
| **E-commerce, streaming** | Behavior matters more than content |

---

## 1.4 User-Based vs Item-Based: Decision Guide

| Criteria | User-Based | Item-Based | Better Choice |
|----------|------------|------------|---------------|
| **# Users >> # Items** | Slow (O(m²)) | Fast | Item-Based |
| **# Items >> # Users** | Fast | Slow (O(n²)) | User-Based |
| **User behavior changes often** | Recalculate often | Stable | Item-Based |
| **New items frequently** | Handles better | Cold start | User-Based |
| **Explainability** | "Users like you..." | "Because you liked X..." | Both good |
| **Industry standard** | Academic | Amazon, Netflix | Item-Based |

**Key insight**: Item-Based CF is preferred in production because:
1. Items are more stable than users
2. Precomputed similarities can be cached
3. Better explainability for users

---

## 1.5 Challenges and Solutions

| Challenge | Description | Solution |
|-----------|-------------|----------|
| **Cold Start** | New user/item with no history | Content-based fallback, ask preferences, popularity-based |
| **Sparsity** | 95%+ of matrix is empty | Matrix factorization (SVD) captures patterns |
| **Scalability** | Millions of users/items | Approximate nearest neighbors, clustering |
| **Diversity** | Avoid "filter bubbles" | Exploration-exploitation, diversity re-ranking |
| **Implicit feedback** | No explicit ratings | Use clicks, views, time spent |

In [None]:
# ============================================================
# SETUP AND IMPORTS
# ============================================================
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
from collections import defaultdict
import warnings
warnings.filterwarnings('ignore')

# Sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse.linalg import svds
from scipy import sparse

# Display settings
plt.style.use('seaborn-v0_8-whitegrid')
np.random.seed(42)

print("="*70)
print("RECOMMENDATION SYSTEM - TUTORIAL")
print("="*70)
print(f"NumPy: {np.__version__}")
print(f"Pandas: {pd.__version__}")
print("\nAll libraries loaded!")

---

<a id='part2'></a>
# Part 2: Similarity Metrics

---

## 2.1 Why Similarity Matters

Collaborative filtering relies on finding **similar users** or **similar items**.

The choice of similarity metric significantly affects recommendations!

---

## 2.2 Common Similarity Metrics

### Cosine Similarity

$$\text{cosine}(A, B) = \frac{A \cdot B}{\|A\| \|B\|} = \frac{\sum a_i b_i}{\sqrt{\sum a_i^2} \sqrt{\sum b_i^2}}$$

| Property | Description |
|----------|-------------|
| **Range** | -1 to 1 (0 to 1 for positive values) |
| **Measures** | Angle between vectors |
| **Best for** | When magnitude doesn't matter |
| **Ignores** | Missing values (use only co-rated items) |

### Pearson Correlation

$$\text{pearson}(A, B) = \frac{\sum (a_i - \bar{a})(b_i - \bar{b})}{\sqrt{\sum (a_i - \bar{a})^2} \sqrt{\sum (b_i - \bar{b})^2}}$$

| Property | Description |
|----------|-------------|
| **Range** | -1 to 1 |
| **Measures** | Linear correlation |
| **Best for** | When users have different rating scales |
| **Handles** | Mean-centering automatically |

### Jaccard Similarity

$$\text{jaccard}(A, B) = \frac{|A \cap B|}{|A \cup B|}$$

| Property | Description |
|----------|-------------|
| **Range** | 0 to 1 |
| **Measures** | Set overlap |
| **Best for** | Binary data (liked/not liked) |
| **Ignores** | Rating values |

---

## 2.3 Which Metric to Choose?

| Scenario | Best Metric | Why |
|----------|-------------|-----|
| **Explicit ratings (1-5 stars)** | Pearson | Handles different rating scales |
| **Implicit feedback (clicks, views)** | Cosine | Magnitude = engagement strength |
| **Binary data (liked/not liked)** | Jaccard | Designed for sets |
| **Sparse data** | Cosine or Pearson | Both handle missing values |
| **User-User similarity** | Pearson | Users have different biases |
| **Item-Item similarity** | Cosine | Items don't have "bias" |

### Why Pearson for User-Based CF?

Users have different rating behaviors:
- User A: Rates everything 4-5 stars (generous)
- User B: Rates everything 2-3 stars (harsh)

Both might have the **same preferences** but different scales. Pearson correlation **normalizes** by subtracting the mean, capturing the relative pattern, not absolute values.

---

## 2.4 Comparison Table

| Metric | Range | Handles Bias | Uses Values | Sparse Data | Time Complexity |
|--------|-------|--------------|-------------|-------------|-----------------|
| **Cosine** | 0 to 1 | No | Yes | Good | O(d) |
| **Pearson** | -1 to 1 | Yes | Yes | Good | O(d) |
| **Jaccard** | 0 to 1 | N/A | No | Excellent | O(d) |
| **Euclidean** | 0 to ∞ | No | Yes | Poor | O(d) |

*d = number of co-rated items*

In [None]:
# ============================================================
# SIMILARITY METRICS IMPLEMENTATION
# ============================================================
print("="*70)
print("SIMILARITY METRICS")
print("="*70)

def cosine_sim(a, b):
    """Compute cosine similarity between two vectors."""
    # Only use co-rated items (both non-zero)
    mask = (a != 0) & (b != 0)
    if mask.sum() == 0:
        return 0
    a_masked, b_masked = a[mask], b[mask]
    dot = np.dot(a_masked, b_masked)
    norm = np.linalg.norm(a_masked) * np.linalg.norm(b_masked)
    return dot / norm if norm > 0 else 0

def pearson_sim(a, b):
    """Compute Pearson correlation between two vectors."""
    # Only use co-rated items
    mask = (a != 0) & (b != 0)
    if mask.sum() < 2:  # Need at least 2 co-rated items
        return 0
    a_masked, b_masked = a[mask], b[mask]
    # Mean-center
    a_centered = a_masked - np.mean(a_masked)
    b_centered = b_masked - np.mean(b_masked)
    # Correlation
    num = np.dot(a_centered, b_centered)
    denom = np.linalg.norm(a_centered) * np.linalg.norm(b_centered)
    return num / denom if denom > 0 else 0

def jaccard_sim(a, b):
    """Compute Jaccard similarity (for binary/implicit feedback)."""
    a_set = set(np.where(a > 0)[0])
    b_set = set(np.where(b > 0)[0])
    intersection = len(a_set & b_set)
    union = len(a_set | b_set)
    return intersection / union if union > 0 else 0

# Example
print("\nExample: Comparing Two Users' Ratings")
print("-" * 50)

# User ratings for 5 movies (0 = not rated)
user_a = np.array([5, 4, 0, 1, 2])  # Likes movies 1,2 | Dislikes 4,5
user_b = np.array([4, 5, 0, 2, 1])  # Similar to A
user_c = np.array([1, 2, 0, 5, 4])  # Opposite taste to A

print(f"User A ratings: {user_a}")
print(f"User B ratings: {user_b}")
print(f"User C ratings: {user_c}")

print(f"\nSimilarities (A vs B - Similar tastes):")
print(f"  Cosine:  {cosine_sim(user_a, user_b):.4f}")
print(f"  Pearson: {pearson_sim(user_a, user_b):.4f}")
print(f"  Jaccard: {jaccard_sim(user_a, user_b):.4f}")

print(f"\nSimilarities (A vs C - Opposite tastes):")
print(f"  Cosine:  {cosine_sim(user_a, user_c):.4f}")
print(f"  Pearson: {pearson_sim(user_a, user_c):.4f}")
print(f"  Jaccard: {jaccard_sim(user_a, user_c):.4f}")

print("\nObservation:")
print("  - Pearson captures the negative correlation for opposite tastes")
print("  - Cosine is always positive (doesn't catch opposite tastes well)")
print("  - Jaccard only looks at what they rated, not how")

---

<a id='part3'></a>
# Part 3: Collaborative Filtering Theory

---

## 3.1 User-Based Collaborative Filtering

### Algorithm:

1. **Build user-item matrix** (rows=users, cols=items)
2. **Find similar users** to target user
3. **Predict rating** as weighted average of similar users' ratings

### Prediction Formula:

$$\hat{r}_{u,i} = \bar{r}_u + \frac{\sum_{v \in N(u)} \text{sim}(u,v) \cdot (r_{v,i} - \bar{r}_v)}{\sum_{v \in N(u)} |\text{sim}(u,v)|}$$

Where:
- $\hat{r}_{u,i}$ = predicted rating of user u for item i
- $\bar{r}_u$ = average rating of user u
- $N(u)$ = neighbors (similar users) of u who rated item i
- $\text{sim}(u,v)$ = similarity between users u and v

---

## 3.2 Item-Based Collaborative Filtering

### Algorithm:

1. **Build item-item similarity matrix**
2. **For prediction**: Find items similar to those user liked
3. **Predict rating** as weighted average based on item similarities

### Prediction Formula:

$$\hat{r}_{u,i} = \frac{\sum_{j \in N(i)} \text{sim}(i,j) \cdot r_{u,j}}{\sum_{j \in N(i)} |\text{sim}(i,j)|}$$

Where:
- $N(i)$ = items similar to item i that user u has rated
- $\text{sim}(i,j)$ = similarity between items i and j

---

## 3.3 User-Based vs Item-Based Comparison

| Aspect | User-Based | Item-Based |
|--------|------------|------------|
| **Precompute** | User similarities | Item similarities |
| **Scales better** | Few users | Few items |
| **Stability** | Changes often (user behavior) | More stable |
| **Explainability** | "Users like you..." | "Because you liked X..." |
| **Cold Start** | New user problem | New item problem |

---

## 3.4 Neighborhood Selection (K)

| K Value | Effect |
|---------|--------|
| **Small K (5-10)** | More personalized, may be noisy |
| **Large K (50+)** | Smoother, less personalized |
| **Optimal** | Usually 20-50, tune via cross-validation |

---

<a id='part4'></a>
# Part 4: Matrix Factorization (SVD)

---

## 4.1 The Idea

Decompose the user-item matrix into latent factors:

```
USER-ITEM MATRIX              =    USER FACTORS    ×    ITEM FACTORS
     (m × n)                         (m × k)              (k × n)

┌─────────────────┐           ┌─────────┐         ┌─────────────────┐
│ 5  3  ?  1  ?   │           │ u1 u2   │         │ i1 i2 i3 i4 i5 │
│ 4  ?  ?  1  ?   │     =     │ u1 u2   │    ×    │ i1 i2 i3 i4 i5 │
│ 1  1  ?  5  4   │           │ u1 u2   │         └─────────────────┘
│ ?  ?  5  4  ?   │           │ u1 u2   │              k factors
└─────────────────┘           └─────────┘
                               k factors
```

## 4.2 SVD (Singular Value Decomposition)

$$R = U \Sigma V^T$$

| Matrix | Shape | Meaning |
|--------|-------|--------|
| **R** | m × n | Original ratings matrix |
| **U** | m × k | User latent factors |
| **Σ** | k × k | Singular values (importance) |
| **V^T** | k × n | Item latent factors |

## 4.3 What Are Latent Factors?

Hidden features that explain preferences:

| Factor | Could Represent |
|--------|----------------|
| Factor 1 | Action vs Drama preference |
| Factor 2 | Old vs New movies |
| Factor 3 | Mainstream vs Indie |
| ... | Other hidden patterns |

## 4.4 Prediction with SVD

$$\hat{r}_{u,i} = \mu + b_u + b_i + q_i^T p_u$$

Where:
- $\mu$ = global average rating
- $b_u$ = user bias (some users rate higher)
- $b_i$ = item bias (some movies are rated higher)
- $q_i$ = item latent vector
- $p_u$ = user latent vector

## 4.5 Key Parameters

| Parameter | Description | Typical Value |
|-----------|-------------|---------------|
| **k (n_factors)** | Number of latent factors | 20-100 |
| **Learning rate** | For SGD optimization | 0.005-0.01 |
| **Regularization** | Prevent overfitting | 0.02-0.1 |

---

<a id='part5'></a>
# Part 5: Dataset Loading & EDA

---

## 5.1 Dataset Information

| Attribute | Value |
|-----------|-------|
| **Kaggle Dataset** | `movie-recommendation-system` |
| **Kaggle Path** | `/kaggle/input/movie-recommendation-system` |

---

## 5.2 Interview Question: Why Use This Dataset?

| Reason | Explanation |
|--------|-------------|
| **Real-world data** | Actual user ratings, not synthetic |
| **Benchmark standard** | MovieLens is industry standard for RecSys |
| **Good sparsity** | ~95% sparse, realistic challenge |
| **Multiple evaluation** | Can compare methods fairly |

In [None]:
# ============================================================
# LOAD DATASET FROM KAGGLE
# ============================================================
print("="*70)
print("LOADING MOVIE RECOMMENDATION DATASET")
print("="*70)

# ============================================================
# KAGGLE PATH CONFIGURATION
# ============================================================
# Dataset: /kaggle/input/movie-recommendation-system

USE_KAGGLE = os.path.exists('/kaggle/input')

# Possible paths for the dataset
POSSIBLE_PATHS = [
    # Primary: movie-recommendation-system dataset
    '/kaggle/input/movie-recommendation-system/ratings.csv',
    '/kaggle/input/movie-recommendation-system/rating.csv',
    '/kaggle/input/movie-recommendation-system',
    # Alternative MovieLens paths
    '/kaggle/input/movielens-100k-dataset/ml-100k/u.data',
    '/kaggle/input/movielens-small-latest-dataset/ratings.csv',
]

ratings = None
movies = None

if USE_KAGGLE:
    # First, let's check what files exist in the dataset
    base_path = '/kaggle/input/movie-recommendation-system'
    if os.path.exists(base_path):
        print(f"Found dataset at: {base_path}")
        print(f"Files in dataset: {os.listdir(base_path)}")
        
        # Try to find ratings file
        for filename in os.listdir(base_path):
            filepath = os.path.join(base_path, filename)
            if 'rating' in filename.lower() and filename.endswith('.csv'):
                try:
                    ratings = pd.read_csv(filepath)
                    print(f"Loaded ratings from: {filename}")
                    break
                except Exception as e:
                    print(f"Error loading {filename}: {e}")
        
        # Try to find movies file for titles
        for filename in os.listdir(base_path):
            filepath = os.path.join(base_path, filename)
            if 'movie' in filename.lower() and filename.endswith('.csv') and 'rating' not in filename.lower():
                try:
                    movies = pd.read_csv(filepath)
                    print(f"Loaded movies from: {filename}")
                    break
                except Exception as e:
                    print(f"Error loading {filename}: {e}")
    
    # If still no ratings, try alternative paths
    if ratings is None:
        for path in POSSIBLE_PATHS:
            if os.path.exists(path):
                try:
                    if path.endswith('.csv'):
                        ratings = pd.read_csv(path)
                    elif 'u.data' in path:
                        ratings = pd.read_csv(path, sep='\t', 
                                            names=['user_id', 'movie_id', 'rating', 'timestamp'])
                    print(f"Loaded ratings from: {path}")
                    break
                except Exception as e:
                    continue

# Standardize column names
if ratings is not None:
    ratings.columns = [c.lower().replace(' ', '_') for c in ratings.columns]
    # Handle various column naming conventions
    col_mapping = {
        'userid': 'user_id', 'user': 'user_id',
        'movieid': 'movie_id', 'movie': 'movie_id', 'item_id': 'movie_id',
        'rate': 'rating', 'score': 'rating'
    }
    ratings.rename(columns=col_mapping, inplace=True)

# Fallback: Create synthetic dataset
if ratings is None:
    print("\nNo Kaggle dataset found. Creating synthetic MovieLens-like data...")
    print("(Add 'movie-recommendation-system' dataset in Kaggle for real data)")
    
    np.random.seed(42)
    n_users = 500
    n_movies = 200
    n_ratings = 50000
    
    # Create user preferences (latent factors)
    user_preferences = np.random.randn(n_users, 5)
    movie_features = np.random.randn(n_movies, 5)
    
    ratings_list = []
    for _ in range(n_ratings):
        u = np.random.randint(1, n_users + 1)
        m = np.random.randint(1, n_movies + 1)
        base_rating = np.dot(user_preferences[u-1], movie_features[m-1])
        rating = np.clip(base_rating + 3 + np.random.randn() * 0.5, 1, 5)
        rating = round(rating * 2) / 2
        ratings_list.append([u, m, rating, 0])
    
    ratings = pd.DataFrame(ratings_list, 
                          columns=['user_id', 'movie_id', 'rating', 'timestamp'])
    ratings = ratings.drop_duplicates(subset=['user_id', 'movie_id'], keep='last')

print(f"\n" + "="*50)
print("DATASET SUMMARY")
print("="*50)
print(f"Columns: {list(ratings.columns)}")
print(f"Total ratings: {len(ratings):,}")
print(f"Unique users: {ratings['user_id'].nunique():,}")
print(f"Unique movies: {ratings['movie_id'].nunique():,}")
print(f"Rating range: {ratings['rating'].min()} - {ratings['rating'].max()}")
print(f"Average rating: {ratings['rating'].mean():.2f}")
print(f"\nSparsity: {100 * (1 - len(ratings) / (ratings['user_id'].nunique() * ratings['movie_id'].nunique())):.2f}%")

print("\nFirst few ratings:")
print(ratings.head(10))

In [None]:
# ============================================================
# EXPLORATORY DATA ANALYSIS
# ============================================================
print("="*70)
print("EXPLORATORY DATA ANALYSIS")
print("="*70)

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. Rating distribution
ax1 = axes[0, 0]
ratings['rating'].value_counts().sort_index().plot(kind='bar', ax=ax1, color='steelblue', edgecolor='black')
ax1.set_xlabel('Rating')
ax1.set_ylabel('Count')
ax1.set_title('Rating Distribution', fontweight='bold')
ax1.tick_params(axis='x', rotation=0)

# 2. Ratings per user
ax2 = axes[0, 1]
ratings_per_user = ratings.groupby('user_id').size()
ax2.hist(ratings_per_user, bins=50, color='coral', edgecolor='black', alpha=0.7)
ax2.axvline(ratings_per_user.median(), color='red', linestyle='--', label=f'Median: {ratings_per_user.median():.0f}')
ax2.set_xlabel('Number of Ratings')
ax2.set_ylabel('Number of Users')
ax2.set_title('Ratings per User', fontweight='bold')
ax2.legend()

# 3. Ratings per movie
ax3 = axes[1, 0]
ratings_per_movie = ratings.groupby('movie_id').size()
ax3.hist(ratings_per_movie, bins=50, color='green', edgecolor='black', alpha=0.7)
ax3.axvline(ratings_per_movie.median(), color='red', linestyle='--', label=f'Median: {ratings_per_movie.median():.0f}')
ax3.set_xlabel('Number of Ratings')
ax3.set_ylabel('Number of Movies')
ax3.set_title('Ratings per Movie', fontweight='bold')
ax3.legend()

# 4. Average rating per user
ax4 = axes[1, 1]
avg_rating_per_user = ratings.groupby('user_id')['rating'].mean()
ax4.hist(avg_rating_per_user, bins=30, color='purple', edgecolor='black', alpha=0.7)
ax4.axvline(avg_rating_per_user.mean(), color='red', linestyle='--', label=f'Mean: {avg_rating_per_user.mean():.2f}')
ax4.set_xlabel('Average Rating')
ax4.set_ylabel('Number of Users')
ax4.set_title('User Rating Bias', fontweight='bold')
ax4.legend()

plt.suptitle('MovieLens Dataset Analysis', fontweight='bold', fontsize=14)
plt.tight_layout()
plt.show()

print("\nKey Observations:")
print(f"  - Most users rate {int(ratings_per_user.median())}-{int(ratings_per_user.quantile(0.75))} movies")
print(f"  - Some movies are very popular, most have few ratings (long tail)")
print(f"  - Users have different rating biases (some rate high, some low)")

In [None]:
# ============================================================
# PREPARE DATA FOR RECOMMENDATIONS
# ============================================================
print("="*70)
print("PREPARING DATA")
print("="*70)

# Filter users and movies with minimum ratings (reduces sparsity)
min_user_ratings = 10
min_movie_ratings = 10

# Filter users
user_counts = ratings['user_id'].value_counts()
valid_users = user_counts[user_counts >= min_user_ratings].index

# Filter movies
movie_counts = ratings['movie_id'].value_counts()
valid_movies = movie_counts[movie_counts >= min_movie_ratings].index

# Apply filters
ratings_filtered = ratings[
    (ratings['user_id'].isin(valid_users)) & 
    (ratings['movie_id'].isin(valid_movies))
].copy()

print(f"\nAfter filtering (min {min_user_ratings} ratings):")
print(f"  Ratings: {len(ratings_filtered):,}")
print(f"  Users: {ratings_filtered['user_id'].nunique():,}")
print(f"  Movies: {ratings_filtered['movie_id'].nunique():,}")

# Create train/test split
train_data, test_data = train_test_split(ratings_filtered, test_size=0.2, random_state=42)

print(f"\nTrain/Test split:")
print(f"  Train: {len(train_data):,} ratings")
print(f"  Test:  {len(test_data):,} ratings")

# Create user-item matrix
print("\nCreating user-item matrix...")

# Get unique users and movies
users = sorted(train_data['user_id'].unique())
movies = sorted(train_data['movie_id'].unique())

# Create mappings
user_to_idx = {u: i for i, u in enumerate(users)}
idx_to_user = {i: u for u, i in user_to_idx.items()}
movie_to_idx = {m: i for i, m in enumerate(movies)}
idx_to_movie = {i: m for m, i in movie_to_idx.items()}

n_users = len(users)
n_movies = len(movies)

# Create matrix
user_item_matrix = np.zeros((n_users, n_movies))
for _, row in train_data.iterrows():
    u_idx = user_to_idx.get(row['user_id'])
    m_idx = movie_to_idx.get(row['movie_id'])
    if u_idx is not None and m_idx is not None:
        user_item_matrix[u_idx, m_idx] = row['rating']

print(f"\nUser-Item Matrix shape: {user_item_matrix.shape}")
print(f"Sparsity: {100 * (1 - np.count_nonzero(user_item_matrix) / user_item_matrix.size):.2f}%")

---

<a id='part6'></a>
# Part 6: User-Based Collaborative Filtering

---

In [None]:
# ============================================================
# USER-BASED COLLABORATIVE FILTERING
# ============================================================
print("="*70)
print("USER-BASED COLLABORATIVE FILTERING")
print("="*70)

print("""
Algorithm:
==========
1. Compute similarity between all user pairs
2. For prediction: Find K most similar users who rated the item
3. Predict = weighted average of neighbors' ratings

Key Parameters:
  - K (n_neighbors): Number of similar users to consider
  - Similarity metric: Pearson (recommended) or Cosine
""")

class UserBasedCF:
    """
    User-Based Collaborative Filtering from scratch.
    """
    
    def __init__(self, n_neighbors=20, similarity='pearson'):
        """
        Parameters:
        - n_neighbors: Number of similar users to use
        - similarity: 'pearson' or 'cosine'
        """
        self.n_neighbors = n_neighbors
        self.similarity = similarity
        self.user_sim_matrix = None
        self.user_means = None
        self.matrix = None
    
    def fit(self, user_item_matrix):
        """
        Compute user-user similarity matrix.
        """
        self.matrix = user_item_matrix.copy()
        n_users = self.matrix.shape[0]
        
        # Compute user means (for non-zero ratings only)
        self.user_means = np.zeros(n_users)
        for i in range(n_users):
            rated = self.matrix[i] > 0
            if rated.sum() > 0:
                self.user_means[i] = self.matrix[i, rated].mean()
        
        # Compute similarity matrix
        print(f"Computing {n_users}x{n_users} user similarity matrix...")
        self.user_sim_matrix = np.zeros((n_users, n_users))
        
        for i in range(n_users):
            for j in range(i, n_users):
                if i == j:
                    self.user_sim_matrix[i, j] = 1.0
                else:
                    if self.similarity == 'pearson':
                        sim = pearson_sim(self.matrix[i], self.matrix[j])
                    else:
                        sim = cosine_sim(self.matrix[i], self.matrix[j])
                    self.user_sim_matrix[i, j] = sim
                    self.user_sim_matrix[j, i] = sim
        
        print("Done!")
        return self
    
    def predict(self, user_idx, item_idx):
        """
        Predict rating for a user-item pair.
        """
        # Get users who rated this item
        item_ratings = self.matrix[:, item_idx]
        rated_mask = item_ratings > 0
        
        if not rated_mask.any():
            return self.user_means[user_idx]  # Fallback to user mean
        
        # Get similarities to users who rated this item
        similarities = self.user_sim_matrix[user_idx, rated_mask]
        ratings = item_ratings[rated_mask]
        means = self.user_means[rated_mask]
        
        # Get top K neighbors
        if len(similarities) > self.n_neighbors:
            top_k_idx = np.argsort(similarities)[-self.n_neighbors:]
            similarities = similarities[top_k_idx]
            ratings = ratings[top_k_idx]
            means = means[top_k_idx]
        
        # Compute weighted average (with mean centering)
        sim_sum = np.abs(similarities).sum()
        if sim_sum == 0:
            return self.user_means[user_idx]
        
        # Prediction = user_mean + weighted sum of deviations
        deviation = ratings - means
        pred = self.user_means[user_idx] + np.dot(similarities, deviation) / sim_sum
        
        # Clip to valid rating range
        return np.clip(pred, 1, 5)
    
    def recommend(self, user_idx, n_recommendations=10):
        """
        Get top N recommendations for a user.
        """
        # Get items user hasn't rated
        unrated_mask = self.matrix[user_idx] == 0
        unrated_items = np.where(unrated_mask)[0]
        
        # Predict ratings for all unrated items
        predictions = []
        for item_idx in unrated_items:
            pred = self.predict(user_idx, item_idx)
            predictions.append((item_idx, pred))
        
        # Sort by predicted rating
        predictions.sort(key=lambda x: x[1], reverse=True)
        
        return predictions[:n_recommendations]

# Train User-Based CF
print("\nTraining User-Based CF...")
print(f"  n_neighbors: 20")
print(f"  Similarity: Pearson")

user_cf = UserBasedCF(n_neighbors=20, similarity='pearson')
user_cf.fit(user_item_matrix)

print("\nUser-Based CF ready!")

In [None]:
# Visualize user similarity matrix
print("="*70)
print("USER SIMILARITY MATRIX (Sample)")
print("="*70)

# Show subset of similarity matrix
n_show = min(30, n_users)

fig, ax = plt.subplots(figsize=(10, 8))

sns.heatmap(user_cf.user_sim_matrix[:n_show, :n_show], 
            cmap='RdYlBu_r', center=0, ax=ax,
            xticklabels=5, yticklabels=5)
ax.set_title(f'User-User Similarity Matrix (First {n_show} Users)', fontweight='bold')
ax.set_xlabel('User Index')
ax.set_ylabel('User Index')

plt.tight_layout()
plt.show()

print("\nInterpretation:")
print("  - Red = Similar users (positive correlation)")
print("  - Blue = Dissimilar users (negative correlation)")
print("  - Diagonal = Self-similarity (always 1.0)")

---

<a id='part7'></a>
# Part 7: Item-Based Collaborative Filtering

---

In [None]:
# ============================================================
# ITEM-BASED COLLABORATIVE FILTERING
# ============================================================
print("="*70)
print("ITEM-BASED COLLABORATIVE FILTERING")
print("="*70)

print("""
Algorithm:
==========
1. Compute similarity between all item pairs
2. For prediction: Find items similar to those user rated
3. Predict = weighted average based on item similarities

Why Item-Based is Often Preferred:
  - Items are more stable than users
  - Precomputed similarities can be reused
  - Better explainability ("Because you liked X...")
""")

class ItemBasedCF:
    """
    Item-Based Collaborative Filtering from scratch.
    """
    
    def __init__(self, n_neighbors=20, similarity='cosine'):
        """
        Parameters:
        - n_neighbors: Number of similar items to use
        - similarity: 'cosine' or 'pearson'
        """
        self.n_neighbors = n_neighbors
        self.similarity = similarity
        self.item_sim_matrix = None
        self.matrix = None
    
    def fit(self, user_item_matrix):
        """
        Compute item-item similarity matrix.
        """
        self.matrix = user_item_matrix.copy()
        n_items = self.matrix.shape[1]
        
        # Transpose for item-based operations
        item_matrix = self.matrix.T  # Now rows are items
        
        # Compute similarity matrix
        print(f"Computing {n_items}x{n_items} item similarity matrix...")
        self.item_sim_matrix = np.zeros((n_items, n_items))
        
        for i in range(n_items):
            for j in range(i, n_items):
                if i == j:
                    self.item_sim_matrix[i, j] = 1.0
                else:
                    if self.similarity == 'pearson':
                        sim = pearson_sim(item_matrix[i], item_matrix[j])
                    else:
                        sim = cosine_sim(item_matrix[i], item_matrix[j])
                    self.item_sim_matrix[i, j] = sim
                    self.item_sim_matrix[j, i] = sim
        
        print("Done!")
        return self
    
    def predict(self, user_idx, item_idx):
        """
        Predict rating for a user-item pair.
        """
        # Get items this user has rated
        user_ratings = self.matrix[user_idx]
        rated_mask = user_ratings > 0
        
        if not rated_mask.any():
            return 3.0  # Fallback to neutral rating
        
        # Get similarities to items user has rated
        similarities = self.item_sim_matrix[item_idx, rated_mask]
        ratings = user_ratings[rated_mask]
        
        # Get top K similar items
        if len(similarities) > self.n_neighbors:
            top_k_idx = np.argsort(similarities)[-self.n_neighbors:]
            similarities = similarities[top_k_idx]
            ratings = ratings[top_k_idx]
        
        # Only use positive similarities
        pos_mask = similarities > 0
        if not pos_mask.any():
            return ratings.mean() if len(ratings) > 0 else 3.0
        
        similarities = similarities[pos_mask]
        ratings = ratings[pos_mask]
        
        # Weighted average
        pred = np.dot(similarities, ratings) / similarities.sum()
        
        return np.clip(pred, 1, 5)
    
    def recommend(self, user_idx, n_recommendations=10):
        """
        Get top N recommendations for a user.
        """
        # Get items user hasn't rated
        unrated_mask = self.matrix[user_idx] == 0
        unrated_items = np.where(unrated_mask)[0]
        
        # Predict ratings for all unrated items
        predictions = []
        for item_idx in unrated_items:
            pred = self.predict(user_idx, item_idx)
            predictions.append((item_idx, pred))
        
        # Sort by predicted rating
        predictions.sort(key=lambda x: x[1], reverse=True)
        
        return predictions[:n_recommendations]
    
    def get_similar_items(self, item_idx, n=5):
        """
        Get most similar items to a given item.
        """
        similarities = self.item_sim_matrix[item_idx]
        # Exclude self
        similarities[item_idx] = -1
        top_idx = np.argsort(similarities)[-n:][::-1]
        return [(idx, similarities[idx]) for idx in top_idx]

# Train Item-Based CF
print("\nTraining Item-Based CF...")
print(f"  n_neighbors: 20")
print(f"  Similarity: Cosine")

item_cf = ItemBasedCF(n_neighbors=20, similarity='cosine')
item_cf.fit(user_item_matrix)

print("\nItem-Based CF ready!")

In [None]:
# Visualize item similarity matrix
print("="*70)
print("ITEM SIMILARITY MATRIX (Sample)")
print("="*70)

# Show subset of similarity matrix
n_show = min(30, n_movies)

fig, ax = plt.subplots(figsize=(10, 8))

sns.heatmap(item_cf.item_sim_matrix[:n_show, :n_show], 
            cmap='YlOrRd', ax=ax,
            xticklabels=5, yticklabels=5)
ax.set_title(f'Item-Item Similarity Matrix (First {n_show} Movies)', fontweight='bold')
ax.set_xlabel('Movie Index')
ax.set_ylabel('Movie Index')

plt.tight_layout()
plt.show()

# Show example of similar items
print("\nExample: Most similar movies to Movie 0:")
similar = item_cf.get_similar_items(0, n=5)
for idx, sim in similar:
    print(f"  Movie {idx}: similarity = {sim:.4f}")

---

<a id='part8'></a>
# Part 8: SVD-Based Recommendations

---

In [None]:
# ============================================================
# SVD-BASED RECOMMENDATIONS
# ============================================================
print("="*70)
print("SVD-BASED RECOMMENDATIONS (Matrix Factorization)")
print("="*70)

print("""
SVD Approach:
=============
1. Fill missing values (e.g., with mean)
2. Apply SVD: R = U × Σ × V^T
3. Keep top k singular values (dimensionality reduction)
4. Reconstruct matrix for predictions

Key Parameter: k (number of latent factors)
  - Small k: Simple model, may underfit
  - Large k: Complex model, may overfit
  - Typical: 20-100 factors
""")

class SVDRecommender:
    """
    SVD-based recommendation system.
    """
    
    def __init__(self, n_factors=50):
        """
        Parameters:
        - n_factors: Number of latent factors to keep
        """
        self.n_factors = n_factors
        self.user_factors = None
        self.item_factors = None
        self.user_means = None
        self.global_mean = None
        self.predictions = None
    
    def fit(self, user_item_matrix):
        """
        Fit SVD model.
        """
        matrix = user_item_matrix.copy()
        
        # Compute means
        self.global_mean = matrix[matrix > 0].mean()
        self.user_means = np.zeros(matrix.shape[0])
        for i in range(matrix.shape[0]):
            rated = matrix[i] > 0
            if rated.sum() > 0:
                self.user_means[i] = matrix[i, rated].mean()
            else:
                self.user_means[i] = self.global_mean
        
        # Fill missing values with user mean
        matrix_filled = matrix.copy()
        for i in range(matrix.shape[0]):
            matrix_filled[i, matrix[i] == 0] = self.user_means[i]
        
        # Mean-center the matrix
        matrix_centered = matrix_filled - self.user_means.reshape(-1, 1)
        
        # Apply SVD
        print(f"Applying SVD with k={self.n_factors} factors...")
        U, sigma, Vt = svds(sparse.csr_matrix(matrix_centered), k=self.n_factors)
        
        # Store factors
        sigma = np.diag(sigma)
        self.user_factors = U
        self.sigma = sigma
        self.item_factors = Vt
        
        # Reconstruct matrix for predictions
        self.predictions = U @ sigma @ Vt + self.user_means.reshape(-1, 1)
        
        # Clip predictions to valid range
        self.predictions = np.clip(self.predictions, 1, 5)
        
        print("Done!")
        return self
    
    def predict(self, user_idx, item_idx):
        """
        Predict rating for a user-item pair.
        """
        return self.predictions[user_idx, item_idx]
    
    def recommend(self, user_idx, original_matrix, n_recommendations=10):
        """
        Get top N recommendations for a user.
        """
        # Get items user hasn't rated
        unrated_mask = original_matrix[user_idx] == 0
        
        # Get predictions for unrated items
        user_predictions = self.predictions[user_idx].copy()
        user_predictions[~unrated_mask] = -np.inf  # Exclude rated items
        
        # Get top N
        top_items = np.argsort(user_predictions)[-n_recommendations:][::-1]
        
        return [(item, self.predictions[user_idx, item]) for item in top_items]

# Train SVD Recommender
print("\nTraining SVD Recommender...")
print(f"  n_factors: 50")

svd_rec = SVDRecommender(n_factors=50)
svd_rec.fit(user_item_matrix)

print("\nSVD Recommender ready!")

In [None]:
# Visualize latent factors
print("="*70)
print("SVD LATENT FACTORS")
print("="*70)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# User factors (first 2 dimensions)
ax1 = axes[0]
ax1.scatter(svd_rec.user_factors[:, 0], svd_rec.user_factors[:, 1], 
           alpha=0.5, c='steelblue', s=20)
ax1.set_xlabel('Latent Factor 1')
ax1.set_ylabel('Latent Factor 2')
ax1.set_title('Users in Latent Space', fontweight='bold')
ax1.axhline(y=0, color='gray', linestyle='--', alpha=0.5)
ax1.axvline(x=0, color='gray', linestyle='--', alpha=0.5)

# Item factors (first 2 dimensions)
ax2 = axes[1]
ax2.scatter(svd_rec.item_factors[0, :], svd_rec.item_factors[1, :], 
           alpha=0.5, c='coral', s=20)
ax2.set_xlabel('Latent Factor 1')
ax2.set_ylabel('Latent Factor 2')
ax2.set_title('Items in Latent Space', fontweight='bold')
ax2.axhline(y=0, color='gray', linestyle='--', alpha=0.5)
ax2.axvline(x=0, color='gray', linestyle='--', alpha=0.5)

plt.suptitle('SVD Latent Factor Visualization (First 2 Factors)', fontweight='bold')
plt.tight_layout()
plt.show()

print("\nInterpretation:")
print("  - Each point represents a user/item in latent space")
print("  - Close points have similar tastes/characteristics")
print("  - Factors could represent genres, eras, popularity, etc.")

---

<a id='part9'></a>
# Part 9: Evaluation & Comparison

---

In [None]:
# ============================================================
# EVALUATION METRICS
# ============================================================
print("="*70)
print("EVALUATION METRICS")
print("="*70)

print("""
Rating Prediction Metrics:
==========================

| Metric | Formula | Interpretation |
|--------|---------|----------------|
| RMSE   | √(Σ(r - r̂)²/n) | Lower is better, penalizes large errors |
| MAE    | Σ|r - r̂|/n     | Lower is better, average error |

Ranking Metrics:
================

| Metric | Description |
|--------|-------------|
| Precision@K | % of top-K recommendations that are relevant |
| Recall@K    | % of relevant items in top-K |
| NDCG@K      | Normalized discounted cumulative gain |
""")

def evaluate_model(model, test_data, user_to_idx, movie_to_idx, model_name, is_svd=False):
    """
    Evaluate a recommendation model on test data.
    """
    predictions = []
    actuals = []
    
    for _, row in test_data.iterrows():
        u_idx = user_to_idx.get(row['user_id'])
        m_idx = movie_to_idx.get(row['movie_id'])
        
        if u_idx is not None and m_idx is not None:
            pred = model.predict(u_idx, m_idx)
            predictions.append(pred)
            actuals.append(row['rating'])
    
    predictions = np.array(predictions)
    actuals = np.array(actuals)
    
    rmse = np.sqrt(mean_squared_error(actuals, predictions))
    mae = mean_absolute_error(actuals, predictions)
    
    return {
        'Model': model_name,
        'RMSE': rmse,
        'MAE': mae,
        'n_predictions': len(predictions)
    }

# Evaluate all models
print("\nEvaluating models on test set...")
print("(This may take a minute)\n")

# Sample test data for faster evaluation
test_sample = test_data.sample(min(5000, len(test_data)), random_state=42)

results = []

# User-Based CF
print("Evaluating User-Based CF...")
results.append(evaluate_model(user_cf, test_sample, user_to_idx, movie_to_idx, "User-Based CF"))

# Item-Based CF
print("Evaluating Item-Based CF...")
results.append(evaluate_model(item_cf, test_sample, user_to_idx, movie_to_idx, "Item-Based CF"))

# SVD
print("Evaluating SVD...")
results.append(evaluate_model(svd_rec, test_sample, user_to_idx, movie_to_idx, "SVD (k=50)"))

# Add baseline (global mean)
global_mean = train_data['rating'].mean()
baseline_rmse = np.sqrt(mean_squared_error(test_sample['rating'], [global_mean] * len(test_sample)))
baseline_mae = mean_absolute_error(test_sample['rating'], [global_mean] * len(test_sample))
results.append({'Model': 'Baseline (Mean)', 'RMSE': baseline_rmse, 'MAE': baseline_mae, 'n_predictions': len(test_sample)})

# Create results DataFrame
results_df = pd.DataFrame(results)
results_df = results_df.sort_values('RMSE')

print("\n" + "="*50)
print("RESULTS COMPARISON")
print("="*50)
print(results_df.to_string(index=False))

In [None]:
# Visualize results
print("="*70)
print("RESULTS VISUALIZATION")
print("="*70)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

models = results_df['Model']
colors = ['steelblue', 'coral', 'green', 'gray']

# RMSE comparison
ax1 = axes[0]
bars1 = ax1.barh(models, results_df['RMSE'], color=colors, edgecolor='black')
ax1.set_xlabel('RMSE (Lower is Better)')
ax1.set_title('RMSE Comparison', fontweight='bold')
ax1.axvline(x=baseline_rmse, color='red', linestyle='--', alpha=0.5, label='Baseline')
for bar, val in zip(bars1, results_df['RMSE']):
    ax1.text(val + 0.02, bar.get_y() + bar.get_height()/2, f'{val:.3f}', va='center')

# MAE comparison
ax2 = axes[1]
bars2 = ax2.barh(models, results_df['MAE'], color=colors, edgecolor='black')
ax2.set_xlabel('MAE (Lower is Better)')
ax2.set_title('MAE Comparison', fontweight='bold')
ax2.axvline(x=baseline_mae, color='red', linestyle='--', alpha=0.5, label='Baseline')
for bar, val in zip(bars2, results_df['MAE']):
    ax2.text(val + 0.02, bar.get_y() + bar.get_height()/2, f'{val:.3f}', va='center')

plt.suptitle('Recommendation Model Comparison', fontweight='bold', fontsize=14)
plt.tight_layout()
plt.show()

best_model = results_df.iloc[0]['Model']
best_rmse = results_df.iloc[0]['RMSE']
improvement = ((baseline_rmse - best_rmse) / baseline_rmse) * 100

print(f"\nBest Model: {best_model}")
print(f"RMSE Improvement over Baseline: {improvement:.1f}%")

In [None]:
# Example recommendations
print("="*70)
print("EXAMPLE RECOMMENDATIONS")
print("="*70)

# Pick a random user
example_user = 0

print(f"\nRecommendations for User {example_user}:")
print("-" * 50)

# Get user's highly rated items
user_ratings = user_item_matrix[example_user]
high_rated = np.where(user_ratings >= 4)[0][:5]
print(f"\nItems user rated highly (4+):")
for item_idx in high_rated:
    print(f"  Movie {idx_to_movie[item_idx]}: {user_ratings[item_idx]:.1f} stars")

# Get recommendations from each model
print(f"\nTop 5 Recommendations:")

print("\n[User-Based CF]")
user_recs = user_cf.recommend(example_user, n_recommendations=5)
for item_idx, pred in user_recs:
    print(f"  Movie {idx_to_movie[item_idx]}: predicted {pred:.2f} stars")

print("\n[Item-Based CF]")
item_recs = item_cf.recommend(example_user, n_recommendations=5)
for item_idx, pred in item_recs:
    print(f"  Movie {idx_to_movie[item_idx]}: predicted {pred:.2f} stars")

print("\n[SVD]")
svd_recs = svd_rec.recommend(example_user, user_item_matrix, n_recommendations=5)
for item_idx, pred in svd_recs:
    print(f"  Movie {idx_to_movie[item_idx]}: predicted {pred:.2f} stars")

---

<a id='part10'></a>
# Part 10: Summary & Key Takeaways

---

In [None]:
# Final summary
print("="*70)
print("RECOMMENDATION SYSTEM - SUMMARY")
print("="*70)

print("""
WHAT WE LEARNED:
================

1. TYPES OF RECOMMENDATION SYSTEMS:
   ┌─────────────────┬──────────────────────────────────────┐
   │ Type            │ How It Works                         │
   ├─────────────────┼──────────────────────────────────────┤
   │ Content-Based   │ Match item features to user profile  │
   │ Collaborative   │ Use similar users/items              │
   │ Hybrid          │ Combine both approaches              │
   └─────────────────┴──────────────────────────────────────┘

2. COLLABORATIVE FILTERING:
   ┌─────────────────┬─────────────┬─────────────────────────┐
   │ Method          │ Similarity  │ Best For                │
   ├─────────────────┼─────────────┼─────────────────────────┤
   │ User-Based CF   │ User-User   │ Few users, many items   │
   │ Item-Based CF   │ Item-Item   │ Many users, stable      │
   │ SVD             │ Latent      │ Large, sparse matrices  │
   └─────────────────┴─────────────┴─────────────────────────┘

3. SIMILARITY METRICS:
   - Cosine: Good for sparse data, ignores magnitude
   - Pearson: Handles user bias, captures correlation
   - Jaccard: For binary data (liked/not liked)

4. KEY PARAMETERS TO TUNE:
   - K (neighbors): 20-50 typically works well
   - n_factors (SVD): 20-100 latent dimensions
   - Similarity metric: Pearson for users, Cosine for items

5. EVALUATION METRICS:
   - RMSE: Root Mean Squared Error (penalizes large errors)
   - MAE: Mean Absolute Error (average error)
   - Precision@K, Recall@K: For ranking quality
""")

print("\nMODEL COMPARISON SUMMARY:")
print(results_df.to_string(index=False))

print("\n" + "="*70)

## Algorithm Taxonomy & Model Selection Guide

### When to Use Which Model?

| Scenario | Recommended Model | Why | Alternatives |
|----------|-------------------|-----|--------------|
| **Small dataset (<100K ratings)** | User-Based CF | Fast enough, captures user nuance | Item-Based CF |
| **Large dataset (>1M ratings)** | SVD or Item-Based CF | Scales better | ALS, Neural CF |
| **Very sparse (>99% empty)** | SVD | Handles sparsity via latent factors | ALS |
| **Real-time recommendations** | Item-Based CF (precomputed) | Fast lookup | Content-based |
| **New items frequently** | Content-Based or Hybrid | No cold start | User-Based CF |
| **Implicit feedback (clicks)** | ALS, BPR | Designed for implicit | Item-Based with Jaccard |
| **Need explainability** | Item-Based CF | "Because you liked X..." | Content-based |

---

## Parameter Tuning Guide

### User-Based & Item-Based CF

| Parameter | Range | Effect | How to Tune |
|-----------|-------|--------|-------------|
| **K (neighbors)** | 5-100 | More K = smoother, less personalized | Grid search, start with 20-50 |
| **Similarity metric** | Pearson/Cosine | Pearson handles bias | Pearson for users, Cosine for items |
| **Min co-rated items** | 1-10 | Higher = more reliable similarity | Set based on sparsity |

### SVD / Matrix Factorization

| Parameter | Range | Effect | How to Tune |
|-----------|-------|--------|-------------|
| **n_factors (k)** | 10-200 | More = captures more patterns, risk overfitting | Cross-validation, typically 50-100 |
| **Learning rate** | 0.001-0.1 | Higher = faster but unstable | Start 0.005, reduce if diverges |
| **Regularization (λ)** | 0.01-0.5 | Higher = prevents overfitting | Increase if train >> test error |
| **Epochs** | 10-100 | More = better fit, slower | Early stopping |

---

## Deep Dive: Common Problems & Solutions

### Cold Start Problem

Cold start = new users or items with no interaction history.

| Type | Problem | Solutions |
|------|---------|-----------|
| **New User** | No ratings to find similar users | Ask for preferences, use demographics, popularity-based |
| **New Item** | No one has rated it yet | Content-based features, editorial curation |

**Best approach**: Use a hybrid system - content-based for new items, popularity for new users, then transition to collaborative filtering as we gather data.

---

### Why SVD Works for Sparse Data

1. **Dimensionality reduction**: SVD compresses the sparse matrix into dense latent factors
2. **Generalization**: Latent factors capture patterns across missing entries
3. **Implicit matrix completion**: Reconstructed matrix fills in missing values

```
Sparse Matrix (95% empty)    →    SVD    →    Dense Factors
                                                    ↓
                                            Reconstruct full matrix
```

---

### Scaling to Millions of Users

| Strategy | Description |
|----------|-------------|
| **Precomputation** | Compute similarities offline, store in cache |
| **Approximate NN** | Use LSH, Annoy, FAISS for fast similarity search |
| **Clustering** | Group similar users, recommend within clusters |
| **Model compression** | Reduce latent factors, quantization |
| **Distributed computing** | Spark MLlib for ALS at scale |

---

### Evaluation Metrics Deep Dive

| Metric | Type | What It Measures | Use When |
|--------|------|------------------|----------|
| **RMSE** | Rating prediction | Prediction accuracy | Explicit ratings |
| **MAE** | Rating prediction | Average error | Less sensitive to outliers |
| **Precision@K** | Ranking | % of top-K that are relevant | Implicit feedback |
| **Recall@K** | Ranking | % of relevant in top-K | Coverage matters |
| **NDCG** | Ranking | Position-aware quality | Order matters |
| **Coverage** | Diversity | % of catalog recommended | Avoid popularity bias |

---

### Handling Popularity Bias

Popularity bias = popular items get recommended more, reinforcing their popularity.

| Solution | Description |
|----------|-------------|
| **Inverse propensity weighting** | Weight less popular items higher |
| **Calibration** | Match recommendation distribution to user history |
| **MMR (Maximal Marginal Relevance)** | Balance relevance with diversity |
| **Exploration-exploitation** | ε-greedy, Thompson sampling |

---

## Complete Algorithm Comparison

| Algorithm | Type | Complexity | Scalability | Cold Start | Explainable | Best For |
|-----------|------|------------|-------------|------------|-------------|----------|
| **User-Based CF** | Memory | O(m²n) | Poor | User problem | Yes | Small datasets |
| **Item-Based CF** | Memory | O(mn²) | Medium | Item problem | Yes | E-commerce |
| **SVD** | Model | O(mnk) | Good | Both problems | No | Sparse data |
| **ALS** | Model | O(mnk) | Excellent | Both problems | No | Implicit feedback |
| **Content-Based** | Feature | O(nd) | Good | No | Yes | New items |
| **Neural CF** | Deep Learning | Varies | Excellent | Both problems | No | Large data |
| **Hybrid** | Combined | Varies | Varies | Reduced | Partial | Production |

---

## Key Libraries

| Library | What It's For | Key Algorithms |
|---------|---------------|----------------|
| **Surprise** | Research, prototyping | SVD, KNN, NMF |
| **LightFM** | Hybrid recommendations | Factorization machines |
| **Implicit** | Implicit feedback | ALS, BPR |
| **TensorFlow Recommenders** | Deep learning | Neural CF, Two-Tower |
| **Spark MLlib** | Distributed computing | ALS at scale |

---

## Checklist

- [x] Understood Content-Based vs Collaborative Filtering trade-offs
- [x] Know when to use User-Based vs Item-Based CF
- [x] Understand similarity metrics (Cosine, Pearson, Jaccard) and when to use each
- [x] Can explain SVD and why it handles sparsity
- [x] Know how to handle cold start problem
- [x] Understand evaluation metrics (RMSE, MAE, Precision@K, NDCG)
- [x] Can discuss scalability solutions
- [x] Know parameter tuning strategies

---

**End of Recommendation System Tutorial**