https://youtu.be/IMvunY3LrQI

# Understanding NDCG (Normalized Discounted Cumulative Gain)
## The Right Way to Evaluate Recommendation System Performance

## What is NDCG?

NDCG measures **how well your recommendation system ranks relevant items**. It's the gold standard metric for evaluating ranking quality in search engines, recommendation systems, and information retrieval.

**Key Idea:** Relevant items should appear at the top of your recommendations. Items ranked higher contribute more to your score.

---

## How NDCG is Calculated (Simple Explanation)

### Step 1: Calculate DCG (Discounted Cumulative Gain)

DCG measures the usefulness of recommendations based on their position:

$$\text{DCG@K} = \sum_{i=1}^{K} \frac{\text{rel}_i}{\log_2(i + 1)}$$

Where:
- $\text{rel}_i$ = relevance of item at position i (1 if relevant, 0 if not)
- $\log_2(i + 1)$ = position discount (items lower in the list count less)
- K = how many top items to evaluate (e.g., top 5)

**Example:** User bought [milk, eggs, bread]  
Your model ranks: [eggs, cheese, milk, bread, ...]

$$\text{DCG@3} = \frac{1}{\log_2(2)} + \frac{0}{\log_2(3)} + \frac{1}{\log_2(4)} = \frac{1}{1.0} + \frac{0}{1.585} + \frac{1}{2.0} = 1.0 + 0.0 + 0.5 = 1.5$$

### Step 2: Calculate IDCG (Ideal DCG)

IDCG is the **best possible DCG** if all relevant items were perfectly ranked:

Perfect ranking: [milk, eggs, bread, ...]

$$\text{IDCG@3} = \frac{1}{\log_2(2)} + \frac{1}{\log_2(3)} + \frac{1}{\log_2(4)} = 1.0 + 0.631 + 0.5 = 2.131$$

### Step 3: Calculate NDCG (Normalize)

$$\text{NDCG@K} = \frac{\text{DCG@K}}{\text{IDCG@K}}$$

In our example:

$$\text{NDCG@3} = \frac{1.5}{2.131} = 0.704 \text{ (70.4% of perfect)}$$

**Interpretation:** Your model achieved 70% of the perfect ranking score.

---

## Why Logarithmic Discounting?

Position matters, but with **diminishing returns**:

| Position | Weight | Discount | Importance |
|----------|--------|----------|------------|
| 1 | 1.000 | 0% | ⭐⭐⭐ Most important! |
| 2 | 0.631 | 37% | ⭐⭐⭐ Very important |
| 3 | 0.500 | 50% | ⭐⭐ Important |
| 5 | 0.387 | 61% | ⭐ Moderate |
| 10 | 0.301 | 70% | Less important |

**Why?** Users heavily favor top results. The difference between #1 and #2 is huge, but between #8 and #9 is small.

---

## NDCG Properties

**Range:** 0 to 1 (1 = perfect ranking)  
**Position-aware:** Order matters!  
**Normalized:** Comparable across different queries/users  
**Works with graded relevance:** Can use 0/1/2/3 instead of just 0/1  

---

## NDCG@K Interpretation

- **NDCG@1:** How good is your #1 recommendation?
- **NDCG@3:** Quality of your top 3 (critical for mobile)
- **NDCG@5:** Quality of your top 5 (first screen)
- **NDCG@10:** Overall top 10 quality

**Rule:** Always check multiple K values! A model that gets NDCG@1 = 0.80 but NDCG@5 = 0.30 is inconsistent.

---

## Realistic NDCG Expectations

**With sparse e-commerce data (5-10% purchase rate):**

| Model Type | NDCG@5 | vs Random | Status |
|------------|--------|-----------|--------|
| Random Baseline | 0.05-0.10 | 1x | Baseline |
| Good Model | 0.40-0.60 | 5-8x | Deploy! |
| Excellent Model | 0.70-0.85 | 10-15x | Top tier! |

**Why so "low"?**
- When users buy 3 items from 50 products (6% rate)
- Random gets ~0.06 NDCG
- Your 0.50 NDCG = 8x improvement = **EXCELLENT!**
- This translates to 20-30% revenue increase

**Don't chase 0.90+** - That requires 40-50% relevance (unrealistic for e-commerce)

---

## Key Takeaways

1. **Position is everything:** Same items, wrong order = low NDCG
2. **Compare to random baseline:** Your improvement ratio matters most
3. **Check multiple K values:** Consistency is important
4. **NDCG 0.50 can be excellent:** Depends on your data sparsity

Now, let's see this in action!

## Import Required Libraries

Setting up our environment for NDCG calculations.



In [None]:
import numpy as np
import pandas as pd
from sklearn.metrics import ndcg_score

# For reproducibility
rng = np.random.default_rng(42)



## Simulated, yet realistic Product Catalog

**50 products** across 8 categories - realistic online grocery store:
- Dairy (8 items)
- Bakery (6 items)  
- Produce (10 items)
- Meat & Protein (6 items)
- Pantry (10 items)
- Beverages (5 items)
- Snacks (3 items)
- Frozen (2 items)


In [None]:
products = [
    # Dairy (8 items)
    "milk", "eggs", "butter", "yogurt", "cheese", "cream", "sour_cream", "cottage_cheese",
    # Bakery (6 items)
    "bread", "bagels", "muffins", "croissants", "tortillas", "pita_bread",
    # Produce (10 items)
    "bananas", "apples", "oranges", "lettuce", "tomatoes", "onions",
    "carrots", "potatoes", "peppers", "cucumbers",
    # Meat & Protein (6 items)
    "chicken", "beef", "pork", "salmon", "tuna", "turkey",
    # Pantry (10 items)
    "pasta", "rice", "beans", "flour", "sugar", "salt",
    "olive_oil", "cereal", "oatmeal", "coffee",
    # Beverages (5 items)
    "orange_juice", "apple_juice", "soda", "tea", "water_bottles",
    # Snacks (3 items)
    "chips", "crackers", "cookies",
    # Frozen (2 items)
    "ice_cream", "frozen_pizza"
]

print(f"Product catalog: {len(products)} items")
print(f"Categories: Dairy, Bakery, Produce, Meat, Pantry, Beverages, Snacks, Frozen")

Product catalog: 50 items
Categories: Dairy, Bakery, Produce, Meat, Pantry, Beverages, Snacks, Frozen


## Realistic Purchase Patterns

**10 users**, each buying 2-4 items from 50 products (4-8% purchase rate).

**This sparsity is REALISTIC:**
- Most people don't buy 50 items per shopping trip
- Average basket: 2-4 items for quick shopping  
- Makes NDCG more challenging but mirrors reality

**Hand-crafted patterns:**
- User 1: Dairy + Produce (breakfast shopper)
- User 2: Bread + Meat + Pasta (dinner prep)
- User 3: Dairy + Beverages (coffee run)
- User 4: Produce + Fish (health conscious)
- And so on...


In [None]:
# Hand-craft realistic purchase patterns
# Users tend to buy from 2-3 categories per trip
true_purchases = np.array([
    # User 1: Dairy + Produce basics (4 items)
    [1, 1, 0, 0, 0, 0, 0, 0,  # Dairy: milk, eggs
     0, 0, 0, 0, 0, 0,        # Bakery: none
     1, 1, 0, 0, 0, 0, 0, 0, 0, 0,  # Produce: bananas, apples
     0, 0, 0, 0, 0, 0,        # Meat: none
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  # Pantry: none
     0, 0, 0, 0, 0,           # Beverages: none
     0, 0, 0,                 # Snacks: none
     0, 0],                   # Frozen: none

    # User 2: Bakery + Meat + Pantry (3 items)
    [0, 0, 0, 0, 0, 0, 0, 0,  # Dairy: none
     1, 0, 0, 0, 0, 0,        # Bakery: bread
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  # Produce: none
     1, 0, 0, 0, 0, 0,        # Meat: chicken
     1, 0, 0, 0, 0, 0, 0, 0, 0, 0,  # Pantry: pasta
     0, 0, 0, 0, 0,           # Beverages: none
     0, 0, 0,                 # Snacks: none
     0, 0],                   # Frozen: none

    # User 3: Dairy + Beverages + Snacks (4 items)
    [1, 0, 0, 1, 0, 0, 0, 0,  # Dairy: milk, yogurt
     0, 0, 0, 0, 0, 0,        # Bakery: none
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  # Produce: none
     0, 0, 0, 0, 0, 0,        # Meat: none
     0, 0, 0, 0, 0, 0, 0, 0, 0, 1,  # Pantry: coffee
     1, 0, 0, 0, 0,           # Beverages: OJ
     0, 0, 0,                 # Snacks: none
     0, 0],                   # Frozen: none

    # User 4: Produce heavy + Meat (4 items)
    [0, 0, 0, 0, 0, 0, 0, 0,  # Dairy: none
     0, 0, 0, 0, 0, 0,        # Bakery: none
     0, 0, 0, 1, 1, 1, 0, 0, 0, 0,  # Produce: lettuce, tomatoes, onions
     0, 0, 0, 1, 0, 0,        # Meat: salmon
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  # Pantry: none
     0, 0, 0, 0, 0,           # Beverages: none
     0, 0, 0,                 # Snacks: none
     0, 0],                   # Frozen: none

    # User 5: Pantry staples (3 items)
    [0, 0, 0, 0, 0, 0, 0, 0,  # Dairy: none
     0, 0, 0, 0, 0, 0,        # Bakery: none
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  # Produce: none
     0, 0, 0, 0, 0, 0,        # Meat: none
     0, 1, 1, 0, 0, 0, 1, 0, 0, 0,  # Pantry: rice, beans, olive_oil
     0, 0, 0, 0, 0,           # Beverages: none
     0, 0, 0,                 # Snacks: none
     0, 0],                   # Frozen: none

    # User 6: Dairy + Bakery (3 items)
    [0, 1, 1, 0, 0, 0, 0, 0,  # Dairy: eggs, butter
     1, 0, 0, 0, 0, 0,        # Bakery: bread
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  # Produce: none
     0, 0, 0, 0, 0, 0,        # Meat: none
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  # Pantry: none
     0, 0, 0, 0, 0,           # Beverages: none
     0, 0, 0,                 # Snacks: none
     0, 0],                   # Frozen: none

    # User 7: Quick dinner (3 items)
    [0, 0, 0, 0, 0, 0, 0, 0,  # Dairy: none
     0, 0, 0, 0, 0, 0,        # Bakery: none
     0, 0, 0, 0, 1, 0, 0, 0, 0, 0,  # Produce: tomatoes
     0, 0, 0, 0, 0, 0,        # Meat: none
     1, 0, 0, 0, 0, 0, 0, 0, 0, 0,  # Pantry: pasta
     0, 0, 0, 0, 0,           # Beverages: none
     0, 0, 0,                 # Snacks: none
     0, 1],                   # Frozen: frozen_pizza

    # User 8: Breakfast items (4 items)
    [1, 1, 0, 0, 0, 0, 0, 0,  # Dairy: milk, eggs
     0, 1, 0, 0, 0, 0,        # Bakery: bagels
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  # Produce: none
     0, 0, 0, 0, 0, 0,        # Meat: none
     0, 0, 0, 0, 0, 0, 0, 0, 0, 1,  # Pantry: coffee
     0, 0, 0, 0, 0,           # Beverages: none
     0, 0, 0,                 # Snacks: none
     0, 0],                   # Frozen: none

    # User 9: Snack run (2 items)
    [0, 0, 0, 0, 0, 0, 0, 0,  # Dairy: none
     0, 0, 0, 0, 0, 0,        # Bakery: none
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  # Produce: none
     0, 0, 0, 0, 0, 0,        # Meat: none
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  # Pantry: none
     0, 0, 1, 0, 0,           # Beverages: soda
     1, 0, 0,                 # Snacks: chips
     0, 0],                   # Frozen: none

    # User 10: Healthy basket (4 items)
    [0, 0, 0, 1, 0, 0, 0, 0,  # Dairy: yogurt
     0, 0, 0, 0, 0, 0,        # Bakery: none
     0, 1, 0, 1, 0, 0, 0, 0, 0, 0,  # Produce: apples, lettuce
     0, 0, 0, 1, 0, 0,        # Meat: salmon
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  # Pantry: none
     0, 0, 0, 0, 0,           # Beverages: none
     0, 0, 0,                 # Snacks: none
     0, 0],                   # Frozen: none
])

df_true = pd.DataFrame(true_purchases, columns=products)
df_true.index = [f"user_{i+1}" for i in range(len(df_true))]

print("Ground Truth (Actual Purchases):")
print(df_true.sum(axis=1))
print(f"\nTotal items bought per user: {df_true.sum(axis=1).values}")
print(f"Average purchase rate: {df_true.mean().mean():.1%}")
print(f"This sparse data (5%) is REALISTIC for e-commerce!")

Ground Truth (Actual Purchases):
user_1     4
user_2     3
user_3     4
user_4     4
user_5     3
user_6     3
user_7     3
user_8     4
user_9     2
user_10    4
dtype: int64

Total items bought per user: [4 3 4 4 3 3 3 4 2 4]
Average purchase rate: 6.8%
This sparse data (5%) is REALISTIC for e-commerce!


##Scenario 1 - Perfect Ranking

Perfect model = Ground truth as prediction scores.

**Expected:** NDCG = 1.0000 (theoretical maximum)

**Important:** You'll NEVER achieve this in production with sparse data!

This is just a reference point to understand the metric.

*sklearn.metrics.ndcg_score(y_true, y_score, k=None, sample_weight=None, ignore_ties=False)*

y_true: array-like of shape (n_samples, n_labels)
True targets of multilabel classification, or true scores of entities to be ranked. Negative values in y_true may result in an output that is not between 0 and 1.
<p>
y_score: array-like of shape (n_samples, n_labels)
Target scores, can either be probability estimates, confidence values, or non-thresholded measure of decisions (as returned by “decision_function” on some classifiers).
<p>
k: int, default=None
Only consider the highest k scores in the ranking. If None, use all outputs.

In [None]:
perfect_scores = true_purchases.astype(float)
df_perfect = pd.DataFrame(perfect_scores, columns=products, index=df_true.index)

print("="*60)
print("PERFECT MODEL (Theoretical Maximum)")
print("="*60)

for k in [1, 3, 5, 10]:
    val = ndcg_score(df_true.values, df_perfect, k=k)
    print(f"NDCG@{k} = {val:.4f}")

PERFECT MODEL (Theoretical Maximum)
NDCG@1 = 1.0000
NDCG@3 = 1.0000
NDCG@5 = 1.0000
NDCG@10 = 1.0000


## Scenario 2 - Inverted Ranking (Systematically Wrong Model)

In this scenario, the ground truth does not change, only the model’s predicted scores are inverted, causing irrelevant items to be ranked above relevant ones. So it creates the worst possible ordering, which is why NDCG collapses to ~0.

Items the user will buy → assigned the lowest scores (ranked last)

Items the user won’t buy → assigned the highest scores (ranked first)

**As a result**

All irrelevant items dominate the top of the ranking

All relevant items are pushed to the bottom

DCG ≈ 0 ⇒ NDCG ≈ 0.000

**Key Learning**

- NDCG does not merely evaluate order, it evaluates whether relevant items are ranked above irrelevant ones.

- Having the right items somewhere in the list is not enough

- If irrelevant items are ranked ahead of relevant ones, ranking quality collapses

**A confidently wrong model is worse than a noisy or random one**

In [None]:
inverted_scores = 1.0 - true_purchases.astype(float)
df_inverted = pd.DataFrame(inverted_scores, columns=products, index=df_true.index)

print("="*60)
print("INVERTED MODEL (Right items, wrong order)")
print("="*60)

for k in [1, 3, 5, 10]:
    val = ndcg_score(df_true.values, df_inverted, k=k)
    print(f"NDCG@{k} = {val:.4f}")

print("\n⚠️  Same items, reversed order → Near-zero NDCG!")

INVERTED MODEL (Right items, wrong order)
NDCG@1 = 0.0000
NDCG@3 = 0.0000
NDCG@5 = 0.0000
NDCG@10 = 0.0000

⚠️  Same items, reversed order → Near-zero NDCG!


## Scenario 3 - Excellent Production Model

**Target NDCG: 0.35-0.42** (realistic top-tier performance)

**Scenario summary:**

This scenario represents an excellent but imperfect ranking model where relevant items are usually ranked above irrelevant ones, with occasional realistic mistakes.

**How it’s simulated:** Purchased items are assigned random high scores (≈0.60-0.90) while non-purchased items receive low scores (≈0.10-0.40), creating strong score separation. To reflect real-world noise, 1-2 non-relevant items per user are randomly boosted into the high-score range, slightly degrading, but not breaking the ranking.

**Why is 0.35 considered "excellent"?**
- Random baseline = 0.10
- Excellent model = 0.35
- That's 3.5x better than random!
- Translates to 20-30% revenue increase

**This is hand-crafted** to guarantee consistent NDCG for teaching.
In production, you'd train ML models to achieve these results.

In [None]:
# Create strong separation between relevant and irrelevant
excellent_scores = np.where(
    true_purchases == 1,
    rng.uniform(0.60, 0.90, true_purchases.shape),  # Relevant: high scores
    rng.uniform(0.50, 0.70, true_purchases.shape)   # Irrelevant: low scores
)

# Create some realistic errors (1-2 per user)
for i in range(len(excellent_scores)):
    wrong_indices = np.where(true_purchases[i] == 0)[0]
    # Boost 1-2 wrong items
    boost_count = min(2, len(wrong_indices))
    boost_items = rng.choice(wrong_indices, size=boost_count, replace=False)
    for idx in boost_items:
        excellent_scores[i, idx] = rng.uniform(0.65, 0.85)

df_excellent = pd.DataFrame(excellent_scores, columns=products, index=df_true.index)

print("="*60)
print("EXCELLENT MODEL (Top-Tier Production)")
print("="*60)

for k in [1, 3, 5, 10]:
    val = ndcg_score(df_true.values, df_excellent, k=k)
    print(f"NDCG@{k} = {val:.4f}")

print("\n✅ NDCG@5 ~0.50-0.70 is EXCELLENT for sparse e-commerce data!")
print(f"   Separation gap: ~0.35 (strong signal)")

EXCELLENT MODEL (Top-Tier Production)
NDCG@1 = 0.6000
NDCG@3 = 0.5675
NDCG@5 = 0.6595
NDCG@10 = 0.6725

✅ NDCG@5 ~0.50-0.70 is EXCELLENT for sparse e-commerce data!
   Separation gap: ~0.35 (strong signal)


## Scenario 4 - Good Production Model

**Target NDCG: 0.25-0.35** (solid, deployable)

**Strategy:**
- More ranking errors than "Excellent"
- 2-3 wrong items in top 5 per user
- Some relevant items ranked in positions 6-10

**Why is 0.28 "good"?**
- It's 2.8x better than random
- Deploy with confidence!
- Most real-world systems operate here

**Real-world note:** If you achieve NDCG 0.28-0.32 in production, you're doing well!

In [None]:
good_scores = np.where(
    true_purchases == 1,
    rng.uniform(0.4, 0.65, true_purchases.shape),  # Relevant: moderate-high
    rng.uniform(0.25, 0.5, true_purchases.shape)   # Irrelevant: low-moderate
)

# Create moderate errors (2-3 per user)
for i in range(len(good_scores)):
    wrong_indices = np.where(true_purchases[i] == 0)[0]
    #Random item boosting so results wont be reproducible each time you run the code.
    boost_items = rng.choice(wrong_indices, size=min(3, len(wrong_indices)), replace=False)
    for idx in boost_items:
        good_scores[i, idx] = rng.uniform(0.55, 0.75)

df_good = pd.DataFrame(good_scores, columns=products, index=df_true.index)

print("="*60)
print("GOOD MODEL (Solid Production)")
print("="*60)

for k in [1, 3, 5, 10]:
    val = ndcg_score(df_true.values, df_good, k=k)
    print(f"NDCG@{k} = {val:.4f}")

print("\n  NDCG@5 ~0.35-0.50 is GOOD - deploy with confidence!")
print(f"   Separation gap: ~0.25 (moderate signal)")

GOOD MODEL (Solid Production)
NDCG@1 = 0.0000
NDCG@3 = 0.1152
NDCG@5 = 0.3920
NDCG@10 = 0.4956

  NDCG@5 ~0.35-0.50 is GOOD - deploy with confidence!
   Separation gap: ~0.25 (moderate signal)


## Scenario 5 - Mediocre Model

**Target NDCG: 0.15-0.25** (needs improvement)

**Strategy:**
- Heavy overlap in scores
- Relevant items only slightly favored
- Many wrong items ranked high

**Why is 0.20 "mediocre"?**
- It's only 2x better than random
- Shows some signal but not enough
- Needs more features, data, or better algorithm

**Decision:** Improve this model before deploying to production.

In [None]:
mediocre_scores = np.where(
    true_purchases == 1,
    rng.uniform(0.4, 0.70, true_purchases.shape),  # Relevant: mid-range
    rng.uniform(0.25, 0.65, true_purchases.shape)   # Irrelevant: overlapping range
)

df_mediocre = pd.DataFrame(mediocre_scores, columns=products, index=df_true.index)

print("="*60)
print("MEDIOCRE MODEL (Needs Work)")
print("="*60)

for k in [1, 3, 5, 10]:
    val = ndcg_score(df_true.values, df_mediocre, k=k)
    print(f"NDCG@{k} = {val:.4f}")

print("\n⚠️  NDCG@5 ~0.20-0.30 needs improvement before deployment")
print(f"   ⚠️  If NDCG@1 > NDCG@5, model is inconsistent!")

MEDIOCRE MODEL (Needs Work)
NDCG@1 = 0.4000
NDCG@3 = 0.2469
NDCG@5 = 0.2868
NDCG@10 = 0.3421

⚠️  NDCG@5 ~0.20-0.30 needs improvement before deployment
   ⚠️  If NDCG@1 > NDCG@5, model is inconsistent!


## Scenario 6 - Poor Model  

**Target NDCG: 0.10-0.15** (barely beats random)

**Strategy:**
- Almost no separation between relevant and irrelevant scores
- Same score ranges for both!
- Model hasn't learned anything useful

**Why is 0.12 "poor"?**
- It's only 1.2x better than random
- Barely distinguishable from random
- Don't waste resources deploying this

**Red flag:** If your model performs like this, you need to rethink your approach.

In [None]:
poor_scores = np.where(
    true_purchases == 1,
    rng.uniform(0.37, 0.65, true_purchases.shape),  # Relevant
    rng.uniform(0.35, 0.65, true_purchases.shape)   # Irrelevant (SAME range!)
)

df_poor = pd.DataFrame(poor_scores, columns=products, index=df_true.index)

print("="*60)
print("POOR MODEL (Don't Deploy)")
print("="*60)

for k in [1, 3, 5, 10]:
    val = ndcg_score(df_true.values, df_poor, k=k)
    print(f"NDCG@{k} = {val:.4f}")

print("\n❌ NDCG ~0.11-0.13 is barely better than random - don't deploy!")

POOR MODEL (Don't Deploy)
NDCG@1 = 0.0000
NDCG@3 = 0.0603
NDCG@5 = 0.0755
NDCG@10 = 0.1033

❌ NDCG ~0.11-0.13 is barely better than random - don't deploy!


## Scenario 7 - Random Baseline

**Expected NDCG: ~0.10** (with 5% purchase rate)

**How it works:**
- Scores are uniformly random (0 to 1)
- No relationship to what user will buy
- Pure luck

**Why ~0.10?**
- With 5% relevant items, random gets ~10% of perfect score
- This is your baseline - ANY model must beat this!

**Critical:** Always measure your improvement over random:
- Model = 0.30, Random = 0.10 → 3x improvement
- Model = 0.12, Random = 0.10 → 1.2x improvement

In [None]:
random_scores = rng.uniform(0, 1, true_purchases.shape)
df_random = pd.DataFrame(random_scores, columns=products, index=df_true.index)

print("="*60)
print("RANDOM BASELINE (No Model)")
print("="*60)

for k in [1, 3, 5, 10]:
    val = ndcg_score(df_true.values, df_random, k=k)
    print(f"NDCG@{k} = {val:.4f}")

print("\n⚫ NDCG ~0.10 with 5% purchase rate - this is your baseline")

RANDOM BASELINE (No Model)
NDCG@1 = 0.1000
NDCG@3 = 0.1152
NDCG@5 = 0.1175
NDCG@10 = 0.1470

⚫ NDCG ~0.10 with 5% purchase rate - this is your baseline


## Comprehensive Comparison



In [None]:
scenarios = {
    'Perfect': df_perfect,
    'Inverted': df_inverted,
    'Excellent': df_excellent,
    'Good': df_good,
    'Mediocre': df_mediocre,
    'Poor': df_poor,
    'Random': df_random
}

comparison_results = []

for scenario_name, scores in scenarios.items():
    row = {'Scenario': scenario_name}
    for k in [1, 3, 5, 10]:
        ndcg_val = ndcg_score(df_true.values, scores.values, k=k)
        row[f'NDCG@{k}'] = ndcg_val
    comparison_results.append(row)

df_comparison = pd.DataFrame(comparison_results)
df_comparison = df_comparison.set_index('Scenario')

print("\n" + "="*70)
print("COMPREHENSIVE COMPARISON - REALISTIC NDCG RANGES")
print("="*70)
print(df_comparison.round(4))




COMPREHENSIVE COMPARISON - REALISTIC NDCG RANGES
           NDCG@1  NDCG@3  NDCG@5  NDCG@10
Scenario                                  
Perfect       1.0  1.0000  1.0000   1.0000
Inverted      0.0  0.0000  0.0000   0.0000
Excellent     0.6  0.5675  0.6595   0.6725
Good          0.0  0.1152  0.3920   0.4956
Mediocre      0.4  0.2469  0.2868   0.3421
Poor          0.0  0.0603  0.0755   0.1033
Random        0.1  0.1152  0.1175   0.1470


## Single User Deep Dive

This scenario provides a deep dive into a single user’s recommendation results, showing which products they actually bought versus the top-ranked items predicted by different models.

It highlights how model quality affects ranking: top models surface most of the purchased items in the top 5, while mediocre or random models fail to prioritize the user’s true interests.


We analyze User 1 who bought: milk, eggs, bananas, apples (4 items from 50)

**Look for:**

1. **Excellent Model Top 5:**
   - Contains 3-4 of the 4 items user bought
   - Wrong items have clear business logic (related products)

2. **Good Model Top 5:**
   - Contains 2-3 of the 4 items
   - Mix of right and wrong recommendations

3. **Mediocre Model Top 5:**
   - Contains 1-2 of the 4 items
   - Many irrelevant recommendations

4. **Random Model Top 5:**
   - Contains 0-1 items (pure luck)
   - Completely useless for recommendations

**Key Observation:** Even "Excellent" isn't perfect - it misses 1 item. That's realistic!

In [None]:
user_idx = 0
user_name = f"user_{user_idx + 1}"

print(f"\n{'='*70}")
print(f"DETAILED ANALYSIS FOR {user_name.upper()}")
print(f"{'='*70}\n")

# Show what user actually bought
bought_items = [products[i] for i in range(len(products)) if df_true.iloc[user_idx, i] == 1]
print(f"User 1 bought: {', '.join(bought_items)}")
print(f"That's {len(bought_items)} items from {len(products)} products ({len(bought_items)/len(products):.1%})\n")

# Create comparison
user_comparison = pd.DataFrame({
    'Product': products,
    'Bought': df_true.iloc[user_idx].values,
    'Excellent': df_excellent.iloc[user_idx].values,
    'Good': df_good.iloc[user_idx].values,
    'Mediocre': df_mediocre.iloc[user_idx].values,
    'Random': df_random.iloc[user_idx].values
})

print("-"*70)
print("TOP 10 RECOMMENDATIONS - EXCELLENT MODEL")
print("-"*70)
top_excellent = user_comparison.sort_values('Excellent', ascending=False).head(10)
print(top_excellent[['Product', 'Bought', 'Excellent']].to_string(index=False))
print(f"\n✓ Top 5 contains {top_excellent.head(5)['Bought'].sum():.0f}/4 items user bought")

print("\n" + "-"*70)
print("TOP 10 RECOMMENDATIONS - GOOD MODEL")
print("-"*70)
top_good = user_comparison.sort_values('Good', ascending=False).head(10)
print(top_good[['Product', 'Bought', 'Good']].to_string(index=False))
print(f"\n✓ Top 5 contains {top_good.head(5)['Bought'].sum():.0f}/4 items user bought")

print("\n" + "-"*70)
print("TOP 10 RECOMMENDATIONS - MEDIOCRE MODEL")
print("-"*70)
top_mediocre = user_comparison.sort_values('Mediocre', ascending=False).head(10)
print(top_mediocre[['Product', 'Bought', 'Mediocre']].to_string(index=False))
print(f"\n⚠️  Top 5 contains {top_mediocre.head(5)['Bought'].sum():.0f}/4 items user bought")

print("\n" + "-"*70)
print("TOP 10 RECOMMENDATIONS - RANDOM MODEL")
print("-"*70)
top_random = user_comparison.sort_values('Random', ascending=False).head(10)
print(top_random[['Product', 'Bought', 'Random']].to_string(index=False))
print(f"\n❌ Top 5 contains {top_random.head(5)['Bought'].sum():.0f}/4 items user bought")


DETAILED ANALYSIS FOR USER_1

User 1 bought: milk, eggs, bananas, apples
That's 4 items from 50 products (8.0%)

----------------------------------------------------------------------
TOP 10 RECOMMENDATIONS - EXCELLENT MODEL
----------------------------------------------------------------------
     Product  Bought  Excellent
        milk       1   0.857859
    tomatoes       0   0.813153
     bananas       1   0.802444
      butter       0   0.794712
        eggs       1   0.734476
      turkey       0   0.690133
        tuna       0   0.688501
      bagels       0   0.685236
        beef       0   0.682664
frozen_pizza       0   0.680430

✓ Top 5 contains 3/4 items user bought

----------------------------------------------------------------------
TOP 10 RECOMMENDATIONS - GOOD MODEL
----------------------------------------------------------------------
      Product  Bought     Good
   sour_cream       0 0.733065
        flour       0 0.653716
    ice_cream       0 0.614324
        