# 🎬 Introduction to Recommendation Systems - PART 2
## Item-Based CF, Comparisons, and Broader AI Connections

**CCO 460 - Inteligencia Artificial**  
**Universidad del Sagrado Corazón**  
**Lesson 2 of 2**

---

## 📚 Today's Learning Objectives

By the end of this lesson, students will understand:

1. **How** item-based collaborative filtering differs from user-based
2. **When** to choose user-based vs item-based approaches
3. **How** recommendation systems connect to broader AI (neural networks, deep learning)
4. **What** statistical foundations underpin collaborative filtering
5. **How** to evaluate recommendation system performance

---

## 🗂️ Lesson 2 Outline

1. [Quick Review of Part 1](#1.-Quick-Review)
2. [Item-Based Collaborative Filtering](#2.-Item-Based-CF)
3. [Comparison: User-Based vs Item-Based](#3.-Comparison)
4. [Connections to Broader AI](#4.-AI-Connections)
5. [Statistical Foundations](#5.-Statistical-Foundations)
6. [Complete Glossary](#6.-Glossary)
7. [Summary and Wrap-Up](#7.-Summary)

---

# 1. Quick Review of Part 1

## 📝 What We Learned

### User-Based Collaborative Filtering: "People like you also liked..."

**The Algorithm:**
1. **Find similar users** → Compare target user with all others using cosine similarity
2. **Get their preferences** → Look at movies similar users rated highly
3. **Make recommendations** → Recommend top-rated movies (weighted by similarity)

**Key Concepts:**
- **User-Item Matrix:** Rows = users, Columns = movies, Values = ratings
- **Sparsity:** Most values are missing (users rate few movies)
- **Cosine Similarity:** Measures angle between user rating vectors

**Formula:**
```
cos(θ) = (A · B) / (||A|| × ||B||)
```

**Intuition:** Users with similar rating patterns = similar taste

---

## 🤔 Discussion Questions (5 minutes)

**For the Class:**
1. What are the main advantages of user-based CF?
2. What challenges might arise with millions of users?
3. What happens when a new user joins (cold start problem)?

**Teaching Note:** Use this as a transition to item-based CF, which addresses some of these challenges.

---

# 2. Item-Based Collaborative Filtering

## 🎯 Core Idea: "Because you watched..."

**User-Based:** Find similar **people**  
**Item-Based:** Find similar **items (movies)**

### The Algorithm:

**Step 1: Calculate Item Similarity**
- Compare every movie with every other movie
- Based on how users rated them
- Movies rated similarly by many users → similar movies

**Step 2: Find User's Favorites**
- Look at movies the target user rated highly

**Step 3: Recommend Similar Items**
- Find movies similar to user's favorites
- Recommend those the user hasn't seen yet

---

## 🤝 Conceptual Example

**Jane's Ratings:**
- The Godfather: ⭐⭐⭐⭐⭐ (5.0)
- Star Wars: ⭐⭐⭐⭐ (4.0)

**Item Similarity (pre-computed):**

| Movie | Similar To Godfather | Similar To Star Wars |
|-------|---------------------|---------------------|
| Godfather Part II | 0.95 | 0.45 |
| Scarface | 0.88 | 0.32 |
| Empire Strikes Back | 0.42 | 0.97 |
| Frozen | 0.15 | 0.20 |

**Recommendations:**
1. **Godfather Part II** (very similar to Godfather, which Jane loved)
2. **Empire Strikes Back** (very similar to Star Wars, which Jane liked)
3. **Scarface** (similar to Godfather)

**Teaching Note:** Emphasize that we're comparing ITEMS, not USERS. The similarity is based on co-ratings.

---

In [None]:
# Setup: Import libraries and load data from Part 1
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

print("✅ Libraries imported!")
print("\nNote: Make sure you've completed Part 1 or have the data files ready.")

In [None]:
# Load MovieLens data (same as Part 1)
try:
    movies = pd.read_csv('movies.csv')
    ratings = pd.read_csv('ratings.csv')
    data = pd.merge(movies, ratings, on='movieId')
    
    user_item_matrix = data.pivot_table(
        index='userId',
        columns='title',
        values='rating'
    )
    
    user_item_matrix_filled = user_item_matrix.fillna(0)
    
    print("✅ Data loaded and prepared!")
    print(f"   - Matrix shape: {user_item_matrix.shape}")
    
except FileNotFoundError:
    print("❌ Error: Data files not found")
    print("   Please download MovieLens from: https://grouplens.org/datasets/movielens/")

## 🔄 The Key Difference: Matrix Transpose

**User-Based:**
- Calculate similarity between **rows** (users)
- Matrix: users × movies

**Item-Based:**
- Calculate similarity between **columns** (movies)
- Matrix: movies × users (transposed!)

### Visualization:

```
USER-BASED (users × movies):
          Movie1  Movie2  Movie3
User1       5       4       3
User2       4       5       4    ← Compare ROWS
User3       3       4       5

ITEM-BASED (movies × users):
          User1  User2  User3
Movie1      5      4      3
Movie2      4      5      4    ← Compare ROWS (which are movies!)
Movie3      3      4      5
```

**Teaching Note:** This is a critical concept. Draw on the board to show the transpose operation.

---

In [None]:
# Calculate ITEM-ITEM similarity matrix
print("🔄 Calculating item-item similarities...")
print("   This compares every movie with every other movie")
print(f"   {user_item_matrix.shape[1]:,} movies → {user_item_matrix.shape[1]**2:,} comparisons!")

# CRITICAL: Transpose the matrix so movies are rows
item_similarity = cosine_similarity(user_item_matrix_filled.T)
item_similarity_df = pd.DataFrame(
    item_similarity,
    index=user_item_matrix.columns,  # Movie titles
    columns=user_item_matrix.columns  # Movie titles
)

print(f"\n✅ Item similarity matrix created: {item_similarity_df.shape}")
print("   Each cell shows how similar two movies are (0 to 1)")

# Show sample similarities for a popular movie
sample_movie = "Godfather, The (1972)"

if sample_movie in item_similarity_df.index:
    print(f"\n📋 Movies most similar to '{sample_movie}':")
    similar = item_similarity_df[sample_movie].sort_values(ascending=False)
    for i, (movie, sim) in enumerate(similar.head(10).items(), 1):
        if i == 1:
            print(f"{i:2d}. {movie[:60]:<60} {sim:.3f} (itself)")
        else:
            print(f"{i:2d}. {movie[:60]:<60} {sim:.3f}")
else:
    print(f"\nNote: '{sample_movie}' not found, using first movie as example")
    sample_movie = item_similarity_df.index[0]
    print(f"\n📋 Movies most similar to '{sample_movie}':")
    similar = item_similarity_df[sample_movie].sort_values(ascending=False)
    for i, (movie, sim) in enumerate(similar.head(10).items(), 1):
        print(f"{i:2d}. {movie[:60]:<60} {sim:.3f}")

In [None]:
# Visualize item similarity distribution
plt.figure(figsize=(12, 5))

# Get upper triangle to avoid counting each pair twice
similarities = item_similarity_df.values[np.triu_indices_from(item_similarity_df.values, k=1)]

plt.hist(similarities, bins=50, edgecolor='black', alpha=0.7, color='steelblue')
plt.xlabel('Cosine Similarity', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title('Distribution of Item-Item Similarities', fontsize=14, fontweight='bold')
plt.axvline(similarities.mean(), color='red', linestyle='--', label=f'Mean: {similarities.mean():.3f}')
plt.legend()
plt.grid(axis='y', alpha=0.3)
plt.show()

print(f"📊 Item Similarity Statistics:")
print(f"   - Mean: {similarities.mean():.3f}")
print(f"   - Median: {np.median(similarities):.3f}")
print(f"   - Std Dev: {similarities.std():.3f}")
print(f"\n💡 Most movies have low similarity (expected - diverse catalog)")

## 💻 Implementation: Item-Based Recommendation Functions

We'll implement two functions:
1. **Find similar movies** to a given movie
2. **Recommend movies for a user** based on items they liked

---

In [None]:
def recommend_similar_movies(movie_title, similarity_matrix, n_recommendations=10):
    """
    Find movies similar to a given movie.
    
    This is the core of item-based CF: finding similar items.
    
    Args:
        movie_title: Title of the reference movie
        similarity_matrix: Item-item similarity scores (movies × movies)
        n_recommendations: Number of similar movies to return
        
    Returns:
        pandas Series with similar movie titles and similarity scores
    """
    # Validation
    if movie_title not in similarity_matrix.index:
        raise ValueError(f"Movie '{movie_title}' not found in dataset")
    
    # Get similarity scores for this movie
    movie_similarities = similarity_matrix[movie_title]
    
    # Sort by similarity (descending)
    similar_movies = movie_similarities.sort_values(ascending=False)
    
    # Remove the movie itself (similarity = 1.0)
    similar_movies = similar_movies[similar_movies.index != movie_title]
    
    # Return top N
    return similar_movies.head(n_recommendations)

print("✅ Function 1: recommend_similar_movies() defined")

In [None]:
def recommend_movies_item_based(user_id, user_item_matrix, similarity_matrix, 
                                n_recommendations=10, min_rating=4.0):
    """
    Generate movie recommendations for a user using Item-Based CF.
    
    Algorithm:
    1. Find movies the user rated highly (>= min_rating)
    2. For each highly-rated movie, find similar movies
    3. Aggregate scores (weighted by user's rating and similarity)
    4. Recommend highest-scoring unwatched movies
    
    Args:
        user_id: Target user ID
        user_item_matrix: User-item rating matrix (filled with 0s)
        similarity_matrix: Item-item similarity scores
        n_recommendations: Number of movies to recommend
        min_rating: Threshold for "highly rated" (default: 4.0)
        
    Returns:
        pandas Series with movie titles and predicted scores
    """
    # Validation
    if user_id not in user_item_matrix.index:
        raise ValueError(f"User {user_id} not found in dataset")
    
    # Step 1: Get user's ratings
    user_ratings = user_item_matrix.loc[user_id]
    
    # Step 2: Find highly-rated movies
    highly_rated = user_ratings[user_ratings >= min_rating]
    
    if len(highly_rated) == 0:
        # User hasn't rated anything highly
        return pd.Series(dtype=float)
    
    # Step 3: Aggregate similarity scores
    recommendation_scores = pd.Series(0.0, index=user_item_matrix.columns)
    
    for movie, rating in highly_rated.items():
        # Get movies similar to this one
        similar_movies = similarity_matrix[movie]
        
        # Weight by user's rating (movies they loved contribute more)
        weighted_similarities = similar_movies * rating
        
        # Add to total scores
        recommendation_scores += weighted_similarities
    
    # Step 4: Remove movies already rated by user
    rated_movies = user_ratings[user_ratings > 0].index
    recommendations = recommendation_scores.drop(rated_movies, errors='ignore')
    
    # Step 5: Return top N
    return recommendations.sort_values(ascending=False).head(n_recommendations)

print("✅ Function 2: recommend_movies_item_based() defined")
print("\nBoth functions ready to use!")

In [None]:
# Test: Find similar movies
test_movie = "Star Wars (1977)"

# Check if movie exists
if test_movie not in item_similarity_df.index:
    # Try alternative title
    alternatives = [m for m in item_similarity_df.index if 'Star Wars' in m]
    if alternatives:
        test_movie = alternatives[0]
    else:
        test_movie = item_similarity_df.index[10]  # Use any movie

print(f"🎬 Finding movies similar to '{test_movie}'...\n")

similar_movies = recommend_similar_movies(test_movie, item_similarity_df, n_recommendations=10)

print(f"Top 10 movies similar to '{test_movie}':")
print("="*80)
for i, (movie, similarity) in enumerate(similar_movies.items(), 1):
    print(f"{i:2d}. {movie[:60]:<60} Similarity: {similarity:.3f}")

print("\n💡 Notice: Similar movies often share genre, director, or era!")

In [None]:
# Test: Generate recommendations for a user
test_user = 1

print(f"🎬 Generating item-based recommendations for User {test_user}...\n")

# First, show what this user has rated highly
user_ratings = user_item_matrix.loc[test_user]
highly_rated = user_ratings[user_ratings >= 4.0].sort_values(ascending=False)

print(f"📋 User {test_user}'s Highly-Rated Movies (>= 4.0):")
print("="*80)
for i, (movie, rating) in enumerate(highly_rated.head(10).items(), 1):
    stars = '⭐' * int(rating)
    print(f"{i:2d}. {movie[:55]:<55} {stars} ({rating})")
if len(highly_rated) > 10:
    print(f"    ... and {len(highly_rated)-10} more movies")

# Generate recommendations
item_recommendations = recommend_movies_item_based(
    test_user,
    user_item_matrix_filled,
    item_similarity_df,
    n_recommendations=10,
    min_rating=4.0
)

print(f"\n🎯 Item-Based Recommendations for User {test_user}:")
print("="*80)
for i, (movie, score) in enumerate(item_recommendations.items(), 1):
    print(f"{i:2d}. {movie[:60]:<60} Score: {score:.3f}")

print("\n💡 These are based on movies similar to what the user already loved!")

---

# 3. Comparison: User-Based vs Item-Based

## 📊 Side-by-Side Comparison

Let's compare both methods for the same user:

---

In [None]:
# Calculate user-based recommendations (from Part 1)
def recommend_movies_user_based(user_id, matrix, similarity_matrix, n=10, n_neighbors=10):
    """
    User-Based CF (from Part 1).
    """
    if user_id not in similarity_matrix.index:
        raise ValueError(f"User {user_id} not found")
    
    user_similarities = similarity_matrix.loc[user_id]
    similar_users_indices = user_similarities.sort_values(ascending=False).index[1:n_neighbors+1]
    similar_users_ratings = matrix.loc[similar_users_indices]
    similarity_weights = user_similarities.loc[similar_users_indices]
    
    weighted_sum = similar_users_ratings.T.dot(similarity_weights)
    total_weight = similarity_weights.sum()
    
    if total_weight > 0:
        predicted_ratings = weighted_sum / total_weight
    else:
        return pd.Series(dtype=float)
    
    user_rated_movies = matrix.loc[user_id]
    user_rated_movies = user_rated_movies[user_rated_movies > 0].index
    recommendations = predicted_ratings.drop(user_rated_movies, errors='ignore')
    
    return recommendations.sort_values(ascending=False).head(n)

# Calculate user similarity matrix
user_similarity = cosine_similarity(user_item_matrix_filled)
user_similarity_df = pd.DataFrame(
    user_similarity,
    index=user_item_matrix.index,
    columns=user_item_matrix.index
)

print("✅ User-based function ready for comparison")

In [None]:
# Generate both types of recommendations
comparison_user = 1

user_based_recs = recommend_movies_user_based(
    comparison_user,
    user_item_matrix_filled,
    user_similarity_df,
    n=10
)

item_based_recs = recommend_movies_item_based(
    comparison_user,
    user_item_matrix_filled,
    item_similarity_df,
    n_recommendations=10
)

print(f"🔍 COMPARISON FOR USER {comparison_user}")
print("="*80)

print("\n👥 USER-BASED CF (People like you also liked...):")
print("-"*80)
for i, (movie, score) in enumerate(user_based_recs.head(10).items(), 1):
    print(f"{i:2d}. {movie[:65]:<65} {score:.3f}")

print("\n🎬 ITEM-BASED CF (Because you watched...):")
print("-"*80)
for i, (movie, score) in enumerate(item_based_recs.head(10).items(), 1):
    print(f"{i:2d}. {movie[:65]:<65} {score:.3f}")

# Find overlap
overlap = set(user_based_recs.index) & set(item_based_recs.index)
print(f"\n📌 Overlap: {len(overlap)} movies appear in both lists")
if overlap:
    print("   Movies in both:", list(overlap)[:3])

## 📋 When to Use Which?

### User-Based CF: "People like you also liked..."

**Best for:**
- Small to medium number of users
- User preferences change frequently
- Want serendipitous recommendations (discovery)
- Social platforms ("friends also liked")

**Advantages:**
- ✅ Better for discovery (finds unexpected matches)
- ✅ Captures evolving trends (user taste changes reflected)
- ✅ Can explain: "Users like you enjoyed this"

**Challenges:**
- ❌ Computationally expensive (O(users²))
- ❌ Doesn't scale to millions of users
- ❌ Similarity matrix changes frequently
- ❌ Cold start: new users have no history

---

### Item-Based CF: "Because you watched..."

**Best for:**
- Large number of users
- Stable item catalog
- E-commerce (product recommendations)
- Streaming services (Netflix, Spotify)

**Advantages:**
- ✅ Scales better (O(items²), usually items << users)
- ✅ Item similarities are stable (can pre-compute)
- ✅ Faster recommendations (lookup vs computation)
- ✅ Can explain: "Similar to movies you loved"

**Challenges:**
- ❌ Less serendipitous (stays in same genre/style)
- ❌ Filter bubble (reinforces existing preferences)
- ❌ Cold start: new items have no ratings
- ❌ Popularity bias (popular items dominate)

---

## 🏢 Real-World Usage

| Company | Primary Approach | Why? |
|---------|-----------------|------|
| **Amazon** | Item-Based | Billions of users, stable products |
| **Netflix** | Hybrid (both) | Started item-based, now uses neural networks |
| **Spotify** | Hybrid + Deep Learning | Combines CF with audio features |
| **YouTube** | Item-Based + Neural Nets | Millions of users, billions of videos |
| **Facebook** | User-Based (social graph) | Social connections are key |

**Teaching Note:** Discuss why each company chose their approach. What are their constraints?

---

## 💡 Hybrid Approaches

Most production systems combine both:

1. **Weighted Hybrid:** Average scores from both methods
2. **Switching Hybrid:** Use user-based for new users, item-based for established
3. **Cascade Hybrid:** Use user-based first, item-based for tie-breaking
4. **Feature Augmentation:** Use CF outputs as features for ML model

**See AI Explorer:** `/recommendation-systems` → Hybrid Systems for more details

---

# 4. Connections to Broader AI

## 🧠 How Recommendation Systems Fit in AI

Collaborative filtering is part of a larger AI ecosystem:

---

## 🔄 Evolution of Recommendation Systems

### **1990s: Rule-Based & Content Filtering**
- Manual rules: "If user liked action, recommend action"
- Content-based: Match item features
- **Limitation:** Doesn't discover unexpected preferences

### **2000s: Collaborative Filtering (What We Learned!)**
- User-based and item-based CF
- Matrix factorization (SVD, ALS)
- **Breakthrough:** Netflix Prize (2006-2009)
- **Limitation:** Cold start, sparsity

### **2010s: Neural Networks & Deep Learning**
- Deep neural collaborative filtering
- Autoencoders for recommendations
- Sequence models (RNNs, LSTMs)
- **Advantage:** Learn complex patterns

### **2020s: Transformers & Large Models**
- Transformer-based recommenders
- Multi-modal (text, images, audio)
- Personalized language models
- **Example:** YouTube, TikTok, Instagram feeds

---

## 🔗 Connection 1: Matrix Factorization

**Core Idea:** Decompose the user-item matrix into latent factors

```
Rating Matrix (users × items) = User Features × Item Features
    1000 × 5000              =    1000 × 50    ×   50 × 5000
```

**Latent Factors** might represent:
- Genre preferences (action, drama, comedy)
- Time period (classic, modern)
- Intensity (cerebral vs. action-packed)
- Visual style (dark vs. bright)

**Techniques:**
- **SVD (Singular Value Decomposition)**
- **ALS (Alternating Least Squares)**
- **NMF (Non-negative Matrix Factorization)**

**Connection to AI:** This is unsupervised learning - finding hidden structure in data!

---

## 🔗 Connection 2: Neural Collaborative Filtering

**Evolution:** Replace dot product with neural network

**Traditional CF:**
```
rating = user_vector · item_vector
```

**Neural CF:**
```
rating = NeuralNetwork(user_embedding, item_embedding)
```

**Architecture:**
```
User ID → Embedding Layer → \
                             → Concatenate → Hidden Layers → Output (rating)
Item ID → Embedding Layer → /
```

**Advantages:**
- Learns non-linear relationships
- Can incorporate side information (user age, item genre)
- Flexible architecture

**See AI Explorer:** `/recommendation-systems` → Neural Networks for implementation examples

---

## 🔗 Connection 3: Deep Learning Evolution

### **Autoencoders for Recommendations**
- Compress user-item interactions
- Denoise sparse ratings
- Generate recommendations from reconstructed matrix

### **Recurrent Neural Networks (RNNs)**
- Model sequential behavior: "What will user watch next?"
- Capture temporal patterns
- Used in: YouTube, Netflix (session-based)

### **Transformers**
- Self-attention mechanism
- Capture long-range dependencies
- Multi-modal: combine text, images, metadata
- Used in: Modern recommender systems (2020+)

### **Reinforcement Learning**
- Maximize long-term engagement (not just next click)
- Exploration vs. exploitation
- Used in: TikTok, Instagram feeds

---

## 🧩 How It All Connects

```
Classical ML Foundation:
├── k-Nearest Neighbors (k-NN) → User/Item similarity
├── Linear Algebra → Matrix operations, SVD
├── Optimization → Gradient descent, ALS
└── Statistics → Correlation, regression

                    ↓

Collaborative Filtering (What We Learned!):
├── User-Based CF
├── Item-Based CF
└── Matrix Factorization

                    ↓

Neural Networks:
├── Neural Collaborative Filtering
├── Autoencoders
├── RNNs / LSTMs
└── Transformers

                    ↓

Modern AI Systems:
├── Hybrid models (CF + Deep Learning)
├── Multi-modal recommendations
├── Reinforcement learning
└── Personalized LLMs
```

**Key Insight:** Collaborative filtering is not obsolete! It's a building block:
- Used in **ensemble models** with neural networks
- Provides **interpretable baselines**
- Works well for **cold-start** scenarios (new deep models need data)
- **Computationally efficient** for many use cases

**Teaching Note:** Emphasize that students are learning foundational concepts that scale to modern AI.

---

# 5. Statistical Foundations

## 📊 The Math Behind Collaborative Filtering

CF is built on classical statistics and machine learning concepts:

---

## 1️⃣ Correlation and Similarity

### **Cosine Similarity (What We Used)**
```
cos(θ) = (A · B) / (||A|| × ||B||)
```
- Measures angle between vectors
- Range: 0 to 1 (for ratings)
- **Advantage:** Ignores magnitude (harsh/lenient raters)

### **Pearson Correlation**
```
r = Σ(xᵢ - x̄)(yᵢ - ȳ) / √[Σ(xᵢ - x̄)² × Σ(yᵢ - ȳ)²]
```
- Measures linear relationship
- Range: -1 to 1
- **Advantage:** Centers data (removes rating bias)

### **Jaccard Similarity**
```
J(A,B) = |A ∩ B| / |A ∪ B|
```
- Measures set overlap
- Range: 0 to 1
- **Advantage:** Works with binary data (liked/not liked)

### **Euclidean Distance**
```
d = √[Σ(xᵢ - yᵢ)²]
Similarity = 1 / (1 + d)
```
- Geometric distance
- **Advantage:** Intuitive, works well with dense data

**Connection:** These are all distance/similarity metrics from **clustering** and **classification**!

---

## 2️⃣ k-Nearest Neighbors (k-NN)

**CF is essentially k-NN applied to ratings:**

**User-Based CF:**
1. Find k most similar users → k-nearest neighbors in user space
2. Aggregate their ratings → weighted average
3. Predict target user's rating → classification/regression

**Item-Based CF:**
1. Find k most similar items → k-nearest neighbors in item space
2. Aggregate user's ratings on similar items → weighted average
3. Predict rating for new item → regression

**Parameters:**
- `k` = number of neighbors (we used 10)
- Distance metric = cosine similarity
- Weighting = by similarity score

**Connection:** This is supervised learning (predict ratings) using unsupervised learning (find neighbors)!

---

## 3️⃣ Regression and Prediction

**CF is a regression problem:**

**Goal:** Predict missing ratings
```
r̂ᵤᵢ = predicted rating for user u on item i
```

**User-Based Prediction:**
```
r̂ᵤᵢ = (Σ sim(u,v) × rᵥᵢ) / Σ sim(u,v)
```
Where:
- `sim(u,v)` = similarity between users u and v
- `rᵥᵢ` = rating by user v on item i
- Sum over k most similar users

**Item-Based Prediction:**
```
r̂ᵤᵢ = (Σ sim(i,j) × rᵤⱼ) / Σ sim(i,j)
```
Where:
- `sim(i,j)` = similarity between items i and j
- `rᵤⱼ` = rating by user u on item j
- Sum over k most similar items

**This is weighted average regression!**

---

## 4️⃣ Bayesian Interpretation

**CF can be viewed through Bayesian lens:**

**Question:** What's the probability user u will like item i?

**Bayes' Theorem:**
```
P(like i | user u) = P(user u | like i) × P(like i) / P(user u)
```

**User-Based CF as Bayes:**
- **Prior:** What do similar users like?
- **Likelihood:** How similar are they?
- **Posterior:** Predicted rating

**Connection to Probabilistic Models:**
- Bayesian networks
- Latent factor models
- Probabilistic matrix factorization

---

## 5️⃣ Matrix Decomposition (SVD)

**Singular Value Decomposition:**

```
R = U × Σ × Vᵀ
```

Where:
- `R` = rating matrix (users × items)
- `U` = user feature matrix (users × latent factors)
- `Σ` = singular values (importance of factors)
- `V` = item feature matrix (items × latent factors)

**Interpretation:**
- Each user represented by k latent features
- Each item represented by k latent features
- Rating = dot product of feature vectors

**Dimensionality Reduction:**
- Reduce from 1000s of items to ~50 latent factors
- Captures 90% of variance with 5% of dimensions
- Removes noise from sparse data

**Connection:** Same technique used in:
- Principal Component Analysis (PCA)
- Image compression
- Natural language processing (LSA)

---

## 🔄 How Statistics Connect to Modern AI

**Classical Statistics → CF → Neural Networks:**

1. **Correlation** → Cosine similarity → Attention mechanisms (transformers)
2. **k-NN** → User/Item CF → Graph neural networks
3. **Regression** → Rating prediction → Neural regression
4. **Bayesian** → Probabilistic CF → Variational autoencoders
5. **SVD** → Matrix factorization → Embedding layers

**Teaching Note:** Show how classical statistics are foundational to modern deep learning.

---

# 6. Complete Glossary

## 📖 Comprehensive Reference

---

## 🎯 Core Concepts

**Collaborative Filtering (CF)**  
Making predictions about user preferences by leveraging collective patterns from many users.

**User-Item Matrix**  
A matrix where rows represent users, columns represent items, and values are ratings. Most cells are empty (sparse).

**Sparsity**  
The proportion of missing values in the user-item matrix. Real-world datasets are 95-99% sparse.

**Cold Start Problem**  
Difficulty making recommendations for new users (no history) or new items (no ratings).

---

## 🔍 Similarity Metrics

**Cosine Similarity**  
Measures the angle between two vectors. Range: 0 to 1 (for ratings). Formula: `cos(θ) = (A·B) / (||A|| × ||B||)`

**Pearson Correlation**  
Measures linear correlation, accounting for rating bias. Range: -1 to 1. Centers data around mean.

**Jaccard Similarity**  
Measures set overlap. Best for binary data (liked/not liked). Formula: `|A∩B| / |A∪B|`

**Euclidean Distance**  
Straight-line distance in multi-dimensional space. Convert to similarity: `1/(1+distance)`

**Manhattan Distance**  
Sum of absolute differences. Also called L1 distance or taxicab distance.

---

## 📊 Evaluation Metrics

**Mean Absolute Error (MAE)**  
Average absolute difference between predicted and actual ratings. Lower is better. Formula: `Σ|predicted - actual| / n`

**Root Mean Square Error (RMSE)**  
Square root of average squared errors. Penalizes large errors more. Formula: `√(Σ(predicted - actual)² / n)`

**Precision**  
Proportion of recommended items that user actually liked. Formula: `(relevant ∩ recommended) / recommended`

**Recall**  
Proportion of liked items that were recommended. Formula: `(relevant ∩ recommended) / relevant`

**F1 Score**  
Harmonic mean of precision and recall. Formula: `2 × (precision × recall) / (precision + recall)`

**Mean Reciprocal Rank (MRR)**  
Average of reciprocal ranks of first relevant item. Rewards getting relevant items at top of list.

**Normalized Discounted Cumulative Gain (NDCG)**  
Measures ranking quality, accounting for position. Higher-ranked relevant items contribute more.

**Coverage**  
Percentage of items that can be recommended. Measures how well system uses full catalog.

**Diversity**  
Variety in recommendations. Avoids recommending only popular or similar items.

**Serendipity**  
Ability to recommend unexpected but relevant items. Balances accuracy with discovery.

---

## 🛠️ Techniques and Algorithms

**User-Based CF**  
Find similar users, recommend what they liked. Formula: `rating = Σ(sim × rating) / Σsim`

**Item-Based CF**  
Find similar items, recommend based on user's favorites. More scalable than user-based.

**Matrix Factorization**  
Decompose rating matrix into user and item feature matrices. Examples: SVD, ALS, NMF.

**Singular Value Decomposition (SVD)**  
Linear algebra technique that finds latent factors. `R = U × Σ × Vᵀ`

**Alternating Least Squares (ALS)**  
Iterative optimization for matrix factorization. Alternates between fixing user/item factors.

**k-Nearest Neighbors (k-NN)**  
Find k most similar users/items. CF is essentially k-NN applied to ratings.

**Content-Based Filtering**  
Recommend based on item features (genre, director, etc.). Doesn't require other users' data.

**Hybrid Methods**  
Combine multiple approaches (CF + content-based + demographic). Used by most production systems.

---

## 🧠 Advanced Concepts

**Neural Collaborative Filtering**  
Replace dot product with neural network. Learns non-linear relationships.

**Autoencoders**  
Neural networks that compress and reconstruct user-item matrix. Denoises sparse data.

**Recurrent Neural Networks (RNN)**  
Model sequential behavior. Captures temporal patterns: "What will user watch next?"

**Transformers**  
Self-attention mechanism. Captures long-range dependencies. State-of-the-art for many tasks.

**Reinforcement Learning**  
Maximize long-term engagement, not just immediate clicks. Exploration vs. exploitation.

**Embedding**  
Dense vector representation of users/items. Learned from data, captures latent features.

**Attention Mechanism**  
Weights different inputs by importance. Core component of transformers.

---

## ⚠️ Challenges and Issues

**Cold Start**  
New users/items have no history. Solutions: content-based, demographic, hybrid.

**Sparsity**  
Most users rate few items. Solutions: matrix factorization, implicit feedback.

**Scalability**  
Computing similarity for millions of users/items. Solutions: sampling, approximate methods, item-based CF.

**Filter Bubble**  
Recommendations reinforce existing preferences. Users stuck in narrow content. Solutions: diversity, serendipity metrics.

**Popularity Bias**  
System over-recommends popular items. Long-tail items ignored. Solutions: regularization, re-ranking.

**Shilling Attacks**  
Malicious users create fake ratings to boost/demote items. Solutions: anomaly detection, robust similarity.

**Privacy**  
Recommendations reveal user preferences. Solutions: differential privacy, federated learning.

**Fairness**  
System may discriminate against certain groups/items. Solutions: fairness-aware algorithms.

---

## 📚 Related Concepts from Other AI Areas

**Clustering**  
Group similar users/items. Related to CF's similarity computation.

**Classification**  
Predict categories (liked/disliked). Binary recommendation problem.

**Regression**  
Predict continuous values (ratings). Core of CF prediction.

**Dimensionality Reduction**  
Reduce features while preserving information. PCA, SVD, autoencoders.

**Natural Language Processing (NLP)**  
Process text features (reviews, descriptions). Word embeddings similar to user/item embeddings.

**Computer Vision**  
Process visual features (movie posters, product images). CNNs for visual recommendations.

**Graph Neural Networks**  
Model user-item interactions as graph. Captures multi-hop relationships.

**Transfer Learning**  
Use knowledge from one domain in another. Cross-domain recommendations.

---

# 7. Summary and Wrap-Up

## 🎯 What We Learned (Both Parts)

### **Part 1: User-Based Collaborative Filtering**
- ✅ Understanding the recommendation problem
- ✅ Creating and analyzing user-item matrices
- ✅ Calculating user-user similarity with cosine
- ✅ Implementing user-based recommendations
- ✅ Working with MovieLens dataset

### **Part 2: Item-Based CF and Beyond**
- ✅ Item-based collaborative filtering
- ✅ Comparing user-based vs item-based approaches
- ✅ Connections to broader AI (neural networks, transformers)
- ✅ Statistical foundations (k-NN, regression, Bayesian)
- ✅ Complete glossary of metrics and techniques

---

## 💡 Key Takeaways

### 1. **Two Main Approaches**
- **User-Based:** "People like you also liked..." → Better discovery
- **Item-Based:** "Because you watched..." → More scalable

### 2. **Real Systems Use Hybrids**
- Combine CF with content-based filtering
- Layer neural networks on top of CF
- Balance accuracy, diversity, serendipity

### 3. **CF is Foundational**
- Connects to k-NN, regression, clustering
- Evolved into neural collaborative filtering
- Still used in production systems today

### 4. **Challenges Remain**
- Cold start problem
- Sparsity and scalability
- Filter bubbles and fairness
- Privacy concerns

---

## 🚀 Next Steps for Students

### **Practice:**
1. Implement both methods from scratch (Assignment 7)
2. Experiment with different similarity metrics
3. Try matrix factorization (SVD)
4. Build a hybrid recommender

### **Explore:**
1. **AI Explorer** → `/recommendation-systems` for interactive demos
2. Try Surprise library (Python) for advanced CF
3. Read Netflix Prize papers (groundbreaking work)
4. Explore modern neural recommenders (PyTorch, TensorFlow)

### **Real-World Projects:**
1. Build a movie recommender with web interface
2. Create music playlist generator
3. Implement product recommendations for e-commerce
4. Build a book/article recommendation system

---

## 📚 Additional Resources

### **Datasets:**
- MovieLens: https://grouplens.org/datasets/movielens/
- Amazon Reviews: https://cseweb.ucsd.edu/~jmcauley/datasets.html
- Last.fm Music: http://ocelma.net/MusicRecommendationDataset/
- Book-Crossing: http://www2.informatik.uni-freiburg.de/~cziegler/BX/

### **Libraries:**
- Surprise: http://surpriselib.com/ (collaborative filtering)
- LightFM: https://github.com/lyst/lightfm (hybrid recommenders)
- RecBole: https://recbole.io/ (comprehensive toolkit)
- Implicit: https://github.com/benfred/implicit (fast CF)

### **Papers to Read:**
1. "Item-Based Collaborative Filtering" (Sarwar et al., 2001)
2. "Matrix Factorization Techniques" (Koren et al., 2009)
3. "Neural Collaborative Filtering" (He et al., 2017)
4. "Deep Learning based Recommender Systems" (Zhang et al., 2019)

### **AI Explorer Pages:**
- `/recommendation-systems` → Complete guide with interactive examples
- `/neural-networks` → Foundation for advanced recommenders
- `/machine-learning` → Related ML concepts

---

## 🤔 Final Discussion Questions

**For the Class (15 minutes):**

1. **Ethics:** Should recommendation systems prioritize engagement or user well-being? (e.g., YouTube recommending conspiracy theories)

2. **Privacy:** How much should systems know about you to make good recommendations? Where's the line?

3. **Filter Bubbles:** Do recommendations trap us in echo chambers? How can we design for diversity?

4. **Fairness:** Should recommendations be equally good for all users? What about niche interests?

5. **Future:** Where are recommendation systems headed? Will transformers replace CF entirely?

**Teaching Note:** These questions have no right answers. Encourage debate and critical thinking.

---

## ✅ Assignment Preview

**Assignment 7: Collaborative Filtering Implementation**

You will:
1. Implement user-based CF from scratch (30 pts)
2. Implement item-based CF from scratch (30 pts)
3. Compare both methods (25 pts)
4. Analyze limitations and propose improvements (15 pts)

**Due:** [See Canvas]

**Resources:**
- This notebook (reference implementation)
- Corrected code file (bug-free version)
- AI Explorer (conceptual help)
- Office hours (debugging support)

---

## 🎬 Closing Thoughts

**Recommendation systems are everywhere:**
- 🎬 What you watch (Netflix, YouTube, TikTok)
- 🎵 What you listen to (Spotify, Apple Music)
- 📚 What you read (Amazon, Goodreads)
- 🛍️ What you buy (Amazon, eBay)
- 👥 Who you connect with (Facebook, LinkedIn)
- 📰 What news you see (Google, Twitter)

**You now understand how they work!**

**From simple collaborative filtering to advanced neural networks, the core ideas remain:**
- Learn from collective behavior
- Find patterns in data
- Predict what people will like
- Balance accuracy, diversity, and fairness

**As AI engineers, you can:**
- Build better recommendation systems
- Understand their limitations
- Design for ethics and fairness
- Push the field forward

---

## 🙏 Thank You!

**Questions? Comments? Ideas for projects?**

**See you in the next lesson!**

---

**CCO 460 - Inteligencia Artificial**  
**Universidad del Sagrado Corazón**  
**End of Lesson 2**

---