# 🎬 Introduction to Recommendation Systems - PART 1
## User-Based Collaborative Filtering with MovieLens

**CCO 460 - Inteligencia Artificial**  
**Universidad del Sagrado Corazón**  
**Lesson 1 of 2**

---

## 📚 Today's Learning Objectives

By the end of this lesson, students will understand:

1. **What** recommendation systems are and their real-world impact
2. **How** to structure data in a user-item matrix
3. **How** user-based collaborative filtering works (conceptually and mathematically)
4. **Why** cosine similarity is effective for finding similar users
5. **How** to implement user-based CF with Python and scikit-learn

---

## 🎯 What We'll Build Today

A **user-based recommendation system** that finds users with similar taste and recommends movies they enjoyed.

**Example:**
- **User Jane** loves: "The Godfather", "Star Wars", "Jaws"
- **User Bob** has similar taste and also loved "Pulp Fiction"
- **Recommendation:** Suggest "Pulp Fiction" to Jane!

---

## 🗂️ Lesson 1 Outline

1. [Introduction: The Recommendation Problem](#1.-Introduction)
2. [Setup: Loading MovieLens Data](#2.-Setup)
3. [Understanding the User-Item Matrix](#3.-User-Item-Matrix)
4. [User-Based Collaborative Filtering](#4.-User-Based-CF)
5. [Summary & Next Steps](#5.-Summary)

---

# 1. Introduction: The Recommendation Problem

## 🤔 The Problem

Imagine you run a video streaming service with **10,000 movies**. A new user signs up and watches just 5 movies. **How do you recommend the 6th movie?**

### Challenges:
- Too many choices → users feel overwhelmed
- Users don't know what they want → need discovery
- Each user is unique → need personalization

## 💡 The Solution: Collaborative Filtering

**Core Idea:** "People who agreed in the past tend to agree in the future"

### User-Based Approach: "People like you also liked..."

1. Find users with similar taste to yours
2. Look at movies they enjoyed
3. Recommend those movies to you

**Example:**
- Amazon: "Customers who bought this also bought..."
- Spotify: "Fans of this artist also listen to..."

## 🌟 Real-World Impact

| Platform | Impact |
|----------|--------|
| **Netflix** | 75% of viewing from recommendations |
| **Amazon** | 35% of revenue driven by recommendations |
| **YouTube** | 70% of watch time from recommendations |
| **Spotify** | 40% of discoveries through personalized playlists |

**Teaching Note:** Ask students which recommendation systems they use daily. Discuss how these affect their choices.

---

# 2. Setup: Loading MovieLens Data

## 📦 About MovieLens Dataset

MovieLens is the **industry-standard dataset** for learning recommendation systems:

- **Created by:** GroupLens Research (University of Minnesota)
- **Contains:** Real movie ratings from real users
- **Files:** 
  - `movies.csv` → movie titles and genres
  - `ratings.csv` → user ratings (userId, movieId, rating, timestamp)

### Download:
- https://grouplens.org/datasets/movielens/
- Use **MovieLens 100K** or **MovieLens 1M** dataset

**Teaching Note:** Have students download the dataset before class. Walk through the directory structure.

---

In [None]:
# Import libraries
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
import seaborn as sns

# Set visualization style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

print("✅ Libraries imported successfully!")
print("\nLibraries we're using:")
print("  • pandas: Data manipulation")
print("  • numpy: Numerical operations")
print("  • sklearn: Cosine similarity calculation")
print("  • matplotlib/seaborn: Visualization")

In [None]:
# Load the MovieLens dataset
# Make sure movies.csv and ratings.csv are in the same directory as this notebook

try:
    movies = pd.read_csv('movies.csv')
    ratings = pd.read_csv('ratings.csv')
    print("✅ Data loaded successfully!")
    print(f"\n📊 Dataset Statistics:")
    print(f"   - Total movies: {len(movies):,}")
    print(f"   - Total ratings: {len(ratings):,}")
    print(f"   - Unique users: {ratings['userId'].nunique():,}")
    print(f"   - Ratings per user (avg): {len(ratings) / ratings['userId'].nunique():.1f}")
except FileNotFoundError as e:
    print(f"❌ Error: {e}")
    print("\n📥 Please download MovieLens dataset from:")
    print("   https://grouplens.org/datasets/movielens/")
    print("\n💡 Place movies.csv and ratings.csv in the same folder as this notebook")

## 🔍 Exploring the Data

**Teaching Note:** Walk through the data structure. Ask students:
- What does each row represent?
- What are the data types?
- What patterns do you notice?

---

In [None]:
# Examine the movies data
print("🎬 MOVIES DATA:")
print("=" * 80)
display(movies.head(10))

print("\nColumn information:")
print(movies.info())

In [None]:
# Examine the ratings data
print("⭐ RATINGS DATA:")
print("=" * 80)
display(ratings.head(10))

print("\nRating distribution:")
print(ratings['rating'].value_counts().sort_index())

# Visualize rating distribution
plt.figure(figsize=(10, 5))
ratings['rating'].hist(bins=10, edgecolor='black')
plt.xlabel('Rating', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title('Distribution of Movie Ratings', fontsize=14, fontweight='bold')
plt.grid(axis='y', alpha=0.3)
plt.show()

print(f"\n💡 Insight: Most ratings are between {ratings['rating'].quantile(0.25)} and {ratings['rating'].quantile(0.75)}")

In [None]:
# Merge the datasets
data = pd.merge(movies, ratings, on='movieId')

print(f"✅ Merged dataset created: {len(data):,} rows")
print("\nSample merged data:")
display(data.head())

print("\nNow we have: userId, movieId, title, genres, rating, timestamp in one table!")

---

# 3. Understanding the User-Item Matrix

## 📐 The Core Data Structure

The **user-item matrix** is the foundation of collaborative filtering:

```
          The Godfather | Star Wars | Jaws | Titanic | ...
User 1         5.0      |    4.5    |  3.0 |   ?     | ...
User 2         4.0      |    5.0    |  4.5 |   4.0   | ...
User 3         ?        |    4.0    |  ?   |   5.0   | ...
...
```

**Key Points:**
- Each **row** = one user's ratings across all movies
- Each **column** = one movie's ratings from all users
- **Missing values (?)** = user hasn't rated that movie
- **Goal:** Predict the missing values!

**Teaching Note:** Draw this on the board. Emphasize that most values are missing (sparse matrix).

---

In [None]:
# Create the user-item matrix
# pivot_table: rows=users, columns=movies, values=ratings
user_item_matrix = data.pivot_table(
    index='userId',
    columns='title',
    values='rating'
)

print(f"📊 User-Item Matrix Shape: {user_item_matrix.shape}")
print(f"   → {user_item_matrix.shape[0]:,} users (rows)")
print(f"   → {user_item_matrix.shape[1]:,} movies (columns)")
print(f"   → {user_item_matrix.shape[0] * user_item_matrix.shape[1]:,} total cells")

print("\n📋 Sample of the matrix:")
display(user_item_matrix.iloc[:5, :5])

In [None]:
# Calculate and visualize sparsity
total_cells = user_item_matrix.shape[0] * user_item_matrix.shape[1]
filled_cells = user_item_matrix.notna().sum().sum()
sparsity = 1 - (filled_cells / total_cells)

print(f"🔍 Matrix Sparsity Analysis:")
print(f"   - Total possible ratings: {total_cells:,}")
print(f"   - Actual ratings: {filled_cells:,}")
print(f"   - Missing ratings: {total_cells - filled_cells:,}")
print(f"   - Sparsity: {sparsity*100:.2f}%")
print(f"\n💡 This means users have rated only {(1-sparsity)*100:.2f}% of movies!")

# Visualize sparsity
plt.figure(figsize=(12, 8))
sample = user_item_matrix.iloc[:50, :50].notna().astype(int)
sns.heatmap(sample, cmap='YlOrRd', cbar=True, yticklabels=False, xticklabels=False)
plt.title('User-Item Matrix Sparsity Visualization\n(First 50 users × 50 movies)\nYellow = Has Rating, Dark = Missing', 
          fontsize=14, fontweight='bold')
plt.xlabel('Movies (sample of 50)', fontsize=12)
plt.ylabel('Users (sample of 50)', fontsize=12)
plt.tight_layout()
plt.show()

print("\n📌 Teaching Point: Sparsity is a MAJOR challenge in recommendation systems!")

In [None]:
# Fill missing values with 0 for computation
# Note: 0 means "not rated" not "rated as 0"
user_item_matrix_filled = user_item_matrix.fillna(0)

print("✅ Matrix prepared for computation (NaN → 0)")
print("\n⚠️  Important: 0 means 'not rated', not 'rated as zero'")
print("   We use 0 for mathematical operations only.")

---

# 4. User-Based Collaborative Filtering

## 🎯 Core Algorithm: "People like you also liked..."

### The 3-Step Process:

**Step 1: Find Similar Users**
- Compare target user with all other users
- Use similarity metric (cosine, Pearson, etc.)
- Identify users with similar taste

**Step 2: Get Their Preferences**
- Look at movies similar users rated highly
- Filter out movies target user already saw
- Weight by similarity score

**Step 3: Make Recommendations**
- Rank movies by weighted score
- Recommend top N movies

---

## 🤝 Conceptual Example: Recommending for User Jane

**Jane's Ratings:**
- The Godfather: ⭐⭐⭐⭐⭐ (5.0)
- Star Wars: ⭐⭐⭐⭐ (4.0)
- Jaws: ⭐⭐⭐⭐ (4.5)

**Step 1: Find Similar Users**

| User | Similarity | Their Ratings |
|------|-----------|---------------|
| Bob | 0.92 | Godfather(5), Star Wars(4.5), Jaws(4), **Pulp Fiction(5)** |
| Alice | 0.88 | Godfather(4.5), Star Wars(5), Jaws(4), **Shawshank(5)** |
| Charlie | 0.12 | Frozen(5), Titanic(4.5), Romance movies... |

**Step 2: Get Recommendations**
- Bob (0.92 similar) loved Pulp Fiction (5) → Score: 0.92 × 5 = 4.6
- Alice (0.88 similar) loved Shawshank (5) → Score: 0.88 × 5 = 4.4
- Ignore Charlie (0.12 = not similar)

**Step 3: Recommend**
1. Pulp Fiction (score: 4.6)
2. The Shawshank Redemption (score: 4.4)

**Teaching Note:** Walk through this example step by step. Ask students to identify the intuition.

---

## 📐 The Math: Cosine Similarity

**Formula:**
```
cos(θ) = (A · B) / (||A|| × ||B||)
```

Where:
- `A · B` = dot product of vectors A and B
- `||A||` = magnitude (length) of vector A
- `||B||` = magnitude (length) of vector B

### Intuitive Explanation:

**Imagine each user as a vector (arrow) in "movie rating space":**
- Similar users → arrows point in similar directions → small angle → high cosine
- Different users → arrows point in different directions → large angle → low cosine

**Range:** -1 to 1
- 1 = identical taste (same direction)
- 0 = no similarity (perpendicular)
- -1 = opposite taste (opposite directions)

**Why cosine?**
- Ignores magnitude (harsh vs generous raters with same preferences are still similar)
- Works well with sparse data
- Computationally efficient

**See AI Explorer:** `/recommendation-systems` → Similarity Metrics → Cosine Similarity for detailed examples

**Teaching Note:** Draw vectors on board. Show how angle represents similarity.

---

In [None]:
# Calculate user-user similarity matrix
print("🔄 Calculating user similarities...")
print("   This compares every user with every other user")
print(f"   {user_item_matrix.shape[0]:,} users → {user_item_matrix.shape[0]**2:,} comparisons!")

user_similarity = cosine_similarity(user_item_matrix_filled)
user_similarity_df = pd.DataFrame(
    user_similarity,
    index=user_item_matrix.index,
    columns=user_item_matrix.index
)

print(f"\n✅ Similarity matrix created: {user_similarity_df.shape}")
print("\n📋 Sample similarities (User 1 with others):")
print(user_similarity_df.iloc[0].sort_values(ascending=False).head(10))

print("\n💡 Notice: User 1 has similarity of 1.0 with themselves (perfect match!)")

In [None]:
# Visualize similarity distribution
plt.figure(figsize=(12, 5))

# Get upper triangle to avoid counting each pair twice
similarities = user_similarity_df.values[np.triu_indices_from(user_similarity_df.values, k=1)]

plt.hist(similarities, bins=50, edgecolor='black', alpha=0.7)
plt.xlabel('Cosine Similarity', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title('Distribution of User-User Similarities', fontsize=14, fontweight='bold')
plt.axvline(similarities.mean(), color='red', linestyle='--', label=f'Mean: {similarities.mean():.3f}')
plt.legend()
plt.grid(axis='y', alpha=0.3)
plt.show()

print(f"📊 Similarity Statistics:")
print(f"   - Mean: {similarities.mean():.3f}")
print(f"   - Median: {np.median(similarities):.3f}")
print(f"   - Std Dev: {similarities.std():.3f}")
print(f"\n💡 Most users have low-to-moderate similarity (typical for sparse data)")

## 💻 Implementation: User-Based Recommendation Function

Now let's implement the complete algorithm:

---

In [None]:
def recommend_movies_user_based(user_id, matrix, similarity_matrix, n_recommendations=5, n_neighbors=10):
    """
    Generate movie recommendations using User-Based Collaborative Filtering.
    
    Algorithm:
    1. Find users most similar to target user
    2. Get movies they rated highly that target user hasn't seen
    3. Rank by weighted average (similarity × rating)
    
    Args:
        user_id: Target user ID
        matrix: User-item rating matrix (filled with 0s)
        similarity_matrix: User-user similarity scores
        n_recommendations: Number of movies to recommend
        n_neighbors: Number of similar users to consider
        
    Returns:
        pandas Series with movie titles and predicted scores
    """
    # Validation
    if user_id not in similarity_matrix.index:
        raise ValueError(f"User {user_id} not found in dataset")
    
    # Step 1: Get similarity scores for this user
    user_similarities = similarity_matrix.loc[user_id]
    
    # Step 2: Find most similar users (excluding the user themselves)
    similar_users = user_similarities.sort_values(ascending=False).iloc[1:n_neighbors+1].index
    
    # Step 3: Get ratings from similar users
    similar_users_ratings = matrix.loc[similar_users]
    
    # Step 4: Calculate weighted average (similarity × rating)
    weights = user_similarities.loc[similar_users]
    weighted_sum = similar_users_ratings.T.dot(weights)
    
    # Normalize by sum of weights
    total_weight = weights.sum()
    if total_weight > 0:
        predicted_ratings = weighted_sum / total_weight
    else:
        return pd.Series(dtype=float)
    
    # Step 5: Remove movies already rated by target user
    user_rated_movies = matrix.loc[user_id]
    user_rated_movies = user_rated_movies[user_rated_movies > 0].index
    
    recommendations = predicted_ratings.drop(user_rated_movies, errors='ignore')
    
    # Step 6: Return top N recommendations
    return recommendations.sort_values(ascending=False).head(n_recommendations)

print("✅ Recommendation function defined!")
print("\nFunction signature:")
print("  recommend_movies_user_based(user_id, matrix, similarity_matrix, n_recommendations=5, n_neighbors=10)")

In [None]:
# Test the recommendation function
test_user = 1

print(f"🎬 Generating recommendations for User {test_user}...\n")

# First, show what this user has already rated
user_ratings = user_item_matrix.loc[test_user]
rated_movies = user_ratings[user_ratings.notna()].sort_values(ascending=False)

print(f"📋 User {test_user}'s Rating History ({len(rated_movies)} movies):")
print("="*80)
for i, (movie, rating) in enumerate(rated_movies.head(10).items(), 1):
    stars = '⭐' * int(rating)
    print(f"{i:2d}. {movie[:55]:<55} {stars} ({rating})")
if len(rated_movies) > 10:
    print(f"    ... and {len(rated_movies)-10} more movies")

# Generate recommendations
recommendations = recommend_movies_user_based(
    test_user,
    user_item_matrix_filled,
    user_similarity_df,
    n_recommendations=10,
    n_neighbors=10
)

print(f"\n\n🎯 Top 10 Recommendations for User {test_user}:")
print("="*80)
for i, (movie, score) in enumerate(recommendations.items(), 1):
    stars = '⭐' * int(score)
    print(f"{i:2d}. {movie[:55]:<55} {stars} (Score: {score:.3f})")

print("\n💡 These recommendations are based on users with similar taste!")

## 🎓 Understanding the Results

**Discussion Questions for Students:**

1. **Do the recommendations make sense?**
   - Look at what the user rated highly
   - Are the recommendations in similar genres?
   - Would you watch these movies?

2. **How does the number of neighbors (k) affect recommendations?**
   - Try k=5 vs k=20 vs k=50
   - What changes?

3. **What happens for users with few ratings?**
   - Try a user who rated only 2-3 movies
   - Are recommendations still good?
   - This is the "cold start problem"

**Teaching Activity:** Have students pick different users and analyze their recommendations.

---

In [None]:
# Interactive exploration: Try different users
def explore_recommendations(user_id, n_recs=5):
    """Helper function for exploring different users"""
    print(f"\n{'='*80}")
    print(f"USER {user_id} ANALYSIS")
    print(f"{'='*80}\n")
    
    # User's ratings
    user_ratings = user_item_matrix.loc[user_id]
    rated = user_ratings[user_ratings.notna()]
    
    print(f"📊 Profile: {len(rated)} movies rated, avg rating: {rated.mean():.2f}")
    print(f"\nTop 5 favorites:")
    for movie, rating in rated.sort_values(ascending=False).head(5).items():
        print(f"  • {movie[:50]} ({rating})")
    
    # Recommendations
    recs = recommend_movies_user_based(user_id, user_item_matrix_filled, 
                                       user_similarity_df, n_recommendations=n_recs)
    print(f"\n🎯 Recommendations:")
    for i, (movie, score) in enumerate(recs.items(), 1):
        print(f"  {i}. {movie[:50]} (score: {score:.3f})")

# Try a few different users
for user in [1, 50, 100]:
    try:
        explore_recommendations(user, n_recs=5)
    except:
        print(f"\nUser {user} not found in dataset")

---

# 5. Summary & Next Steps

## ✅ What We Learned Today

**Concepts:**
- Recommendation systems solve the problem of choice overload
- Collaborative filtering uses patterns in user behavior
- User-based CF: "People like you also liked..."

**Technical Skills:**
- Created user-item matrix from transaction data
- Calculated user-user similarity with cosine similarity
- Implemented complete user-based recommendation algorithm
- Generated personalized recommendations

**Key Insights:**
- Sparsity is a major challenge (users rate <1% of items)
- Similarity metrics capture "similar taste"
- Weighted averaging gives better predictions

---

## 🚀 Next Lesson Preview: Item-Based Collaborative Filtering

**Part 2 will cover:**

1. **Item-Based CF:** "Because you watched..."
   - Finding similar movies instead of similar users
   - Why it's faster and more scalable
   - When to use item-based vs user-based

2. **Comparison:**
   - User-based vs Item-based performance
   - Strengths and weaknesses of each
   - Hybrid approaches

3. **Advanced Topics:**
   - Connection to neural networks
   - Matrix factorization
   - Deep learning recommendations

4. **Evaluation:**
   - How do we measure if recommendations are good?
   - RMSE, MAE, Precision, Recall

---

## 📝 Assignment for Next Class

**Complete the Canvas homework assignment:**
- Implement user-based CF from scratch
- Experiment with different similarity metrics
- Compare your results with today's implementation
- Answer reflection questions about the algorithm

**Preparation:**
- Review AI Explorer: `/recommendation-systems`
- Read about Pearson correlation and Jaccard similarity
- Think about: What are the limitations of user-based CF?

---

## 💬 Discussion Questions

Before we end, let's discuss:

1. **Ethics:** Can recommendation systems create "filter bubbles"? Should they?

2. **Business:** Why would Netflix prefer item-based over user-based?

3. **Technical:** What happens when a new user signs up (cold start)?

4. **Future:** How might AI/neural networks improve recommendations?

---

**See you in Part 2! 🎬**