# Collaborative Filtering Exercises

**AI Explorer - Recommendation Systems Module**

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/avunque/ai-explorer-notebooks/blob/main/collaborative_filtering_exercises.ipynb)

---

## Welcome to the Collaborative Filtering Practice Exercises!

This notebook contains hands-on exercises to help you master both **user-based** and **item-based collaborative filtering**. You'll work with the MovieLens dataset to build, evaluate, and improve recommendation systems.

### Learning Objectives:
- Implement similarity metrics from scratch
- Build and compare user-based and item-based recommenders
- Handle the cold start problem
- Optimize recommendation quality
- Evaluate system performance

### Instructions:
1. Complete each exercise in order
2. Test your code on the provided examples
3. Compare your results with the expected outputs
4. Experiment with different parameters

ðŸ’¡ **Tip**: Try solving each exercise yourself before looking at hints!

## Setup: Load Data and Libraries

In [None]:
# Install required libraries
!pip install pandas numpy scikit-learn matplotlib seaborn -q

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("âœ“ Libraries loaded!")

In [None]:
# Download and load MovieLens dataset
!wget -q http://files.grouplens.org/datasets/movielens/ml-100k.zip
!unzip -q ml-100k.zip

ratings = pd.read_csv('ml-100k/u.data', sep='\t', 
                      names=['user_id', 'movie_id', 'rating', 'timestamp'])
movies = pd.read_csv('ml-100k/u.item', sep='|', encoding='latin-1',
                     usecols=[0, 1], names=['movie_id', 'title'])

user_item_matrix = ratings.pivot(index='user_id', columns='movie_id', values='rating')

print(f"âœ“ Dataset loaded!")
print(f"  Ratings: {len(ratings):,}")
print(f"  Users: {ratings['user_id'].nunique():,}")
print(f"  Movies: {ratings['movie_id'].nunique():,}")

---

# Part 1: Similarity Metrics

## Exercise 1.1: Implement Cosine Similarity

**Task**: Implement cosine similarity from scratch without using sklearn.

**Formula**:
$$\text{cosine}(u_1, u_2) = \frac{u_1 \cdot u_2}{\|u_1\| \|u_2\|}$$

**Requirements**:
- Only consider items both users have rated
- Handle edge cases (no common ratings, zero vectors)
- Return a value between -1 and 1

In [None]:
def cosine_similarity_manual(user1_ratings, user2_ratings):
    """
    Calculate cosine similarity between two users.
    
    Args:
        user1_ratings: Series of ratings for user 1
        user2_ratings: Series of ratings for user 2
    
    Returns:
        Cosine similarity score (float between -1 and 1)
    """
    # YOUR CODE HERE
    # Hint: Find common rated items, compute dot product and magnitudes
    pass

# Test your implementation
user1 = user_item_matrix.loc[1]
user2 = user_item_matrix.loc[2]
similarity = cosine_similarity_manual(user1, user2)
print(f"Cosine similarity between User 1 and User 2: {similarity:.4f}")

# Expected: value around 0.3-0.5

<details>
<summary>ðŸ’¡ Click for Solution</summary>

```python
def cosine_similarity_manual(user1_ratings, user2_ratings):
    # Find commonly rated items
    common_items = user1_ratings.dropna().index.intersection(user2_ratings.dropna().index)
    
    if len(common_items) == 0:
        return 0.0
    
    # Get ratings for common items
    u1_common = user1_ratings[common_items].values
    u2_common = user2_ratings[common_items].values
    
    # Calculate dot product
    dot_product = np.dot(u1_common, u2_common)
    
    # Calculate magnitudes
    magnitude1 = np.sqrt(np.sum(u1_common ** 2))
    magnitude2 = np.sqrt(np.sum(u2_common ** 2))
    
    if magnitude1 == 0 or magnitude2 == 0:
        return 0.0
    
    return dot_product / (magnitude1 * magnitude2)
```
</details>

## Exercise 1.2: Implement Pearson Correlation

**Task**: Implement Pearson correlation coefficient from scratch.

**Formula**:
$$r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2} \sqrt{\sum (y_i - \bar{y})^2}}$$

**Why Pearson?** It accounts for different rating scales (one user rates 4-5, another rates 1-2).

In [None]:
def pearson_correlation(user1_ratings, user2_ratings):
    """
    Calculate Pearson correlation between two users.
    
    Args:
        user1_ratings: Series of ratings for user 1
        user2_ratings: Series of ratings for user 2
    
    Returns:
        Pearson correlation coefficient (float between -1 and 1)
    """
    # YOUR CODE HERE
    # Hint: Center the ratings by subtracting means
    pass

# Test your implementation
correlation = pearson_correlation(user1, user2)
print(f"Pearson correlation between User 1 and User 2: {correlation:.4f}")

---

# Part 2: User-Based Collaborative Filtering

## Exercise 2.1: Find K-Nearest Neighbors

**Task**: Given a user, find their k most similar users.

**Requirements**:
- Use cosine similarity
- Return k users (excluding the target user themselves)
- Sort by similarity (highest first)

In [None]:
def find_k_similar_users(target_user_id, user_item_matrix, k=10):
    """
    Find k most similar users to target user.
    
    Args:
        target_user_id: ID of the target user
        user_item_matrix: User-item ratings matrix
        k: Number of similar users to return
    
    Returns:
        DataFrame with columns ['user_id', 'similarity']
    """
    # YOUR CODE HERE
    pass

# Test
similar_users = find_k_similar_users(1, user_item_matrix, k=5)
print("Top 5 similar users to User 1:")
print(similar_users)

## Exercise 2.2: Predict User Rating

**Task**: Predict how a user would rate a specific movie using user-based CF.

**Formula**:
$$\hat{r}_{u,i} = \frac{\sum_{v \in N(u)} \text{sim}(u,v) \cdot r_{v,i}}{\sum_{v \in N(u)} |\text{sim}(u,v)|}$$

**Requirements**:
- Only use neighbors who have rated the item
- Weight ratings by similarity
- Clip result to valid range [1, 5]

In [None]:
def predict_user_based(user_id, movie_id, user_item_matrix, k=30):
    """
    Predict user's rating for a movie using user-based CF.
    
    Args:
        user_id: Target user ID
        movie_id: Target movie ID
        user_item_matrix: User-item ratings matrix
        k: Number of neighbors to consider
    
    Returns:
        Predicted rating (float) or None
    """
    # YOUR CODE HERE
    pass

# Test
pred = predict_user_based(1, 50, user_item_matrix)
actual = user_item_matrix.loc[1, 50]
print(f"Predicted: {pred:.2f}" if pred else "Predicted: N/A")
print(f"Actual: {actual:.0f}" if not pd.isna(actual) else "Actual: Not rated")

---

# Part 3: Item-Based Collaborative Filtering

## Exercise 3.1: Calculate Adjusted Cosine Similarity

**Task**: Implement adjusted cosine similarity for items.

**Why adjusted?** Different users have different rating scales. We subtract each user's mean rating before computing similarity.

**Formula**:
$$\text{sim}(i, j) = \frac{\sum_{u} (r_{u,i} - \bar{r}_u)(r_{u,j} - \bar{r}_u)}{\sqrt{\sum_{u} (r_{u,i} - \bar{r}_u)^2} \sqrt{\sum_{u} (r_{u,j} - \bar{r}_u)^2}}$$

In [None]:
def adjusted_cosine_similarity(item1_ratings, item2_ratings, user_means):
    """
    Calculate adjusted cosine similarity between two items.
    
    Args:
        item1_ratings: Series of ratings for item 1 (indexed by user)
        item2_ratings: Series of ratings for item 2 (indexed by user)
        user_means: Series of mean ratings for each user
    
    Returns:
        Adjusted cosine similarity (float)
    """
    # YOUR CODE HERE
    # Hint: Find common users, subtract their means, then compute cosine
    pass

# Test
user_means = user_item_matrix.mean(axis=1)
movie1 = user_item_matrix[50]  # Star Wars
movie2 = user_item_matrix[172]  # Empire Strikes Back

sim = adjusted_cosine_similarity(movie1, movie2, user_means)
print(f"Adjusted cosine similarity: {sim:.4f}")
# Expected: High similarity (> 0.5) as they're related movies

## Exercise 3.2: Predict Rating (Item-Based)

**Task**: Predict a user's rating using item-based CF.

**Approach**:
1. Find items similar to the target item
2. Among those, find items the user has rated
3. Weight by similarity to predict rating

**Formula**:
$$\hat{r}_{u,i} = \frac{\sum_{j \in N(i,u)} \text{sim}(i,j) \cdot r_{u,j}}{\sum_{j \in N(i,u)} |\text{sim}(i,j)|}$$

In [None]:
def predict_item_based(user_id, movie_id, user_item_matrix, item_similarity_matrix, k=30):
    """
    Predict user's rating for a movie using item-based CF.
    
    Args:
        user_id: Target user ID
        movie_id: Target movie ID
        user_item_matrix: User-item ratings matrix
        item_similarity_matrix: Pre-computed item similarity matrix
        k: Number of similar items to consider
    
    Returns:
        Predicted rating (float) or None
    """
    # YOUR CODE HERE
    pass

# First, compute item similarity matrix
user_means = user_item_matrix.mean(axis=1)
user_item_centered = user_item_matrix.sub(user_means, axis=0).fillna(0)
item_sim_matrix = pd.DataFrame(
    cosine_similarity(user_item_centered.T),
    index=user_item_matrix.columns,
    columns=user_item_matrix.columns
)

# Test
pred = predict_item_based(1, 100, user_item_matrix, item_sim_matrix)
print(f"Item-based prediction: {pred:.2f}" if pred else "Item-based prediction: N/A")

---

# Part 4: Evaluation and Comparison

## Exercise 4.1: Calculate RMSE and MAE

**Task**: Implement evaluation metrics to compare prediction accuracy.

**Metrics**:
- **RMSE** (Root Mean Square Error): Penalizes large errors
- **MAE** (Mean Absolute Error): Average error magnitude

**Formulas**:
$$\text{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(\hat{r}_i - r_i)^2}$$
$$\text{MAE} = \frac{1}{n}\sum_{i=1}^{n}|\hat{r}_i - r_i|$$

In [None]:
def evaluate_predictions(predictions, actuals):
    """
    Calculate RMSE and MAE for predictions.
    
    Args:
        predictions: List/array of predicted ratings
        actuals: List/array of actual ratings
    
    Returns:
        Dictionary with 'rmse' and 'mae'
    """
    # YOUR CODE HERE
    pass

# Test with sample data
sample_preds = [4.2, 3.8, 5.0, 2.5, 3.9]
sample_actual = [4.0, 4.0, 5.0, 3.0, 4.0]

metrics = evaluate_predictions(sample_preds, sample_actual)
print(f"RMSE: {metrics['rmse']:.4f}")
print(f"MAE: {metrics['mae']:.4f}")

## Exercise 4.2: Compare User-Based vs Item-Based

**Task**: Evaluate both approaches on a test set and compare their performance.

**Steps**:
1. Split data into train (80%) and test (20%)
2. Build models on training data
3. Make predictions on test set
4. Calculate RMSE, MAE, and coverage for both methods
5. Compare results

In [None]:
# Split data
train_data, test_data = train_test_split(ratings, test_size=0.2, random_state=42)
train_matrix = train_data.pivot(index='user_id', columns='movie_id', values='rating')

# Build item similarity on training data
train_user_means = train_matrix.mean(axis=1)
train_centered = train_matrix.sub(train_user_means, axis=0).fillna(0)
train_item_sim = pd.DataFrame(
    cosine_similarity(train_centered.T),
    index=train_matrix.columns,
    columns=train_matrix.columns
)

# YOUR CODE HERE
# Make predictions on test set using both methods
# Calculate metrics for both
# Compare and visualize results

print("\nðŸ“Š Performance Comparison:")
print("="*50)
# Print your comparison results here

---

# Part 5: Advanced Challenges

## Exercise 5.1: Handle Cold Start Problem

**Task**: Implement a hybrid approach for new users with < 5 ratings.

**Strategy**:
1. If user has < 5 ratings, use popularity-based recommendations
2. Otherwise, use collaborative filtering

**Popularity**: Recommend movies with highest average rating (min 50 ratings)

In [None]:
def recommend_with_cold_start(user_id, user_item_matrix, item_similarity_matrix, 
                              movies_df, n=10, min_ratings=5):
    """
    Recommend movies handling cold start problem.
    
    Args:
        user_id: Target user ID
        user_item_matrix: User-item matrix
        item_similarity_matrix: Item similarity matrix
        movies_df: Movie information
        n: Number of recommendations
        min_ratings: Threshold for cold start
    
    Returns:
        DataFrame with recommendations
    """
    # YOUR CODE HERE
    # Check if user is "cold" (few ratings)
    # If cold: return popular movies
    # If not cold: use collaborative filtering
    pass

# Test with a new user (simulate by using only first 3 ratings)
# YOUR TEST CODE HERE

## Exercise 5.2: Optimize K Parameter

**Task**: Find the optimal value of k (number of neighbors) for your recommendation system.

**Approach**:
1. Try different values of k (e.g., 5, 10, 20, 30, 50, 100)
2. Evaluate RMSE for each k on validation set
3. Plot k vs RMSE
4. Choose k with lowest RMSE

In [None]:
def find_optimal_k(train_matrix, valid_data, k_values=[5, 10, 20, 30, 50, 100]):
    """
    Find optimal k by evaluating on validation set.
    
    Args:
        train_matrix: Training user-item matrix
        valid_data: Validation DataFrame (user_id, movie_id, rating)
        k_values: List of k values to try
    
    Returns:
        Dictionary mapping k to RMSE
    """
    # YOUR CODE HERE
    pass

# Split data into train, validation, test
train_val, test = train_test_split(ratings, test_size=0.2, random_state=42)
train, valid = train_test_split(train_val, test_size=0.2, random_state=42)

# YOUR CODE HERE
# Find optimal k and visualize results

## Exercise 5.3: Diversity Metric

**Task**: Implement a diversity metric to measure how varied your recommendations are.

**Diversity Formula**:
$$\text{Diversity} = \frac{1}{|R|(|R|-1)/2} \sum_{i,j \in R, i \neq j} (1 - \text{sim}(i,j))$$

Where R is the set of recommended items.

**Goal**: Balance accuracy with diversity (avoid filter bubble)

In [None]:
def calculate_diversity(recommended_items, item_similarity_matrix):
    """
    Calculate diversity of recommendations.
    
    Args:
        recommended_items: List of recommended movie IDs
        item_similarity_matrix: Item similarity matrix
    
    Returns:
        Diversity score (0 = all similar, 1 = all different)
    """
    # YOUR CODE HERE
    # Calculate average dissimilarity between all pairs
    pass

# Test
# Generate recommendations and measure diversity
# Compare diversity of user-based vs item-based methods

---

# Bonus Challenge: Build a Complete Recommender System

## Exercise 6: Complete Recommendation Pipeline

**Task**: Build a complete recommendation system class that:
1. Trains on a dataset
2. Handles both user-based and item-based methods
3. Manages cold start
4. Provides explanations for recommendations
5. Evaluates performance

**Bonus**: Add methods to:
- Update with new ratings (online learning)
- Generate explanations ("Because you liked X")
- Compute confidence scores

In [None]:
class CollaborativeFilteringRecommender:
    """
    Complete collaborative filtering recommendation system.
    """
    
    def __init__(self, method='item-based', k=30):
        """
        Initialize recommender.
        
        Args:
            method: 'user-based' or 'item-based'
            k: Number of neighbors
        """
        self.method = method
        self.k = k
        # YOUR CODE HERE - Add necessary attributes
    
    def fit(self, ratings_df):
        """
        Train the recommender on ratings data.
        """
        # YOUR CODE HERE
        pass
    
    def predict(self, user_id, item_id):
        """
        Predict rating for user-item pair.
        """
        # YOUR CODE HERE
        pass
    
    def recommend(self, user_id, n=10, with_explanations=False):
        """
        Generate top-N recommendations.
        """
        # YOUR CODE HERE
        pass
    
    def explain_recommendation(self, user_id, item_id):
        """
        Explain why an item was recommended.
        """
        # YOUR CODE HERE
        pass
    
    def evaluate(self, test_data):
        """
        Evaluate on test set.
        """
        # YOUR CODE HERE
        pass

# Test your recommender
recommender = CollaborativeFilteringRecommender(method='item-based', k=30)
recommender.fit(train_data)
recommendations = recommender.recommend(user_id=1, n=10, with_explanations=True)
print(recommendations)

---

# Reflection Questions

After completing the exercises, consider:

1. **Performance**: Which method (user-based vs item-based) worked better? Why?

2. **Scalability**: How would these approaches scale to millions of users/items?

3. **Cold Start**: What strategies did you find most effective for new users?

4. **Diversity vs Accuracy**: Did you notice a trade-off between accuracy and diversity?

5. **Real-World Application**: How would you adapt these techniques for:
   - E-commerce product recommendations?
   - Music streaming playlists?
   - Social media friend suggestions?

6. **Improvements**: What additional features would make recommendations better?
   - Temporal dynamics (recent preferences matter more)?
   - Content features (genre, actors, directors)?
   - Contextual information (time of day, device, location)?

---

## ðŸŽ‰ Congratulations!

You've completed the collaborative filtering exercises. You now understand:
- âœ… Similarity metrics and their implementations
- âœ… User-based and item-based collaborative filtering
- âœ… Evaluation techniques and metrics
- âœ… Handling real-world challenges (cold start, sparsity)
- âœ… Building complete recommendation systems

### Next Steps:
1. **Matrix Factorization**: Learn SVD and ALS for scalable recommendations
2. **Deep Learning**: Explore neural collaborative filtering
3. **Production Systems**: Study how to deploy recommenders at scale
4. **A/B Testing**: Learn how to evaluate recommendations with real users

Keep learning and building! ðŸš€