### Collaborative Filtering: Overview and Motivation

**Collaborative filtering** is a recommendation approach that leverages the **collective preferences and behaviors of users** to make personalized recommendations. Unlike content-based filtering which focuses on item features, collaborative filtering relies on patterns of user-item interactions.

---

#### Why Choose Collaborative Filtering?

- **Discovers latent patterns**: Can detect preferences and similarities that aren't captured in explicit item features.
- **No content analysis required**: Works without needing detailed feature engineering of items.
- **Serendipity**: Can recommend unexpected but relevant items beyond a user's typical content profile.
- **Community wisdom**: Leverages the collective intelligence of the user base.
- **Domain agnostic**: Can work across different types of content without domain-specific feature engineering.

---

#### Approaches to Collaborative Filtering

1. **Memory-Based Approaches**
   - **User-User**: Recommends items liked by users similar to the target user.
   - **Item-Item**: Recommends items similar to those the user already liked.

2. **Model-Based Approaches**
   - **Matrix Factorization**: Decomposes user-item interaction matrix into latent factors.
   - **Neural Networks**: Uses deep learning to model complex user-item interactions.

---

#### Limitations (to be addressed in hybrid systems)

- **Cold-start problem**: Difficult to recommend to new users or items with few interactions.
- **Sparsity issues**: Most users interact with only a small fraction of available items.
- **Popularity bias**: Tendency to recommend popular items over niche content.

---

In summary, collaborative filtering excels when:
- large interaction datasets are available,
- understanding the social/community dynamics of consumption is important,
- and discovering non-obvious recommendations is valued.


In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import json
from tqdm import tqdm
import os
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import ndcg_score
from scipy.sparse import csr_matrix
from scipy.sparse.linalg import svds

DATA_PATH = 'data_final_project/KuaiRec 2.0/data/'


### Load Data

### Dataset Description

The recommendation system is based on several CSV files that represent user interactions and video metadata:

---

#### `big_matrix.csv`
- Contains historical **user-video interactions**.
- Used for **training** user profiles.
- Includes fields like: `user_id`, `video_id`, `watch_ratio`, `play_duration`, etc.

#### `small_matrix.csv`
- Contains **test interactions**, recorded after the training period.
- Used to **evaluate** recommendation quality.

---

#### `item_daily_features.csv`
- Provides **daily aggregated statistics per video**.
- Used for additional analysis but not directly in collaborative filtering.

#### `item_categories.csv` and `kuairec_caption_category.csv`
- Contains metadata that may be used for hybrid approaches but are not central to pure collaborative filtering.

---

For collaborative filtering, we primarily focus on the **interaction matrices** (`big_matrix.csv` and `small_matrix.csv`), as the approach relies on patterns of user-item interactions rather than content features.

In [2]:
print("Loading datasets...")
		
# Load interaction data
interactions_train = pd.read_csv(os.path.join(DATA_PATH, "big_matrix.csv"))
interactions_test = pd.read_csv(os.path.join(DATA_PATH, "small_matrix.csv"))

# Load video metadata (may be used in hybrid approaches or analysis)
item_daily_features = pd.read_csv(os.path.join(DATA_PATH, "item_daily_features.csv"))

print(f"Training interactions: {interactions_train.shape}")
print(f"Test interactions: {interactions_test.shape}")

Loading datasets...
Training interactions: (12530806, 8)
Test interactions: (4676570, 8)


### Data Preprocessing

In [3]:
# Show differences before and after dropna and drop_duplicates

def preprocess(df, name):
    """Remove missing values and duplicates, and report cleaning stats"""
    before = df.shape[0]
    after = df.dropna().drop_duplicates().shape[0]
    print(f"{name}: {before} rows -> {after} rows ({before - after} removed)")
    return df.dropna().drop_duplicates()

interactions_train = preprocess(interactions_train, "interactions_train")
interactions_test = preprocess(interactions_test, "interactions_test")
item_daily_features = preprocess(item_daily_features, "item_daily_features")

interactions_train: 12530806 rows -> 11564987 rows (965819 removed)
interactions_test: 4676570 rows -> 4494578 rows (181992 removed)
item_daily_features: 343341 rows -> 239968 rows (103373 removed)


In [4]:
# Process watch_ratio - clamp extreme values to keep them in a reasonable range
# Normally, a watch_ratio > 1 can indicate replays, but clamp to 2.0 as a reasonable max value
interactions_train['watch_ratio_clamped'] = np.clip(interactions_train['watch_ratio'], 0, 2.0)
interactions_test['watch_ratio_clamped'] = np.clip(interactions_test['watch_ratio'], 0, 2.0)

# Adjust the positive interaction threshold based on data analysis
POSITIVE_THRESHOLD = 0.7
interactions_train['positive_interaction'] = (interactions_train['watch_ratio_clamped'] >= POSITIVE_THRESHOLD).astype(int)
interactions_test['positive_interaction'] = (interactions_test['watch_ratio_clamped'] >= POSITIVE_THRESHOLD).astype(int)

print(f"Positive interactions in train set: {interactions_train['positive_interaction'].sum()} / {len(interactions_train)} ({interactions_train['positive_interaction'].mean()*100:.2f}%)")
print(f"Positive interactions in test set: {interactions_test['positive_interaction'].sum()} / {len(interactions_test)} ({interactions_test['positive_interaction'].mean()*100:.2f}%)")

print("Data loaded successfully!")

Positive interactions in train set: 5966291 / 11564987 (51.59%)
Positive interactions in test set: 2536880 / 4494578 (56.44%)
Data loaded successfully!


### Building User-Item Interaction Matrices

For collaborative filtering, we need to construct **user-item matrices** that represent user interactions with videos. We'll create two types of matrices:

1. **Binary Matrix**: Where entries are 1 if the user had a positive interaction with the video (watch_ratio ≥ 0.7) and 0 otherwise.
2. **Rating Matrix**: Where entries are the actual watch_ratio values, representing the strength of interest.

These matrices will be used for user-user similarity, item-item similarity, and matrix factorization approaches.

In [5]:
def create_interaction_matrices():
    """Create user-item interaction matrices for collaborative filtering"""
    print("Creating user-item interaction matrices...")
    
    # Create unique user and item mappings
    unique_users = np.union1d(interactions_train['user_id'].unique(), interactions_test['user_id'].unique())
    unique_videos = np.union1d(interactions_train['video_id'].unique(), interactions_test['video_id'].unique())
    
    user_to_idx = {user_id: idx for idx, user_id in enumerate(unique_users)}
    video_to_idx = {video_id: idx for idx, video_id in enumerate(unique_videos)}
    idx_to_user = {idx: user_id for user_id, idx in user_to_idx.items()}
    idx_to_video = {idx: video_id for video_id, idx in video_to_idx.items()}
    
    # Calculate matrix dimensions
    n_users = len(unique_users)
    n_videos = len(unique_videos)
    print(f"Matrix dimensions: {n_users} users x {n_videos} videos")
    
    # Create sparse matrices for training data
    train_rows = interactions_train['user_id'].map(user_to_idx).values
    train_cols = interactions_train['video_id'].map(video_to_idx).values
    
    # Binary matrix (positive interactions)
    binary_data = interactions_train['positive_interaction'].values
    binary_matrix = csr_matrix((binary_data, (train_rows, train_cols)), shape=(n_users, n_videos))
    
    # Rating matrix (watch_ratio values)
    rating_data = interactions_train['watch_ratio_clamped'].values
    rating_matrix = csr_matrix((rating_data, (train_rows, train_cols)), shape=(n_users, n_videos))
    
    print(f"Binary matrix: {binary_matrix.nnz} non-zero entries (density: {binary_matrix.nnz / (n_users * n_videos):.6f})")
    print(f"Rating matrix: {rating_matrix.nnz} non-zero entries (density: {rating_matrix.nnz / (n_users * n_videos):.6f})")
    
    return binary_matrix, rating_matrix, user_to_idx, video_to_idx, idx_to_user, idx_to_video

binary_matrix, rating_matrix, user_to_idx, video_to_idx, idx_to_user, idx_to_video = create_interaction_matrices()

Creating user-item interaction matrices...
Matrix dimensions: 7176 users x 10728 videos
Binary matrix: 10300969 non-zero entries (density: 0.133806)
Rating matrix: 10300969 non-zero entries (density: 0.133806)


### User-User Collaborative Filtering

User-User collaborative filtering works by finding users with similar preferences and recommending items that these similar users have enjoyed.

Steps:
1. Calculate similarity between users based on their interaction patterns
2. For each user, identify the most similar users
3. Recommend items that similar users have liked but the target user hasn't seen yet

This approach is intuitive but can be computationally expensive with large user bases.

In [6]:
def user_user_collaborative_filtering(user_id, n_neighbors=50, top_n=10):
    """Generate recommendations using user-user collaborative filtering"""
    if user_id not in user_to_idx:
        print(f"User {user_id} not found in the dataset")
        return []
    
    user_idx = user_to_idx[user_id]
    user_profile = binary_matrix[user_idx].toarray().flatten()
    
    # Find videos this user has already interacted with
    user_videos = set(np.where(user_profile > 0)[0])
    if len(user_videos) == 0:
        print(f"User {user_id} has no interactions in the training set")
        return []
    
    # Calculate similarity with other users who have at least one interaction
    active_users = np.where(binary_matrix.getnnz(axis=1) > 0)[0]
    similarities = []
    
    # Use a batch approach for large matrices
    for u_idx in active_users:
        if u_idx == user_idx:
            continue
        other_profile = binary_matrix[u_idx].toarray().flatten()
        sim = cosine_similarity(user_profile.reshape(1, -1), other_profile.reshape(1, -1))[0][0]
        if sim > 0:  # Only consider positive similarity
            similarities.append((u_idx, sim))
    
    # Sort by similarity and get top neighbors
    similarities.sort(key=lambda x: x[1], reverse=True)
    neighbors = similarities[:n_neighbors]
    
    # Calculate score for each video based on similar users' interactions
    video_scores = {}
    for neighbor_idx, similarity in neighbors:
        neighbor_videos = set(np.where(binary_matrix[neighbor_idx].toarray().flatten() > 0)[0])
        for vid_idx in neighbor_videos:
            if vid_idx not in user_videos:  # Only recommend videos the user hasn't seen
                if vid_idx not in video_scores:
                    video_scores[vid_idx] = 0
                # Weight the video by the similarity score
                video_scores[vid_idx] += similarity
    
    # Sort videos by score and get top_n recommendations
    sorted_videos = sorted(video_scores.items(), key=lambda x: x[1], reverse=True)[:top_n]
    recommendations = [(idx_to_video[vid_idx], score) for vid_idx, score in sorted_videos]
    
    return recommendations

# Example usage
test_user_id = interactions_test['user_id'].iloc[0]
user_user_recommendations = user_user_collaborative_filtering(test_user_id, n_neighbors=50, top_n=10)
print(f"User-User CF recommendations for user {test_user_id}:")
for video_id, score in user_user_recommendations:
    print(f"Video {video_id}: Score {score:.4f}")

User-User CF recommendations for user 14:
Video 3282: Score 11.7344
Video 10021: Score 11.1498
Video 8247: Score 11.1180
Video 6951: Score 11.0990
Video 4694: Score 10.8222
Video 5789: Score 10.5506
Video 2845: Score 10.2379
Video 8328: Score 9.9632
Video 9717: Score 9.9502
Video 7779: Score 9.6383


### Item-Item Collaborative Filtering

Item-Item collaborative filtering recommends items similar to those the user has already liked. It's often more scalable than user-user CF since:
- The item catalog changes less frequently than user preferences
- Item similarity matrices can be pre-computed

Steps:
1. Calculate similarity between items based on user interaction patterns
2. For each item a user has liked, find similar items
3. Aggregate these similar items to create personalized recommendations

This approach is widely used in production systems due to its efficiency and accuracy.

In [7]:
def calculate_item_similarities(n_items=5000):
    """Calculate similarity between items based on user interaction patterns"""
    print("Calculating item similarities...")
    
    # For efficiency, we'll only calculate similarities for the most interacted-with items
    item_interaction_counts = binary_matrix.sum(axis=0).A1
    top_items_idx = np.argsort(item_interaction_counts)[-n_items:]
    
    # Create a map of video indices to their positions in the similarity matrix
    item_pos_map = {idx: pos for pos, idx in enumerate(top_items_idx)}
    
    # Extract submatrix with only top items
    top_items_matrix = binary_matrix[:, top_items_idx].transpose()
    
    print(f"Computing similarities for {n_items} most popular items...")
    item_similarity = cosine_similarity(top_items_matrix)
    
    # Set self-similarity to 0 to avoid recommending the same item
    np.fill_diagonal(item_similarity, 0)
    
    print(f"Item similarity matrix shape: {item_similarity.shape}")
    return item_similarity, item_pos_map, top_items_idx

# Calculate similarities for the most popular items (adjust n_items as needed)
item_similarity, item_pos_map, top_items_idx = calculate_item_similarities(n_items=5000)

Calculating item similarities...
Computing similarities for 5000 most popular items...
Item similarity matrix shape: (5000, 5000)


In [8]:
def item_item_collaborative_filtering(user_id, top_n=10):
    """Generate recommendations using item-item collaborative filtering"""
    if user_id not in user_to_idx:
        print(f"User {user_id} not found in the dataset")
        return []
    
    user_idx = user_to_idx[user_id]
    user_profile = binary_matrix[user_idx].toarray().flatten()
    
    # Find items this user has already interacted with
    user_items = np.where(user_profile > 0)[0]
    if len(user_items) == 0:
        print(f"User {user_id} has no interactions in the training set")
        return []
    
    # Calculate recommendation scores for all items
    scores = np.zeros(binary_matrix.shape[1])
    for item_idx in user_items:
        # Skip if item is not in the top items for which we calculated similarity
        if item_idx not in item_pos_map:
            continue
            
        # Get similarities to this item and update scores
        item_pos = item_pos_map[item_idx]
        similar_items = item_similarity[item_pos]
        
        for pos, sim in enumerate(similar_items):
            if sim > 0:  # Only consider positive similarity
                comparable_item_idx = top_items_idx[pos]
                scores[comparable_item_idx] += sim
    
    # Zero out items the user has already interacted with
    scores[user_items] = 0
    
    # Get top N recommendations
    top_item_indices = np.argsort(scores)[-top_n:][::-1]
    recommendations = [(idx_to_video[idx], scores[idx]) for idx in top_item_indices if scores[idx] > 0]
    
    return recommendations

# Example usage
item_item_recommendations = item_item_collaborative_filtering(test_user_id, top_n=10)
print(f"Item-Item CF recommendations for user {test_user_id}:")
for video_id, score in item_item_recommendations:
    print(f"Video {video_id}: Score {score:.4f}")

Item-Item CF recommendations for user 14:
Video 4694: Score 62.5350
Video 3282: Score 62.3537
Video 8328: Score 59.6561
Video 6951: Score 58.4308
Video 8247: Score 58.4220
Video 10021: Score 58.0466
Video 2123: Score 57.4027
Video 8435: Score 56.9225
Video 4751: Score 56.4862
Video 2638: Score 56.4165


### Matrix Factorization Approach

Matrix Factorization is a model-based collaborative filtering approach that decomposes the user-item interaction matrix into lower-dimensional latent factor matrices. These latent factors can capture hidden patterns in user preferences and item characteristics.

Key benefits:
- Can handle sparsity better than memory-based approaches
- Often provides better prediction accuracy
- Reduces dimensionality for computational efficiency
- Can capture latent factors not evident in raw data

We'll implement Singular Value Decomposition (SVD) to factorize the rating matrix.

In [9]:
def matrix_factorization(n_factors=50):
    """Perform matrix factorization using Singular Value Decomposition (SVD)"""
    print(f"Performing matrix factorization with {n_factors} latent factors...")
    
    # Convert sparse matrix to dense (this could be memory-intensive for large matrices)
    # For production systems, consider using libraries like surprise or implicit that handle sparse matrices directly
    matrix = rating_matrix.toarray()
    
    # Fill missing values with column means (average rating for each item)
    # In production, more sophisticated imputation methods could be used
    item_means = np.nanmean(matrix, axis=0)
    item_means[np.isnan(item_means)] = 0  # Handle items with no ratings
    matrix_filled = matrix.copy()
    
    for col in range(matrix.shape[1]):
        mask = matrix[:, col] == 0
        matrix_filled[mask, col] = item_means[col]
    
    # Perform SVD
    U, sigma, Vt = svds(matrix_filled, k=n_factors)
    sigma = np.diag(sigma)
    
    # Predict ratings: U × sigma × Vt
    predicted_ratings = np.dot(np.dot(U, sigma), Vt)
    
    print("Matrix factorization completed")
    return predicted_ratings

# Generate the predicted ratings matrix
predicted_ratings = matrix_factorization(n_factors=50)

Performing matrix factorization with 50 latent factors...
Matrix factorization completed


In [10]:
def matrix_factorization_recommendations(user_id, top_n=10):
    """Generate recommendations using matrix factorization results"""
    if user_id not in user_to_idx:
        print(f"User {user_id} not found in the dataset")
        return []
    
    user_idx = user_to_idx[user_id]
    user_interactions = binary_matrix[user_idx].toarray().flatten()
    user_predictions = predicted_ratings[user_idx]
    
    # Set already watched items to negative infinity to exclude them
    watched_items = np.where(user_interactions > 0)[0]
    user_predictions[watched_items] = float('-inf')
    
    # Get indices of top N predictions
    top_indices = np.argsort(user_predictions)[-top_n:][::-1]
    
    # Create recommendations list with scores
    recommendations = [(idx_to_video[idx], user_predictions[idx]) for idx in top_indices]
    return recommendations

# Example usage
mf_recommendations = matrix_factorization_recommendations(test_user_id, top_n=10)
print(f"Matrix Factorization recommendations for user {test_user_id}:")
for video_id, score in mf_recommendations:
    print(f"Video {video_id}: Score {score:.4f}")

Matrix Factorization recommendations for user 14:
Video 3400: Score 1.5975
Video 4751: Score 1.4310
Video 4694: Score 1.3057
Video 8328: Score 1.2772
Video 8435: Score 1.2699
Video 6951: Score 1.2257
Video 2845: Score 1.2134
Video 3871: Score 1.1290
Video 9758: Score 1.1099
Video 3827: Score 1.1028


### Evaluation Framework

We'll evaluate our collaborative filtering approaches using the same metrics as in the content-based notebook:
- Precision@k: Proportion of recommended items that were actually relevant
- Recall@k: Proportion of relevant items that were successfully recommended
- NDCG@k: Normalized Discounted Cumulative Gain, which measures ranking quality

This will allow direct comparison between collaborative and content-based approaches.

In [11]:
def evaluate_recommender(recommendation_func, k=10, positive_threshold=0.7):
    """Evaluate a recommender function on test data"""
    print(f"Evaluating recommender on test videos... (top-{k})")
    
    # List of users to evaluate = those present in small_matrix + known users
    test_users = interactions_test['user_id'].unique()
    test_users = [u for u in test_users if u in user_to_idx]
    
    precision_list = []
    recall_list = []
    ndcg_list = []
    skipped = 0
    
    for user_id in tqdm(test_users[:100], desc="Evaluating users"):  # Limit to 100 users for speed
        # Videos this user has seen in small_matrix
        user_test_data = interactions_test[
            (interactions_test['user_id'] == user_id) &
            (interactions_test['video_id'].isin(video_to_idx))
        ]
        
        if user_test_data.empty:
            skipped += 1
            continue
        
        # Ground truth: liked videos in test set
        positive_videos = set(
            user_test_data[user_test_data['watch_ratio_clamped'] >= positive_threshold]['video_id']
        )
        
        if len(positive_videos) == 0:
            skipped += 1
            continue
        
        # Generate recommendations
        recommendations = recommendation_func(user_id, top_n=k)
        if not recommendations:
            skipped += 1
            continue
            
        recommended_videos = [video_id for video_id, _ in recommendations]
        
        # Evaluation
        recommended_set = set(recommended_videos)
        intersection = positive_videos & recommended_set
        
        precision = len(intersection) / k
        recall = len(intersection) / len(positive_videos)
        relevance = [1 if vid in positive_videos else 0 for vid in recommended_videos]
        ndcg = ndcg_score([relevance], [list(range(k, 0, -1))])
        
        precision_list.append(precision)
        recall_list.append(recall)
        ndcg_list.append(ndcg)
    
    print(f"Users evaluated: {len(precision_list)} / {len(test_users[:100])}")
    print(f"Users skipped: {skipped}")
    
    return {
        'precision@k': np.mean(precision_list) if precision_list else 0,
        'recall@k': np.mean(recall_list) if recall_list else 0,
        'ndcg@k': np.mean(ndcg_list) if ndcg_list else 0,
        'users_evaluated': len(precision_list)
    }

In [13]:
for k in [5, 10, 20]:
    # Evaluate Item-Item CF
    ii_results = evaluate_recommender(item_item_collaborative_filtering, k=k)
    print(f"\nItem-Item Collaborative Filtering Results (k={k}):")
    print(f"Precision@{k}: {ii_results['precision@k']:.4f}")
    print(f"Recall@{k}:    {ii_results['recall@k']:.4f}")
    print(f"NDCG@{k}:      {ii_results['ndcg@k']:.4f}")

    # Evaluate Matrix Factorization
    mf_results = evaluate_recommender(matrix_factorization_recommendations, k=k)
    print(f"\nMatrix Factorization Results (k={k}):")
    print(f"Precision@{k}: {mf_results['precision@k']:.4f}")
    print(f"Recall@{k}:    {mf_results['recall@k']:.4f}")
    print(f"NDCG@{k}:      {mf_results['ndcg@k']:.4f}")

Evaluating recommender on test videos... (top-5)


Evaluating users: 100%|██████████| 100/100 [00:29<00:00,  3.34it/s]


Users evaluated: 100 / 100
Users skipped: 0

Item-Item Collaborative Filtering Results (k=5):
Precision@5: 0.0660
Recall@5:    0.0002
NDCG@5:      0.0984
Evaluating recommender on test videos... (top-5)


Evaluating users: 100%|██████████| 100/100 [00:02<00:00, 48.71it/s]


Users evaluated: 100 / 100
Users skipped: 0

Matrix Factorization Results (k=5):
Precision@5: 0.2800
Recall@5:    0.0008
NDCG@5:      0.4603
Evaluating recommender on test videos... (top-10)


Evaluating users: 100%|██████████| 100/100 [00:30<00:00,  3.33it/s]


Users evaluated: 100 / 100
Users skipped: 0

Item-Item Collaborative Filtering Results (k=10):
Precision@10: 0.1720
Recall@10:    0.0009
NDCG@10:      0.3221
Evaluating recommender on test videos... (top-10)


Evaluating users: 100%|██████████| 100/100 [00:02<00:00, 49.13it/s]


Users evaluated: 100 / 100
Users skipped: 0

Matrix Factorization Results (k=10):
Precision@10: 0.3390
Recall@10:    0.0020
NDCG@10:      0.5479
Evaluating recommender on test videos... (top-20)


Evaluating users: 100%|██████████| 100/100 [00:30<00:00,  3.31it/s]


Users evaluated: 100 / 100
Users skipped: 0

Item-Item Collaborative Filtering Results (k=20):
Precision@20: 0.3225
Recall@20:    0.0035
NDCG@20:      0.4860
Evaluating recommender on test videos... (top-20)


Evaluating users: 100%|██████████| 100/100 [00:02<00:00, 48.21it/s]

Users evaluated: 100 / 100
Users skipped: 0

Matrix Factorization Results (k=20):
Precision@20: 0.4205
Recall@20:    0.0048
NDCG@20:      0.6273



