# Matrix Factorization with Alternating Least Squares (ALS)

This notebook implements ALS matrix factorization for collaborative filtering using the `implicit` library.

## Overview
- **Matrix Factorization**: Decompose user-item interaction matrix into user and item latent factors
- **ALS Algorithm**: Alternating Least Squares optimization for implicit feedback
- **Confidence Weighting**: Handle implicit feedback with confidence-based learning
- **GPU Acceleration**: Leverage GPU for faster training when available

## 1. Setup and Imports

In [None]:
import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix
from typing import Dict, List, Tuple, Any
from implicit.als import AlternatingLeastSquares
import pickle

# Set random seed for reproducibility
np.random.seed(42)

## 2. Helper Functions

In [None]:
def load_data(path: str) -> pd.DataFrame:
    """
    Load a parquet DataFrame from disk.
    
    Args:
        path: Path to the parquet DataFrame file.
        
    Returns:
        DataFrame with user-item interaction data.
    """
    return pd.read_parquet(path, engine="pyarrow")


def build_interaction_matrix(df: pd.DataFrame) -> Tuple[csr_matrix, pd.Index, pd.Index]:
    """
    Build a sparse interaction matrix from a DataFrame of user-item interactions.
    
    Args:
        df: DataFrame with columns 'user_id' and 'click_article_id'.
        
    Returns:
        mat: csr_matrix of shape (n_users, n_items), 1 indicates a click.
        user_index: Index of unique user_id values.
        item_index: Index of unique click_article_id values.
    """
    users = df['user_id'].astype('category')
    items = df['click_article_id'].astype('category')
    user_codes = users.cat.codes.values
    item_codes = items.cat.codes.values
    n_users = users.cat.categories.size
    n_items = items.cat.categories.size

    mat = csr_matrix(
        (np.ones(len(df), dtype=np.float32), (user_codes, item_codes)),
        shape=(n_users, n_items)
    )
    return mat, users.cat.categories, items.cat.categories

## 3. ALS Training Function

In [None]:
def train_als(
    train_sparse: csr_matrix,
    factors: int = 50,
    regularization: float = 0.01,
    iterations: int = 15,
    alpha: float = 40.0
) -> AlternatingLeastSquares:
    """
    Trains an ALS model on implicit feedback data.
    
    Args:
        train_sparse: User-item interaction matrix (users x items)
        factors: Number of latent factors
        regularization: L2 regularization strength
        iterations: Number of ALS iterations
        alpha: Confidence weighting for implicit feedback
        
    Returns:
        Trained ALS model
    """
    print(f"Training ALS with matrix shape {train_sparse.shape}")
    print(f"Parameters: factors={factors}, reg={regularization}, iter={iterations}, alpha={alpha}")
    
    # Create confidence matrix: C = 1 + alpha * R
    confidence = train_sparse.copy()
    confidence = confidence.multiply(alpha)
    confidence.data += 1.0
    
    # Create ALS model
    model = AlternatingLeastSquares(
        factors=factors,
        regularization=regularization,
        iterations=iterations,
        use_gpu=True,  # Use GPU if available
        use_native=False  # Expect users x items matrix
    )
    
    # Fit model
    model.fit(confidence)
    
    print(f"Model trained successfully!")
    print(f"User factors shape: {model.user_factors.shape}")
    print(f"Item factors shape: {model.item_factors.shape}")
    
    return model

## 4. Recommendation Generation

In [None]:
def recommend_als(
    model: AlternatingLeastSquares,
    train_sparse: csr_matrix,
    user_index: pd.Index,
    item_index: pd.Index,
    top_k: int = 10
) -> Dict[int, List[int]]:
    """
    Generate top-k ALS recommendations for all users.
    
    Args:
        model: Trained ALS model
        train_sparse: Training interaction matrix
        user_index: Mapping from user codes to original user IDs
        item_index: Mapping from item codes to original item IDs
        top_k: Number of recommendations per user
        
    Returns:
        Dictionary mapping user_id to list of recommended item_ids
    """
    recs = {}
    n_users = min(train_sparse.shape[0], model.user_factors.shape[0])
    
    print(f"Generating recommendations for {n_users} users...")
    
    for u in range(n_users):
        user_id = int(user_index[u])
        
        try:
            # Get recommendations from model
            recommended = model.recommend(
                u, train_sparse[u], N=top_k, filter_already_liked_items=True
            )
            
            # Extract item codes
            if isinstance(recommended, tuple) and len(recommended) == 2:
                item_codes = recommended[0]
            else:
                try:
                    item_codes = [rec[0] for rec in recommended]
                except (IndexError, TypeError):
                    item_codes = recommended
            
            # Filter valid indices and convert to original item IDs
            valid_item_codes = [icode for icode in item_codes if 0 <= icode < len(item_index)]
            recs[user_id] = [int(item_index[icode]) for icode in valid_item_codes]
            
        except Exception as e:
            print(f"Error for user {u}: {e}")
            recs[user_id] = []  # Empty recommendations on error
    
    print(f"Generated recommendations for {len(recs)} users")
    return recs

## 5. Evaluation Function

In [None]:
def evaluate_recommendations(
    recommendations: Dict[int, List[int]],
    ground_truth_df: pd.DataFrame,
    k_list: List[int] = [5, 10, 20]
) -> Dict[str, float]:
    """
    Evaluate recommendations against ground truth.
    
    Args:
        recommendations: Dictionary mapping user_ids to recommended item_ids.
        ground_truth_df: DataFrame with held-out interactions.
        k_list: List of k values for precision@k and recall@k.
        
    Returns:
        Dictionary with evaluation metrics.
    """
    # Build held-out map: user_id â†’ set of held-out click_article_id
    held_out_map: Dict[int, set] = {}
    for _, row in ground_truth_df.iterrows():
        u = int(row['user_id'])
        i = int(row['click_article_id'])
        held_out_map.setdefault(u, set()).add(i)

    # Initialize metrics
    sum_prec = {k: 0.0 for k in k_list}
    sum_rec = {k: 0.0 for k in k_list}
    sum_rank = 0.0
    total_users = 0
    total_held_items = 0

    for u, rec_list in recommendations.items():
        if u not in held_out_map:
            continue
        gt_items = held_out_map[u]
        if not gt_items:
            continue

        total_users += 1
        total_held_items += len(gt_items)

        # Compute Precision@k and Recall@k
        for k in k_list:
            topk = set(rec_list[:k])
            n_hit = len(topk & gt_items)
            sum_prec[k] += n_hit / float(k)
            sum_rec[k] += n_hit / float(len(gt_items))

        # Compute Mean Rank
        for item in gt_items:
            if item in rec_list:
                r = rec_list.index(item) + 1  # 1-based rank
            else:
                r = len(rec_list) + 1
            sum_rank += r

    # Aggregate results
    results: Dict[str, float] = {}
    for k in k_list:
        results[f'precision@{k}'] = sum_prec[k] / total_users if total_users > 0 else 0.0
        results[f'recall@{k}'] = sum_rec[k] / total_users if total_users > 0 else 0.0
    results['mean_rank'] = sum_rank / total_held_items if total_held_items > 0 else float('inf')
    results['total_users'] = total_users

    return results

## 6. Data Loading and Preprocessing

In [None]:
# Load datasets (use sample data for demo)
try:
    train_df = load_data('../data/sample/sample_interactions.csv')
    valid_df = load_data('../data/sample/sample_interactions.csv')
    test_df = load_data('../data/sample/sample_interactions.csv')
except FileNotFoundError:
    print("Sample data not found. Please ensure sample datasets are available.")
    print("Creating synthetic data for demonstration...")
    
    # Create synthetic data
    n_users, n_items, n_interactions = 1000, 500, 5000
    
    synthetic_data = pd.DataFrame({
        'user_id': np.random.randint(0, n_users, n_interactions),
        'click_article_id': np.random.randint(0, n_items, n_interactions),
        'click_timestamp': pd.date_range('2024-01-01', periods=n_interactions, freq='1H')
    })
    
    # Split into train/valid/test
    train_df = synthetic_data.iloc[:3000].copy()
    valid_df = synthetic_data.iloc[3000:4000].copy()
    test_df = synthetic_data.iloc[4000:].copy()

print(f"Dataset shapes - Train: {train_df.shape}, Valid: {valid_df.shape}, Test: {test_df.shape}")

## 7. Build Interaction Matrices

In [None]:
# Build interaction matrices
print("Building interaction matrices...")

train_mat, user_index, item_index = build_interaction_matrix(train_df)
valid_mat, _, _ = build_interaction_matrix(valid_df)
test_mat, _, _ = build_interaction_matrix(test_df)

print(f"Training matrix shape: {train_mat.shape}")
print(f"Training matrix density: {train_mat.nnz / (train_mat.shape[0] * train_mat.shape[1]):.6f}")
print(f"Number of users: {len(user_index)}")
print(f"Number of items: {len(item_index)}")

## 8. Train ALS Model

In [None]:
# Train ALS model with different configurations
print("Training ALS model...")

# Hyperparameters
als_model = train_als(
    train_mat, 
    factors=64, 
    regularization=0.01, 
    iterations=20, 
    alpha=40.0
)

print("ALS training completed!")

## 9. Generate Recommendations

In [None]:
# Generate recommendations
print("Generating recommendations...")

als_recs_valid = recommend_als(als_model, train_mat, user_index, item_index, top_k=10)
als_recs_test = recommend_als(als_model, train_mat, user_index, item_index, top_k=10)

print(f"Generated recommendations for {len(als_recs_valid)} users")

# Show sample recommendations
sample_user = next(iter(als_recs_valid.keys()))
print(f"\nSample recommendations for user {sample_user}:")
print(als_recs_valid[sample_user][:5])

## 10. Evaluation

In [None]:
# Evaluate on validation set
print("Evaluating on validation set...")
metrics_valid = evaluate_recommendations(als_recs_valid, valid_df, k_list=[5, 10, 20])

print("\nValidation Metrics (ALS):")
for metric, value in metrics_valid.items():
    if metric != 'total_users':
        print(f"  {metric}: {value:.4f}")
    else:
        print(f"  {metric}: {int(value)}")

In [None]:
# Evaluate on test set
print("Evaluating on test set...")
metrics_test = evaluate_recommendations(als_recs_test, test_df, k_list=[5, 10, 20])

print("\nTest Metrics (ALS):")
for metric, value in metrics_test.items():
    if metric != 'total_users':
        print(f"  {metric}: {value:.4f}")
    else:
        print(f"  {metric}: {int(value)}")

## 11. Hyperparameter Analysis

In [None]:
# Compare different hyperparameter settings
print("\nHyperparameter Analysis:")
print("Testing different factor sizes and alpha values...")

hyperparams = [
    {'factors': 32, 'alpha': 20.0},
    {'factors': 64, 'alpha': 40.0},
    {'factors': 128, 'alpha': 40.0}
]

results_comparison = []

for params in hyperparams:
    print(f"\nTesting: factors={params['factors']}, alpha={params['alpha']}")
    
    # Train model
    model = train_als(
        train_mat,
        factors=params['factors'],
        regularization=0.01,
        iterations=15,
        alpha=params['alpha']
    )
    
    # Generate recommendations
    recs = recommend_als(model, train_mat, user_index, item_index, top_k=10)
    
    # Evaluate
    metrics = evaluate_recommendations(recs, valid_df, k_list=[5, 10])
    
    # Store results
    result = params.copy()
    result.update({
        'precision@5': metrics['precision@5'],
        'recall@10': metrics['recall@10'],
        'mean_rank': metrics['mean_rank']
    })
    results_comparison.append(result)
    
    print(f"  Precision@5: {metrics['precision@5']:.4f}")
    print(f"  Recall@10: {metrics['recall@10']:.4f}")
    print(f"  Mean Rank: {metrics['mean_rank']:.2f}")

# Summary
print("\nHyperparameter Comparison Summary:")
comparison_df = pd.DataFrame(results_comparison)
print(comparison_df.round(4))

## 12. Model Inspection

In [None]:
# Inspect learned factors
print("Model Factor Analysis:")
print(f"User factors shape: {als_model.user_factors.shape}")
print(f"Item factors shape: {als_model.item_factors.shape}")

# Analyze factor distributions
user_factor_norms = np.linalg.norm(als_model.user_factors, axis=1)
item_factor_norms = np.linalg.norm(als_model.item_factors, axis=1)

print(f"\nUser factor norms - Mean: {user_factor_norms.mean():.3f}, Std: {user_factor_norms.std():.3f}")
print(f"Item factor norms - Mean: {item_factor_norms.mean():.3f}, Std: {item_factor_norms.std():.3f}")

# Find most similar items to a sample item
if len(item_index) > 0:
    sample_item = 0
    item_similarities = np.dot(als_model.item_factors, als_model.item_factors[sample_item])
    most_similar = np.argsort(item_similarities)[-6:-1][::-1]  # Top 5 similar items
    
    print(f"\nMost similar items to item {item_index[sample_item]}:")
    for i, similar_idx in enumerate(most_similar):
        similarity = item_similarities[similar_idx]
        print(f"  {i+1}. Item {item_index[similar_idx]} (similarity: {similarity:.3f})")

## 13. Conclusions

This notebook demonstrates:

1. **ALS Matrix Factorization**: Implemented using the `implicit` library for efficient training
2. **Confidence Weighting**: Applied to handle implicit feedback appropriately
3. **Hyperparameter Tuning**: Compared different factor sizes and alpha values
4. **Comprehensive Evaluation**: Used multiple metrics to assess performance

### Key Insights:
- **Latent Factors**: ALS learns meaningful user and item representations
- **Scalability**: Efficient for large-scale datasets with sparse interactions
- **Hyperparameters**: Factor size and confidence weighting significantly impact performance
- **Cold Start**: Performs poorly for users/items with no training interactions

### Next Steps:
- Experiment with different regularization values
- Implement item-based ALS for comparison
- Combine ALS with other algorithms in an ensemble
- Add content-based features to handle cold start users