# SageMaker Factorization Machines Exercise

This notebook demonstrates Amazon SageMaker's **Factorization Machines** algorithm for classification and regression with sparse data.

## What You'll Learn
1. How to prepare sparse data for Factorization Machines
2. How to train a model for recommendation-style problems
3. How to interpret predictions

## What are Factorization Machines?

Factorization Machines are a **supervised** algorithm that:
- Captures feature interactions efficiently
- Works well with high-dimensional sparse data
- Combines linear regression with factorized interaction terms

**Key Formula:**
```
ŷ = w₀ + Σ wᵢxᵢ + Σ Σ <vᵢ, vⱼ>xᵢxⱼ
```
- w₀: Global bias
- wᵢ: Linear feature weights
- <vᵢ, vⱼ>: Factorized pairwise interaction

## Use Cases

| Application | Description |
|-------------|-------------|
| Recommendation | Predict user-item interactions |
| Click prediction | Ad click-through rate |
| Rating prediction | Movie/product ratings |
| Sparse classification | High-dimensional categorical data |

---

## Step 1: Setup and Imports

In [None]:
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.image_uris import retrieve
from sagemaker.estimator import Estimator
import pandas as pd
import numpy as np
import json
import os
from datetime import datetime
from dotenv import load_dotenv
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report

# Load environment variables from .env file
load_dotenv()

# Configure AWS session from environment variables
aws_profile = os.getenv('AWS_PROFILE')
aws_region = os.getenv('AWS_REGION', 'us-west-2')
sagemaker_role = os.getenv('SAGEMAKER_ROLE_ARN')

if aws_profile:
    boto3.setup_default_session(profile_name=aws_profile, region_name=aws_region)
else:
    boto3.setup_default_session(region_name=aws_region)

# SageMaker session and role
sagemaker_session = sagemaker.Session()

if sagemaker_role:
    role = sagemaker_role
else:
    role = get_execution_role()

region = sagemaker_session.boto_region_name

print(f"AWS Profile: {aws_profile or 'default'}")
print(f"SageMaker Role: {role}")
print(f"Region: {region}")
print(f"SageMaker SDK Version: {sagemaker.__version__}")

In [None]:
# Configuration
BUCKET_NAME = sagemaker_session.default_bucket()
PREFIX = "factorization-machines"

# Dataset parameters
NUM_USERS = 500
NUM_ITEMS = 200
NUM_INTERACTIONS = 10000
RANDOM_STATE = 42

print(f"S3 Bucket: {BUCKET_NAME}")
print(f"S3 Prefix: {PREFIX}")

## Step 2: Generate Synthetic Recommendation Data

In [None]:
def generate_recommendation_data(num_users=500, num_items=200, num_interactions=10000, seed=42):
    """
    Generate synthetic user-item interaction data.
    
    Creates a binary classification problem (click/no-click).
    """
    np.random.seed(seed)
    
    # Generate latent factors for users and items
    num_factors = 10
    user_factors = np.random.randn(num_users, num_factors) * 0.5
    item_factors = np.random.randn(num_items, num_factors) * 0.5
    
    # Generate biases
    user_bias = np.random.randn(num_users) * 0.2
    item_bias = np.random.randn(num_items) * 0.2
    
    interactions = []
    
    for _ in range(num_interactions):
        user_id = np.random.randint(0, num_users)
        item_id = np.random.randint(0, num_items)
        
        # Compute interaction score
        score = (
            user_bias[user_id] + 
            item_bias[item_id] + 
            np.dot(user_factors[user_id], item_factors[item_id])
        )
        
        # Convert to probability and sample
        prob = 1 / (1 + np.exp(-score))
        label = 1 if np.random.random() < prob else 0
        
        interactions.append({
            'user_id': user_id,
            'item_id': item_id,
            'label': label
        })
    
    return pd.DataFrame(interactions)

# Generate data
df = generate_recommendation_data(NUM_USERS, NUM_ITEMS, NUM_INTERACTIONS, RANDOM_STATE)

print(f"Dataset shape: {df.shape}")
print(f"\nLabel distribution:")
print(df['label'].value_counts())
print(f"\nSample data:")
print(df.head(10))

In [None]:
# Split data
np.random.seed(RANDOM_STATE)
indices = np.random.permutation(len(df))

train_size = int(0.8 * len(df))
train_idx = indices[:train_size]
test_idx = indices[train_size:]

train_df = df.iloc[train_idx].reset_index(drop=True)
test_df = df.iloc[test_idx].reset_index(drop=True)

print(f"Training samples: {len(train_df)}")
print(f"Test samples: {len(test_df)}")

## Step 3: Prepare Data for Factorization Machines

Factorization Machines expect **sparse** feature representation with one-hot encoding.

In [None]:
def create_sparse_features(df, num_users, num_items):
    """
    Create sparse feature matrix with one-hot encoded users and items.
    
    Feature space: [user_0, ..., user_N, item_0, ..., item_M]
    """
    num_samples = len(df)
    num_features = num_users + num_items
    
    # Create sparse matrix
    X = np.zeros((num_samples, num_features), dtype=np.float32)
    y = df['label'].values.astype(np.float32)
    
    for i, row in df.iterrows():
        user_idx = int(row['user_id'])
        item_idx = num_users + int(row['item_id'])
        
        X[i, user_idx] = 1.0
        X[i, item_idx] = 1.0
    
    return X, y

# Create features
X_train, y_train = create_sparse_features(train_df, NUM_USERS, NUM_ITEMS)
X_test, y_test = create_sparse_features(test_df, NUM_USERS, NUM_ITEMS)

print(f"Training features shape: {X_train.shape}")
print(f"Test features shape: {X_test.shape}")
print(f"Feature dimensionality: {X_train.shape[1]}")
print(f"Sparsity: {1 - np.count_nonzero(X_train) / X_train.size:.4f}")

In [None]:
# Save as CSV with label first
os.makedirs('data/fm', exist_ok=True)

train_data = np.column_stack([y_train, X_train])
test_data = np.column_stack([y_test, X_test])

np.savetxt('data/fm/train.csv', train_data, delimiter=',')
np.savetxt('data/fm/test.csv', test_data, delimiter=',')

print("Data files created:")
for f in os.listdir('data/fm'):
    size = os.path.getsize(f'data/fm/{f}') / 1024 / 1024
    print(f"  data/fm/{f} ({size:.1f} MB)")

In [None]:
# Upload to S3
s3_client = boto3.client('s3')

for split in ['train', 'test']:
    s3_key = f"{PREFIX}/{split}/{split}.csv"
    s3_client.upload_file(f'data/fm/{split}.csv', BUCKET_NAME, s3_key)
    print(f"Uploaded: s3://{BUCKET_NAME}/{s3_key}")

train_uri = f"s3://{BUCKET_NAME}/{PREFIX}/train"
test_uri = f"s3://{BUCKET_NAME}/{PREFIX}/test"

## Step 4: Train Factorization Machines Model

### Key Hyperparameters

| Parameter | Description | Default |
|-----------|-------------|---------|
| `num_factors` | Dimension of factorized interaction | 64 |
| `predictor_type` | `binary_classifier` or `regressor` | Required |
| `feature_dim` | Number of features | Required |
| `epochs` | Training epochs | 1 |
| `mini_batch_size` | Batch size | 1000 |
| `bias_lr` | Learning rate for bias | 0.1 |
| `linear_lr` | Learning rate for linear terms | 0.001 |
| `factors_lr` | Learning rate for factor terms | 0.0001 |

In [None]:
# Get Factorization Machines container image
fm_image = retrieve(
    framework='factorization-machines',
    region=region,
    version='1'
)

print(f"Factorization Machines Image URI: {fm_image}")

In [None]:
# Create Factorization Machines estimator
fm_estimator = Estimator(
    image_uri=fm_image,
    role=role,
    instance_count=1,
    instance_type='ml.c5.xlarge',
    output_path=f's3://{BUCKET_NAME}/{PREFIX}/output',
    sagemaker_session=sagemaker_session,
    base_job_name='factorization-machines'
)

In [None]:
# Set hyperparameters
hyperparameters = {
    "num_factors": 64,
    "predictor_type": "binary_classifier",
    "feature_dim": NUM_USERS + NUM_ITEMS,
    "epochs": 20,
    "mini_batch_size": 200,
    "bias_lr": 0.1,
    "linear_lr": 0.01,
    "factors_lr": 0.001,
    "bias_init_method": "normal",
    "bias_init_scale": 0.1,
    "linear_init_method": "normal",
    "linear_init_scale": 0.1,
    "factors_init_method": "normal",
    "factors_init_scale": 0.1,
}

fm_estimator.set_hyperparameters(**hyperparameters)

print("Factorization Machines hyperparameters:")
for k, v in hyperparameters.items():
    print(f"  {k}: {v}")

In [None]:
# Start training
print("Starting Factorization Machines training job...")
print("This will take approximately 5-10 minutes.\n")

fm_estimator.fit(
    {
        'train': train_uri,
        'test': test_uri
    },
    wait=True,
    logs=True
)

In [None]:
# Get training job info
job_name = fm_estimator.latest_training_job.name
print(f"Training job completed: {job_name}")
print(f"Model artifacts: {fm_estimator.model_data}")

## Step 5: Deploy and Test Model

In [None]:
# Deploy the model
print("Deploying Factorization Machines model...")
print("This will take approximately 5-7 minutes.\n")

fm_predictor = fm_estimator.deploy(
    initial_instance_count=1,
    instance_type='ml.m5.large',
    endpoint_name=f'fm-{datetime.now().strftime("%Y%m%d%H%M")}'
)

print(f"\nEndpoint deployed: {fm_predictor.endpoint_name}")

In [None]:
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import JSONDeserializer

# Configure predictor
fm_predictor.serializer = CSVSerializer()
fm_predictor.deserializer = JSONDeserializer()

def predict(data, predictor, batch_size=500):
    """
    Get predictions.
    """
    scores = []
    labels = []
    
    for i in range(0, len(data), batch_size):
        batch = data[i:i+batch_size]
        response = predictor.predict(batch)
        
        for pred in response['predictions']:
            scores.append(pred['score'])
            labels.append(pred['predicted_label'])
    
    return np.array(scores), np.array(labels)

In [None]:
# Get predictions
print("Getting predictions...")
scores, y_pred = predict(X_test, fm_predictor)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
auc = roc_auc_score(y_test, scores)

print("\n" + "=" * 50)
print("CLASSIFICATION RESULTS")
print("=" * 50)
print(f"\nAccuracy: {accuracy:.4f}")
print(f"AUC-ROC:  {auc:.4f}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred))

In [None]:
# Score distribution
fig, ax = plt.subplots(figsize=(10, 5))

ax.hist(scores[y_test == 0], bins=50, alpha=0.5, label='Negative', color='blue')
ax.hist(scores[y_test == 1], bins=50, alpha=0.5, label='Positive', color='red')
ax.axvline(x=0.5, color='black', linestyle='--', label='Threshold')
ax.set_xlabel('Prediction Score')
ax.set_ylabel('Count')
ax.set_title('Score Distribution by True Label')
ax.legend()
plt.show()

## Step 6: Clean Up Resources

In [None]:
# Delete the endpoint
print(f"Deleting endpoint: {fm_predictor.endpoint_name}")
fm_predictor.delete_endpoint()
print("Endpoint deleted successfully!")

---

## Summary

In this exercise, you learned:

1. **Data Format**: CSV with label first, sparse one-hot features

2. **Key Hyperparameters**:
   - `num_factors`: Dimensionality of factorization
   - `predictor_type`: binary_classifier or regressor
   - Learning rates for bias, linear, and factor terms

3. **Output**:
   - Classification: score and predicted_label
   - Regression: score (predicted value)

### Factorization Machines vs Other Algorithms

| Aspect | FM | Logistic Regression | Neural Network |
|--------|----|--------------------|----------------|
| Feature interactions | Automatic | Manual | Learned |
| Sparse data | Excellent | Good | OK |
| Training speed | Fast | Very fast | Slower |
| Interpretability | Medium | High | Low |

### Instance Recommendations

| Task | Instance Types |
|------|----------------|
| Training | ml.c5.xlarge, ml.m5.large (CPU recommended) |
| Inference | ml.c5.large, ml.m5.large |

### Best Practices

- Use one-hot encoding for categorical features
- Normalize continuous features
- Start with `num_factors=64`
- Use separate learning rates for different terms

### Next Steps

- Try regression for rating prediction
- Add additional features (user demographics, item attributes)
- Compare with other recommendation algorithms
- Use for click-through rate prediction