# Ranking: Feature Engineering

**Phase 1: Exploration**

This notebook creates features for the GBDT ranking model.

## Strategy

- **Simple aggregation features first**: User/movie statistics from training data
- **Prevent data leakage**: Only use training data for aggregations
- **Handle cold-start**: Provide defaults for new users/movies (though minimal with filtered splits)
- **Prepare for GBDT**: XGBoost/LightGBM-friendly features

## Feature Categories (38 total)

1. **User aggregations** (5): rating count, avg, std, min, max
2. **Movie aggregations** (5): rating count, avg, std, min, max
3. **User-movie interactions** (4): rating diff, normalized counts, activity product
4. **Demographics** (3): gender, age, occupation
5. **Genres** (18): multi-hot encoding of 18 movie genres
6. **Target + IDs** (3): rating, user_id, movie_id

## 1. Setup

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import json

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

# Set plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

# Paths
DATA_DIR = Path('../../data/splits')
FEATURES_DIR = Path('.')  # Save features in the same directory as notebook

## 2. Load Data

In [None]:
# Load splits
train_ratings = pd.read_parquet(DATA_DIR / 'train_ratings.parquet')
val_ratings = pd.read_parquet(DATA_DIR / 'val_ratings.parquet')
test_ratings = pd.read_parquet(DATA_DIR / 'test_ratings.parquet')

# Load metadata
movies = pd.read_parquet(DATA_DIR / 'movies.parquet')
users = pd.read_parquet(DATA_DIR / 'users.parquet')

print("Data loaded:")
print(f"Train: {len(train_ratings):,} ratings")
print(f"Val:   {len(val_ratings):,} ratings")
print(f"Test:  {len(test_ratings):,} ratings")
print(f"Movies: {len(movies):,}")
print(f"Users:  {len(users):,}")

## 3. Feature Schema Definition

Define exactly which features we'll create and their interpretations.

In [None]:
# Feature schema documentation
FEATURE_SCHEMA = {
    "user_aggregations": {
        "user_rating_count": "Number of ratings user has given (activity level)",
        "user_avg_rating": "User's average rating (generosity/strictness)",
        "user_rating_std": "Std dev of user's ratings (consistency)",
        "user_rating_min": "Minimum rating given by user",
        "user_rating_max": "Maximum rating given by user"
    },
    "movie_aggregations": {
        "movie_rating_count": "Number of ratings movie received (popularity)",
        "movie_avg_rating": "Movie's average rating (perceived quality)",
        "movie_rating_std": "Std dev of movie ratings (controversy/polarization)",
        "movie_rating_min": "Minimum rating received by movie",
        "movie_rating_max": "Maximum rating received by movie"
    },
    "interactions": {
        "user_movie_rating_diff": "Difference: user_avg - movie_avg",
        "user_rating_count_norm": "Log-transformed user rating count",
        "movie_rating_count_norm": "Log-transformed movie rating count",
        "user_movie_activity_product": "Product of normalized counts"
    },
    "demographics": {
        "gender": "User gender (M=1, F=0)",
        "age_group": "User age group (1, 18, 25, 35, 45, 50, 56)",
        "occupation": "User occupation code (0-20)"
    },
    "genres": "18 binary features for movie genres"
}

print("Feature Schema:")
for category, features in FEATURE_SCHEMA.items():
    if isinstance(features, dict):
        print(f"\n{category}:")
        for name, desc in features.items():
            print(f"  {name}: {desc}")
    else:
        print(f"\n{category}: {features}")

## 4. User Aggregation Features

**Critical**: Computed ONLY from training data to prevent data leakage.

In [None]:
# Compute user statistics from TRAIN ONLY
user_stats = train_ratings.groupby('user_id')['rating'].agg([
    ('user_rating_count', 'count'),
    ('user_avg_rating', 'mean'),
    ('user_rating_std', 'std'),
    ('user_rating_min', 'min'),
    ('user_rating_max', 'max')
]).reset_index()

# Fill NaN std (happens when user has only 1 rating)
user_stats['user_rating_std'] = user_stats['user_rating_std'].fillna(0)

print(f"User statistics computed for {len(user_stats):,} users")
print("\nUser features summary:")
print(user_stats.describe())

# Visualize distributions
fig, axes = plt.subplots(2, 3, figsize=(15, 8))
for idx, col in enumerate(['user_rating_count', 'user_avg_rating', 'user_rating_std', 
                           'user_rating_min', 'user_rating_max']):
    ax = axes[idx // 3, idx % 3]
    user_stats[col].hist(bins=50, ax=ax)
    ax.set_title(col)
    ax.set_xlabel('Value')
    ax.set_ylabel('Count')
axes[1, 2].axis('off')
plt.tight_layout()
plt.show()

## 5. Movie Aggregation Features

**Critical**: Computed ONLY from training data.

In [None]:
# Compute movie statistics from TRAIN ONLY
movie_stats = train_ratings.groupby('movie_id')['rating'].agg([
    ('movie_rating_count', 'count'),
    ('movie_avg_rating', 'mean'),
    ('movie_rating_std', 'std'),
    ('movie_rating_min', 'min'),
    ('movie_rating_max', 'max')
]).reset_index()

# Fill NaN std (happens when movie has only 1 rating)
movie_stats['movie_rating_std'] = movie_stats['movie_rating_std'].fillna(0)

print(f"Movie statistics computed for {len(movie_stats):,} movies")
print("\nMovie features summary:")
print(movie_stats.describe())

# Visualize distributions
fig, axes = plt.subplots(2, 3, figsize=(15, 8))
for idx, col in enumerate(['movie_rating_count', 'movie_avg_rating', 'movie_rating_std',
                           'movie_rating_min', 'movie_rating_max']):
    ax = axes[idx // 3, idx % 3]
    movie_stats[col].hist(bins=50, ax=ax)
    ax.set_title(col)
    ax.set_xlabel('Value')
    ax.set_ylabel('Count')
axes[1, 2].axis('off')
plt.tight_layout()
plt.show()

## 6. Cold-Start Defaults

Define defaults for users/movies not seen in training data.

In [None]:
# Compute global defaults from training data
global_mean_rating = train_ratings['rating'].mean()
global_std_rating = train_ratings['rating'].std()
global_min_rating = train_ratings['rating'].min()
global_max_rating = train_ratings['rating'].max()

cold_start_defaults = {
    'user_rating_count': 0,  # Flag as new user
    'user_avg_rating': global_mean_rating,
    'user_rating_std': global_std_rating,
    'user_rating_min': global_min_rating,
    'user_rating_max': global_max_rating,
    'movie_rating_count': 0,  # Flag as new movie
    'movie_avg_rating': global_mean_rating,
    'movie_rating_std': global_std_rating,
    'movie_rating_min': global_min_rating,
    'movie_rating_max': global_max_rating
}

print("Cold-start defaults:")
for key, value in cold_start_defaults.items():
    print(f"  {key}: {value:.3f}")

## 7. Genre Features

Multi-hot encoding of 18 movie genres.

In [None]:
# Extract all unique genres
all_genres = set()
for genres_str in movies['genres']:
    all_genres.update(genres_str.split('|'))

all_genres = sorted(all_genres)
print(f"Found {len(all_genres)} unique genres: {all_genres}")

# Create binary features for each genre
# Clean genre names for column names (replace special chars)
movies_with_genres = movies.copy()
for genre in all_genres:
    # Clean genre name for column name
    genre_clean = genre.lower().replace('-', '_').replace("'", '')
    col_name = f'genre_{genre_clean}'
    movies_with_genres[col_name] = movies_with_genres['genres'].str.contains(genre, regex=False).astype(int)

# Get list of genre columns
genre_cols = [col for col in movies_with_genres.columns if col.startswith('genre_')]
print(f"\nCreated {len(genre_cols)} genre features: {genre_cols}")

# Show genre distribution
genre_counts = movies_with_genres[genre_cols].sum().sort_values(ascending=False)
print("\nGenre distribution:")
print(genre_counts)

# Visualize
plt.figure(figsize=(12, 6))
genre_counts.plot(kind='barh')
plt.title('Movie Count by Genre')
plt.xlabel('Number of Movies')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

## 8. Demographic Features

Encode user demographics for cold-start and diversity.

In [None]:
# Encode gender: M=1, F=0
users_encoded = users.copy()
users_encoded['gender'] = (users_encoded['gender'] == 'M').astype(int)

# age and occupation are already numeric, keep as-is
demographic_cols = ['gender', 'age', 'occupation']

# Rename age to age_group for clarity
users_encoded = users_encoded.rename(columns={'age': 'age_group'})

print("Demographic features:")
print(f"  gender: binary (M=1, F=0)")
print(f"  age_group: {sorted(users_encoded['age_group'].unique())}")
print(f"  occupation: {sorted(users_encoded['occupation'].unique())}")

print("\nDemographic distributions:")
print(users_encoded[['gender', 'age_group', 'occupation']].describe())

## 9. Feature Materialization Function

Build features for any ratings dataframe (train/val/test).

In [None]:
def build_features(ratings_df, user_stats, movie_stats, movies_df, users_df, defaults):
    """
    Build feature matrix for a ratings dataframe.
    
    Args:
        ratings_df: Ratings to build features for
        user_stats: Pre-computed user statistics (from train only)
        movie_stats: Pre-computed movie statistics (from train only)
        movies_df: Movie metadata with genre features
        users_df: User demographics
        defaults: Cold-start default values
    
    Returns:
        DataFrame with all features
    """
    print(f"Building features for {len(ratings_df):,} ratings...")
    
    # Start with ratings
    features = ratings_df.copy()
    
    # 1. Merge user features (left join to handle cold-start)
    features = features.merge(user_stats, on='user_id', how='left')
    
    # 2. Merge movie features (left join)
    features = features.merge(movie_stats, on='movie_id', how='left')
    
    # 3. Fill NaN with cold-start defaults
    for col, default_val in defaults.items():
        if col in features.columns:
            features[col] = features[col].fillna(default_val)
    
    # 4. Merge demographics
    features = features.merge(
        users_df[['user_id', 'gender', 'age_group', 'occupation']], 
        on='user_id', 
        how='left'
    )
    
    # 5. Merge genres (keep only genre columns + movie_id for merge)
    genre_features = movies_df[['movie_id'] + genre_cols]
    features = features.merge(genre_features, on='movie_id', how='left')
    
    # 6. Compute interaction features
    features['user_movie_rating_diff'] = features['user_avg_rating'] - features['movie_avg_rating']
    features['user_rating_count_norm'] = np.log1p(features['user_rating_count'])
    features['movie_rating_count_norm'] = np.log1p(features['movie_rating_count'])
    features['user_movie_activity_product'] = features['user_rating_count_norm'] * features['movie_rating_count_norm']
    
    # 7. Verify no missing values
    missing = features.isnull().sum()
    if missing.sum() > 0:
        print("\nWarning: Missing values found:")
        print(missing[missing > 0])
        # Fill any remaining NaN with 0 (e.g., for genre features of new movies)
        features = features.fillna(0)
    
    print(f"Features built: {features.shape}")
    print(f"Columns: {len(features.columns)}")
    
    return features

## 10. Materialize Features for All Splits

In [None]:
# Build features for train/val/test
train_features = build_features(
    train_ratings, user_stats, movie_stats, 
    movies_with_genres, users_encoded, cold_start_defaults
)

val_features = build_features(
    val_ratings, user_stats, movie_stats,
    movies_with_genres, users_encoded, cold_start_defaults
)

test_features = build_features(
    test_ratings, user_stats, movie_stats,
    movies_with_genres, users_encoded, cold_start_defaults
)

print("\n" + "="*60)
print("Feature Materialization Complete")
print("="*60)
print(f"Train: {train_features.shape}")
print(f"Val:   {val_features.shape}")
print(f"Test:  {test_features.shape}")

## 11. Data Leakage Validation

Verify no future information leaked into features.

In [None]:
# Check 1: User/movie stats should be identical across splits (frozen from train)
print("Leakage Check 1: User stats consistency")
sample_user = train_features['user_id'].iloc[0]
train_user_avg = train_features[train_features['user_id'] == sample_user]['user_avg_rating'].iloc[0]

if sample_user in val_features['user_id'].values:
    val_user_avg = val_features[val_features['user_id'] == sample_user]['user_avg_rating'].iloc[0]
    print(f"Sample user {sample_user}: train_avg={train_user_avg:.3f}, val_avg={val_user_avg:.3f}")
    assert abs(train_user_avg - val_user_avg) < 0.001, "User stats differ between splits!"
    print("✓ User stats are consistent (frozen from train)")
else:
    print(f"Sample user {sample_user} not in val (as expected)")

# Check 2: Cold-start users should have default values
print("\nLeakage Check 2: Cold-start handling")
cold_start_users = val_features[val_features['user_rating_count'] == 0]
print(f"Cold-start users in val: {len(cold_start_users)}")
if len(cold_start_users) > 0:
    print("Sample cold-start user features:")
    print(cold_start_users[['user_avg_rating', 'user_rating_std', 'user_rating_count']].head())
else:
    print("✓ No cold-start users in val (as expected with filtered splits)")

# Check 3: No NaN values
print("\nLeakage Check 3: No missing values")
for name, df in [("Train", train_features), ("Val", val_features), ("Test", test_features)]:
    missing = df.isnull().sum().sum()
    print(f"{name}: {missing} missing values")
    assert missing == 0, f"Found missing values in {name}"
print("✓ No missing values in any split")

print("\n" + "="*60)
print("✓ All leakage checks passed!")
print("="*60)

## 12. Feature Quality Checks

In [None]:
# Check 1: Feature distributions across splits
print("Feature Distribution Comparison")
print("="*60)

key_features = ['user_rating_count', 'user_avg_rating', 'movie_rating_count', 'movie_avg_rating']

for feature in key_features:
    print(f"\n{feature}:")
    print(f"  Train: mean={train_features[feature].mean():.2f}, std={train_features[feature].std():.2f}")
    print(f"  Val:   mean={val_features[feature].mean():.2f}, std={val_features[feature].std():.2f}")
    print(f"  Test:  mean={test_features[feature].mean():.2f}, std={test_features[feature].std():.2f}")

In [None]:
# Check 2: Correlation with target
print("\nFeature Correlation with Target (rating)")
print("="*60)

# Compute correlations on train set
feature_cols = [col for col in train_features.columns 
                if col not in ['user_id', 'movie_id', 'rating', 'timestamp', 'title', 'genres', 'zip_code']]

correlations = train_features[feature_cols + ['rating']].corr()['rating'].drop('rating').sort_values(ascending=False)

print("\nTop 10 positive correlations:")
print(correlations.head(10))

print("\nTop 10 negative correlations:")
print(correlations.tail(10))

# Visualize
plt.figure(figsize=(10, 8))
correlations.sort_values().plot(kind='barh')
plt.title('Feature Correlation with Rating')
plt.xlabel('Correlation')
plt.axvline(x=0, color='red', linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()

In [None]:
# Check 3: Feature correlation matrix (identify redundancy)
print("\nFeature Correlation Matrix (top features)")

# Select subset of features for visualization
top_features = ['user_rating_count', 'user_avg_rating', 'user_rating_std',
                'movie_rating_count', 'movie_avg_rating', 'movie_rating_std',
                'user_movie_rating_diff', 'user_rating_count_norm', 'movie_rating_count_norm',
                'gender', 'age_group', 'occupation']

corr_matrix = train_features[top_features].corr()

plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm', center=0, 
            square=True, linewidths=1)
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.show()

# Identify highly correlated pairs (>0.9)
high_corr = []
for i in range(len(corr_matrix.columns)):
    for j in range(i+1, len(corr_matrix.columns)):
        if abs(corr_matrix.iloc[i, j]) > 0.9:
            high_corr.append((corr_matrix.columns[i], corr_matrix.columns[j], corr_matrix.iloc[i, j]))

if high_corr:
    print("\nHighly correlated feature pairs (|r| > 0.9):")
    for feat1, feat2, corr in high_corr:
        print(f"  {feat1} <-> {feat2}: {corr:.3f}")
else:
    print("\n✓ No highly correlated features (|r| > 0.9)")

## 13. Save Features

In [None]:
# Save as parquet
print("Saving features...")

train_features.to_parquet(FEATURES_DIR / 'train_features.parquet', index=False)
val_features.to_parquet(FEATURES_DIR / 'val_features.parquet', index=False)
test_features.to_parquet(FEATURES_DIR / 'test_features.parquet', index=False)

print("\nSaved files:")
for file in sorted(FEATURES_DIR.glob('*.parquet')):
    size_mb = file.stat().st_size / (1024 * 1024)
    print(f"  {file.name:30} ({size_mb:6.2f} MB)")

In [None]:
# Save feature metadata
metadata = {
    "version": "1.0",
    "created_at": pd.Timestamp.now().isoformat(),
    "train_timestamp_max": int(train_ratings['timestamp'].max()),
    "num_features": len(feature_cols),
    "feature_groups": {
        "user_agg": list(FEATURE_SCHEMA["user_aggregations"].keys()),
        "movie_agg": list(FEATURE_SCHEMA["movie_aggregations"].keys()),
        "interaction": list(FEATURE_SCHEMA["interactions"].keys()),
        "demographic": list(FEATURE_SCHEMA["demographics"].keys()),
        "genre": genre_cols
    },
    "cold_start_defaults": {k: float(v) for k, v in cold_start_defaults.items()},
    "split_sizes": {
        "train": len(train_features),
        "val": len(val_features),
        "test": len(test_features)
    }
}

with open(FEATURES_DIR / 'feature_metadata.json', 'w') as f:
    json.dump(metadata, f, indent=2)

print("\n✓ Metadata saved")

In [None]:
# Save README
readme_content = f"""# Ranking Model Features

## Feature Set v1.0

Created: {pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S')}

## Files

- `train_features.parquet`: Training features ({len(train_features):,} rows)
- `val_features.parquet`: Validation features ({len(val_features):,} rows)
- `test_features.parquet`: Test features ({len(test_features):,} rows)
- `feature_metadata.json`: Feature schema and metadata

## Feature Categories ({len(feature_cols)} features total)

### User Aggregation Features (5)
Computed from training data only:
- user_rating_count: Number of ratings given
- user_avg_rating: Mean rating given
- user_rating_std: Std dev of ratings
- user_rating_min: Minimum rating
- user_rating_max: Maximum rating

### Movie Aggregation Features (5)
Computed from training data only:
- movie_rating_count: Number of ratings received (popularity)
- movie_avg_rating: Mean rating received (quality)
- movie_rating_std: Std dev (polarization)
- movie_rating_min: Minimum rating
- movie_rating_max: Maximum rating

### User-Movie Interaction Features (4)
- user_movie_rating_diff: user_avg - movie_avg
- user_rating_count_norm: log1p(user_rating_count)
- movie_rating_count_norm: log1p(movie_rating_count)
- user_movie_activity_product: user_norm * movie_norm

### Demographic Features (3)
- gender: Binary (M=1, F=0)
- age_group: 7 age bins (1, 18, 25, 35, 45, 50, 56)
- occupation: 21 occupation codes (0-20)

### Genre Features ({len(genre_cols)})
Multi-hot encoding of movie genres:
{', '.join(genre_cols)}

## Data Leakage Prevention

- User/movie aggregations computed **only from training data**
- Same statistics applied to val/test (frozen at train time)
- Cold-start users/movies receive global defaults
- No future information used in feature computation

## Usage

```python
import pandas as pd

# Load features
train = pd.read_parquet('train_features.parquet')
val = pd.read_parquet('val_features.parquet')
test = pd.read_parquet('test_features.parquet')

# Separate features and target
feature_cols = [col for col in train.columns if col not in ['user_id', 'movie_id', 'rating', 'timestamp']]
X_train = train[feature_cols]
y_train = train['rating']
```
"""

with open(FEATURES_DIR / 'README.md', 'w') as f:
    f.write(readme_content)

print("✓ README saved")

## 14. Summary

Features successfully created and saved!

In [None]:
print("="*60)
print("FEATURE ENGINEERING COMPLETE")
print("="*60)
print(f"\nTotal features: {len(feature_cols)}")
print(f"\nFeature breakdown:")
print(f"  User aggregations:     5")
print(f"  Movie aggregations:    5")
print(f"  Interactions:          4")
print(f"  Demographics:          3")
print(f"  Genres:               {len(genre_cols)}")
print(f"\nDatasets:")
print(f"  Train: {train_features.shape}")
print(f"  Val:   {val_features.shape}")
print(f"  Test:  {test_features.shape}")
print(f"\nFiles saved to: {FEATURES_DIR}")
print(f"\nNext step: Train GBDT ranking model using these features!")