# Feature Engineering for Player Churn Dataset

This notebook performs feature engineering on the player_churn.csv dataset by selecting specific columns for further analysis. Feature engineering is a critical step in the machine learning pipeline that helps improve model performance by selecting the most relevant features.

In this notebook, we'll focus on:
1. Loading the player churn dataset
2. Selecting the most important features based on domain knowledge
3. Exploring the selected features
4. Saving the processed dataset for model training

In [1]:
%pip install scikit-learn "pandas>=2.0.0" s3fs==0.4.2 sagemaker xgboost mlflow==2.13.2 sagemaker-mlflow==0.1.0 seaborn

Looking in indexes: https://pypi.org/simple, https://plugin.us-east-1.prod.workshops.aws
Note: you may need to restart the kernel to use updated packages.


## Setup Environment

First, we'll import the necessary libraries for data manipulation, analysis, and visualization.

In [2]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.utils import resample
import boto3
import time
from datetime import datetime

# Set display options
pd.set_option('display.max_columns', None)

## Load the Dataset

We'll load the player_churn.csv file which contains player behavior data and churn information. This dataset includes various metrics about player sessions, engagement patterns, and whether they churned (stopped playing).

In [None]:
# Load the dataset
df = pd.read_csv('data/player_churn.csv')

# Display basic information
print(f"Dataset shape: {df.shape}")
df.head()

## Feature Engineering

In this step, we'll select only the most relevant features for our churn prediction model. These features were likely identified through domain expertise or previous analysis as being the most predictive of player churn.

The selected features include:
- Player identification features (`player_id`, `cohort_id`, `player_type`)
- Temporal features (`cohort_day_of_week`)
- Engagement metrics (`player_lifetime`, `session_count`)
- Session timing patterns (various time-of-day metrics)
- Target variable (`player_churn`)

By focusing on these specific features, we can build a more efficient and interpretable model.

In [None]:
# Define columns to keep
cols = ['cohort_day_of_week',
       'begin_session_time_of_day_std_last_week_1',
       'player_lifetime',
       'begin_session_time_of_day_mean_last_day_1',
       'end_session_time_of_day_mean_last_week_1',
       'begin_session_time_of_day_mean_last_week_1',
       'cohort_id',
       'player_type',
       'begin_session_time_of_day_std_last_day_1',
       'end_session_time_of_day_mean_last_day_1',
       'begin_session_time_of_day_mean_last_day_2',
       'end_session_time_of_day_std_last_week_1',
       'end_session_time_of_day_std_last_day_1',
       'session_count',
       'end_session_time_of_day_mean_last_day_3',
       'begin_session_time_of_day_mean_last_day_3',
       'end_session_time_of_day_mean_last_day_2',
       'player_churn',
       'player_id']

# Check if all columns exist in the dataset
missing_cols = [col for col in cols if col not in df.columns]
if missing_cols:
    print(f"Warning: The following columns are not in the dataset: {missing_cols}")
    # Keep only columns that exist in the dataset
    cols = [col for col in cols if col in df.columns]

# Select only the specified columns
df_selected = df[cols]

# Display the resulting dataframe
print(f"Selected dataset shape: {df_selected.shape}")
df_selected.head()

## Handle Missing Values

For certain time-of-day standard deviation features, we'll fill missing values with 0. This is appropriate for these features as a missing value likely indicates no variation in the session times (e.g., only one session or consistent session times).

In [None]:
# List of columns to fill with 0
fill_zero_cols = [
    'begin_session_time_of_day_std_last_week_1',
    'begin_session_time_of_day_std_last_day_1',
    'end_session_time_of_day_std_last_week_1',
    'end_session_time_of_day_std_last_day_1'
]

# Fill missing values with 0 for specified columns
for col in fill_zero_cols:
    if col in df_selected.columns:
        # Count missing values before filling
        missing_count = df_selected[col].isnull().sum()
        if missing_count > 0:
            print(f"Filling {missing_count} missing values with 0 in column: {col}")
            # df_selected[col] = df_selected[col].fillna(0)
            df_selected.loc[:, col] = df_selected[col].fillna(0)

# Verify the missing values were filled
missing_after = {col: df_selected[col].isnull().sum() for col in fill_zero_cols if col in df_selected.columns}
print("\nRemaining missing values after filling:")
for col, count in missing_after.items():
    print(f"{col}: {count}")

## One-Hot Encoding

We'll apply one-hot encoding to categorical variables to convert them into a format that can be provided to machine learning algorithms. This process creates binary columns for each category in the original categorical columns.

In [None]:
# Check unique values in categorical columns before encoding
if 'player_type' in df_selected.columns:
    print(f"Unique values in player_type: {df_selected['player_type'].nunique()}")
    print(df_selected['player_type'].value_counts())
    
if 'cohort_id' in df_selected.columns:
    print(f"\nUnique values in cohort_id: {df_selected['cohort_id'].nunique()}")
    print(df_selected['cohort_id'].value_counts().head())  # Show only top values if many

In [None]:
# Apply one-hot encoding
# Create dummy variables for player_type and cohort_id
cols_to_encode = ['player_type', 'cohort_id']
encoded_cols = [col for col in cols_to_encode if col in df_selected.columns]

if encoded_cols:
    # Get one-hot encoding
    df_encoded = pd.get_dummies(df_selected, columns=encoded_cols, prefix=encoded_cols, dtype=int)
    
    # Display information about the encoded dataset
    print(f"Shape before encoding: {df_selected.shape}")
    print(f"Shape after encoding: {df_encoded.shape}")
    print(f"New columns added: {df_encoded.shape[1] - df_selected.shape[1]}")
    
    # Update our working dataframe
    df_selected = df_encoded
    
    # Show a sample of the encoded columns
    encoded_column_names = [col for col in df_selected.columns if any(col.startswith(prefix + '_') for prefix in encoded_cols)]
    print("\nSample of encoded columns:")
    print(encoded_column_names[:10])  # Show first 10 encoded columns
    
    # Display the first few rows of the encoded dataframe
    df_selected.head()

## Data Exploration After Feature Selection

Now that we've selected our features, handled missing values, and encoded categorical variables, let's explore the dataset to better understand the data we'll be working with.

### Missing Value Analysis

Let's check if there are any remaining missing values in our selected features. Missing values can significantly impact model performance and need to be handled appropriately.

In [None]:
# Check for missing values
missing_values = df_selected.isnull().sum()
print("Missing values per column:")
print(missing_values[missing_values > 0])

### Statistical Summary

Let's examine the basic statistics of our numerical features to understand their distributions, ranges, and potential outliers.

In [None]:
# Basic statistics of numerical columns
df_selected.describe()

### Target Variable Analysis

Understanding the distribution of our target variable (player_churn) is crucial for model development. An imbalanced distribution might require special handling techniques during model training.

In [None]:
# Distribution of target variable
if 'player_churn' in df_selected.columns:
    plt.figure(figsize=(8, 6))
    sns.countplot(x='player_churn', data=df_selected)
    plt.title('Distribution of Player Churn')
    plt.xlabel('Player Churn (0 = No, 1 = Yes)')
    plt.ylabel('Count')
    plt.show()
    
    # Calculate churn rate
    churn_rate = df_selected['player_churn'].mean() * 100
    print(f"Churn rate: {churn_rate:.2f}%")

## Balance the Dataset

To improve model performance, we'll balance the dataset using random oversampling. This technique involves randomly duplicating samples from the minority class (churned players) to achieve a 1:1 ratio between the classes.

In [None]:
# Separate majority and minority classes
df_majority = df_selected[df_selected['player_churn'] == 0]
df_minority = df_selected[df_selected['player_churn'] == 1]

print(f"Before oversampling:\n"
      f"Number of non-churned players (majority): {len(df_majority)}\n"
      f"Number of churned players (minority): {len(df_minority)}")

# Oversample minority class
df_minority_oversampled = resample(df_minority, 
                                   replace=True,     # sample with replacement
                                   n_samples=len(df_majority),    # match majority class
                                   random_state=42)  # reproducible results

# Combine majority class with oversampled minority class
df_balanced = pd.concat([df_majority, df_minority_oversampled])

# Display new class distribution
print(f"\nAfter oversampling:\n"
      f"Number of non-churned players: {len(df_balanced[df_balanced['player_churn'] == 0])}\n"
      f"Number of churned players: {len(df_balanced[df_balanced['player_churn'] == 1])}")

# Shuffle the balanced dataset
df_balanced = df_balanced.sample(frac=1, random_state=42).reset_index(drop=True)

# Visualize the balanced distribution
plt.figure(figsize=(8, 6))
sns.countplot(x='player_churn', data=df_balanced)
plt.title('Distribution of Player Churn After Balancing')
plt.xlabel('Player Churn (0 = No, 1 = Yes)')
plt.ylabel('Count')
plt.show()

## Convert Data Types

Some machine learning algorithms require specific data types. Here we'll convert the target variable from boolean to long integer format.

In [None]:
# Check current data type of player_churn
print(f"Current data type of player_churn: {df_balanced['player_churn'].dtype}")

# Convert player_churn from boolean to long (int64)
df_balanced['player_churn'] = df_balanced['player_churn'].astype('int64')

# Verify the conversion
print(f"New data type of player_churn: {df_balanced['player_churn'].dtype}")

# Display a sample of the data to confirm
print("\nSample values after conversion:")
print(df_balanced['player_churn'].head())

## Save to SageMaker Feature Store

Now we'll save our processed dataset to Amazon SageMaker Feature Store for use in machine learning workflows. Feature Store provides a centralized repository for features, making it easier to share and reuse features across teams and projects.

### What is SageMaker Feature Store?

Amazon SageMaker Feature Store is a purpose-built repository where you can store and access features so it's much easier to name, organize, and reuse them across teams. Key benefits include:

- **Feature Reuse**: Store features once and reuse them for multiple models
- **Consistency**: Ensure consistent feature transformations between training and inference
- **Discoverability**: Make features discoverable and shareable across your organization
- **Real-time Access**: Access features with low latency for online inference
- **Historical Access**: Retrieve point-in-time feature values for training and backtesting

In [None]:
# Import SageMaker Feature Store modules
import sagemaker
from sagemaker.session import Session
from sagemaker.feature_store.feature_group import FeatureGroup
from sagemaker.feature_store.feature_definition import FeatureDefinition, FeatureTypeEnum

# Initialize SageMaker session
session = sagemaker.Session()
region = session.boto_region_name
s3_bucket_name = session.default_bucket()
prefix = "player-churn-feature-store"
role = sagemaker.get_execution_role()

print(f"SageMaker session initialized in region: {region}")
print(f"Using S3 bucket: {s3_bucket_name}")

### Prepare Data for Feature Store

SageMaker Feature Store requires two special columns:
1. A **record identifier** column that uniquely identifies each record (we'll use `player_id`)
2. An **event time** column that indicates when the feature values were generated

We'll add the event time column to our dataset.

In [13]:
# Table is available as variable `df`
from datetime import datetime, timezone

def generate_event_timestamp():
    # naive datetime representing local time
    naive_dt = datetime.now()
    # take timezone into account
    aware_dt = naive_dt.astimezone()
    # time in UTC
    utc_dt = aware_dt.astimezone(timezone.utc)
    # transform to ISO-8601 format
    event_time = utc_dt.isoformat(timespec='milliseconds')
    event_time = event_time.replace('+00:00', 'Z')
    return event_time



In [None]:
# Create a new column 'EventTime' with the current timestamp. This will be used as the event time for the feature store.
dt = generate_event_timestamp()
df_balanced['event_time'] = dt

# Define feature group name
feature_group_name = "player-churn-features-" + datetime.now().strftime("%Y-%m-%d-%H-%M-%S")

# Create Feature Group
player_churn_feature_group = FeatureGroup(name=feature_group_name, sagemaker_session=session)

# Load the data into Feature Store
player_churn_feature_group.load_feature_definitions(data_frame=df_balanced)
print(f"Feature group name: {feature_group_name}")
print("Feature definitions loaded from dataframe")

### Create Feature Group

Now we'll create the feature group in SageMaker Feature Store. We'll configure it with:
- An S3 location for offline storage
- The record identifier column (`player_id`)
- The event time column (`EventTime`)
- Online store enabled for low-latency access

In [None]:
# Create the feature group
player_churn_feature_group.create(
    s3_uri=f"s3://{s3_bucket_name}/{prefix}",
    record_identifier_name="player_id",
    event_time_feature_name="event_time",
    role_arn=role,
    enable_online_store=True
)

# Wait for feature group creation to complete
# player_churn_feature_group.wait()
# print(f"Feature group {feature_group_name} created successfully")

In [None]:
import time
def wait_for_feature_group_creation_complete(feature_group):
    """Helper function to wait for the completions of creating a feature group"""
    response = feature_group.describe()
    status = response.get("FeatureGroupStatus")
    while status == "Creating":
        print("Waiting for Feature Group Creation")
        time.sleep(5)
        response = feature_group.describe()
        status = response.get("FeatureGroupStatus")

    if status != "Created":
        print(f"Failed to create feature group, response: {response}")
        failureReason = response.get("FailureReason", "")
        raise SystemExit(
            f"Failed to create feature group {feature_group.name}, status: {status}, reason: {failureReason}"
        )
    print(f"FeatureGroup {feature_group.name} successfully created.")

wait_for_feature_group_creation_complete(feature_group=player_churn_feature_group)

### Ingest Data into Feature Store

Finally, we'll ingest our processed data into the feature group. This makes the features available for:
- Training new models
- Real-time inference
- Feature exploration and analysis
- Sharing with other teams

In [None]:
# Ingest data into the feature group
player_churn_feature_group.ingest(data_frame=df_balanced, max_workers=3, wait=True)
print(f"Ingested {len(df_balanced)} records into feature group {feature_group_name}")

# Describe the feature group to verify
feature_group_details = player_churn_feature_group.describe()
print(f"\nFeature Group Details:\n{feature_group_details}")

## Conclusion

In this notebook, we've performed comprehensive feature engineering on the player churn dataset:

1. Selected the most relevant features
2. Handled missing values
3. Applied one-hot encoding to categorical variables
4. Balanced the dataset using oversampling
5. Converted data types for compatibility with ML algorithms
6. Saved the processed data to CSV files
7. Stored the features in SageMaker Feature Store for reuse in ML workflows

The processed data is now ready for model training and evaluation.