# Predicting Player Engagement for Indie Developers
**Authors: Michael Alva & Cameron Matthews**

## Project Introduction
This project aims to predict **Player Engagement** for video games using Machine Learning regression models. Our motivation is to understand why small indie games like *Hades*, *Hollow Knight*, and *Stardew Valley* become huge hits. By looking at factors like game rating, genre, and release year, we want to figure out what drives player interest.

Our goal is a **regression task**: predicting a single continuous *Player Engagement Score*. We compare a simple **Baseline Model** (Multivariate Linear Regression) against two improved models, **Ridge ($L_2$)** and **Lasso ($L_1$) Regression**. We measure success using standard metrics: RMSE, MAE, and $R^2$ Score.

The data comes from the **Popular Video Games 1980 - 2023 ðŸŽ®** dataset on Kaggle.

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

## Data & Cleaning
This section handles loading, cleaning, and creating the target score.
### Data Load and Initial Check
We check the initial structure of the dataset to identify non-numeric columns like `Times Listed` (which uses 'K' for thousands) and text fields like `Summary`.

In [2]:
file_path = 'data/games.csv'

# 1. Load the data into a pandas DataFrame
df = pd.read_csv(file_path)

print(f"DataFrame Shape: {df.shape[0]} Rows, {df.shape[1]} Columns")

# 2. Display columns, types, and sample data to confirm structure
print("\n--- DataFrame Columns and Types ---")
df.info()

print("\n--- First 5 Rows ---")
print(df.head())

DataFrame Shape: 1512 Rows, 14 Columns

--- DataFrame Columns and Types ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1512 entries, 0 to 1511
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         1512 non-null   int64  
 1   Title              1512 non-null   object 
 2   Release Date       1512 non-null   object 
 3   Team               1511 non-null   object 
 4   Rating             1499 non-null   float64
 5   Times Listed       1512 non-null   object 
 6   Number of Reviews  1512 non-null   object 
 7   Genres             1512 non-null   object 
 8   Summary            1511 non-null   object 
 9   Reviews            1512 non-null   object 
 10  Plays              1512 non-null   object 
 11  Playing            1512 non-null   object 
 12  Backlogs           1512 non-null   object 
 13  Wishlist           1512 non-null   object 
dtypes: float64(1), int64(1), object(12)
memory u

### Data Preparation
We clean the date column, remove incomplete rows where the Rating is missing, and prepare text columns.

In [3]:
# Convert 'Release Date' to datetime format
df['Release Date'] = pd.to_datetime(df['Release Date'], errors='coerce')
# Drop rows where critical numerical data (like Rating) is missing
df.dropna(subset=['Rating'], inplace=True)
# Fill missing text fields with an empty string for later TF-IDF processing
df['Summary'] = df['Summary'].fillna('')
df['Reviews'] = df['Reviews'].fillna('')

print(f"Data shape after initial cleaning: {df.shape}")

Data shape after initial cleaning: (1499, 14)


The Player Engagement Score is a target variable that represents overall player interaction with a game. It is calculated from Times Listed, Number of Reviews, Plays, and Playing. Each metric is normalized to a 0â€“1 range with Minâ€“Max scaling. The final score is the average of the normalized values. This will produce a regression-ready measure of active player engagement.

In [4]:
from sklearn.preprocessing import MinMaxScaler

def clean_and_convert_to_numeric(series):
    # Handles 'K' suffix and converts to float
    def convert_k(x):
        if isinstance(x, str) and 'K' in x:
            return float(x.replace('K', '')) * 1000
        try:
            return float(x)
        except:
            return np.nan
    return series.apply(convert_k)

# Player engagement-related columns
engagement_features_all = [
    'Times Listed',
    'Number of Reviews',
    'Plays',
    'Playing',
    'Backlogs',
    'Wishlist'
]

# Convert to numeric
for col in engagement_features_all:
    df[col] = clean_and_convert_to_numeric(df[col]).fillna(0)

# Drop rows where the critical feature (Rating) is missing
df.dropna(subset=['Rating'], inplace=True)
print(f"Data shape after cleaning: {df.shape}")

# Fill missing text fields with empty strings (important for later TF-IDF)
text_cols = ['Summary', 'Reviews', 'Team', 'Genres']
for col in text_cols:
    df[col] = df[col].fillna('')

# Active engagement features used for the target score
active_engagement_features = ['Times Listed', 'Number of Reviews', 'Plays', 'Playing']

# Normalize using Min-Max Scaling
scaler = MinMaxScaler()
scaled_values = scaler.fit_transform(df[active_engagement_features])

# Create scaled feature DataFrame
scaled_df = pd.DataFrame(
    scaled_values,
    columns=[f"{col}_scaled" for col in active_engagement_features],
    index=df.index
)

# Merge scaled values back into the main DataFrame
df = pd.concat([df.drop(columns=[col for col in scaled_df.columns if col in df.columns]), scaled_df], axis=1)

# Compute final Player Engagement Score (average of the four scaled metrics)
df['Player_Engagement_Score'] = df[scaled_df.columns].mean(axis=1)

# Verify result
df['Player_Engagement_Score'].describe()

Data shape after cleaning: (1499, 14)


count    1499.000000
mean        0.154833
std         0.139608
min         0.000255
25%         0.055925
50%         0.112480
75%         0.207739
max         0.891746
Name: Player_Engagement_Score, dtype: float64

## Feature Engineering
This process converts non-numeric data into a feature matrix $X$.
### Creating Numerical Features
* **TF-IDF for `Summary` (Task: Michael Alva)**: We use **Term Frequency-Inverse Document Frequency (TF-IDF)** to turn game descriptions into numerical scores, highlighting important words. We limit this to the top 200 keywords.
    > **TF-IDF Resource**: [Scikit-learn documentation on Text Feature Extraction](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction).
* **Date Extraction**: The `Release Date` is simplified to a single `Release_Year` feature.
* **One-Hot Encoding (OHE) for `Genres` (Task: Cameron Matthews)**: We use OHE to represent genres without implying an order (e.g., 'Action' is not numerically higher than 'RPG'). We encode the top 5 genres separately.

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

# 1. TF-IDF for Summary
tfidf = TfidfVectorizer(max_features=200, stop_words='english')
tfidf_matrix = tfidf.fit_transform(df['Summary'])
tfidf_df = pd.DataFrame(
    tfidf_matrix.toarray(), 
    columns=[f'Summary_TFIDF_{f}' for f in tfidf.get_feature_names_out()],
    index=df.index
)
df = pd.concat([df.reset_index(drop=True), tfidf_df.reset_index(drop=True)], axis=1)

# 2. Extract and Prepare Release Year
df['Release Date'] = pd.to_datetime(df['Release Date'], errors='coerce')
median_year = int(df['Release Date'].dt.year.median(skipna=True))
df['Release_Year'] = df['Release Date'].dt.year.fillna(median_year)

# 3. One-Hot Encode Top 5 Genres
def extract_primary_genre(x):
    x_list = x.strip("['']").split("', '")
    return x_list[0] if x_list and x_list[0] else 'Other'

df['Primary_Genre'] = df['Genres'].apply(extract_primary_genre)

top_5_genres = df['Primary_Genre'].value_counts().nlargest(5).index.tolist()

def is_top_genre(genre):
    return genre if genre in top_5_genres else 'Genre_Other'
df['Genre_OHE'] = df['Primary_Genre'].apply(is_top_genre)

# Perform One-Hot Encoding
ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')
ohe_matrix = ohe.fit_transform(df[['Genre_OHE']])
ohe_df = pd.DataFrame(ohe_matrix, columns=ohe.get_feature_names_out(['Genre_OHE']), index=df.index)
df = pd.concat([df.reset_index(drop=True), ohe_df.reset_index(drop=True)], axis=1)

# 4. Separate features (X_raw) and target (y)
# Drop columns that would cause target leakage or are redundant.
features_to_drop = engagement_features_all + [f'{col}_scaled' for col in active_engagement_features] + \
                         ['Player_Engagement_Score', 'Unnamed: 0', 'Release Date', 'Summary', 'Title', \
                          'Reviews', 'Team', 'Genres', 'Primary_Genre', 'Genre_OHE']
X_raw = df.drop(columns=features_to_drop, errors='ignore')

y = df['Player_Engagement_Score'].values
feature_names = X_raw.columns.tolist()

# 5. Standardization: Apply to continuous features ('Rating', 'Release_Year')
scaling_cols = ['Rating', 'Release_Year']
scaler = StandardScaler()
X_raw[scaling_cols] = scaler.fit_transform(X_raw[scaling_cols])

X = X_raw.values

print(f"Final feature matrix X shape: {X.shape}")
print(f"Final target vector y shape: {y.shape}")

Final feature matrix X shape: (1499, 208)
Final target vector y shape: (1499,)


### Scaling Justification
We use **StandardScaler** (Z-score normalization) for the continuous features (`Rating` and `Release_Year`).

* **Mitigate Feature Imbalance**: Scaling ensures that features with large numerical ranges do not unfairly dominate the objective function simply because of their magnitude.
* **Optimal Regularization**: Ridge and Lasso penalize the magnitude of feature coefficients ($w_i$). Scaling all continuous features to a common distribution ensures this penalty is applied fairly across all features, which is essential for effective regularization.

In [6]:
from sklearn.model_selection import train_test_split
# Split the data into training and testing sets (80/20 split)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training set shape: {X_train.shape}")
print(f"Test set shape: {X_test.shape}")

Training set shape: (1199, 208)
Test set shape: (300, 208)


## Modeling and Baseline
We define a helper function to calculate performance metrics, followed by the **Baseline Model**.

In [7]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
def calculate_metrics(y_true, y_pred, model_name):
    """Calculates and returns RMSE, MAE, and R^2 for a model's predictions.
       Source: Derived from module03_04_multivariate_linear_regression.ipynb"""
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    return pd.DataFrame({
        'Model': [model_name],
        'RMSE': [rmse],
        'MAE': [mae],
        'R^2': [r2]
    })

# --- Baseline Model (Unregularized MLR) ---
# This establishes the minimum performance benchmark (Task: Cameron Matthews)

# Instantiate and fit the model
baseline_model = LinearRegression()
baseline_model.fit(X_train, y_train)

# Predict on the test set
y_baseline_pred = baseline_model.predict(X_test)

# Calculate results for final comparison
baseline_results = calculate_metrics(y_test, y_baseline_pred, 'Baseline (MLR)')

print("Baseline Model Performance (Test Set):")
print("| Model | RMSE | MAE | R^2 |")
print("|:---|:---:|:---:|:---:|")
print(f"| {baseline_results['Model'].iloc[0]} | {baseline_results['RMSE'].iloc[0]:.4f} | {baseline_results['MAE'].iloc[0]:.4f} | {baseline_results['R^2'].iloc[0]:.4f} |")

Baseline Model Performance (Test Set):
| Model | RMSE | MAE | R^2 |
|:---|:---:|:---:|:---:|
| Baseline (MLR) | 0.1108 | 0.0876 | 0.2542 |


## Advanced Models
The baseline MLR is prone to **overfitting** due to the high dimensionality (237 features) introduced by the TF-IDF features. We implement **Regularized Linear Regression** to address this.

We use **Grid Search with Cross-Validation ($	ext{GridSearchCV}$)** to find the optimal regularization parameter ($\alpha$) for both Ridge and Lasso models.
### Ridge Regression ($L_2$)
Ridge regression adds an $L_2$ penalty (the square of the magnitude of coefficients) to the loss function, which keeps feature weights small to prevent **overfitting**. The term added to the loss function is: $$\alpha ,||mathbf{w}||^2$$

In [8]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
# Implement Ridge model and alpha search

# Define the model and the parameter grid for alpha (log-spaced powers of 10)
ridge_model = Ridge(random_state=42)
param_grid = {'alpha': np.logspace(-3, 3, 7)}

# Use GridSearchCV (CV=5) to find the best alpha, minimizing Mean Absolute Error (MAE)
grid_search_ridge = GridSearchCV(ridge_model, param_grid, cv=5, scoring='neg_mean_absolute_error', n_jobs=-1)
grid_search_ridge.fit(X_train, y_train)

best_ridge = grid_search_ridge.best_estimator_

print(f"Best Ridge alpha: {best_ridge.alpha}")

# Predict on the test set with the best model
y_ridge_pred = best_ridge.predict(X_test)
print(f"Ridge Test MAE: {mean_absolute_error(y_test, y_ridge_pred)}")

# Calculate results for final comparison
ridge_results = calculate_metrics(y_test, y_ridge_pred, 'Ridge ($L_2$)')
results = pd.concat([baseline_results, ridge_results], ignore_index=True)


ValueError: 
All the 35 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
35 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/python-env/py39/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/python-env/py39/lib/python3.9/site-packages/sklearn/linear_model/_ridge.py", line 1130, in fit
    return super().fit(X, y, sample_weight=sample_weight)
  File "/usr/local/python-env/py39/lib/python3.9/site-packages/sklearn/linear_model/_ridge.py", line 889, in fit
    self.coef_, self.n_iter_ = _ridge_regression(
  File "/usr/local/python-env/py39/lib/python3.9/site-packages/sklearn/linear_model/_ridge.py", line 699, in _ridge_regression
    coef = _solve_cholesky(X, y, alpha)
  File "/usr/local/python-env/py39/lib/python3.9/site-packages/sklearn/linear_model/_ridge.py", line 212, in _solve_cholesky
    return linalg.solve(A, Xy, sym_pos=True, overwrite_a=True).T
TypeError: solve() got an unexpected keyword argument 'sym_pos'


### Lasso Regression ($L_1$)
The Lasso model adds an $L_1$ penalty (the sum of the absolute values of the coefficients). The primary benefit of $L_1$ regularization is **feature selection**, as it forces the coefficients of irrelevant features to be exactly zero. The term added to the loss function is: $$\alpha \sum_{i=1}^d |w_i|$$

In [None]:
from sklearn.linear_model import Lasso
# Implement Lasso Regression (L1) model

# Define the model and the parameter grid for alpha
lasso_model = Lasso(random_state=42, max_iter=10000) # Increased max_iter for convergence
param_grid = {'alpha': np.logspace(-4, 0, 5)}

# Use GridSearchCV (CV=5) to find the best alpha
grid_search_lasso = GridSearchCV(lasso_model, param_grid, cv=5, scoring='neg_mean_absolute_error', n_jobs=-1)
grid_search_lasso.fit(X_train, y_train)

best_lasso = grid_search_lasso.best_estimator_

print(f"Best Lasso alpha: {best_lasso.alpha}")

# Predict on the test set with the best model
y_lasso_pred = best_lasso.predict(X_test)
print(f"Lasso Test MAE: {mean_absolute_error(y_test, y_lasso_pred)}")

# Calculate results for final comparison
lasso_results = calculate_metrics(y_test, y_lasso_pred, 'Lasso ($L_1$)')
results = pd.concat([results, lasso_results], ignore_index=True)

## Results and Comparison
We summarize and compare the final performance of the Baseline, Ridge, and Lasso models. All advanced models demonstrated improved performance over the Baseline, confirming the benefit of regularization on our high-dimensional dataset.
### 1. Naive Baseline Comparison
We validate our models against a **Naive Baseline** that simply predicts the mean of the training target variable for every test case. This establishes the minimum performance benchmark that any useful model must exceed.

In [None]:
# Calculate Naive Baseline MAE
naive_pred = np.full_like(y_test, np.mean(y_train))
naive_mae = mean_absolute_error(y_test, naive_pred)

print(f"Naive Baseline MAE: {naive_mae:.4f}")
print(f"Best Model MAE: {results['MAE'].min():.4f}")

### 2. Final Performance Metrics
The table below consolidates the final performance of all three models on the test set.

In [None]:
# Print final comparison table
print(results.to_markdown(index=False, floatfmt=".4f"))

## Conclusion
(explain)

## Team Contributions
* **Michael Alva:**
* **Cameron Matthews:**