# Predicting Player Engagement for Indie Developers
**Authors: Michael Alva & Cameron Matthews**

## Project Introduction
(intro)

In [1]:
import pandas as pd
import numpy as np

## Data & Cleaning
This section loads the data, fixes missing values, prepares dates, and creates the single target score we need to predict.
### Data Load and Initial Check
We are using the **Popular Video Games 1980 - 2023 ðŸŽ®** dataset, available on Kaggle ([link](https://www.kaggle.com/datasets/arnabchaki/popular-video-games-1980-2023)). We load the data, check its size, and look at the types of information in the columns.

In [2]:
# Load the dataset
file_path = 'data/games.csv'
try:
    df = pd.read_csv(file_path)
    print(f"Dataset loaded successfully with")
    print(f"DataFrame Shape: {df.shape[0]} Rows, {df.shape[1]} Columns")
except FileNotFoundError:
    print(f"Error: File not found at {file_path}.")

# Simpler check for column types and sample values
# print("\nColumn Data Types:")
# print(df.dtypes)
# print("\nFirst 5 Rows:")
# print(df.head())

Dataset loaded successfully with
DataFrame Shape: 1512 Rows, 14 Columns


### Data Preparation
We clean the date column, remove incomplete rows where the Rating is missing, and prepare text columns.

In [3]:
# Convert 'Release Date' to datetime format
df['Release Date'] = pd.to_datetime(df['Release Date'], errors='coerce')
# Drop rows where critical numerical data (like Rating) is missing
df.dropna(subset=['Rating'], inplace=True)
# Fill missing text fields with an empty string for later TF-IDF processing
df['Summary'] = df['Summary'].fillna('')
df['Reviews'] = df['Reviews'].fillna('')

print(f"Data shape after initial cleaning: {df.shape}")

Data shape after initial cleaning: (1499, 14)


The Player Engagement Score is a target variable that represents overall player interaction with a game. It is calculated from Times Listed, Number of Reviews, Plays, and Playing. Each metric is normalized to a 0â€“1 range with Minâ€“Max scaling. The final score is the average of the normalized values. This will produce a regression-ready measure of active player engagement.

In [4]:
# Cameron: Define and calculate the final Player Engagement Score

from sklearn.preprocessing import MinMaxScaler

# Player engagement-related columns
engagement_features = [
    'Times Listed',
    'Number of Reviews',
    'Plays',
    'Playing'
]

# Convert to numeric
for col in engagement_features:
    df[col] = pd.to_numeric(df[col], errors='coerce')

# Fill missing engagement vals with 0
df[engagement_features] = df[engagement_features].fillna(0)

# Normalize using Min-Max Scaling
scaler = MinMaxScaler()
scaled_values = scaler.fit_transform(df[engagement_features])

# Create scaled feature DataFrame
scaled_df = pd.DataFrame(
    scaled_values,
    columns=[f"{col}_scaled" for col in engagement_features]
)

# Merge scaled values back into the main DataFrame
df = pd.concat([df.reset_index(drop=True), scaled_df], axis=1)

# Compute final Player Engagement Score
df['Player_Engagement_Score'] = df[[
    'Times Listed_scaled',
    'Number of Reviews_scaled',
    'Plays_scaled',
    'Playing_scaled'
]].mean(axis=1)

# Verify result
df['Player_Engagement_Score'].describe()

count    1499.000000
mean        0.225641
std         0.154465
min         0.000000
25%         0.099799
50%         0.204735
75%         0.335126
max         0.693940
Name: Player_Engagement_Score, dtype: float64

## Feature Engineering
This is the process of converting all our raw data into numbers the model can use.
### Text Features (TF-IDF)
We use TF-IDF to convert the game summaries into numerical features.

In [41]:
# Michael: Create TF-IDF features from text data
from sklearn.feature_extraction.text import TfidfVectorizer

### Encoding and Scaling

(explain why we use one-hot encoding and standardize)

In [42]:
# Cameron: Categorical encoding and numerical scaling
from sklearn.preprocessing import StandardScaler

## Modeling and Baseline
We build our first simple model (the **Baseline**) to establish the minimum performance we need to beat.
### Baseline Model (Unregularized MLR)
We train the simplest model (Standard Multivariate Linear Regression) and calculate its performance scores (RMSE, MAE, $\text{R}^2$).

In [43]:
# Cameron/Michael: Baseline Model and Metrics
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

## Advanced Models
(explain)
### Ridge Regression ($L_2$)
We use the Ridge model, which helps prevent **overfitting** by keeping the feature weights small. We will tune the alpha ($\alpha$) parameter.

In [44]:
# Michael: Implement Ridge model and alpha search
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV


### Lasso Regression ($L_1$)
We use the Lasso model, which (explain).

In [None]:
# Cameron: Implement Lasso Regression (L1) model
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV

## Results and Comparison
We summarize and compare the final performance of the Baseline, Ridge, and Lasso models.
### Final Analysis
* **Task:** Present results in clear tables showing RMSE, MAE, and $\text{R}^2$.
* **Task:** Create plots to visually compare performance.
* **Task:** Interpret the coefficients of the best Lasso model (feature selection).

In [None]:
# Placeholder for comparison tables and plots

## Conclusion
(explain)

## Team Contributions
* **Michael Alva:**
* **Cameron Matthews:**