# Title: Using Player Data to Predict Subscription Status

Introduction: 
Many online games offer premium subscriptions that provide extra content or features. Understanding which players are most likely to subscribe helps developers improve marketing and design decisions. To support a research group at UBC studying video game player behaviour, we will use data analysis to answer the predictive question outlined below:

**Can we predict whether a player will subscribe to a premium service based on age and time played?**

To explore this, we applied the K-Nearest Neighbours (KNN) classification algorithm to predict subscription status using player data collected by the research group. The player.csv dataset used for this project includes 196 total observations and 9 variables about information from 196 players, each described by the following features:

- experience: categorical level of gameplay experience (e.g., Amateur, Regular, Veteran, Pro)

- subscribe: Boolean value indicating whether the player subscribed to the premium service

- hashedEmail: Identifier for the players (categorical variable)

- played_hours: total number of hours played (numerical variable)

- name: Player's name (categorical variable)

- gender: self-reported gender identity (categorical variable)

- age: player’s age (numeric variable)

- individualId: Individual player ID (all missing values)

- organizationName: Player organization name (all missing values)

We used subscription status as the target variable (y) and explored whether a player’s age and total hours played could predict whether they would subscribe.

## Methods & Results:

In [1]:
#run code before.
import pandas as pd
import numpy as np
import altair as alt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score

1. Loading the Data

The analysis began by importing the dataset from a Google Drive link using pandas.read_csv(). The dataset contained demographic and gameplay information for 196 players.

In [2]:
url = 'https://drive.google.com/uc?export=download&id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz'
df = pd.read_csv(url)

2. Data Wrangling and Cleaning

Several identifying columns were removed because they were not useful for prediction.
Only the columns relevant to this analysis were kept (age, played_hours, and subscribe).
Rows with missing values in relevant columns were dropped.

Because log-scaling was required later for visualization, played_hours was shifted by +1 so no values were zero.

In [3]:
df = df.drop(columns=["experience", "gender","individualId", "organizationName", "hashedEmail", "name"])
df = df.dropna(subset=[ "age", "played_hours", "subscribe"])
df["played_hours"] = df["played_hours"] + 1 #starts at 1 not 0.

3. Preparing Data for Modelling

Data was split into training and testing sets using an 80/20 stratified split. This ensures the data is stratified with the target variable (subscribe) labels so that the training and testing will have roughly the same proportion of each class label. The target variable subscribe, and predictor variables, age and played_hours, were specified in both training and testing datasets. A random seed is set once in the beginning of the analysis to make the analysis reproducible.

In [4]:
# Set seed
np.random.seed(1)

# Split training and testing set
train_df, test_df = train_test_split(
    df, train_size=0.80, stratify=df["subscribe"], random_state=1
)

# Specify predictor and target variables
train_X = train_df[["age", "played_hours"]]
train_y = train_df["subscribe"]
test_X = test_df[["age", "played_hours"]]
test_y = test_df["subscribe"]

4. Exploratory Data Analysis

To explore whether age and gameplay activity are related to subscription behaviour, we created a scatterplot to visualize the relationship between age and hours played, coloured by subscription status.

## Figure 1. Age vs Played Hours Coloured by Subscription Status

In [5]:
scatter = alt.Chart(train_df).mark_circle(opacity=0.7).encode(
    x=alt.X("age", title="Player Age"),
    y=alt.Y("played_hours", title="Total Hours Played").scale(type="log"),
    color=alt.Color("subscribe:N", title="Subscribed")
        
).properties(
    title="Figure 1. Age vs Played Hours by Subscription Status",
)
scatter

From figure 1, we can see that most subscribers are around age 15 and have under 9 hours of playing time. There are 5 subscribers in between age 30 - 100,  8 subscribers with 10-100 playing hours, and 3 players with 100-300 playing hours. Non-subscribers all have under 10 playing hours. This shows that both variables have a wide range of values with some more extreme outliers.

Though weak patterns can be seen in subscription behavior, the data does not show any clear trends or shapes. This makes knn-classification appropriate to use because the model makes few assumptions on the shape or pattern of the data. We can also see class imbalance with significantly more True labels compared to False labels. This leads to a model that may not be as accurate because the KNN algorithm will tend to predict new observations as the more common class label, which is True in this case.

5. Preparing Data for Modeling (continued)
   
The predictor variables, age and played_hours, are on very different scales. Played_hours has a bigger range compared to age and will likely have a larger effect on determining which neighbours are selected to make the predictions. To ensure both predictors were on a comparable scale, they were standardized using StandardScaler() within a preprocessing pipeline.

In [6]:
# Preprocessing (standardize age + played_hours)
preprocess = make_column_transformer(
    (StandardScaler(), ["age", "played_hours"]),
)

6. Model Building and preproccessor

A K-Nearest Neighbours (KNN) classifier was used to model subscription behaviour.
A grid search tested various odd values of k between 1 and 49 using 5-fold cross-validation to find the best-performing model.

In [7]:
# Build pipeline
knn = KNeighborsClassifier()
pipe = make_pipeline(preprocess, knn)

# Parameter grid for tuning k
param_grid = {
    "kneighborsclassifier__n_neighbors": range(1, 51, 2)
}

# Grid search (5-fold CV)
grid = GridSearchCV(pipe, param_grid=param_grid, cv=5)
grid.fit(train_X, train_y)

  _data = np.array(data, dtype=dtype, copy=copy,


### Results
Best Hyperparameter (k = 21)

In [8]:
# Best k
print("Best k:", grid.best_params_)

Best k: {'kneighborsclassifier__n_neighbors': 21}


In [9]:
# CV accuracy estimate for K values
accuracy_grid = (
    pd.DataFrame(grid.cv_results_)
    .rename(columns={"param_kneighborsclassifier__n_neighbors": "n_neighbors"})
)

# Plot for accuracy and neighbours
accuracy_plot = alt.Chart(accuracy_grid).mark_line(point=True).encode(
    x=alt.X("n_neighbors").title("Number of Neighbors"),
    y=alt.Y("mean_test_score")
        .scale(zero=False)
        .title("Accuracy estimate")
).properties(
    title="Figure 2. Estimated Accuracy vs Number of Neighbors",
)
accuracy_plot

Using the best_params_ attribute of the fit GridSearchCV object, returned that the K value with the highest accuracy to be 21. From figure 2, we can see that K=21-25 have the same and the highest estimated accuracy (75%). The estimated accuracy of K values from 9 to 19 and from 27 to 49 are slightly lower, but still close to 74%. Therefore, we chose K=21 for the classifier because it has the highest accuracy and changing the K value to a number near by will only affect the accuracy a small amount. 

Model Performance

Predictions were made on the test dataset. Accuracy, precision, recall, and a confusion maxtrix were calculated to get statistics on the quality of the model.

In [10]:
# Predictions on test data
test_pred = grid.predict(test_X)

# Evaluation metrics
print("Accuracy:", accuracy_score(test_y, test_pred))
print("Precision:", precision_score(test_y, test_pred))
print("Recall:", recall_score(test_y, test_pred))

Accuracy: 0.75
Precision: 0.7435897435897436
Recall: 1.0


### Confusion Matrix (Figure 3)

In [17]:
# Confusion Matrix
print("\nFigure 3: Confusion Matrix")
player_mat = pd.crosstab(test_y, test_pred, rownames=["Actual"], colnames=["Predicted"])
player_mat


Figure 3: Confusion Matrix


Predicted,False,True
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
False,1,10
True,0,29


The confusion matrix shows that 29 players were correctly predicted as subscribers, and 1 player was correctly predicted as a non-subscriber. 10 players were incorrectly classified as subscribers, while no player were incorrectly classified as non-subscribers.

## Discussion: 
The model performed best with k = 21, meaning it made the most accurate predictions when averaging over 21 nearby players. This makes sense for a noisy real-world dataset, since a larger k produces smoother and more stable predictions. The overall accuracy was 0.75, indicating the model correctly classified 75% of players. Precision was 0.74, showing that when the model predicted a player would subscribe, it was right about three-quarters of the time. The strongest result was the recall of 1.0, meaning the model successfully identified every actual subscriber and made no false-negative errors. 

However, the classifier made a lot of mistakes in the prediction for non-subscribers as 10 players were classified as subscribers when they are not, and only 1 non-subscriber was predicted correctly. This shows that the model tends to classify non-subscribers as subscribers, which suggest that class imbalance was affecting the model's performance as expected, because the dataset does not contain enough non-subscribed player data to make accurate predictions. The lack of non-subscriber data with a larger k value of 21 meant that majority of the neighbors ended up being subscribers and dominated the predictions. 

These findings were partly expected. It makes sense that hours played would be linked to subscription behaviour, since players who spend more time in the game may be more willing to pay for premium features. Age also plays an important role—many players in the 18-28 range tend to be highly active gamers.

The impact of these findings is that game developers could potentially use simple behavioural metrics like age and playtime to identify highly engaged players who might be receptive to premium offers. However, the model’s difficulty in correctly identifying non-subscribers shows that relying on these predictors alone may not be sufficient for precise targeting. 

These results lead to several future questions. Would the model perform better with more balanced data or additional features, such as in-game purchases ot session frequency? Also, what could he be worth examining how game design elements (such as difficulty, rewards, or progression speed) influence a player’s likelihood of subscribing. Additionally, we could explore whether specific experience levels (Amateur vs. Veteran) respond differently to premium features.