Prediction of Game-related Newsletter Subscription Using Experience Level and Playtime Via K-Nearest Neighbors
-

Introduction
-

The UBC Minecraft Research server is currently an ongoing research project, which studies how people play video games. As part of this project, player demographics and in-game activity data are collected to ensure they have enough resources to handle the number of current and incoming players. An important challenge for the research group is understanding which types of players are most likely to remain engaged, specifically through actions such as subscribing to the servers newsletter. Newsletter subscribers are shown to generally be more connected to the game, and often more likely to participate in other studies.

The question our group will be answering, is: “Can we predict whether a player subscribes to the newsletter using their experience level and playtime?"

From the two available, we will be using the players.csv dataset to conduct our has 196 observations and 9 variables that describe each player that has logged onto the UBC Minecraft research server. It was loaded from a Google Drive link to ensure full reproducibility within the Jupyter environment.

In [1]:
import pandas as pd
players = pd.read_csv("https://drive.google.com/uc?export=download&id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz")
players

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


Figure 1: Read in of Working Data Set

Each row contains information of a player’s age, relating to their identity and server statistics.

In [2]:
players.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 196 entries, 0 to 195
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   experience        196 non-null    object 
 1   subscribe         196 non-null    bool   
 2   hashedEmail       196 non-null    object 
 3   played_hours      196 non-null    float64
 4   name              196 non-null    object 
 5   gender            196 non-null    object 
 6   age               196 non-null    int64  
 7   individualId      0 non-null      float64
 8   organizationName  0 non-null      float64
dtypes: bool(1), float64(3), int64(1), object(4)
memory usage: 12.6+ KB


Figure 2: Information about the Data Frame

In [3]:
variable_table = pd.DataFrame({
    "Variable": [
        "experience", "subscribe", "hashedEmail", "played_hours",
        "name", "gender", "age", "individualId", "organizationName"
    ],
    "Type": [
        "Categorical", "Boolean", "String", "Numeric",
        "String", "Categorical", "Integer", "NaN", "NaN"
    ],
    "Description": [
        "The reported experience of the player",
        "If the player subscribed to the newsletter",
        "The hashed email of the player",
        "The playtime of the player",
        "The username of the player",
        "The reported gender of the player",
        "The age of the player",
        "Empty",
        "Empty"
    ]
})

variable_table

Unnamed: 0,Variable,Type,Description
0,experience,Categorical,The reported experience of the player
1,subscribe,Boolean,If the player subscribed to the newsletter
2,hashedEmail,String,The hashed email of the player
3,played_hours,Numeric,The playtime of the player
4,name,String,The username of the player
5,gender,Categorical,The reported gender of the player
6,age,Integer,The age of the player
7,individualId,,Empty
8,organizationName,,Empty


Figure 3: Table of Variables, Types, and Corresponding Descriptions

The variables individualId and organizationName, are empty and need to be dropped. The other variables are mostly good, but a potential issue that we noticed is that all of the self reported fields may be inconsistent or biased, especially the experience field for which the three categories seem rather vague and up to interpretation. Even with these issues, the dataset is good enough for making player level predictions without needing session level data.

In [4]:
players_tidy = players.drop(columns=["individualId", "organizationName"])
players_tidy

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21
...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17


Figure 4: Tidied Data Frame

Methods and Results
-

In [5]:
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier

X = players_tidy[["experience", "played_hours"]]
y = players_tidy["subscribe"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=1234,
)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), ["played_hours"]),
        ("cat", OneHotEncoder(handle_unknown="ignore"), ["experience"])]
)

preprocessor

This preprocessor readies the dataset for modeling by handling mixed data types that it contains. First, it standardizes the numerical column played_hours so that it has a normalized numerical scale, then it uses the one-hot encoder which encodes the categorical variable experience so that the algorithm can properly interpret it, and make use of it. The result is clean, model-ready data that can be passed directly into a predictive model.

Once the data has been split into a testing and training set and the preprocessor has been created, we employed a 5-fold cross-validation (standard, solid choice which is suitable for our amount of data), testing 29 different k values. These k values were assigned estimated accuracies, which were used to inform which k value to apply on the test data set.

In [6]:
# import packages
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline

# define parameter grid + define pipeline
param_grid = {
    "kneighborsclassifier__n_neighbors": range(1, 30, 1),
}
players_tune_pipe = make_pipeline(preprocessor, KNeighborsClassifier())

# tune model
knn_tune_grid = GridSearchCV(
    estimator=players_tune_pipe,
    param_grid=param_grid,
    cv=5
)
knn_tune_grid

# fit the tuned model
knn_model_grid = knn_tune_grid.fit(X_train, y_train)
knn_model_grid

accuracies_grid = pd.DataFrame(knn_model_grid.cv_results_)
accuracies_grid.head(17)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_kneighborsclassifier__n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.005251,0.001062,0.004508,0.000256,1,{'kneighborsclassifier__n_neighbors': 1},0.625,0.645161,0.645161,0.709677,0.709677,0.666935,0.035667,27
1,0.00421,3.7e-05,0.004525,0.000392,2,{'kneighborsclassifier__n_neighbors': 2},0.46875,0.483871,0.580645,0.483871,0.580645,0.519556,0.050183,29
2,0.00417,4.6e-05,0.004309,4e-05,3,{'kneighborsclassifier__n_neighbors': 3},0.71875,0.645161,0.709677,0.645161,0.709677,0.685685,0.033253,24
3,0.004218,8.2e-05,0.004283,2.6e-05,4,{'kneighborsclassifier__n_neighbors': 4},0.625,0.483871,0.677419,0.612903,0.677419,0.615323,0.070839,28
4,0.004178,5.7e-05,0.004255,2.5e-05,5,{'kneighborsclassifier__n_neighbors': 5},0.75,0.709677,0.741935,0.741935,0.741935,0.737097,0.014061,11
5,0.004574,0.000451,0.004299,5.1e-05,6,{'kneighborsclassifier__n_neighbors': 6},0.75,0.677419,0.645161,0.612903,0.709677,0.679032,0.047955,26
6,0.004327,0.000336,0.004286,4.6e-05,7,{'kneighborsclassifier__n_neighbors': 7},0.75,0.741935,0.741935,0.741935,0.741935,0.743548,0.003226,1
7,0.004138,1.9e-05,0.004278,2.8e-05,8,{'kneighborsclassifier__n_neighbors': 8},0.65625,0.741935,0.741935,0.741935,0.741935,0.724798,0.034274,12
8,0.004146,4.6e-05,0.004271,1.7e-05,9,{'kneighborsclassifier__n_neighbors': 9},0.75,0.741935,0.741935,0.741935,0.741935,0.743548,0.003226,1
9,0.004172,7.4e-05,0.004277,2.5e-05,10,{'kneighborsclassifier__n_neighbors': 10},0.65625,0.612903,0.741935,0.645161,0.741935,0.679637,0.052823,25


Figure 5: Data Frame of Accuracies Grid with Multiple K Values (First 17 Entries)

In [7]:
# import package
import altair as alt

# visualize the accuracies grid
accuracies_grid_plot = alt.Chart(accuracies_grid, title="Estimated Accuracy vs. K Value").mark_line(point=True).encode(
    x=alt.X("param_kneighborsclassifier__n_neighbors")
        .title("K Value"),
    y=alt.Y("mean_test_score")
        .title("Mean Test Score")
)
accuracies_grid_plot

Figure 6: Estimated Accuracy Scores Against K value

A plot of the accuracy scores against the corresponding k-values was used to better contextualize the model's performance.

In [8]:
# find the k value with highest estimated accuracy
knn_tune_grid.best_params_

{'kneighborsclassifier__n_neighbors': 7}

Based on .best_params_, the K value with highest corresponding estimated accuracy (mean test score) is 7 (74.35%). However, the accuracy fluctuates quite a bit when increasing or decreasing from 7. K = 9 has the same accuracy as K = 7, but runs into the same issue with accuracy fluctuation of nearby K values. 

K = 17 was chosen as the best K value due to more stable accuracy, while retaining the same estimated accuracy as K = 7 and K = 9. The stability was preferable, to make the model more resistant against inaccurate data points. 

In [9]:
# 1. Define the final model pipeline using the chosen k value of k=17
final_model_pipe = make_pipeline(preprocessor, KNeighborsClassifier(n_neighbors=17))

# 2. Train the final model on testing data
final_model_pipe.fit(X_test, y_test)

# 3. Generate predictions for the dataset X
all_predictions_test = final_model_pipe.predict(X_test)

# 4. Create a DataFrame for test results only (using X_test)
test_results_df = X_test.copy()
test_results_df["True_Subscribe"] = y_test
test_results_df["Predicted_Subscribe"] = all_predictions_test

output_filename = "players_final_predictions_unchanged_code.csv"
test_results_df.to_csv(output_filename, index=False)

print(f" Success Final predictions saved to '{output_filename}'.")
print("\nFirst 5 rows of the final output:")
print(test_results_df.head(5))

 Success Final predictions saved to 'players_final_predictions_unchanged_code.csv'.

First 5 rows of the final output:
    experience  played_hours  True_Subscribe  Predicted_Subscribe
101    Amateur           0.0            True                 True
51     Regular         218.1            True                 True
146        Pro           0.0            True                 True
153   Beginner           0.1            True                 True
106    Regular           0.0           False                 True


Figure 7: First 5 Rows of Output from Data Frame with Predictions and True Labels

Discussion
-

In [10]:
true_plot = alt.Chart(test_results_df, title="True Classification of Test Set").mark_point(opacity=0.4).encode(
    x=alt.X("experience"),
    y=alt.Y("played_hours"),
    color="True_Subscribe"
)

prediction_plot = alt.Chart(test_results_df, title="Classification by Predictive Model").mark_point(opacity=0.4).encode(
    x=alt.X("experience"),
    y=alt.Y("played_hours"),
    color = alt.Color("Predicted_Subscribe", title="Newsletter Subscription Status")
)
prediction_plot

true_plot | prediction_plot

Figure 8: Side by Side Presentation True Classifications vs. Predictive Classifications on Test Data

In [11]:
accuracy = final_model_pipe.score(X_test, y_test)
accuracy

0.7

The estimated accuracy obtained by the .score method on the predictive model was 70%.

In [12]:
players_crosstab = pd.crosstab(
    test_results_df["True_Subscribe"],
    test_results_df["Predicted_Subscribe"]
)
players_crosstab

Predicted_Subscribe,True
True_Subscribe,Unnamed: 1_level_1
False,12
True,28


Figure 9: Confusion Matrix

From the confusion matrix, we can see that 12 entries were mis-labelled. All of which were "False" for their true classification.

In [13]:
all_predictions_test

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True])

None of the predictions by the model on the testing set were designated as "False", which was not expected. 

In [14]:
players["subscribe"].value_counts()

subscribe
True     144
False     52
Name: count, dtype: int64

There was a disproportionate number of people subscribed to newsletters compared to the people who weren't, which could've thrown off the classification model. It's possible that this would be relieved with some upsampling of the minority classification for the purposes of improving the model's predictive accuracy.

In [15]:
evaluation_table = pd.DataFrame({
    "Precision": [0.7, 0],
    "Recall": [1.0, 0]}, 
    index=["true_as_pos", "false_as_pos"]
)
evaluation_table

Unnamed: 0,Precision,Recall
true_as_pos,0.7,1.0
false_as_pos,0.0,0.0


Figure 10: Tabulated Precision and Recall with "False" or "True" as Positive Labels Respectively

From the precisions and recalls, the model would not be reliable for predictions on new data, especially not for predicting which individuals are not subscribed to a games-related newsletter. 

From all the factors considered, analysis suggests that the two factors of experience and play time do not fully explain player subscription, telling us that other variables covering intent and motivation could be measured to fully explain a player's subscription possibility. Had the predictions been more reliable, this model could've been used to inform the marketing tactics and outreach initiatives of various gaming companies. 

In the future, it would be interesting to see what the split is for people who are subscribed to a games-related newsletter compared to those who are not, within a wider population. It would also be interesting to see how applicable this type of predictive model would be for a much larger set of people (especially for those who would not volunteer to play on the PLAIcraft server.