## Title

### Introduction

Understanding player behavior is important for managing online gaming platforms effectively. A computer science research group at UBC, Pacific Laboratory for Artificial Intelligence (PLAI) set up plaicraft.ai, a Minecraft server designed to collect gameplay data. Our project focuses on predicting which types of players are likely to have high hours of playtime, helping PLAI with their research. By figuring out which players are likely to play the most, PLAI can recruit people and prepare for the team’s needs more effectively.

We aim to answer the question: **What kinds of players are most likely to contribute a significant amount of playtime, and how can this be predicted using a K-Nearest Neighbors KNN Regression model**.

Provide background information about KNN Regression model so that the readers can understand what it does: KNN Regression is a model primarily used for predicting continuous numerical values, and is based fundamentally on distance metric calculations. KNN Regression shares similarities to its classification model, in terms of little assumptions required from the data, and ability to function with non-linear relationships between predictor and response variables. Moreover, it is simple and intuitive to work with.

The players.csv dataset contains information about participants. There are 9 variables listed below, and 196 observations, for each participant. Data frame appears tidy.
The sessions.csv dataset contains information that pertains to participants' session times. There are four variables, and 1535 observations recorded. Dataframe is possibly not tidy.

In [42]:
import pandas as pd
data = {
    "Variable Name": ["experience", "subscribe", "Hashed Email", "Played Hours", "name", "gender", "age", "Individual Id", "Organization Name"],
    "Type": ["String", "Boolean", "String", "Numeric", "String", "String", "Numeric", "Undefined", "Undefined"],
    "Description": [
        "Participant’s skill and mastery throughout gaming sessions.",
        "Participant’s subscription status.",
        "Participant’s email, redacted for privacy reasons.",
        "Number of hours played by each participant.",
        "Participant’s name.",
        "Participant’s gender identity.",
        "Participant’s age.",
        "Participant’s ID used throughout the sessions.",
        "Unique identifier for the organization."
    ],
    "Value classes": [
        "Beginner, amateur, regular, pro, veteran",
        "True or False",
        "Random encryption",
        "Floats",
        "First names",
        "Male, female, nonbinary, agender, prefer not to say, etc.",
        "Integers",
        "NaN",
        "NaN"
    ]
}

# Create the DataFrame
df = pd.DataFrame(data)
df

Unnamed: 0,Variable Name,Type,Description,Value classes
0,experience,String,Participant’s skill and mastery throughout gam...,"Beginner, amateur, regular, pro, veteran"
1,subscribe,Boolean,Participant’s subscription status.,True or False
2,Hashed Email,String,"Participant’s email, redacted for privacy reas...",Random encryption
3,Played Hours,Numeric,Number of hours played by each participant.,Floats
4,name,String,Participant’s name.,First names
5,gender,String,Participant’s gender identity.,"Male, female, nonbinary, agender, prefer not t..."
6,age,Numeric,Participant’s age.,Integers
7,Individual Id,Undefined,Participant’s ID used throughout the sessions.,
8,Organization Name,Undefined,Unique identifier for the organization.,


### Methods & Results

Firstly, to perform the data analyses, the following methods described were implemented. We began by loading important packages, using the “import function” to pandas, altair, numpy, and scikit-learn with its applicable features (from sklearn function as well). An initial seed value was implemented and carried in the downstream codes.

In [72]:
#importing packages
import altair as alt
import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn import set_config
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

np.random.seed(10)

The relevant “players.csv” dataset pertaining to our question, was loaded/read using the “pd.read.csv()” function.

In [73]:
players = pd.read_csv("data/players.csv")
players

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


Following, we removed “individualID” and “organizationName” columns as part of data wrangling methods, as they are irrelevant to our analysis, given their undefined nature.

In [74]:
players_filtered = players.drop(columns = ["individualId", "organizationName"])
players_filtered

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21
...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17


Missing values were then checked in the newly named dataset with the info() method, to identify other potential areas for wranging,  and to summarize the “players_filtered” key structure, and data characteristics, including non-null and object types.

In [75]:
players_filtered.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 196 entries, 0 to 195
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   experience    196 non-null    object 
 1   subscribe     196 non-null    bool   
 2   hashedEmail   196 non-null    object 
 3   played_hours  196 non-null    float64
 4   name          196 non-null    object 
 5   gender        196 non-null    object 
 6   age           196 non-null    int64  
dtypes: bool(1), float64(1), int64(1), object(4)
memory usage: 9.5+ KB


Additionally, we specifically chose to explore the relationships between two object variables to played hours, to explore trends. Specifically, we individually visualized the relationship between played hours to experience and subscribe value using a scatter and bar plot, created with altairs mark_point(), and mark_bar() functions.

In [76]:
plot_1 = alt.Chart(
    players_filtered,
    title= "Exploratory visualization of Played hours versus Subscription"
).mark_point(size = 20).encode(
    x=alt.X("experience")
        .title("particpants Subscription"),
    y=alt.Y("played_hours")
        .title("Participants played hours")
).properties(width = 300)
plot_1

In [77]:
plot_2 = alt.Chart(
    players_filtered,
    title= "Exploratory visualization of Played hours versus Subscription"
).mark_bar(size = 20).encode(
    x=alt.X("experience")
        .title("particpants Subscription"),
    y=alt.Y("played_hours")
        .title("Participants played hours")
).properties(width = 300)
plot_2

In [78]:
plot_3 = alt.Chart(
    players_filtered,
    title= "Exploratory visualization of Played hours versus Subscription"
).mark_point(size = 20).encode(
    x=alt.X("subscribe")
        .title("particpants Subscription"),
    y=alt.Y("played_hours")
        .title("Participants played hours")
).properties(width = 300)
plot_3

In [79]:
plot_4 = alt.Chart(
    players_filtered,
    title= "Exploratory visualization of Played hours versus Subscription"
).mark_bar(size = 20).encode(
    x=alt.X("subscribe")
        .title("particpants Subscription"),
    y=alt.Y("played_hours")
        .title("Participants played hours")
).properties(width = 300)
plot_4

Therefore, our exploratory data analysis affirmed the feasibility of the experience and subscribed variables as potential predictors, in building the relevant model.

From this, we examined all the object values experience columns using the unique() function, that contains categorical values including “Beginner”, “Amateur”, “Regular”, “Pro”,  and “Veteran”. Subsequently, we decided to implement KNN regression, and then mapped the experience variable to numerical values, ranging from 0 to 4. This was a preliminary wrangling step to ensure our KNN regression model can generate meaningful data, as it is based on distance metrics.

In [80]:
players_filtered["experience"] = players_filtered["experience"].replace({
"Beginner" : 0,
"Amateur"  : 1,
"Regular" :  2,
"Veteran" :  3,
"Pro" : 4
})
players_filtered["experience"].unique()
players_filtered

  players_filtered["experience"] = players_filtered["experience"].replace({


Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age
0,4,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9
1,3,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17
2,3,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17
3,1,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21
4,2,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21
...,...,...,...,...,...,...,...
191,1,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17
192,3,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22
193,1,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17
194,1,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17


Subsequently, we checked to ensure the “played_filtered” dataset was prepared for model implementation. As a note, we decided to exclude the subscribe column in our analysis, as mapping it produced a binary variable of [0 or 1] rendering its predictive nature ineffective for meaningful KNN regression distance calculations.

Secondly, In the analysis, we first split the players_filtered data into respective training and test data sizes of 60% and 40%, to ensure the model can understand and generalize the data.

In [99]:
players_filtered_train, players_filtered_test = train_test_split(
            players_filtered, train_size=0.70, random_state = np.random.seed(10)
)

Following, we loaded our “knnRegressor” model to train, with implementation of our seed, to ensure reproducibility.

In [100]:
knn = KNeighborsRegressor()

Then we created our preprocessor to scale the experience predictor, although not required, as it was the only predictor we used.

In [101]:
preprocessor = make_column_transformer(
    (StandardScaler(), ["experience"]),
    remainder="passthrough",
    verbose_feature_names_out = False
)

Moreover, we created our pipeline to che preprocessor and knn regressor models, and a parameter grid of n_neighbours with range of (3,20), due to small train size.

In [102]:
pipeline = make_pipeline(preprocessor, knn)  
param_grid = {
    "kneighborsregressor__n_neighbors" : range(3, 20,2),
}

Subsequently, we decided to use a 5-fold grid search, and integrated our pipeline as our estimator, and parameter grid, and with scoring of neg_root_mean_squared_error(RMSPE), for cross validation purposes, and choose the best neighbours.

In [103]:
gridsearch = GridSearchCV(
    estimator = pipeline,
    param_grid = param_grid,
    cv = 5,
    scoring = "neg_root_mean_squared_error"
)

Following, we fitted our gridsearch model into our train data, and retrieved our cross validation scores stored in the cv_results, by creating a data frame called results.

In [104]:
gridsearch.fit(
    players_filtered_train[["experience"]],
    players_filtered_train["played_hours"]
)

In addition, we also created a standard error column, using the standard deviation in the results data frame, and applying the necessary calculations.

We also converted the RMSPE to a positive value, as it will be given in its negative form. Then we filtered our results dataset, to only display the important column, and also applied the rename function for the neighbours column .

In [105]:
results = pd.DataFrame(gridsearch.cv_results_)
results

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_kneighborsregressor__n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.003394,0.000247,0.002109,0.000168,3,{'kneighborsregressor__n_neighbors': 3},-55.376084,-51.343204,-30.248203,-59.924272,-28.228177,-45.023988,13.187418,9
1,0.003026,0.000142,0.001861,8e-06,5,{'kneighborsregressor__n_neighbors': 5},-39.618894,-45.14624,-19.735736,-56.574939,-17.82156,-35.779474,14.931959,8
2,0.003092,0.000137,0.001918,7.2e-05,7,{'kneighborsregressor__n_neighbors': 7},-39.933847,-39.613123,-26.329342,-56.05727,-14.803235,-35.347363,13.939288,7
3,0.00302,0.000103,0.001877,6e-06,9,{'kneighborsregressor__n_neighbors': 9},-35.852219,-39.303776,-21.455277,-50.125911,-15.485642,-32.444565,12.487277,4
4,0.004367,0.00285,0.001858,1.6e-05,11,{'kneighborsregressor__n_neighbors': 11},-33.673376,-39.297866,-18.519617,-51.092237,-20.11729,-32.540077,12.18223,5
5,0.002959,7.1e-05,0.001859,1.2e-05,13,{'kneighborsregressor__n_neighbors': 13},-32.630034,-39.879128,-16.76213,-51.77498,-17.207794,-31.650813,13.445593,2
6,0.002957,7e-05,0.001853,1.5e-05,15,{'kneighborsregressor__n_neighbors': 15},-31.793014,-40.402255,-19.079729,-52.150354,-20.437451,-32.772561,12.44443,6
7,0.002915,2.7e-05,0.001843,8e-06,17,{'kneighborsregressor__n_neighbors': 17},-31.248299,-39.017599,-17.543247,-52.459255,-18.152097,-31.684099,13.180713,3
8,0.002957,7.2e-05,0.001847,2e-05,19,{'kneighborsregressor__n_neighbors': 19},-30.404144,-38.990474,-16.603969,-52.733597,-16.377652,-31.021967,13.839135,1


In [106]:
results["sem_test_score"] = results["std_test_score"]/5**(1/2)
results["mean_test_score"] = -results["mean_test_score"]
results = (
    results[["param_kneighborsregressor__n_neighbors",
             "mean_test_score",
             "sem_test_score"
            ]]
             .rename(columns = {"param_kneighborsregressor__n_neighbors" : "n_neighbors"})
)
             
results

Unnamed: 0,n_neighbors,mean_test_score,sem_test_score
0,3,45.023988,5.897593
1,5,35.779474,6.677775
2,7,35.347363,6.233839
3,9,32.444565,5.58448
4,11,32.540077,5.448059
5,13,31.650813,6.013052
6,15,32.772561,5.565318
7,17,31.684099,5.894594
8,19,31.021967,6.189049


Furthermore, we obtained the best k neighbours of [19], with RMSPE of approximately 30.55 with standard error of 5.41, by using the best params and nsmallest function; this is coupled with visualization of the results data frame, to understand how the model is changing based on the k neighbours.

In [107]:
results.nsmallest(1, "mean_test_score")

Unnamed: 0,n_neighbors,mean_test_score,sem_test_score
8,19,31.021967,6.189049


In [108]:
gridsearch.best_params_

{'kneighborsregressor__n_neighbors': 19}

Thirdly, we retrained our model using k neighbours set at 19, and fitted it into the same train set, all integrated into a new pipeline that contained the same preprocessor.

In [112]:
knn1 = KNeighborsRegressor(n_neighbors = 11)
X = players_filtered_train[["experience"]]
y = players_filtered_train["played_hours"]

pipeline1 = make_pipeline (preprocessor, knn1)
pipeline1.fit(X, y)
pipeline1

After, we evaluated its performance on a separate test set, using the predict function.

In [113]:
players_filtered_test["predicted"] = gridsearch.predict(players_filtered_test)
RMSPE = mean_squared_error(
    y_true = players_filtered_test["played_hours"],
    y_pred = players_filtered_test["predicted"]
)**(1/2)
RMSPE

np.float64(17.83682431142111)

Then we coded and obtained RMSPE value of approximately 22.05 in the test data, using the mean_squared_error function, which encoded the true response variables (y_true) and predicted response variables (y_pred).

Finally, we also visualized the model on the test data and the actual data itself to see the performance

In [None]:
？

### Discussion

Our k neighbours was 19, from its lowest RMSE of approximately 30, which are the best parameters for a regressor model not to overfit or underfit with new observations. Additionally, We also found that the RMSPE for the test data is approximately 22.05, which is measured in the same unit of the target variable, played hours. This means that on average, our model's predictor is off by about 22 hours, and may only be useful for understanding general trends. However, since the RMSPE of the test data and cross validation differ by a small value, the model is capable of generalizing to unseen data.

The model predicts that all experience levels except those between amateurs and veterans do not exhibit any relationship with playtime, and will all exhibit a consistently low amount of playtime, up to a maximum of ~5 hours.

The model predicts that as individuals gain more experience beyond ‘amateurs’, they will linearly increase their playtime, until a maximum of  ~35 hours of playtime can be achieved by ‘regulars’. Then, playtime will linearly decrease as experience improves until the threshold for ‘veteran’ has been crossed.

**Findings aligned with expectations, as from the lack of numerical predictors, and weak relative relationships to the played hours, we expected a simple model, that may not entirely perform well in prediction**

While preliminary exploration of the data revealed that it was unlikely that played hours would linearly scale with experience, we predicted that some underlying relationship would still be present. For example, it was entirely possible that people with very low and very high levels of experience would exhibit high amounts of playtime, as the former would perhaps feel more inclined to explore and learn the game’s systems, while the latter group may use the opportunity to show off their expertise.. However, this is not what was observed, and instead we found that, barring one category, no relationship exists between playtime and experience.


This data is impactful, as we now attempt to answer the question, “What ‘kinds’ of players contribute the most data?”. We predict that, based on our KNN-regression model, players that self-identify as ‘Regulars’ in terms of their experience will be most likely to contribute the most data by exhibiting the most playtime.

Beyond that, we predict that there is no relationship with played hours and experience, and researchers should not focus their efforts on recruiting players based on those demographics. However, this model could be further improved if ‘experience’ was better quantified. For the future, perhaps rather than asking participants to self-identify their levels of experience, they could instead be screened by a quiz to empirically determine this value.


It is also possible that there are other variables that have a greater correlation with playtime than experience. For further research we would potentially investigate if age, occupation/amount of free time, geographical location, economic stability, level of interest in gaming, and mental health status all have a relationship (or lack thereof) with playtime.