# Analysis of the Attributes of Players on the PaliCraft Server that are Associated with High Playtimes

## Introduction

Plaicraft is an online Minecraft server provided by a research group from UBC computer science. Players who join the server provide a small amount of information about themselves before they begin to play the game. The server is set up to collect data about each player’s playtime and this is logged within a .csv file titled players.csv. The time played is tabulated with other data such as the gender of each player, their age, their previous experience with the game as well as their participation in the emailing list. The goal of our research group has been to answer the question: what kinds of players are most likely to contribute a large amount of data to the dataset? We go on further to investigate whether the experience of a player can be predicted with a model based on factors such as play time and other variables offered in the data set. We would also like to know how this information can be used to target players that are more likely to contribute more data during recruitment. This is important as the researchers involved with maintaining the Plaicraft server need to know how much to expand the capacity of the server to compensate for more players. The researchers would also be able to gain more players for their efforts by targeting advertisements to players that fit the demographics that seem to contribute the most data of players in the dataset. 

### The Data

The chosen dataset, players.csv, details **nine** variables describing  the data of **196** PlaiCraft players. These nine variables provide information on the players': 
- **experience** - The self-reported level of experience players listed as having when signing up. Experience levels include Beginner, Amateur, Regular, Veteran, and Pro,
- **subscribe** - Players' subscription status to email updates about when other players are online on the server,
- **hashedEmail** - The players' hashed email address,
- **played_hours** - Each player's total playtime in hours,
- **name** - The name chosen by the player,
- **gender** - The self-reported gender of the player. Players were given the options: Male, Female, Agender, Non-Binary, Two-Spirited, Other, and 'Prefer not to say',
- **age** - The self-reported age in years of the player,
- **individualId** - The individual ID of the user,
- **organizationName** - The name of the organization the player is a part of.

To allow for data analysis, this dataset is loaded in using pandas.

## Methods for Analysis

Before any analysis can begin, all nescessary packages must be loaded in. These packages include: altair, numpy, pandas, and sklearn

In [144]:
import altair as alt
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

To ensure our methadology and findings are reporducable, the players.csv dataset is loaded directly from the web. The dataset is saved the variable "players"

In [145]:
url = 'https://drive.google.com/file/d/1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz/edit'
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
players = pd.read_csv(path)
players.head(3)

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,


Loading this dataset, it is clear that, although the data is tidy, much of the presented variables are irrelevant to answering the proposed questions. To allow for accurate and easily interpretable analysis, the columns for players' name, individual ID, organization name, and hashedEmail will be dropped as they do not include any relevant data for the chosen question as the name and hashed email of players do not provide insight into the "kind" of player they are and the individual ID and organization name variables do not include any data.

This is done using pandas' drop columns function and the new dataset is save to "players_clean"

In [146]:
players_clean= players.drop(columns= {"hashedEmail", "individualId", "organizationName"})
players_clean.head(3)

Unnamed: 0,experience,subscribe,played_hours,name,gender,age
0,Pro,True,30.3,Morgan,Male,9
1,Veteran,True,3.8,Christian,Male,17
2,Veteran,False,0.0,Blake,Male,17


Now that the dataset only includes the data relevant to answering the question "Which 'kinds' of players are most likely to contribute a large amount of data?", analysis to identify these "kinds" of players can be conducted. The best way to do this is to determine how different variables influence the variable "played_hours", as the groups with the greatest play time will contribute the largest amount of data and should thus be targeted for recruiting efforts. 

Having said that, problems do arise due to individuals misreporting their age. For example, the 196th datapoint has an unrealistic age of 91 years. If these data points are included, this will lead to false data resulting in inaccurate predictions for the age group that contributes the most playtime. For these reasons, only data for ages 15-30 will be used for analysis of the relationship between age and playtime.

In addition to the relationship between average age and playtime, playtime will be compared with experience, gender, and subscription status. These comparisons will be done using bar plots.

The first plot we examined plotted experience against average hours played. Using the alt.Chart function we encoded experience onto the X-axis and average hours played on the Y-axis and used mark_bar to produce a bar graph. 

In [147]:
#Plot Based on Experience
player_data_experience_mean = (players_clean.groupby('experience')
                        .mean(numeric_only = True) #Taking the mean playtime for individuals of different experience levels.
                        .reset_index()
                              )

player_data_experience_bar = alt.Chart(player_data_experience_mean).mark_bar().encode(
    x=alt.X('experience', title = 'Experience Level'),
    y=alt.Y('played_hours', title = "Average Hours Played (Hours)")
).properties(
    title= 'Figure 1. Average Number of Hours Played by Different Experience Levels'
)
player_data_experience_bar

From the plot above, it is evident that Amateur and Regular players show higher-than-average gameplay hours.

The second plot we made graphs age against average hours played. We then encoded age onto the X-axis and average hours played on the Y-axis. 

In [160]:
#Plot Based on Age
player_data_age_mean = (players_clean[(players_clean["age"] >= 15) & (players_clean["age"] <= 30)]
                        .groupby('age')
                        .mean(numeric_only = True) #Taking the mean playtime of players with different ages between 15 and 30
                        .reset_index()
                       )


player_data_age_bar = alt.Chart(player_data_age_mean).mark_bar().encode(
    x=alt.X('age', title = 'Age of Players (Years)'),
    y=alt.Y('played_hours', title = "Average Hours Played (Hours)")
).properties(
    title= 'Figure 2. Average Number of Hours Played by Players of Different Ages'
)

In [161]:
display(player_data_age_bar)

The above plot highlights the peaks in playtime among players in late teenage years and early twenties, with notably high playtime among players aged 16, 19, and 20.

Our next focus was the gender identity of each player and how it correlated with hours played. Gender identity was plotted on the X-axis while hours played remained on the Y-axis. 

In [162]:
#Plot Based on Gender
player_data_gender_mean = (players_clean.groupby('gender')
                        .mean(numeric_only = True) #Taking the mean playtime of players of different genders 
                        .reset_index()
                       )

player_data_gender_bar = alt.Chart(player_data_gender_mean).mark_bar().encode(
    x=alt.X('gender', title = 'Reported Gender of Players'),
    y=alt.Y('played_hours', title = 'Average Hours Played (Hours)')
).properties(
    title = 'Figure 3. Average Number of Hours Played by Players of Different Genders'
)

In [163]:
display(player_data_gender_bar)

From the plot, it is clear that non-binary individuals contribute the most data, followed by females, agender individuals, and then males. Other genders contribute an insignificant amount of data. 

Next, subscription status and hours played were plotted against each other in our final bar plot with subscriptions on the X-axis and hours on the Y-axis. 

In [164]:
#Plot Based on Subscription Status
player_data_subscribe_mean = (players_clean.groupby('subscribe')
                        .mean(numeric_only = True) #taking the mean playtime of players of different subscription statuses
                        .reset_index()
                          )

player_data_subscribe_bar = alt.Chart(player_data_subscribe_mean).mark_bar().encode(
    x=alt.X('subscribe', title = 'Subscription Status of Players'),
    y=alt.Y('played_hours', title = 'Average Hours Played (Hours)')
).properties(
    title = 'Figure 4. Average Number of Hours Played by Players of Different Subscription Status'
)

In [165]:
display(player_data_subscribe_bar)

The graph demonstrates that subscribed players contribute significantly higher gameplay hours.

## Results and Discussion

The analysis aimed to answer the question "Which 'kinds' of players are most likely to contribute a large amount of data so that they can be targeted in recruiting efforts?" Having been given this question, we strived to identify which players are most likely to contribute the most playtime to the server. The four bar plots we created provided us with clear insights into this question. The first bar plot, which compared player experience with average hours played, revealed that "regular" players contributed the most playtime, followed by “amateur” players and, surprisingly one of the least, "pro" players. This indicates that while it may have been assumed that more experienced players generally invest more time, inexperienced players cannot be overlooked in their potential to contribute. The second plot showed that younger players, especially those aged 16, had the highest playtime, with a noticeable drop-off after age 20. The plots examining gender and subscription status highlighted that non-binary players recorded the most hours, while email subscribers were more likely to invest significant playtime, aligning with their demonstrated interest in the game.

These findings largely aligned with expectations, though there were some surprises. It was anticipated that players identifying as "veterans" or "pros" would spend the most time on the server, as experience often correlates with engagement. However, the notable playtime of the "regular" and "amateur" players suggests that interest in new experiences and enthusiasm may outweigh skill or experience for some demographics. Similarly, while the high playtime among younger players was expected, the drop-off in hours was not as linear as predicted, with ages 19 and 20 showing unexpectedly high engagement and the playtime of players of ages 18 and 17 being surprisingly low. The findings related to the correlation between gender and playtime were not guided by specific expectations, as gender is not typically associated with gaming behaviour, and the results supported this assumption with relatively consistent playtime across genders.

Such findings have important implications for research recruitment efforts when attempting to identify players that will contribute the most data. Understanding that both "regular" and "amateur" players contribute significantly, that individuals aged 16-20 will contribute the most data, and that non-binary individuals and females tend to have the highest playtimes can guide strategies to recruit individuals that fulfill these characteristics. The analysis data spotlight what are potentially key demographics for targeted marketing for research subjects. Furthermore, the strong correlation between email subscription and playtime highlights the importance of maintaining an engaged and large subscriber base to sustain activity levels on the server.

Future research could explore several intriguing questions. For instance, what specific factors drive the unexpected playtime patterns among "amateur" and players aged 17-18? Additionally, why do non-binary players seem to engage more with the game, and how can developers create inclusive environments that support this trend? Moreover, what combinations of player attributes tend to lead to high playtime? These findings open the door to deeper analyses that could help refine strategies for growing and sustaining research data.

## Secondary Additional Question

Having tackled the primary question asked by the PaliCraft hosts, a secondary question to gain further insight into the players of the server was proposed: “How can an exploratory kNN model be utilized in order to highlight and predict different patterns in gaming behaviour of players of different gender, age, and experience level. What insights and conclusions can be drawn from the findings of such a model and what combination of attributes of a player leads to the highest playtime?”

Answering this question begins with using the players_clean dataset, which only includes relevant variables. That said, the variables we wish to predict in the dataset are categorical variables and thus need to be converted to numerical codes to allow for effective modelling. This is done using pandas’ Categorical function. Each column's categorical values are explicitly converted and the mappings of categories to their codes are saved for decoding predictions later.


In [130]:
# Encoding 'gender'
players_clean["gender"] = pd.Categorical(players_clean["gender"])
gender_categories = players_clean["gender"].cat.categories
players_clean["gender"] = players_clean["gender"].cat.codes

# Encoding 'experience'
players_clean["experience"] = pd.Categorical(players_clean["experience"])
experience_categories = players_clean["experience"].cat.categories
players_clean["experience"] = players_clean["experience"].cat.codes

To build the model, the data is divided into features, defined as X, (played hours) and target variables to be predicted (gender, age, and experience), defined as y. This separation is performed so that kNN can use the provided features to predict the target variables. 

In [131]:
X = players_clean[["played_hours"]]
y = players_clean[["gender", "age", "experience"]]

A preprocessor is then made in order to standardize the data using the StandardScaler() function. This ensures that all features have a mean of 0 and a standard deviation of 1. This is especially important for distance-based models like kNN, where differences in scales could skew results.

In [132]:
preprocessor = StandardScaler()
X_scaled = preprocessor.fit_transform(X)

We then split the dataset into training and testing subsets with a test size of 0.2. The training data is used to train the models, and the testing data helps evaluate their performance.
Using the split data, a cross-validation of the model with a different number of nearest neighbours is performed for each variable and a cross-validation plot is created to allow for the optimal number of nearest neighbours to be identified for each attribute. Nearest neighbours from 1 to 15 are tested.

In [133]:
#Splitting the data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=1234)

#Creating a dataset with only gender to use for the cross-validation
y_train_gender = y_train[["gender"]]

#Creating the cross-validation model
knn_spec = KNeighborsClassifier()
param_grid = {"n_neighbors": range(1, 16, 1),}
knn_tune_grid = GridSearchCV(knn_spec, param_grid, return_train_score = True, n_jobs = -1, cv = 5, scoring='accuracy',)
knn_train_grid_gender = knn_tune_grid.fit(X_train, y_train_gender)
observer=True
accuracies_grid_gender = pd.DataFrame(knn_train_grid_gender.cv_results_)

#Plotting the mean test score against the number of nearest neighbours for gender
cross_val_plot_gender = alt.Chart(accuracies_grid_gender).mark_line(point=True).encode(x=alt.X('param_n_neighbors').title('Number of Nearest Neighbours').scale(zero=False),y=alt.Y('mean_test_score').title('Mean Test Score').scale(zero=False)).properties(
    title = 'Figure 5. Mean Test Score of Exploratory Models of Different Numbers of Nearest Neighbours for Gender')

#Creating a dataset with only experience to use for the cross-validation
y_train_experience = y_train[["experience"]]

#Creating the cross-validation model
knn_train_grid_experience = knn_tune_grid.fit(X_train, y_train_experience)
accuracies_grid_experience = pd.DataFrame(knn_train_grid_experience.cv_results_)

#Plotting the mean test score against the number of nearest neighbours for gender
cross_val_plot_experience = alt.Chart(accuracies_grid_experience).mark_line(point=True).encode(x=alt.X('param_n_neighbors').title('Number of Nearest Neighbours').scale(zero=False),y=alt.Y('mean_test_score').title('Mean Test Score').scale(zero=False)).properties(
    title = 'Figure 6. Mean Test Score of Exploratory Models of Different Numbers of Nearest Neighbours for Age')


#Creating a dataset with only age to use for the cross-validation
y_train_age = y_train[["age"]]

#Creating the cross-validation model
knn_train_grid_age = knn_tune_grid.fit(X_train, y_train_age)
accuracies_grid_age = pd.DataFrame(knn_train_grid_age.cv_results_)

#Plotting the mean test score against the number of nearest neighbours for gender
cross_val_plot_age = alt.Chart(accuracies_grid_age).mark_line(point=True).encode(x=alt.X('param_n_neighbors').title('Number of Nearest Neighbours').scale(zero=False),y=alt.Y('mean_test_score').title('Mean Test Score').scale(zero=False)).properties(
    title = 'Figure 7. Mean Test Score of Exploratory Models for Age of Different Numbers of Nearest Neighbours')



display(cross_val_plot_gender)
display(cross_val_plot_experience)
display(cross_val_plot_age)

From the plots above, it is clear that the optimal number of nearest neighbours, which maximize the mean test score are 15, 14, and 12 for the models for gender, experience, and age respectively. With these values, separate k-Nearest Neighbors models are trained for each target variable (gender, age, and experience). Each model learns to predict one specific attribute based on the input feature of played hours.

In [134]:
# Training a kNN model for each target variable
knn_gender = KNeighborsClassifier(n_neighbors=15)
knn_experience = KNeighborsClassifier(n_neighbors=14)
knn_age = KNeighborsClassifier(n_neighbors=12)
# Fitting the models
knn_gender.fit(X_train, y_train["gender"])
knn_experience.fit(X_train, y_train["experience"])
knn_age.fit(X_train, y_train["age"])

We then create a function to predict a player’s attributes based on the inputed playtime playtime. It scales the input data, makes predictions using the models trained in the previous step, and decodes categorical predictions back into their original categories for interpretability. The function then merges the predictions of each model into one output.

The function below is tested with a sample input of 100 hours, outputting the attributes of the player most likely to contribute that particular playtime.

In [138]:
# Input for prediction
played_hours_input = np.array([[100]])
played_hours_scaled = preprocessor.transform(played_hours_input)

# Predictions
predicted_gender_code = knn_gender.predict(played_hours_scaled)
predicted_age = knn_age.predict(played_hours_scaled)
predicted_experience_code = knn_experience.predict(played_hours_scaled)

# Decoding predictions
predicted_gender = gender_categories[predicted_gender_code]
predicted_experience = experience_categories[predicted_experience_code]

# Output
predicted_attributes = {
    "gender": predicted_gender,
    "age": predicted_age,
    "experience": predicted_experience,
}
print(predicted_attributes)

{'gender': Index(['Male'], dtype='object'), 'age': array([17]), 'experience': Index(['Amateur'], dtype='object')}
