# Predicting Hours Played by MineCraft Users Based on Age and Experience Level


### Introduction 

Researchers in the Computer Science department at UBC are collecting data on how people play video games to answer a few questions. One of the questions the researchers are asking is “Which ‘kinds’ of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts?” 

The researchers have set up a MineCraft server to record players' actions as they navigate through the server’s world. The project they are running is a lot more complicated than it seems. They need to ensure there are enough resources (server hardware, software licences, etc.) in order to accommodate the number of players they attract to contribute to the study.

To understand the demographics of the players and session activity, the individuals contributing to the study must answer questions that have been formulated by the research group before playing such as what is their age, gender, and experience level. They can then join the server to play, and their session activity is monitored and recorded, allowing for the tracking of both demographic information and gameplay duration.

### Our Question

We will be trying to answer the researchers' question on “Which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts?.” And further we explore whether we can predict the number of hours played based on the players age and experience. 

### The Data Set 

We will be focusing on the `players.csv` dataset containing demographic and experience level information for each participant, which will allow us to examine the relationship between the number of hours played and the characteristics of the participants. 

The `sessions.csv` data set will not be used in this analysis as it does not contain demographic information about the players. We chose to exclude the sessions csv file from our analysis as we believed it did not provide any meaningful infromation to our study that was not already given through the `players.csv` file.

Further analysis will allow us to observe which types of players are likely to contribute more hours when playing the game.

`players.csv` dataset:

The number of observations: 196 (for each player in the study)
The number of variables: 9

Variables:
- `experience` (string): player's experience level
- `subscribe` (boolean): subscription to study's mailing list.
- `hashedEmail` (string): encrypted version of the player's email address.
- `played_hours` (float): number of hours the player has spent on the server.
- `name` (string): player's name.
- `gender` (string): player's gender.
- `age` (integer): age of the player in years.
- `individualId` (NoneType): Doesn't contain data or represent a value. It could be for an alternative ID for the player.
- `organizationName`
(NoneType): Doesn't contain data or represent a value.


### Methods and Results

We started by importing the required libraries and functions into Jupyter.

In [1]:
import pandas as pd
import altair as alt
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import GridSearchCV, train_test_split, cross_validate
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn import set_config
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

Then, we loaded the players csv file into Jupyter.

In [3]:
players_data = pd.read_csv("data/players.csv")
players_data

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


### Methods and Results 

##### Exploratory Data Analysis and Visualization

To answer the researchers question for "Which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts?" We will create 2 scatter plots to see if there is any patterns in the data between `played_hours`and `age` for both `gender` and `experience` categories. 

We will do this by selcecting the `experience`,`played_hours`, `gender`, `age` columns by using the `[]` function. So we are dropping the columns that are not needed as they do not provide information on the players demographics (e.g. `subscribe`, `hashedEmail`, `name`, `individualId` and `organizationName`).

In [6]:
players_filtered = players_data[["experience","played_hours", "gender", "age",]]
players_filtered

Unnamed: 0,experience,played_hours,gender,age
0,Pro,30.3,Male,9
1,Veteran,3.8,Male,17
2,Veteran,0.0,Male,17
3,Amateur,0.7,Female,21
4,Regular,0.1,Male,21
...,...,...,...,...
191,Amateur,0.0,Female,17
192,Veteran,0.3,Male,22
193,Amateur,0.0,Prefer not to say,17
194,Amateur,2.3,Male,17


Now we will create our scatter plots using the alt.Chart function so we can observe the patterns or trends in the data allowing for easier identification of which types of players tend to play for more hours.

Scatter plots show whether there is a relationsip bewteen experience level and played hours, or between gender and played hours. We can also see what age groups are more likely to contribute.

In [7]:
players_plot_experience = alt.Chart(players_filtered).mark_point(opacity=0.5, size=50).encode(
    x=alt.X('age').scale(zero=False).title("Age (in years)"),
    y=alt.Y('played_hours').scale(zero=False)
    .title("Number of Hours played (in hours)"), 
    color=alt.Color("experience").title("Experience Level")
).facet(
    "experience:N",
    columns=5
).configure_axis(titleFontSize=12)

players_plot_experience

In [8]:
players_plot_gender = alt.Chart(players_filtered).mark_point(opacity=0.4, size=50).encode(
    x=alt.X('age').scale(zero=False).title("Age (in years)"),
    y=alt.Y('played_hours').scale(zero=False)
    .title("Number of Hours played (in hours)"), 
    color=alt.Color("gender").title("Gender type")
).facet(
    "gender:N",
    columns=4
).configure_axis(titleFontSize=12)

players_plot_gender

From the plots above we see that:
- There is a very weak relationship between amateurs and regular experience level players playing for longer as there are only a few data points in those categories that have a high play time.
- Other experience levels tend to have lower played hours.
- More younger players contributing to this study (above 10, below 30) and a lot of 17 year olds.
- More males than females participating in the study with a few points where females have a higher play time, same for the males but there are more data points.

This can provide some insight for researchers to ... 

Now we will focus on our predictive question on whether we can predict the number of hours played by MineCraft users based on Age and experience level. Following our filtering step previously, the next step was to wrangle the data. We kept the `played_hours`, `experience` and `age` columns by using `[]`, and dropped the columns that were not needed, includiing the `gender` column.

Next, we used one hot encoding to turn the experience columns into numerical variables in order to be able to use them for regression. Lastly, we combined the original dataframe with the new one hot encoding data frame to get our final data frame using the concat function.

In [9]:
players_clean = players_data[["played_hours", "experience", "age"]]

enc = OneHotEncoder(handle_unknown='ignore', sparse_output = False).set_output(transform  = "pandas")
enctransform = enc.fit_transform(players_clean[["experience"]])

players_final = pd.concat([players_clean, enctransform], axis = 1).drop(columns = ["experience"]).replace(" ", "_")
players_final

Unnamed: 0,played_hours,age,experience_Amateur,experience_Beginner,experience_Pro,experience_Regular,experience_Veteran
0,30.3,9,0.0,0.0,1.0,0.0,0.0
1,3.8,17,0.0,0.0,0.0,0.0,1.0
2,0.0,17,0.0,0.0,0.0,0.0,1.0
3,0.7,21,1.0,0.0,0.0,0.0,0.0
4,0.1,21,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...
191,0.0,17,1.0,0.0,0.0,0.0,0.0
192,0.3,22,0.0,0.0,0.0,0.0,1.0
193,0.0,17,1.0,0.0,0.0,0.0,0.0
194,2.3,17,1.0,0.0,0.0,0.0,0.0


We will now split our data into a training set with 75% of our data, and a testing set with the other 25% of our data. Also included in this step is using a random seed. This is done in order to make our splits reproducible. We will use the training set to find the best value of k for regression, and then we will use the testing set to see how well our model is actually able to predict played hours from age, and experience level.

In [10]:
players_training, players_testing = train_test_split(players_final, test_size = 0.25, random_state = 2020)

X_train = players_training[["age", "experience_Amateur", "experience_Beginner", "experience_Pro", "experience_Veteran", 
                            "experience_Regular"]]
y_train = players_training["played_hours"]

X_test = players_testing[["age", "experience_Amateur", "experience_Beginner", "experience_Pro", "experience_Veteran", 
                            "experience_Regular"]]
y_test = players_testing["played_hours"]

players_pipe = make_pipeline(StandardScaler(), KNeighborsRegressor())

players_cv = pd.DataFrame(
    cross_validate(
        estimator = players_pipe, 
        cv = 5, 
        X = X_train, 
        y = y_train, 
        scoring = "neg_root_mean_squared_error", 
        return_train_score = True
    ))
players_cv

Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.003672,0.00365,-46.898001,-26.422229
1,0.003034,0.001665,-49.872942,-23.684785
2,0.00256,0.001503,-18.824665,-32.41211
3,0.002482,0.001506,-31.367118,-30.304666
4,0.002478,0.001453,-20.006249,-32.23844


Now we will use cross-validation to choose the optimal K-value by first creating a preprocessor to standardize the data by using the `make_column_transformer` function and `StandardScaler`. We will tune the data to find the optimal KNN value by creating a cross-validation set with 5 folds through `GridSearchCV`. Then we will fit our data into the model and analyze our results by creating a new data frame with all the model scores to observe how well the model predicted our response variable.

In [11]:
param_grid = {"kneighborsregressor__n_neighbors": range(1, 111, 1),}

players_tune = GridSearchCV(players_pipe, param_grid, cv = 5, n_jobs = -1, scoring = "neg_root_mean_squared_error")

players_results = pd.DataFrame(players_tune.fit(X_train, y_train).cv_results_)

players_results["sem_test_score"] = players_results["std_test_score"]/5**(1/2)

players_results = (
    players_results[[
        "param_kneighborsregressor__n_neighbors", 
        "mean_test_score", 
        "sem_test_score"
    ]].rename(columns = {"param_kneighborsregressor__n_neighbors": "n_neighbors"})
)

In [12]:
players_results["mean_test_score"] = - players_results["mean_test_score"]
players_results

Unnamed: 0,n_neighbors,mean_test_score,sem_test_score
0,1,53.729896,8.423551
1,2,38.760618,6.246873
2,3,36.270068,4.469285
3,4,35.961546,5.304732
4,5,33.393795,5.829303
...,...,...,...
105,106,28.213674,7.150645
106,107,28.201069,7.161697
107,108,28.191350,7.169811
108,109,28.179194,7.179594


To create our scatterplot we used the alt.Chart function on the mean test score results with mark_line(point = True) to create a line plot that connects each mean test score and neighbors point. Because our independent variable is “Neighbors,” we placed it on the x-axis. Our dependent variable is the “Mean Test Score” so we placed it on the y-axis and scaled it to make our plot .

In [13]:
players_graph = alt.Chart(players_results).mark_line(point = True).encode(
    x = alt.X("n_neighbors").title("Neighbors"), 
    y = alt.Y("mean_test_score").title("Mean Test Score").scale(zero = False)
)
players_graph

After creating the graph to visualize what our best k may be, we run .best_params_ to get the actual best k value for our model and we also find the best RMSPE. We then use this information to use our model on our test set to get a better idea of how well our model actually performs with data it has never seen. After predicting played hours on our test set, we added a column to our data set called “predictions” which contains the predicted hours played our model made on the test set.

In [14]:
players_min = players_tune.best_params_
players_min

{'kneighborsregressor__n_neighbors': 60}

In [15]:
players_best_rmspe = -players_tune.best_score_
players_best_rmspe

np.float64(27.713922905448403)

In [17]:
players_prediction = players_tune.predict(X_test)
players_summary = mean_squared_error(y_true = players_testing["played_hours"], y_pred = players_prediction)**(1/2)
players_summary

np.float64(8.030762100155691)

In [18]:
players_preds = players_training.assign(predictions = players_tune.predict(players_training[["age", "experience_Amateur", "experience_Beginner", "experience_Pro", "experience_Veteran", 
                            "experience_Regular"]]))
players_preds

Unnamed: 0,played_hours,age,experience_Amateur,experience_Beginner,experience_Pro,experience_Regular,experience_Veteran,predictions
159,0.0,22,0.0,1.0,0.0,0.0,0.0,3.070000
174,0.0,17,0.0,0.0,0.0,0.0,1.0,4.675000
60,0.0,17,0.0,0.0,0.0,0.0,1.0,4.675000
49,0.4,22,0.0,1.0,0.0,0.0,0.0,3.070000
130,56.1,23,1.0,0.0,0.0,0.0,0.0,6.310000
...,...,...,...,...,...,...,...,...
91,0.0,17,0.0,1.0,0.0,0.0,0.0,5.306667
118,0.0,46,0.0,0.0,0.0,0.0,1.0,2.265000
67,17.2,14,1.0,0.0,0.0,0.0,0.0,6.193333
136,0.0,20,0.0,0.0,0.0,1.0,0.0,13.161667


We then created a visualization of the predictions created by our model by first producing a scatter plot of our data points, and then creating a line of the predictions of hours played by our model.

In [19]:
base_plot = alt.Chart(players_clean).mark_circle(opacity = 0.3).encode(
    x = alt.X("age").title("Player's Age"),
    y = alt.Y("played_hours").title("Predicted Hours Played")
)

players_line = alt.Chart(players_preds).mark_line(color = "black").encode(
    x = "age", 
    y = "predictions"
)

players_graph = alt.layer(base_plot, players_line, title = "Age vs Predicted Hours Played")
players_graph

## Discussion

In our KNN regression, we were unable to find any correlation between experience level, age, and playing time. During cross-validation, we found that the best k value to use is 61. Given the dataset with only 196 observations, and using 75% of that to train our model, the amount of data points available did not provide us with enough material to produce a graph that is sufficiently able to predict the amount of hours a player would contribute based on their experience level and age.

These were not the findings we expected to see, as we thought there would be a clear way to predict the number of hours played by a participant based on their age and experience level. With this information, we aimed to answer what kinds of players were likely to provide the most data, which in this case is represented by the number of hours played. Our findings showed that with the data available it is difficult to create a model that accurately predicts the number of hours a participant will play based on their age and experience level. It is also to be noted that the majority of our data lies at under one hour played, with most players being 17 year-old amateur males contributing to the study. The skewed nature of the data set given makes it difficult to produce a model that is able to predict with a low standard error mean.

Since we were unable to produce an effective model, impacts would have to be considered in the more general impacts of the entire PLAICRAFT research project. If we were supplied with more data points, and a more equally diverse data set, we would be able to create a model that more accurately is able to form a connection between our predictor (age, and experience level) and response variables (played hours). The impact this new model could have would be to more accurately predict the kinds of players who are more likely to contribute more hours to their study, allowing them to target that audience, or try to target others, outside of their target audience, in order to produce a well-rounded data set which would make it easier to create a more accurate predictive model.

Given that we were not able to form a model that was able to accurately predict the question at hand, one question we might ask in the future is how we can better sample data in order to form a more accurate model. Once we understand what caused the issues with creating our model and how to solve these issues, we could create a more accurate model and gain a better understanding of which kinds of players contribute the most data. Once we know which kinds of players contribute the most data to the study, we can also ask how we can better advertise the study to the demographic which provides the most data. Knowing this would allow the researchers to better advertise their study in order to collect more data points.

## References

Nolan, Ryan. “YouTubeOne Hot Encoder with Python Machine Learning (Scikit-Learn),” YouTube, 14 Aug. 2023, www.youtube.com/watch?v=rsyrZnZ8J2o.