# **Predicting player's engagement based on age**

## **Introduction**

This data set was compiled by a research group at UBC who were observing how players interacted with a Minecraft server. By analyzing this dataset, we are interested in seeing which types of players contribute to high volumes of data, indicating that they may be worth pursuing for recruitment. Specifically, we would like to know if player age can be a predictor for whether a player is a low, hour, medium-hour, or high-hour player. Using this predictive model, we can then better understand which types of players are more likely to heavily interact with the game, and as such, are worth targeting for recruitment. 

Our research question is: 
> Can player age accurately predict whether the player is a low-hour, medium-hour or high-hour player?

#### Describing the dataset

To achieve this, we took the original players.csv dataset and focused on each player’s age along with the resulting play-time length to build the model. Players.csv is a dataset that records the experience level, subscription status, identity, and age of each player: 

| Name of variable | Type | Description |
|----------------- | ---- | ----------- |
| experience | chr | The self-reported level of experience of each player |
| subscribe | lgl | The player's subscription status to a game-related newsletter |
| hashedEmail | chr | The hashed email addresses of each player to identify players without revealing their contact information| 
| played_hours | dbl | The numbers of hours players were playing the game |
| name | chr | The names of each of the players | 
| gender | chr | The gender that each player identifies by | 
| Age | dbl | The age of each player in years |

## **Method**

In [None]:
#Loading the libraries
library(tidyverse)
library(tidymodels)
library(dplyr)
library(RColorBrewer)
set.seed(42)

#### **1. Data processing and preliminary visualisations**

Before starting the analysis, we cleaned and wrangled the dataset so that it is suitable for use. First, we read in the dataset using a URL. 

In [None]:
players_url <- read_csv("https://raw.githubusercontent.com/emma-chow/DSCI-Final-Project/70bbf2c6fcb0a1fd395c3b650eb82c00067f8953/players.csv")
head(players_url)

To tidy the dataset, we identified whether there are missing values that need to be removed so that no error messages are returned when finding summary statistics. 

In [None]:
players_missing <- players_url |> 
    sapply(function(x) sum(is.na(x)))
players_missing

As shown above, there are very little missing values in the dataset, in the variable "Age". So, the missing observations can be removed without biasing the data. 

In [None]:
players_data <- players_url |>
    drop_na()
glimpse(players_data)

Then, the variables used in the analysis were selected to create a separate dataframe. 

In [None]:
players <- players_data |>
select(Age, played_hours) 
head(players)

In order to determine the engagement level of the players, the played hours variable was categorised. The split of low, medium and high hours were decided using data from Statista (U.S. Adults Weekly Gaming Hours by Age 2024 | Statista, 2024). According to this source, the majority of people played between 1-15 hours. 

In [None]:
players_engagement <- players |>
mutate(engagement_level = factor((played_hours > 15) + (played_hours > 1), 
    levels = c(0, 1, 2), 
    labels = c("Low", "Medium", "High"))) 
head(players_engagement)

Then, the data was split 70/30 into training and testing data. 

In [None]:
set.seed(1)
players_split <- initial_split(players_engagement, prop = 0.70, strata = engagement_level)  
players_train <- training(players_split)
players_test <- testing(players_split)

head(players_train)
head(players_test)

Next, the proportions of each class in engagement_level was found in order to determine the number of observations in each class. This will be useful for evaluating the usefulness of the classifier later on. Below, the table shows that in both the training and testing set, low engagement levels is the majority class. 

In [None]:
players_train_proportions <- players_train |> 
    group_by(engagement_level) |>
    summarize(n = n()) |>
    mutate(percent = n/nrow(players_train)*100)
players_train_proportions

players_test_proportions <- players_test |> 
    group_by(engagement_level) |>
    summarize(n = n()) |>
    mutate(percent = n/nrow(players_test)*100)
players_test_proportions

The next step in processing the data is to compute the summary statistics for each of the variables. As the NAs have been removed, this is now possible. 
- Age: There is a 41 year age range in the data. The average age is 20 years old.
- played_hours: There is a wide range of played hours, 218.10 hours, with the lowest being 0.
- engagement_level: The majority class is low engagement level, while medium and high engagement levels have significantly less observations. 

In [None]:
players_summary <- players_train |>
    summary()
players_summary

#### **2. Preliminary visualisations**

In [None]:
options(repr.plot.height = 4, repr.plot.width = 10)
plot_1 <- players_train |>
    ggplot(aes(x = Age, colour = engagement_level)) +
    geom_density(alpha = 0.2, linewidth = 0.75) + 
    labs(x = "Player age (years)", 
         colour = "Engagement level", 
         title = "Figure X: Distribution of player age based on engagement level") +
    theme(element_text(size = 10))
plot_1

This plot was created to visualise the potential relationship between age (years) and engagement level. This plot suggests that there may be a weak relationship between player age and engagement level. Although there is significant overlap of the plots, indicating a weak or no relationship between the variables, the distribution of the medium and high plots (green and blue) is wider, suggesting that as the players are older or younger than the mean, they are more likely to engage in the game. Furthermore, the distribution of the low engagement level plot peaks at around age 18 and again at age 22, suggesting that the majority of players who have a low engagement level are around the median age. 

#### **3. Data analysis**

To begin data analysis, we created an initial recipe, where the predictor is age and the classifier is engagement level. When creating the model, an arbitrary number of neighbours was chosen because it is just used for creating the cross validation, which will later be able to pick the best number of neighbours to use. A workflow was also created and then fit to the training set. 

In [None]:
set.seed(2)
players_recipe <- recipe(engagement_level ~ Age, data = players_train) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors())

knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 2) |>
    set_engine("kknn") |>
    set_mode("classification")

knn_fit <- workflow() |>
    add_recipe(players_recipe) |>
    add_model(knn_spec) |>
    fit(data = players_train)

knn_fit

Next, we conducted a five-fold cross-validation in order to determine the optimal number of neighbours for this model. The dataframe k_vals was created in order to specify the range of k-values that we wanted to be tested. The range we chose to show in the plot was 1-20 because the plot plateaued after k = 6, indicating that the optimal number of neighbours is 6. 

In [None]:
set.seed(3)
players_vfold <- vfold_cv(players_train, v = 5, strata = engagement_level)

knn_tune <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
    set_engine("kknn") |>
    set_mode("classification")

k_vals <- tibble(neighbors = seq(from = 1, to = 20, by = 1))

players_fit <- workflow() |>
    add_recipe(players_recipe) |>
    add_model(knn_tune) |>
    tune_grid(resamples = players_vfold, grid = k_vals) |>
    collect_metrics()

In [None]:
set.seed(4)
options(repr.plot.height = 8, repr.plot.width = 10)
k_acc <- players_fit |>
    filter(.metric == "accuracy")

accuracy_vs_k <- ggplot(k_acc, aes(x = neighbors, y = mean)) +
    geom_point() +
    geom_line() +
    labs(x = "Neighbors", y = "Accuracy Estimate") +
    ggtitle("Figure x: Neighbours vs accuracy estiamte")
accuracy_vs_k

In [None]:
best_k <- k_acc |>
    arrange(desc(mean)) |>
    head(1) |>
    pull(neighbors)
best_k

Then, the model was recreated but this time with the optimal k-value. A workflow was created and the model and recipe was fit back onto the training data. 

In [None]:
set.seed(3)

knn_spec_best <- nearest_neighbor(weight_func = "rectangular", neighbors = best_k) |>
    set_engine("kknn") |>
    set_mode("classification")

players_fit_best <- workflow() |>
    add_recipe(players_recipe) |>
    add_model(knn_spec_best) |>
    fit(data = players_train)

players_fit_best

Finally, the usefulness of the classifier was evaluated on the testing set. The accuracy, precision and recall were found. Additionally, a confusion matrix was created. 

In [None]:
players_test_predictions <- predict(players_fit_best, players_test) |>
    bind_cols(players_test) 

players_test_predictions |>
    metrics(truth = engagement_level, estimate = .pred_class) |>
    filter(.metric == "accuracy")

players_test_predictions |>
    precision(truth = engagement_level, estimate = .pred_class, event_level = "first") 

players_test_predictions |>
    recall(truth = engagement_level, estimate = .pred_class, event_level = "first")

confusion <- players_test_predictions |>
    conf_mat(truth = engagement_level, estimate = .pred_class)
confusion