# DSCI 100 Project Final Report: Predicting Hours Contribution To “Players” Dataset From UBC Minecraft Players Based On Experience Level, Age, & Gender.

**Section:** 100-002. <br>
**Group Number:** 39. <br>
**Members:** <br>
Kevin Liu (91073668). <br>
Sophie Schlatter (77753598). <br>
Robbie Suganob (18247866). <br>
Adrian Tang (56996051).

***

## **PART 1: INTRODUCTION**

### Background:

Understanding player engagement is a critical aspect of game research, influencing game design, resource allocation, and targeted player recruitment, which has led to increased interest in analyzing player activity data. A research group in the Computer Science department at the University of British Columbia (UBC), led by Frank Wood, is investigating how players interact with video games by collecting their in-game behavioural data.

For their study, the research team has set up a Minecraft server, where players' in-game actions are recorded as they navigate through the virtual world. The data collected provides an understanding of how different types of players engage with the game. However, not all players contribute equally. Some engage with the game significantly more than others, and running such a project efficiently requires targeted recruitment strategies to attract the players who will contribute more data. By identifying player characteristics that correlate with high levels of engagement, the research team can focus their recruitment efforts on individuals who are most likely to produce relevant and necessary data.

### Research Question:

This report aims to answer the following question:

**"Can certain experience levels, ages, and genders predict the total number of hours a player contributes to the players dataset?"**

By analyzing how each of these player characteristics relate to total playtime, we can determine which groups of players are most engaged and therefore contribute the most data. These findings will help the research team refine recruitment strategies and allocate their resources more efficiently.

### Dataset Analysis:

To answer this question, we used two datasets: <br>
**`players.csv`** &mdash; Contains general demographic and experience-related information about players. <br>
**`sessions.csv`** &mdash; The logs of individual game sessions.

***
The descriptive summary of the variables in the **`players.csv`** dataset: <br>
<br> - Number of observations: **196** (indicates 196 unique users/players).
<br> - Number of variables: **7** (listed below).
| Variable | Type | Description |
| --- | --- | --- | 
| `experience`| categorical (chr) | Refers to the player's experience level (amateur, regular, veteran, pro). |
| `subscribe` | logical (lgl) | Whether the player is a subscriber to the game-related newsletter. |
| `hashedEmail` | categorical (chr) | A hashed (encrypted) version of a player's email that acts as an anonymized identifier for the player; this is done to avoid using their actual email addresses and thus protect their privacy. |
| `played_hours` | numerical (dbl) | Total hours played by the player. |
| `name` | categorical (chr) | Player's name. |
| `gender` | categorical (chr) | Gender of the player. | 
| `Age` | numerical (dbl) | Player's age. |


Other notes for the **`players.csv`** dataset:
<br> - Some variables such as `gender` and `experience` may not be evenly distributed, which may introduce biases to predictions.
<br> - `played_hours` could have outliers; some players are observed to have extreme values. This could potentially skew the averages and impact the modeling.

***
The descriptive summary of the variables in the **`sessions.csv`** dataset: <br>
<br> - Number of observations: **1535** (indicates 1535 recorded sessions).
<br> - Number of variables: **5** (listed below)
| Variable | Type | Description |
| --- | --- | --- | 
| `hashedEmail`| categorical (chr) | Anonymized player identifier, matches the **`players.csv`** dataset. |
| `start_time` | categorical (chr) | Date and time for the start of player's session. |
| `end_time` | categorical (chr) | Date and time for the end of player's session. |
| `original_start_time` | numerical (dbl) | A timestamp version of `start_time`. |
| `original_end_time` | numerical (dbl) | A timestamp version of `end_time`. |

Other notes for the **`sessions.csv`** dataset:
<br> - There are significantly more session observations than player observations, indicating that some players has multiple sessions.

***

## **PART 2: METHODS & RESULTS**

Here, we will load the necessary libraries and data, assigning **`players.csv`** and **`sessions.csv`** to objects **players** and **sessions** respectively:

In [None]:
library(ggplot2)
library(repr)
library(tidymodels)
library(tidyverse)

In [None]:
players <- read.csv("https://raw.githubusercontent.com/frogbie/dsci100project-002-39/refs/heads/main/players.csv")
head(players)

In [None]:
sessions<- read.csv("https://raw.githubusercontent.com/frogbie/dsci100project-002-39/refs/heads/main/sessions.csv")
head(sessions)

### Data Wrangling & Cleaning:

***

In terms of data wrangling, we will match the **players** and **sessions** dataset using the `hashedEmail` variable, which exists in both datasets. We then will aggregate the total number of sessions per player in a new variable `(total_sessions)`, and also calculate the average session duration per player. Using the `mutate` function, we can create a new variable: `avg_session_length`, which would be computed by played_hours / total_sessions. By doing this, we will be able to compare the `played_hours` and `avg_session_length` variables against `experience`, `age`, and `gender` to determine which characteristics are the strongest predictors of the amount of data a player contributes.


In order to accomplish this, we need to perform the following: <br>
- Store our `players` data frame in the `tidy_players` variable. <br>
- Store our `sessions` data frame in the `tidy_sessions` variable, while also removing redundant data from the sessions dataset. <br>
- Find the mean value for each quantitative variable in the `players.csv` dataset. 

**Wrangling of `players` dataset:** <br>
- The data is mostly tidy, but we have arranged the experience column by experience level for the sake of organization.

In [None]:
tidy_players <- players |>
    mutate(experience = factor(experience, levels = c("Amateur", "Beginner", "Regular", "Veteran", "Pro"))) |>
    arrange(experience)

head(tidy_players)
tail(tidy_players)

**Wrangling of the `sessions` data set:** <br>
Again, the data is mostly tidy in regards to what we require for this project. Although there are multiple repetitions in the `hashedEmail` column, each row represents a different observation (unique session). Removing the `original_start_time` and `original_end_time` columns allows us to avoid redundancy; we only need `start_time` and `end_time` to calculate the total session time. We will also convert the variables to POSIXct format (dttm) from character format to simplify calculations.

In [None]:
tidy_sessions <- sessions |>
    select(-original_start_time, -original_end_time) |>
    mutate(start_time = as.POSIXct(start_time, format="%d/%m/%Y %H:%M", tz="PTC")) |>
    mutate(end_time = as.POSIXct(end_time, format="%d/%m/%Y %H:%M", tz="PTC"))

head(tidy_sessions)

**Mean value for each quantitative variable in the `players.csv` dataset:** <br>
- Calculating the mean for each quantitative variable (age and played hours) can offer valuable insight for data analysis. For example, the mean age can help identify the age range where players are most active and would therefore contribute significant data. Similarly, the observed mean for played hours may serve as a benchmark for what we would define as significant contributors (i.e. we can determine if a certain characteristic is generally associated with higher playtime than the mean).

In [None]:
players_means <- players |>
    summarise(mean_played_hours = mean(played_hours, na.rm = TRUE), mean_age = mean(Age, na.rm = TRUE))

players_means

***

### Dataset Visualisations:

#### 1. Experience Level vs. Played Hours:

In [None]:
experience_plot1 <- tidy_players |>
    ggplot(aes(x = experience, y = played_hours)) +
    geom_point(alpha = 0.6) +
    labs(title = "Played Hours By Experience Level", x = "Experience Level", y = "Played Hours") +
    theme(axis.text.x = element_text(color = "grey20", size = 12, angle = 30, hjust = .5, vjust = .5, face = "plain"),
        axis.text.y = element_text(color = "grey20", size = 12, hjust = 1, vjust = 0, face = "plain"),  
        axis.title.x = element_text(color = "grey20", size = 15, hjust = .5, vjust = 0, face = "bold"),
        axis.title.y = element_text(color = "grey20", size = 15, angle = 90, hjust = .5, vjust = 2.5, face = "bold"),
        title = element_text(size=18, face = "bold"))

experience_plot1

In [None]:
experience_means <- tidy_players |>
    group_by(experience) |>
    summarise(mean_hours = mean(played_hours, na.rm = TRUE))

experience_plot2 <- experience_means |>
    ggplot(aes(x = experience, y = mean_hours)) + 
    geom_col(fill = "green") +
    labs(title = "Average Played Hours By Experience Level", x = "Experience Level", y = "Average Played Hours") +
    theme(axis.text.x = element_text(color = "grey20", size = 12, angle = 30, hjust = .5, vjust = .5, face = "plain"),
        axis.text.y = element_text(color = "grey20", size = 12, hjust = 1, vjust = 0, face = "plain"),  
        axis.title.x = element_text(color = "grey20", size = 15, hjust = .5, vjust = 0, face = "bold"),
        axis.title.y = element_text(color = "grey20", size = 15, angle = 90, hjust = .5, vjust = 2.5, face = "bold"),
        title = element_text(size=18, face = "bold"))


experience_plot2

From the two visualizations above we can see that amateurs have the highest amounts of played hours overall. It may seem like regulars have the most playing time, which inuitively makes sense, however because of outliers in the data we can't say with complete certainty that it's true. Veteran's and Beginners had the least amount of playing time.

#### 2. Age vs. Played Hours:

In [None]:
age_plot <- tidy_players |>
    ggplot(aes(x = Age, y = played_hours)) +
    geom_point(alpha = 0.6) +
    labs(title = "Played Hours By Age", x = "Age", y = "Played Hours") +
    theme(axis.text.x = element_text(color = "grey20", size = 12, hjust = .5, vjust = .5, face = "plain"),
        axis.text.y = element_text(color = "grey20", size = 12, hjust = 1, vjust = 0, face = "plain"),  
        axis.title.x = element_text(color = "grey20", size = 15, hjust = .5, vjust = 0, face = "bold"),
        axis.title.y = element_text(color = "grey20", size = 15, angle = 90, hjust = .5, vjust = 2.5, face = "bold"),
        title = element_text(size=18, face = "bold"))

age_plot

From the graph above we can conclude that in general, players aged from around 15-30 contribute the most data, exhibiting the highest amounts of played hours. Outliers with extremely high player counts appear to fall in the age category of 16-22 showing played hours from that of 150 to 200 and beyond.

#### 3. Gender vs. Played Hours:

In [None]:
gender_plot <- tidy_players |>
    ggplot(aes(x = gender, y = played_hours)) +
    geom_point(alpha = 0.6) +
    labs(title = "Played Hours By Gender", x = "Gender", y = "Played Hours") +
    theme(axis.text.x = element_text(color = "grey20", size = 12, angle = 30, hjust = .5, vjust = .5, face = "plain"),
        axis.text.y = element_text(color = "grey20", size = 12, hjust = 1, vjust = 0, face = "plain"),  
        axis.title.x = element_text(color = "grey20", size = 15, hjust = .5, vjust = 0, face = "bold"),
        axis.title.y = element_text(color = "grey20", size = 15, angle = 90, hjust = .5, vjust = 2.5, face = "bold"),
        title = element_text(size=18, face = "bold"))
   


gender_plot

From the graph above, males have the highest concentration of players, taking the top spot in average hours played with females in close second. Average hours played for other genders are all generally quite low, with the majority clocking under 20 hours of playtime. However, there is a single outlier in non-binary (at roughly 220 hours) which should not be considered in further analysis. 

***

### Data Analysis & Visualisation:

To answer the research question, we will use linear regression as our primary method, using it to predict the total number of hours a player contributes to the data based on experience, gender, and age. Linear regression is the ideal method since it allows for predictions between multiple predictors and a continuous dependent variable (total hours played). This method provides straightfoward model that can be easily manipulated and adjusted, but only under the assumption that the relationship between our predictors and dependent variable is linear. However, since experience and gender are categorical variables, they need to be converted to unique numerical values before analysis. 

#### Preprocessing and fitting the data:

In [None]:
tidy_players_clean <- tidy_players
tidy_players_clean$experience <- as.factor(tidy_players_clean$experience)
tidy_players_clean$gender <- as.factor(tidy_players_clean$gender)

set.seed(1000)

players_split <- initial_split(tidy_players_clean, prop = 0.70, strata = played_hours)
players_train <- training(players_split)
players_test <- testing(players_split)

players_spec <- linear_reg() |>
    set_engine("lm") |>
    set_mode("regression")

players_recipe <- recipe(played_hours ~ experience + Age + gender, data = players_train)

players_fit <- workflow() |>
    add_recipe(players_recipe) |>
    add_model(players_spec) |>
    fit(data = players_train)

players_fit

Coefficients indicate influence of each variable. For example, the coefficient for "Beginner" is -5.9455, meaning that, holding other factors constant, being a "Beginner" experience level would result in a decrease of about 5.95 hours of play time compared to the reference level (which is likely "Amateur" since it's missing here). Please elaborate more on this explanation.

In [None]:
players_test_results <- players_fit |>
    predict(players_test) |>
    bind_cols(players_test) |>
    metrics(truth = played_hours, estimate = .pred)

players_test_results

Explain RMSE and the other metrics here.

In [None]:
coef_data <- data.frame(
    variable = c("Intercept", "experienceBeginner", "experienceRegular", "experienceVeteran",
               "experiencePro", "Age", "genderFemale", "genderMale", "genderNon-binary",
               "genderPrefer_not_to_say", "genderTwo_Spirited"),
    coefficient = c(23.6626, -5.9455, 12.5299, -9.9988, -5.8174, -0.3155, -5.3258, -10.8989, 5.3143, -15.4742, -18.0994))

coef_plot <- coef_data |>
    ggplot(aes(x = reorder(variable, coefficient), y = coefficient)) +
    geom_bar(stat = "identity", fill = "skyblue") +
    coord_flip() + 
    labs(title = "Coefficients Of Linear Model", x = "Predictor Variable", y = "Coefficient") +
    theme(axis.text.x = element_text(color = "grey20", size = 12, hjust = .5, vjust = .5, face = "plain"),
        axis.text.y = element_text(color = "grey20", size = 12, hjust = 1, vjust = 0, face = "plain"),  
        axis.title.x = element_text(color = "grey20", size = 15, hjust = .5, vjust = 0, face = "bold"),
        axis.title.y = element_text(color = "grey20", size = 15, angle = 90, hjust = .5, vjust = 2.5, face = "bold"),
        title = element_text(size=18, face = "bold"))

coef_plot

Okay this is our main visualisation, it tells us which groups are more likely to contribute. We can see that age doesn't really have that big of an influence, and that regular players contribute the most data. This doesn't account for outliers though, so we would need to fix that (especially in males/females).

***

## **PART 3: DISCUSSIONS**

**Summary:**

Through our data analysis we’ve discovered that the players of *Amateur* level contribute the highest number of hours played compared to that of *Beginner*, *Experienced*, *Pro*, and *Veteran* playing times. This suggests that amateur players potentially have both a combination of growing interest and time availability.

Regarding age demographics, we found that players aged from around 15-30 contribute the greatest amount of played hours. This data aligns with current industry gaming engagement among younger audiences.

In terms of gender within our data set, we found that *Males* appear to not only have the highest concentration of players but also have the highest average of played hours. *Female* players follow closely behind in both representation and engagement. Important to note that there is an outlier in the data that *Non-binary* peoples contributed, which probably should not be considered. The data for *Non-binary* players includes an outlier that disprportinately skews the average playtime. This anomaly should be exluded when interpreting overal trends as it doesn't accurately reflect patterns within this demographic.

In [None]:
**Expectations**

Initially, we expected *Experienced*, *Pro*, or *Veteran* players contributing the highest amount of played hours, based on the assumption that more skilled players are more engaged. This misconception was dismantled through our visiual analysis, as we saw *Amateurs* with the highest playing time, and *Experienced* players falsly skewed playing time due to outliers. Additionally, the findings from the age demographic of 15-30 aligned with our assumptions that they'd produce the highest played hours, as well as the fact that *Males* led gender playing time in our data set.

In [None]:
**Impacts**

In [None]:
Again, understanding player engagement is an essential aspect of game research, influencing promotional efforts, resource allocation, and targeted player recruitment. By identifying which *age*, *gender*, and *experience level* groups contribute the most amount to played hours, game developers, researchers, and more can use these findings to help refine and cater to the right demographics, while also allocating time and budget to recruit less involved players. This dual approach not only enhances the reach, and inclusivity of future design efforts, but also supports the development of more balanced and player-centered game design.

***

## **REFERENCES**