<h1>Exploring Minecraft Players Data: Can played_hours, and age predict the subscribe variable (if a player subscribes to the newsletter or not) in players dataset?</h1>

## Introduction

In online gaming communities, player engagement goes beyond just time spent in-game. Some players actively seek deeper involvement, such as following updates, engaging with their communities, or subscribing to newsletters. Others log in, play, and leave without further interaction, remaining passive participants. Understanding what drives that deeper level of engagement can help researchers make better decisions about outreach and recruitment.

A UBC Computer Science research group, led by Frank Wood, operates a Minecraft research server that collects detailed player activity data. The dataset includes two files:
- `players.csv`: A list of all unique players, including data about each player.
- `session.csv`: A list of individual play sessions by each player, including data about the session.

This project examines whether **played hours** and **age** predict a player’s likelihood of subscribing to the newsletter. Are younger players more likely to stay connected? Does spending more time in-game mean someone is more engaged overall? By analyzing data from `players.csv`, we aim to uncover patterns that provide insight into player behavior and long-term involvement.

The dataset `players.csv` contains 7 variables with 196 observations:

| Variable      | Description                                                                 |
|---------------|-----------------------------------------------------------------------------|
| experience    | Player's experience level (e.g., beginner, intermediate)                   |
| subscribe     | Whether the player is subscribed (`TRUE`/`FALSE`)                          |
| hashedEmail   | An anonymized player identifier                                             |
| played_hours  | Total hours the player has played                                           |
| name          | Player's name (may not be unique)                                           |
| gender        | Gender identity of the player                                               |
| Age           | Player’s age                                                                |

We chose `played_hours` and `Age` to predict `subscribe` because they represent key behavioral and demographic factors that likely influence a player's decision to subscribe. `played_hours` reflects how invested a player is with the game, and those who spend more time playing may be more inclined to subscribe. `Age`, on the other hand, can affect spending habits and preferences, as different age groups may have varying levels of disposable income or subscription behavior.


Load data: We will only use the players dataset in this analysis!

In [None]:
library(tidyverse)
#test monkey
players_data <- read_csv("https://raw.githubusercontent.com/emmah47/dsci100-project/refs/heads/main/players.csv", show_col_types = FALSE)

In [None]:
head(players_data)

In [None]:
summary(players_data)

**Data Wrangling**

We can see that there are 2 NA's in Age, we will remove those observations.

In [None]:
clean_data <- filter(players_data, !is.na(Age))
summary(clean_data)

<br>

**Data Summary**

categorical variables:

In [None]:
print("Player experience summary")
table(clean_data$experience)

In [None]:
print("Player subscription summary")
table(clean_data$subscribe)

In [None]:
print("Player gender summary")
table(clean_data$gender)

<br>

### Number of Observations:
194

### Number of Variables:
7

### Summary of numerical features: 
**Type**: played_hours and Age has type <dbl>, which is a numeric value that can have decimal points. <br>
**Description**: played hours is the number of hours a player has played the game, age is the age of the player.
| Column       | Min      | 1st Qu.  | Median  | Mean    | 3rd Qu.  | Max    | 
|-------------|---------|----------|---------|--------|----------|---------|
| played_hours | 0.000  | 0.000    | 0.100   | 5.846  | 0.600    | 223.100 | 
| Age         | 8.00    | 17.00    | 19.00   | 20.52  | 22.00    | 50.00   | 


### Summary of categorical features: 
**Experience Summary** <br>
**Type**: \<chr>, a string<br>
**Description**: the level of previous experience the player has.
| Category  | Count |
|-----------|-------|
| Amateur   | 63    |
| Beginner  | 35    |
| Pro       | 14    |
| Regular   | 36    |
| Veteran   | 48    |

**Subscription Summary**  <br>
**Type**: \<lgl>, true or false boolean value<br>
**Description**: true if the player has subscribed to a game-related newsletter, false if not/
| Subscribed | Count |
|------------|-------|
| FALSE      | 52    |
| TRUE       | 144   |

**Gender Summary**  <br>
**Type**:  \<chr>, a string<br>
**Description**: the player's gender.
| Gender               | Count |
|----------------------|-------|
| Agender             | 2     |
| Female              | 37    |
| Male                | 124   |
| Non-binary          | 15    |
| Other               | 1     |
| Prefer not to say   | 11    |
| Two-Spirited        | 6     |


### How the data is collected:
When users decide to sign up, they are asked to fill out an anonymized form where they choose a name from a list of availible names and also their experience, gender, age, and email. I'm guessing the played_hours is probably just collected by the researchers logging connections to the server, or some similar method.

**Visualiztions here**

In [None]:
sub_proportions <- players_train |>
                      group_by(subscribe) |>
                      summarize(n = n()) |>
                      mutate(percent = 100*n/nrow(players_train))

sub_proportions

In [None]:
options(repr.plot.width=10, repr.plot.height=8)

players_plot <- ggplot(players_data, aes(x = experience, y = played_hours, colour = subscribe)) +
	geom_point() + 
	xlab("experience") + 
	ylab("played hours") +
    labs(colour = "subscribe") +
    theme_minimal() +
    scale_fill_brewer(palette = "Set3") 
    # scale_y_log10() 
players_plot

**KNN**

In [None]:
players_data <- clean_data |>
    select(played_hours, experience, subscribe) |>
    mutate(subscribe = as_factor(subscribe), 
           experience = recode(experience, Beginner = 1, Amateur = 2, Regular = 3, Pro = 4, Veteran = 5))
    
head(players_data)

In [None]:
set.seed(1)

players_split <- initial_split(players_data, prop = 0.75, strata = subscribe)
players_train <- training(players_split)
players_test <- testing(players_split)

In [None]:
library(tidymodels)
set.seed(1)


knn_spec <- nearest_neighbor(weight_func = "optimal", neighbors = tune()) |>
  set_engine("kknn") |>
  set_mode("classification")
knn_spec

In [None]:
players_recipe <- recipe(subscribe ~ played_hours + experience, data = players_train) |>
  step_scale(all_predictors()) |>
  step_center(all_predictors())

players_recipe

In [None]:
set.seed(1)

players_vfold <- vfold_cv(players_train, v = 5, strata = subscribe)
grid_vals <- tibble(neighbors = seq(1, 50, 1))

cv_results <- workflow() |>
                  add_recipe(players_recipe) |>
                  add_model(knn_spec) |>
                  tune_grid(resamples = players_vfold, grid = grid_vals) 

vfold_metrics <- cv_results |>
                  collect_metrics()

accuracies <- vfold_metrics |>
  filter(.metric == "accuracy")

accuracies

In [None]:
accuracy_vs_k <- ggplot(accuracies, aes(x = neighbors, y = mean)) +
  geom_point() +
  geom_line() +
  labs(x = "Neighbors", y = "Accuracy Estimate") +
  theme(text = element_text(size = 12))

accuracy_vs_k

In [None]:
set.seed(1)

best_k = 17

knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = best_k) |>
  set_engine("kknn") |>
  set_mode("classification")

knn_fit <- workflow() |>
  add_recipe(players_recipe) |>
  add_model(knn_spec) |>
  fit(data = players_train)

knn_fit


In [None]:
players_test_predictions <- predict(logit_fit, players_test) |>
  bind_cols(players_test)

players_test_predictions |>
  metrics(truth = subscribe, estimate = .pred_class) |>
  filter(.metric == "accuracy")

In [None]:
players_test_predictions |>
    precision(truth = subscribe, estimate = .pred_class, event_level="first")

In [None]:
players_test_predictions |>
    recall(truth = subscribe, estimate = .pred_class, event_level="first")

In [None]:
conf_mat <- players_test_predictions |>
    conf_mat(truth = subscribe, estimate = .pred_class)
conf_mat

In [None]:
logit_spec <- logistic_reg() |>
    set_engine("glm") |>
    set_mode("classification")

grid_vals <- tibble(neighbors = seq(1, 10, 1))

logit_fit <- workflow() |>
              add_recipe(players_recipe) |>
              add_model(logit_spec) |>
              fit(data = players_train)

logit_fit

In [None]:
players_predictions <- predict(logit_fit, players_test) |>
  bind_cols(players_test)

players_predictions |>
  metrics(truth = subscribe, estimate = .pred_class) |>
  filter(.metric == "accuracy")

In [None]:
players_test_predictions |>
    precision(truth = subscribe, estimate = .pred_class, event_level="first")

In [None]:
players_test_predictions |>
    recall(truth = subscribe, estimate = .pred_class, event_level="first")

In [None]:
conf_mat <- players_test_predictions |>
    conf_mat(truth = subscribe, estimate = .pred_class)
conf_mat

In [None]:
library(tidyverse)

players_data <- read_csv("https://raw.githubusercontent.com/emmah47/dsci100-project/refs/heads/main/players.csv", show_col_types = FALSE)
session_data <- read_csv("https://raw.githubusercontent.com/emmah47/dsci100-project/refs/heads/main/sessions.csv", show_col_types = FALSE)

<br><br>

<h2>1. Data Description</h2>

<h3>First, we take a look at the players data</h3>

In [None]:
head(players_data)

<br>

Below we find some summary statistics and distribution information for variables inside the dataset:

In [None]:
summary(players_data)

In [None]:
print("Player experience summary")
table(players_data$experience)

In [None]:
print("Player subscription summary")
table(players_data$subscribe)

In [None]:
print("Player gender summary")
table(players_data$gender)

In [None]:
# finding rows with missing values
players_data[!complete.cases(players_data), ]

### Number of Observations:
196

### Number of Variables:
7

### Summary of numerical features: 
**Type**: played_hours and Age has type <dbl>, which is a numeric value that can have decimal points. <br>
**Description**: played hours is the number of hours a player has played the game, age is the age of the player.
| Column       | Min      | 1st Qu.  | Median  | Mean    | 3rd Qu.  | Max     | NA Count |
|-------------|---------|----------|---------|--------|----------|---------|----------|
| played_hours | 0.000  | 0.000    | 0.100   | 5.846  | 0.600    | 223.100 | 0        |
| Age         | 8.00    | 17.00    | 19.00   | 20.52  | 22.00    | 50.00   | 2        |


### Summary of categorical features: 
**Experience Summary** <br>
**Type**: \<chr>, a string<br>
**Description**: the level of previous experience the player has.
| Category  | Count |
|-----------|-------|
| Amateur   | 63    |
| Beginner  | 35    |
| Pro       | 14    |
| Regular   | 36    |
| Veteran   | 48    |

**Subscription Summary**  <br>
**Type**: \<lgl>, true or false boolean value<br>
**Description**: true if the player has subscribed to a game-related newsletter, false if not/
| Subscribed | Count |
|------------|-------|
| FALSE      | 52    |
| TRUE       | 144   |

**Gender Summary**  <br>
**Type**:  \<chr>, a string<br>
**Description**: the player's gender.
| Gender               | Count |
|----------------------|-------|
| Agender             | 2     |
| Female              | 37    |
| Male                | 124   |
| Non-binary          | 15    |
| Other               | 1     |
| Prefer not to say   | 11    |
| Two-Spirited        | 6     |


### Issues with the data:
We have two observations that are missing the "age" feature. Another potential issue is that the dataset is very small (196 records), so it may be less representative of the entire population compared to if we had a larget dataset.

### How the data is collected:
When users decide to sign up, they are asked to fill out an anonymized form where they choose a name from a list of availible names and also their experience, gender, age, and email. I'm guessing the played_hours is probably just collected by the researchers logging connections to the server, or some similar method.

<br><br>

<h3>Now we can do the same analysis on session data:</h3>

In [None]:
head(session_data)

In [None]:
summary(session_data)

In [None]:
# finding rows with missing values
session_data[!complete.cases(session_data), ]

### Number of Observations:
1535

### Number of Variables:
5

### Summary of numerical features: 
**original_start_time**
**Type**: <dbl>, which is a numeric value that can have decimal points. <br>
**Description**:  session start time timestamps in milliseconds since the Unix epoch January 1, 1970 (UTC)
| Statistic  | Value        |
|------------|-------------|
| Min        | 1.712e+12   |
| 1st Quartile | 1.716e+12   |
| Median     | 1.719e+12   |
| Mean       | 1.719e+12   |
| 3rd Quartile | 1.722e+12   |
| Max        | 1.727e+12   |
| NA's       | 0          |


**original_end_time**
**Type**: <dbl>, which is a numeric value that can have decimal points. <br>
**Description**:  session end time timestamps in milliseconds since the Unix epoch January 1, 1970 (UTC)
| Statistic  | Value        |
|------------|-------------|
| Min        | 1.712e+12   |
| 1st Quartile | 1.716e+12   |
| Median     | 1.719e+12   |
| Mean       | 1.719e+12   |
| 3rd Quartile | 1.722e+12   |
| Max        | 1.727e+12   |
| NA's       | 2           |



### Issues with the data:
We have two observations that are missing the end_time and original_end_time feature. The data is also untidy, there are both dates (with day, month, year) and times in the start_time and end_time which should be seperated.

### How the data is collected:
This data is collected by recording user's play session lengths.

<br><br>

<h2>2. Questions</h2>

**Broad question**: Question 1: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

**Specific question**: Can experience, played_hours, gender, and age predict the subscribe variable (if a player subscribes to the newsletter or not) in players_data dataset?

This data will help me address my question since I will be able to analyze the relationships between the explanatory variable and the response variable using the data and various methods like linear regression or knn. I will first wrangle the data to impute any NA values by filling them in with the median.

<br><br>

<h2>3. Exploratory Data Analysis and Visualization</h2>

I do not think the sessions dataset will be useful to my predictions and I am not using any variables from that dataset in my question. Therefore, I will not be wrangling or exploring the second dataset in this section.

**Mean values for quantitative variables in players.csv:**

In [None]:
players_quantitative_data <- select(players_data, played_hours, Age)
summary(players_quantitative_data)

**Mean values of quantitative variables in players.csv dataset:**

| Column       | Mean    | 
|-------------|---------|
| played_hours | 5.846  | 
| Age         | 20.52  |

<br>

<h3>Visualizations</h3>

In [None]:
ggplot(players_data, aes(x = experience, fill = as.factor(subscribe))) +
  geom_bar(position = "dodge") +
  labs(title = "Subscription Count by Experience Level",
       x = "Experience Level",
       y = "Count",
       fill = "Subscribed")

In [None]:
ggplot(players_data, aes(x = gender, fill = as.factor(subscribe))) +
  geom_bar(position = "dodge") +
  labs(title = "Subscription Count by Gender",
       x = "Gender",
       y = "Count",
       fill = "Subscribed")

In [None]:
played_hours_scatterplot <- ggplot(players_data, aes(x = played_hours, y = subscribe)) +
	geom_point(alpha = 0.3) + 
	labs(title = "Subscribtion vs Play Time (hrs)",
       x = "Play time (hours)",
       y = "is Subscribed") 

played_hours_scatterplot

<br><br>

<h2>4. Methods and Plan</h2>

<h3>Proposed Method: K-Nearest-Neighbors (KNN)</h3>

**I will use KNN because:**
- it doesn't require the relationship between variables to be linear unlike linear/logistic regression
- It works well when the dimensionality of the data isn't high

**Assumptions:**
- data is sufficiently large enough for KNN to have enough neighbors to compare a new data point to
- data will be scaled (this will be done in the next part of the project)
- no irrelevant features (this will skew distance calculation)

**Weaknesses:**
- irrelevant features can skew prediction
- highly unbalanced dataset can skew prediction
- less interpretable compared to linear models

**Model selection and cross val**
- I will first split the data to 80% training 20% test because the dataset is small so I would like more data points for training.
- I will do cross validation with the training and validation set in order to find the best hyperparameter k.
- I will use the best k in my final KNN model.
