# Predicting Game Newsletter Subscription Using K-Nearest Neighbors

## Introduction
The purpose of this data science project is to analyze and predict player behaviours on a Minecraft server based on data collected by a UBC Computer Science research group. This data consisted of two primary files: players.csv and sessions.csv. The players.csv file included detailed information about each player, such as their age, experience level, email, and other personal traits. The sessions.csv file tracked individual play sessions for each player based on their email, including data on session times. 



### Research Question

**What player characteristics and behaviors are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?**

The specific question we addressed in this project is: "Can experience level, age, and total hours played predict newsletter subscription?"

By analyzing both the age and the experience level of a player, we will build a predictive model (K-NN Classification) to assess the likelihood of player subscribing to the newsletter based on such characteristics.


To answer this question, we focus solely on using the players.csv dataset file, as it includes the relevant characteristics (age and experience level) we will be using to hypothesize a player's subscription behaviour. This will involve exploring the dataset to understand the relationships between these variables and subscription status, followed by the development and assessment of the accuracy of our model to form predictions. 



### Dataset Description

The dataset contains 196 observations with the following variables:

- `experience`: Player experience level (Pro, Veteran, Regular, Amateur)
- `subscribe`: Whether the player subscribed to the newsletter
- `played_hours`: Total hours played
- `Age`: Player's age
- `gender`: Player's gender
- `hashedEmail` and `name`: Removed for irrelevance

This analysis uses techniques covered in the DSCI 100 course including data wrangling, visualization, and K-Nearest Neighbors classification.


K-NN classification is a method where the goal is to predict a categorical label, in the case of this project, whether or not a player subscribes based on a player's features.

K-NN classifies new observations by comparing them to similar, labeled examples in the training set. This is effective when we want to base predictions not on assumptions about the data, but on how similar a new observation is to other known observations. Since we are working with a binary classification (TRUE/FALSE) and want to leverage patterns found in past player behavior, K-NN is a suitable choice for building our predictive model.







**We selected the K-Nearest Neighbors (KNN) algorithm because it is non-parametric, easy to understand, and directly influenced by the similarity between features. This cooperates well with our small dataset and helps build an easy understanding of how different features impact prediction outcomes.**

**Because our predictors are categorical variables, we performed one-hot encoding to convert them into a numerical format to be used for the KNN algorithm.**

**Before modeling, we prepared the dataset as follows:**

- Removed unnecessary variables (name, hashedEmail)
- Removed rows with missing values using na.omit()



In [None]:
library(tidyverse)
library(class)
library(ggplot2)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 20)
library(recipes)
library(themis)
library(dplyr)


In [None]:
# Reading players.csv file

players <- read_csv("players.csv") 

players_clean <- players |>
    select(-hashedEmail, -name) |>
    mutate(subscribe = as.factor(subscribe)) |>
    na.omit()


head(players_clean)


Upon inspecting the file, we observed that the number of subscribers exceeds the number of non-subscribers. To verify this, we created a table to compare the counts of each group. As shown below, 73% are subscribers, while 27% are non-subsribers. Given this class imbalance, we will upsample the response variable once we build the model to ensure a more balanced representation.

In [None]:
num_obs <- nrow(players)

players_subscribe <- players |>
  group_by(subscribe) |>
  summarize(
    count = n(),
    percentage = n() / num_obs * 100
  )

players_subscribe

We then summarized numeric features to understand their scale and variance (e.g., max, min, mean).

In [None]:
players_clean |> 
  summarise(
    max_hours = max(played_hours),
    min_hours = min(played_hours),
    mean_hours = mean(played_hours),
    max_age = max(Age),
    min_age = min(Age),
    mean_age = mean(Age)
  )


### **Findings on Player Type Differences**  

We visualized three relationships:

- Subscription Rate by Experience Level

- Played Hours vs Subscription

- Age Distribution vs Subscription


Based on the visualizations generated, we can analyze the vary in characteristics and how these factors influence **subscription behaviour**.

In [None]:
ggplot(players, aes(x = experience, fill = subscribe)) +
  geom_bar(position = "dodge") +
  labs(title = "Figure 1. Subscription Rate by Experience Level", x = "Experience", y = "Count", fill = "Subscribed")


In [None]:

ggplot(players, aes(x = subscribe, y = played_hours, color = subscribe)) +
  geom_point() +
  labs(
    title = "Figure 2. Played Hours vs Subscription",
    x = "Subscribed",
    y = "Played Hours"
  )


In [None]:
ggplot(players, aes(x = Age, fill = subscribe)) +
  geom_histogram(bins = 15, alpha = 0.6, position = "identity", color = "black") +
  labs(title = "Figure 3. Age Distribution by Subscription", x = "Age", y = "Count", fill = "Subscribed")

In [None]:
# ggplot(players, aes(x = gender, fill = factor(subscribe))) +
#   geom_bar(position = "fill") +
#   labs(
#     title = "Figure 4. Subscription Proportion by Gender",
#     x = "Gender",
#     y = "Proportion",
#     fill = "Subscribed"
#   )


#### **1 Experience Level vs Subscription Rate**
- Pro and Veteran players have the highest subscription rates, while Regular and Amateur players have lower subscription rates.

#### **2 Played Hours vs Subscription**
- Subscribed players tend to have higher median played hours.


#### **3 Age vs Subscription**
- Younger players are more likely to subscribe, at same time older players (above 30) show a lower subscription rate.

- Older players may focus on the playing alone


### **Training the Model (edit this later)**  

We split the dataset into training and test sets using an 80/20 ratio in initial_split(). We used the scale() function to normalize numerical predictors to ensure they are on comparable scales, which is essential for KNN performance.

In [None]:
 set.seed(123)

players_split <- initial_split(players_clean, prop = 0.70, strata = subscribe)

players_train <- training(players_split)

players_test <- testing(players_split)

# glimpse(players_train)
colnames(players_train)

set.seed(2020)

# Split the data into training and testing sets
players_split <- initial_split(players_clean, prop = 0.80, strata = subscribe)
players_train <- training(players_split)
players_test <- testing(players_split)

# Define the recipe for preprocessing
players_recipe <- recipe(subscribe ~ Age + experience + played_hours, data = players_clean) |>
    step_dummy(experience) |> # Convert experience to numeric binary model terms
    step_naomit() |>
    step_upsample(subscribe, over_ratio = 0.6) |>  # Upsample the subscribe class
    step_scale(all_predictors()) |>  # Scale all predictor variables
    step_center(all_predictors())  # Center all predictor variables

# Prepare the recipe with the training data
players_recipe_prep <- prep(players_recipe, training = players_train)

# Apply the transformations to both the training and test data
players_train_transformed <- bake(players_recipe_prep, new_data = players_train)
players_test_transformed <- bake(players_recipe_prep, new_data = players_test)

# Define the KNN model specification
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
    set_engine("kknn") |>
    set_mode("classification")

# Define the cross-validation splits
players_vfold <- vfold_cv(players_train, v = 10, strata = subscribe)

# Control for resampling
resample_control <- control_resamples(save_pred = TRUE)

# Define the workflow for KNN model fitting
knn_workflow <- workflow() |>
    add_recipe(players_recipe) |>
    add_model(knn_spec)

# Tune the KNN model using cross-validation to find the best number of neighbors (k)
k_vals <- tibble(neighbors = seq(from = 1, to = 20, by = 1))

knn_results <- knn_workflow |>
    tune_grid(resamples = players_vfold, grid = k_vals) |>
    collect_metrics()

# Collect accuracy metrics and find the best k
accuracies <- knn_results |>
  filter(.metric == "accuracy")

best_k <- accuracies |>
    arrange(desc(mean)) |>
    head(1) |>
    pull(neighbors)

best_k

# Update the KNN model with the best k value
knn_spec_best <- nearest_neighbor(weight_func = "rectangular", neighbors = best_k) |>
    set_engine("kknn") |>
    set_mode("classification")

In [None]:
# Update the workflow with the best k

colnames(players_train_transformed)
knn_workflow_final <- knn_workflow |>
    update_model(knn_spec_best)

# Fit the final model using the transformed training data
knn_final_fit <- fit(knn_workflow_final, data = players_train_transformed)

# Make predictions on the test set using the final fitted model
knn_predictions <- predict(knn_final_fit, new_data = players_test_transformed)

# Evaluate the predictions
confusion_matrix <- conf_mat(knn_predictions, truth = subscribe, estimate = .pred_class)

# Print the confusion matrix to see the results
confusion_matrix


# # Fit the model on the entire training data using the best k
# knn_final_fit <- fit(knn_workflow_final, data = players_train_transformed)

# # Make predictions on the transformed test data
# knn_predictions <- predict(knn_final_fit, new_data = players_test_transformed)

# # Evaluate the model's performance on the test set
# knn_metrics <- metrics(knn_predictions, truth = subscribe)

# # Print the evaluation metrics
# knn_metrics

Created recipe

In [None]:
set.seed(4040) # DO NOT REMOVEz
players_recipe <- recipe(subscribe ~ Age + experience + played_hours, data = players_clean) |>
    step_dummy(experience) |> # To turn into numeric binary model terms
    step_upsample(subscribe, over_ratio = 1) |>  # To upsample subscribers
    step_scale(all_predictors()) |>
    step_center(all_predictors()) 

players_recipe


knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 3) |>
    set_engine("kknn") |>
    set_mode("classification")



Found best k-value

In [None]:
set.seed(2020) # DO NOT REMOVE


players_vfold <- vfold_cv(players_train, v = 10, strata = subscribe)

resmaple_control <- control_resamples(save_pred = TRUE)


players_vfold_fit <- workflow() |>
    add_recipe(players_recipe) |>
    add_model(knn_spec) |>
    fit_resamples(resamples = players_vfold, control = resmaple_control)


players_metrics <- collect_metrics(players_vfold_fit)


knn_tune <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
    set_engine("kknn") |>
    set_mode("classification")

k_vals <- tibble(neighbors = seq(from = 1, to = 20, by = 1))

knn_results <- workflow() |>
    add_recipe(players_recipe) |>
    add_model(knn_tune) |>
    tune_grid(resamples = players_vfold, grid = k_vals) |>
    collect_metrics()

knn_results

accuracies <- knn_results |>
  filter(.metric == "accuracy")

accuracies
accuracy_vs_k <- ggplot(accuracies, aes(x = neighbors, y = mean)) +
  geom_point() +
  geom_line(color = "blue") +
  labs(
    title = "Figure 5. KNN Accuracy vs K",
    x = "Number of Neighbors (K)",
    y = "Prediction Accuracy"
  ) +
  theme(text = element_text(size = 12))

accuracy_vs_k

best_k <- accuracies |>
        arrange(desc(mean)) |>
        head(1) |>
        pull(neighbors)
best_k

In [None]:
# table(players_train$subscribe)
# table(players_test$subscribe)
# library(yardstick)

# players_predictions <- players_vfold_fit |>
#   collect_predictions()

# conf_mat(players_predictions, truth = subscribe, estimate = .pred_class)

# rec <- prep(players_recipe)
# baked_data <- bake(rec, new_data = NULL)

# table(baked_data$subscribe)

We tested K values from 1 to 20 and calculated test accuracy for each value. The result was plotted as Figure 5, showing how model performance varies with K.

We selected the value of 15 as the best k value.

In [None]:
set.seed(2020) # DO NOT REMOVE

# Creating model
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = best_k) |>
  set_engine("kknn") |>
  set_mode("classification")

knn_fit <- workflow() |>
  add_recipe(players_recipe) |>
  add_model(knn_spec) |>
  fit(data = players_train)

knn_fit



In [None]:
set.seed(2020) # DO NOT REMOVE

players_test_predictions <- predict(knn_fit, players_test) |>
  bind_cols(players_test)

players_test_predictions 


In [None]:
set.seed(2020) # DO NOT REMOVE

players_test_predictions |>
  metrics(truth = subscribe, estimate = .pred_class) |>
  filter(.metric == "accuracy")

players_test_predictions |>
    precision(truth = subscribe, estimate = .pred_class, event_level="first")

players_test_predictions |>
    recall(truth = subscribe, estimate = .pred_class, event_level="first")

In [None]:
confusion <- players_test_predictions |>
             conf_mat(truth = subscribe, estimate = .pred_class)
confusion

## Discussion
**Figure 1: Subscription Rate by Experience Level**

Analysis:
The plot clearly shows that players with higher experience levels (Veteran and Pro) have higher subscription rates than Regular or Amateur players. This supports the assumption that more experienced players are more engaged with the game and thus more likely to subscribe. What's more, it may also reflect that experienced players have been in the game longer and had more possibility to see the newsletter option.

 **Figure 2: Played Hours vs Subscription (Boxplot)**

Analysis :
The boxplot indicates that subscribed players tend to have higher median and maximum playtime. This seems intuitive—more invested players may want to stay updated. However, it’s also possible that subscription leads to more engagement over time (reverse causality)

**Figure 3:Age Distribution by Subscription**

Analysis :
Younger players appear more likely to subscribe. This may reflect digital habits: younger users tend to be more open to email subscriptions, and more accustomed to in-game notifications and marketing. Oppoesitely, it may be that older users are less willing o accpect                                                                 

**Figure 4: Subscription Count by Gender**

Analysis :
There is a small difference in subscription counts between male and non-male players. However, the difference is minor and may not be statistically meaningful. 

These results mostly align with our expectations. It’s true that highly engaged players—those who play longer or identify as experienced—are more likely to subscribe. These players may have a greater emotional investment in the game, and thus are more interested in receiving updates and special content. The limited effect of gender was also expected, as gender is often a weak predictor unless the product is gender-targeted.

These findings can help game developers better understand user behavior. For example, promotional efforts could be tailored to high-playtime players to increase conversions. Knowing that experience and age influence subscription likelihood, marketing emails could highlight tips, achievements, or community events to attract younger or more engaged players.Furthermore,The game team could also consider promoting newsletter subscriptions through platforms like Instagram and TikTok, which are more popular among younger audiences.

## Project Limitations and Improvements
**Sample Size**: With only 196 players in the dataset, the sample may not be sufficiently representative of the overall player population. In particular, certain groups of people (e.g., older players, non-male genders, or very casual users) may be underrepresented, potentially biasing the model’s conclusions. Expanding the dataset with more diverse and balanced samples would improve the robustness and generalizability of the findings.

**Feature Limitation**: The model currently relies on a limited set of variables: age, gender, experience, and total playtime. These features, while relevant, provide only a static snapshot of user behavior. Incorporating dynamic behavioral indicators such as login frequency, session duration, recent activity, or interaction with in-game content could significantly enhance predictive power. Behavioral features often capture intent and engagement more effectively than static demographics.

**Platform Bias**: The dataset does not capture from which channels players encountered or accessed the subscription option. Players who engage primarily via mobile or social platforms might behave differently. Future data collection could include interaction channels to analyze cross-platform behavior differences.