# Individual Planning Report — Predicting Newsletter Subscription

**Course**: DSCI 100  
**Author**: <Bonyoon Goo>  


In [None]:
library(tidyverse)
library(janitor)

In [None]:
players_url  <- "https://raw.githubusercontent.com/bonyoongoo/ubc-dsci100-minecraft-forecasting/refs/heads/main/data/players.csv"
sessions_url <- "https://raw.githubusercontent.com/bonyoongoo/ubc-dsci100-minecraft-forecasting/refs/heads/main/data/sessions.csv"

players  <- read_csv(players_url) |> clean_names()
sessions <- read_csv(sessions_url) |> clean_names()


nrow(players); ncol(players); names(players)
nrow(sessions); ncol(sessions); names(sessions)

### Data collection 
These datasets come from a UBC Computer Science research project running a Minecraft server that logs player sessions and survey metadata. Player attributes (e.g., age, gender, experience, newsletter subscription) were collected via on-boarding forms; session timing and play duration come from server logs. Potential unseen issues include selection bias (self-selected participants), inaccuracies in self-reported fields, timezone normalization for timestamps, duplicate accounts, and bot or shared-account behavior.


### Questions
Broad question: What player characteristics and behaviours are most predictive of subscribing to the newsletter?
Specific question: Can played_hours, experience, and age predict whether a player subscribes to the newsletter?


In [None]:
players <- players |>
  mutate(
    newsletter_subscribed = ifelse(
      is.na(subscribe), NA_character_,
      ifelse(subscribe, "Subscribed", "Not subscribed")
    ),
    newsletter_subscribed = as_factor(newsletter_subscribed)
  )

players |>
  group_by(newsletter_subscribed) |>
  summarize(players = n())


In [None]:

nrow(players); ncol(players)
names(players)


nrow(sessions); ncol(sessions)
names(sessions)

In [None]:
players |>
  summarize(across(everything(), ~ sum(is.na(.))))
sessions |>
  summarize(across(everything(), ~ sum(is.na(.))))

In [None]:
players |>
  summarize(across(everything(), ~ paste(class(.x)[1])))
sessions |>
  summarize(across(everything(), ~ paste(class(.x)[1])))

In [None]:
players |>
  summarize(across(where(is.numeric), ~ round(mean(.x, na.rm = TRUE), 2)))


##Data Description

The players dataset has 8 columns and around 196 rows. The sessions dataset has 5 columns and about 1 500 rows.

Some minor missing values are present in a few variables such as age and played_hours.

Numeric variables (especially played_hours) are right-skewed — many players have very low or zero total hours played, with a few high outliers.

The name and hashed_email columns act as unique identifiers; they contain personally identifiable information and will be excluded from modelling to avoid leakage or privacy issues.

The experience and gender columns are categorical and will need encoding before modelling.

The subscribe column has been cleaned into the new binary variable newsletter_subscribed, which will be the response variable.

The sessions data include start_time and end_time stored as text strings; these may need to be converted to proper datetime values later during the group phase for any time-based analyses.

| Variable | Description | Type | Notes / Issues |
|-----------|-------------|------|----------------|
| name | Unique player identifier | chr | Remove to protect privacy |
| hashed_email | Anonymized email ID | chr | Remove to protect privacy |
| age | Player age (years) | dbl | Some missing values |
| gender | Self-reported gender | fct | Needs dummy encoding |
| experience | Gaming experience level | fct | Ordinal categorical |
| played_hours | Total hours played | dbl | Right-skewed; outliers |
| subscribe | Subscribed to newsletter (yes/no) | lgl | Response variable |
| start_time / end_time | Session timestamps | chr | Convert to datetime later |

No other serious data-quality problems are visible at this stage.

In [None]:
players |>
  group_by(newsletter_subscribed) |>
  summarize(players = n()) |>
  ggplot(aes(x = newsletter_subscribed, y = players)) +
  geom_bar(stat = "identity") +
  labs(x = "Newsletter status",
       y = "Number of players",
       title = "Class balance: newsletter subscription") +
  theme_minimal()


In [None]:
players |>
  ggplot(aes(x = played_hours)) +
  geom_histogram(bins = 30) +
  labs(x = "Played hours (hours)",
       y = "Number Of Players",
       title = "Distribution of played_hours") +
  theme_minimal()


In [None]:
players |>
  filter(!is.na(newsletter_subscribed)) |>
  ggplot(aes(x = newsletter_subscribed, y = played_hours)) +
  geom_boxplot() +
  labs(x = "Newsletter status",
       y = "Played hours (hours) ",
       title = "Played hours by newsletter subscription") +
  theme_minimal()


- The bar chart shows more players subscribed than not, suggesting mild class imbalance.
- The histogram reveals `played_hours` is highly right-skewed; most players have very few hours while a few play much more.
- The boxplot suggests subscribers generally spend more total hours playing than non-subscribers.
- These visual patterns indicate that play time could be a useful predictor for newsletter subscription.

## Methods and Plan

I will use a K-Nearest Neighbors (KNN) classification model to predict whether a player subscribes to the newsletter (subscribe) using predictors such as played_hours, experience, and age.

Why KNN is Appropriate:
KNN is a non-parametric model that classifies observations based on the majority class among their closest neighbors in feature space.

- It works well for classification problems where relationships may be nonlinear.

- The model’s simplicity and interpretability make it ideal for an introductory data science project.

- It aligns with DSCI 100 course concepts and is easy to implement and tune with tidymodels.

##Data Preprocessing Plan

Construct a tidymodels workflow for reproducibility.

Recipe steps:

1. step_dummy(all_nominal_predictors()) – convert categorical variables (e.g., experience, gender) to numeric.

2. step_normalize(all_numeric_predictors()) – standardize numeric predictors for meaningful distance calculations.

3. step_zv() – remove zero-variance predictors.

4. step_impute_mean(all_numeric_predictors()) – handle any missing numeric values.

If class imbalance affects results, apply step_upsample(newsletter_subscribed) to balance the training data.

##Data Splitting

Split the data once at the start:

initial_split(players, prop = 0.75, strata = newsletter_subscribed)

Training set: 75%  Testing set: 25%

Use 5-fold cross-validation on the training data to tune the number of neighbors (neighbors = c(1, 3, 5, …, 51)).

##Model Evaluation

Evaluate model performance with:

Accuracy – overall correctness.

Balanced Accuracy – accounts for class imbalance.

ROC AUC – measures ranking and discrimination.

Select the best K based on highest balanced accuracy or ROC AUC, then test on the held-out data.

Limitations & Assumptions

Assumes nearby players in feature space have similar outcomes.

Sensitive to feature scaling, irrelevant predictors, and class imbalance.

Computationally expensive for large datasets (but feasible here).

Boundaries can be noisy; model performance depends on the chosen K.

In [None]:
GITHUB LINK: https://github.com/bonyoongoo/ubc-dsci100-minecraft-forecasting 