# Predicting Newsletter Subscription from Player Behaviour
**Final Report — DSCI Project (Question 1)**  
**Team:** _Caleb · Melki · Four · Your Name_  
**Date:** 2025-12-03

> This notebook is the final, fully reproducible report. It follows the required sections: Title, Introduction, Methods & Results, Discussion, and References. All figures are numbered with captions. Keep non-code text ≤ **2000 words** (citations excluded).


## Introduction
### Background
Briefly describe the MineCraft research server context and why predicting newsletter subscription matters (recruitment targeting, resource planning, engagement).

### Question
> **Which player characteristics and behaviours best predict whether a player subscribes to the project newsletter?**

### Data Description
- **players.csv**: one row per player (account/demographic fields).  
- **sessions.csv**: one row per session (behavioural logs like start/end time and duration).  
Potential issues: missingness, inconsistent time zones, extreme outliers in durations, class imbalance, and non-causal observational design.


In [None]:

# | label: setup
# | message: false
suppressPackageStartupMessages({
  library(readr)
  library(dplyr)
  library(tidyr)
  library(janitor)
  library(lubridate)
  library(ggplot2)
  library(stringr)
  library(forcats)
  library(pROC)      # ROC curves
  library(rpart)     # decision tree
  library(rpart.plot)
})

options(scipen = 999)
theme_set(theme_minimal())

# Data directory (edit if needed)
DATA_DIR <- "."
players_path  <- file.path(DATA_DIR, "players.csv")
sessions_path <- file.path(DATA_DIR, "sessions.csv")


### Data Loading

In [None]:

players_raw  <- read_csv(players_path, show_col_types = FALSE) |> clean_names()
sessions_raw <- read_csv(sessions_path, show_col_types = FALSE) |> clean_names()

list(
  players_dim  = c(nrow(players_raw),  ncol(players_raw)),
  sessions_dim = c(nrow(sessions_raw), ncol(sessions_raw))
)


### Wrangling & Feature Engineering
We standardize the response (`subscribed_newsletter` → 0/1), aggregate session-level behaviour to per-player features, and join onto the player table to produce **one row per player**.


In [None]:

# ---- Edit column names below to match your data if different ----
# Assumptions:
# - players_raw has: player_id, subscribed_newsletter, region (optional), platform (optional)
# - sessions_raw has: player_id, start_time, end_time, duration_min (or duration_sec)

# Harmonize response to binary 0/1
players <- players_raw |>
  mutate(subscribed_newsletter = case_when(
    tolower(as.character(subscribed_newsletter)) %in% c("yes","y","true","1") ~ 1L,
    tolower(as.character(subscribed_newsletter)) %in% c("no","n","false","0") ~ 0L,
    is.numeric(subscribed_newsletter) ~ as.integer(subscribed_newsletter),
    TRUE ~ NA_integer_
  ))

# Parse times & durations in sessions
sessions <- sessions_raw |>
  mutate(
    start_time = suppressWarnings(ymd_hms(start_time, quiet = TRUE)),
    end_time   = suppressWarnings(ymd_hms(end_time,   quiet = TRUE)),
    duration_min = dplyr::case_when(
      !is.na(duration_min) ~ as.numeric(duration_min),
      !is.na(duration_sec) ~ as.numeric(duration_sec) / 60,
      TRUE ~ as.numeric(NA)
    )
  ) |>
  filter(is.na(duration_min) | duration_min >= 0)

# Aggregate session-level features per player
sessions_agg <- sessions |>
  group_by(player_id) |>
  summarize(
    n_sessions = n(),
    total_duration_min = sum(duration_min, na.rm = TRUE),
    median_duration_min = median(duration_min, na.rm = TRUE),
    avg_duration_min = mean(duration_min, na.rm = TRUE),
    first_seen = suppressWarnings(min(start_time, na.rm = TRUE)),
    last_seen  = suppressWarnings(max(end_time,   na.rm = TRUE))
  )

# Join onto players
player_df <- players |>
  left_join(sessions_agg, by = "player_id") |>
  # Drop rows missing the response
  filter(!is.na(subscribed_newsletter)) |>
  # Replace NA numeric features with 0 for players with no sessions
  mutate(
    across(c(n_sessions, total_duration_min, median_duration_min, avg_duration_min),
           ~ ifelse(is.na(.x), 0, .x))
  )

# Basic sanity
summary(select(player_df, subscribed_newsletter, n_sessions, total_duration_min, median_duration_min, avg_duration_min))


## Methods & Results
### Exploratory Data Analysis

In [None]:

# Figure 1: Subscription rate
p1 <- ggplot(player_df, aes(x = factor(subscribed_newsletter))) +
  geom_bar() +
  labs(
    title = "Newsletter Subscription Count",
    x = "Subscribed (0 = No, 1 = Yes)", y = "Count",
    caption = "Figure 1. Overall newsletter subscription counts."
  )
p1


In [None]:

# Figure 2: Total duration vs subscription
p2 <- ggplot(player_df, aes(x = factor(subscribed_newsletter), y = total_duration_min)) +
  geom_boxplot(outlier.alpha = 0.4) +
  labs(
    title = "Total Playtime vs Subscription",
    x = "Subscribed (0/1)", y = "Total duration (minutes)",
    caption = "Figure 2. Distribution of total playtime by subscription status."
  )
p2


In [None]:

# Figure 3: Sessions vs subscription
p3 <- ggplot(player_df, aes(x = factor(subscribed_newsletter), y = n_sessions)) +
  geom_boxplot(outlier.alpha = 0.4) +
  labs(
    title = "Number of Sessions vs Subscription",
    x = "Subscribed (0/1)", y = "Number of sessions",
    caption = "Figure 3. Session counts by subscription status."
  )
p3


In [None]:

# Figure 4 (optional): Categorical predictor vs subscription if present
cat_plot <- NULL
if ("region" %in% names(player_df)) {
  cat_plot <- ggplot(player_df, aes(x = fct_lump_n(as.factor(region), 10),
                                    fill = factor(subscribed_newsletter))) +
    geom_bar(position = "fill") +
    coord_flip() +
    labs(
      title = "Subscription Rate by Region",
      x = "Region (top 10)", y = "Proportion",
      fill = "Subscribed",
      caption = "Figure 4. Proportion subscribed by region."
    )
  print(cat_plot)
} else {
  message("No 'region' column detected; skipping Figure 4.")
}


### Modeling
We compare a baseline **logistic regression** with a **decision tree**. We use a stratified split to preserve the subscription class balance.


In [None]:

set.seed(123)

# Minimal feature set for demonstration; adjust as needed
model_df <- player_df |>
  select(subscribed_newsletter, n_sessions, total_duration_min, median_duration_min, avg_duration_min) |>
  mutate(subscribed_newsletter = factor(subscribed_newsletter, levels = c(0,1)))

# Stratified split
idx_1 <- which(model_df$subscribed_newsletter == "1")
idx_0 <- which(model_df$subscribed_newsletter == "0")

train_1 <- sample(idx_1, size = floor(0.7 * length(idx_1)))
train_0 <- sample(idx_0, size = floor(0.7 * length(idx_0)))
train_idx <- c(train_1, train_0)

train <- model_df[train_idx, ]
test  <- model_df[-train_idx, ]

list(train_n = nrow(train), test_n = nrow(test),
     prop_train_1 = mean(train$subscribed_newsletter == "1"),
     prop_test_1  = mean(test$subscribed_newsletter == "1"))


In [None]:

# Logistic regression
glm_fit <- glm(subscribed_newsletter ~ n_sessions + total_duration_min + median_duration_min + avg_duration_min,
               data = train, family = binomial())

summary(glm_fit)


In [None]:

# Decision tree
tree_fit <- rpart(subscribed_newsletter ~ n_sessions + total_duration_min + median_duration_min + avg_duration_min,
                  data = train, method = "class", control = rpart.control(cp = 0.001))
rpart.plot(tree_fit)


In [None]:

# Predictions and metrics
# Logistic
glm_prob <- predict(glm_fit, newdata = test, type = "response")
glm_pred <- ifelse(glm_prob >= 0.5, "1", "0")

# Tree
tree_prob <- predict(tree_fit, newdata = test, type = "prob")[, "1"]
tree_pred <- ifelse(tree_prob >= 0.5, "1", "0")

# Accuracy
acc_glm  <- mean(glm_pred == as.character(test$subscribed_newsletter))
acc_tree <- mean(tree_pred == as.character(test$subscribed_newsletter))

# ROC AUC
roc_glm  <- roc(response = test$subscribed_newsletter, predictor = glm_prob, quiet = TRUE)
roc_tree <- roc(response = test$subscribed_newsletter, predictor = tree_prob, quiet = TRUE)

auc_glm  <- as.numeric(auc(roc_glm))
auc_tree <- as.numeric(auc(roc_tree))

tibble(
  model = c("Logistic Regression", "Decision Tree"),
  accuracy = round(c(acc_glm, acc_tree), 3),
  roc_auc  = round(c(auc_glm, auc_tree), 3)
)


In [None]:

# Figure 5: ROC curves
plot(roc_glm, main = "ROC Curves — Logistic vs Tree")
plot(roc_tree, col = "red", add = TRUE)
legend("bottomright", legend = c("Logistic", "Tree"), lty = 1, col = c("black","red"))
mtext("Figure 5. ROC curves comparing logistic regression and decision tree.", side = 1, line = 3)


In [None]:

# Figure 6: Logistic coefficients (odds-scale)
coef_tbl <- broom::tidy(glm_fit) |>
  filter(term != "(Intercept)") |>
  mutate(odds_ratio = exp(estimate)) |>
  select(term, odds_ratio, std.error, statistic, p.value)

p_coef <- ggplot(coef_tbl, aes(x = reorder(term, odds_ratio), y = odds_ratio)) +
  geom_point() + coord_flip() +
  labs(title = "Logistic Regression — Odds Ratios",
       x = "Predictor", y = "Odds ratio",
       caption = "Figure 6. Odds ratios for predictors (error bars omitted for brevity).")
p_coef


## Discussion
**Summary of findings.** Briefly state which behaviours were most predictive of subscription and how strong the discrimination was (ROC AUC).  
**Expectations vs results.** Did heavier play correlate with subscription as expected? Any surprises?  
**Limitations.** Observational design, potential measurement gaps, non-random sample, temporal drift, and class imbalance. No causal claims.  
**Impact.** How the research team might target recruitment or allocate resources.  
**Future work.** Richer temporal features, non-linear models, calibration checks, SHAP/partial dependence for interpretability, and player clustering.


## References
Use a consistent style. Cite the project description or course notes if needed.


## Appendix

In [None]:

sessionInfo()
