# Predicting Game Newsletter Subscription Using Player Characteristics and Behavior 

## Introduction: 
In many industries, it is crucial to understand their target audience and their behaviours as a means to alter and create a product or service that is beneficial to many. More specifically, game publishers and developers must have a thorough understanding of player behaviour. Not only does it enable businesses to market and advertise their product and/or services effectively, but it also can enhance player engagement and even customize experiences. Outside of games, a newsletter that requires a subscription is one of the many ways game publishers and developers can interact with users beyond the game itself. This project will explore which in-game behaviours and players' attributes predict a player's decision to sign up for a newsletter. 


## Question

To investigate and explore which in-game behaviours and player attributes can predict a player's decision to subscribe to a newsletter, this project will aim to answer the question: 
Can a player's total playtime, session frequency and age predict whether they will subscribe to a game-related newsletter? 


## Data Description 


To answer the predictive question above, this project utilizes, extracts and modifies data from two different datasets: players.csv and sessions.csv 

### players.csv
players.csv mainly contains information degarding player demographics and subscription status containing the following variables 

### sessions.csv
sessions.csv primarily logs and keeps record of game play sessions. In other words contains time-stamped session activity 




### Variables from players.csv

| Variable       | Type     | Description                                 |
|----------------|----------|---------------------------------------------|
| hashedEmail    | chr      | Anonymized player ID                        |
| subscribe      | lgl/fct  | TRUE/FALSE or 1/0 for subscription status   |
| age            | dbl      | Player age                                  |
| played_hours   | dbl      | Deprecated metric, not used in model        |
| experience     | chr      | Player experience level (e.g., Pro, Amateur)|
| gender         | chr      | Player gender                               |


### Variables from sessions.csv


| Variable        | Type  | Description                                           |
|-----------------|-------|-------------------------------------------------------|
| hashedEmail     | chr   | Anonymized player ID used for joining datasets        |
| start_time      | chr   | Session start timestamp                               |
| end_time        | chr   | Session end timestamp                                 |
| session_length  | dbl   | Computed length of session in minutes                 |


### Computed Variables 

| Variable        | Type    | Description                                    |
|-----------------|---------|------------------------------------------------|
| total_playtime  | dbl     | Sum of all session lengths for a player        |
| session_count   | int     | Number of valid sessions per player            |
| age_group       | factor  | Categorized age ranges (0–9, 10–19, etc.)      |



### Variables Used in Analysis 

| Variable        | Type    | Description                                      |
|-----------------|---------|--------------------------------------------------|
| subscribe       | factor  | Target variable: whether player subscribed       |
| age             | dbl     | Player age                                       |
| total_playtime  | dbl     | Total playtime in minutes                        |
| session_count   | int     | Number of gameplay sessions                      |
| age_group       | factor  | Age bucket used for grouping and plots           |





# Methods and Results 

As the very first step, it was required to load various libraries in the notebook for a plethora of reasons. Firsly, a majority of the functions used on a daily basis in R like ggplot(), read_csv(), filter(), etc, are not actually built into base R. They are known to come from external packages like the ones listed below, so loading it using the function library(package_name) before using any of its functions. I have done so below of some common packages that will be required for hte analysis to follow.  

In [1]:
library(tidyverse)
library(lubridate)
library(tidymodels)
library(janitor)
library(tibble)
library(knitr)

options(repr.matrix.max.rows = 6)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.6     [32m✔[39m [34mrsample     [39

Before any analysis or data wranglign takes place, I first located the raw data files stored in a GitHub repository. To ensure analysis is portable adn reproducavle, the datasets were uplladed to GitHub and provided access to their raw file URLs, making it possible for anyone with the notebook to rerun the analysis from scratch without requiring local files. 

In [2]:
players <- read_csv("https://raw.githubusercontent.com/Zohranikjo/Files-/refs/heads/main/players.csv")
players 
sessions <- read_csv("https://raw.githubusercontent.com/Zohranikjo/Files-/refs/heads/main/sessions.csv") 
sessions

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,TRUE,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,TRUE,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,FALSE,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
⋮,⋮,⋮,⋮,⋮,⋮,⋮
Amateur,FALSE,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db299bd4fedb06a46ad5bb,0.0,Dylan,Prefer not to say,17
Amateur,FALSE,f19e136ddde68f365afc860c725ccff54307dedd13968e896a9f890c40aea436,2.3,Harlow,Male,17
Pro,TRUE,d9473710057f7d42f36570f0be83817a4eea614029ff90cf50d8889cdd729d11,0.2,Ahmed,Other,


[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


hashedEmail,start_time,end_time,original_start_time,original_end_time
<chr>,<chr>,<chr>,<dbl>,<dbl>
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,30/06/2024 18:12,30/06/2024 18:24,1.71977e+12,1.71977e+12
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,17/06/2024 23:33,17/06/2024 23:46,1.71867e+12,1.71867e+12
f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3398304c7ae42581fdc,25/07/2024 17:34,25/07/2024 17:57,1.72193e+12,1.72193e+12
⋮,⋮,⋮,⋮,⋮
fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33cbb5e894a3867ca44d,28/07/2024 15:36,28/07/2024 15:57,1.72218e+12,1.72218e+12
fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33cbb5e894a3867ca44d,25/07/2024 06:15,25/07/2024 06:22,1.72189e+12,1.72189e+12
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,20/05/2024 02:26,20/05/2024 02:45,1.71617e+12,1.71617e+12


After imported, I wrangled the sessions.csv dataset. Since the timestamps were originally stored as strings, we used ymd_hms() to transform them into appropriate datetime objects. Each sessions duration was first determined in seconds and then changed to minutes, which is a more comprehensible unit for the length of gameplay.

In [None]:
# 1. Calculate session length safely
sessions <- sessions |> 
  mutate(
    start_time = ymd_hms(start_time),
    end_time = ymd_hms(end_time),
    session_seconds = as.numeric(difftime(end_time, start_time, units = "secs")),
    session_length = session_seconds / 60  # minutes
  ) |>
  filter(session_length > 0) |>  # ❗ remove negative/zero durations
  select(-session_seconds) |> 
  relocate(session_length, .after = end_time)
 
sessions 

This information was then aggregated to calculate each individuals total playtime and number of sessions

In [None]:
# 2. Summarize playtime + sessions per player
session_summary <- sessions |> 
  group_by(hashedEmail) |> 
  summarize(
    total_playtime = sum(session_length, na.rm = TRUE),
    session_count = n()
  )
session_summary 

Such session data was then joined with players.csv to give a combined dataset

In [None]:
# 3. Join with players dataset
player_data <- players |>
  left_join(session_summary, by = "hashedEmail")
player_data

Following the merge, the relevent columns were cleaned and selected, to remove players with no playtime and dealing with any missing values. 

In [None]:
# 4. Prepare modeling data
model_data <- player_data |> 
  select(subscribe, Age, total_playtime, session_count) |> 
  drop_na() |>   
  filter(total_playtime > 0)
model_data 

In [None]:
# 5. Create age groups
model_data <- model_data |> 
  mutate(age_group = cut(
    Age,
    breaks = c(0, 9, 19, 29, 39, 49, 59, Inf),
    labels = c("0–9", "10–19", "20–29", "30–39", "40–49", "50–59", "60+"),
    right = TRUE, include.lowest = TRUE
  ))
model_data

In [None]:
playtime_summary <- model_data |>
  mutate(age_group = factor(age_group, levels = c("0–9", "10–19", "20–29", "30–39", "40–49", "50–59", "60+"))) |> 
  group_by(age_group, subscribe, .drop = FALSE) |>
  summarize(
    mean_playtime = mean(total_playtime, na.rm = TRUE),
    .groups = "drop"
  )
playtime_summary

playtime_summary <- playtime_summary |>
  replace_na(list(mean_playtime = 0))

playtime_summary

In [None]:
# 1. Subscription count bar plot (grouped by age group)
plot_subscription_count_by_age_group <- ggplot(model_data, aes(x = age_group, fill = subscribe)) +
  geom_bar(position = "dodge") +
  labs(
    title = "Subscription Count by Age Group",
    x = "Age Group",
    y = "Number of Players",
    fill = "Subscribed"
  ) +
  theme_minimal()

plot_subscription_count_by_age_group

# 2. Average total playtime (minutes) by age group and subscription status
plot_grouped_bar_playtime <- ggplot(playtime_summary, aes(x = age_group, y = mean_playtime, fill = subscribe)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_y_continuous(labels = scales::comma_format(accuracy = 1)) +
  labs(
    title = "Average Total Playtime (in Minutes) by Age Group and Subscription Status",
    x = "Age Group",
    y = "Average Total Playtime (minutes)",
    fill = "Subscribed"
  ) +
  theme_minimal()


plot_grouped_bar_playtime

# 3. Age distribution density plot (uses Age column + labeled object)
plot_age_distribution <- ggplot(model_data, aes(x = Age, color = factor(subscribe))) +
  geom_density() +
  labs(
    title = "Age Distribution by Subscription",
    x = "Age",
    color = "Subscribed"
  ) +
  theme_minimal()

plot_age_distribution


In [None]:
model_data <- model_data |> 
  mutate(subscribe = case_when(
    subscribe == TRUE ~ "1",
    subscribe == FALSE ~ "0",
    TRUE ~ NA_character_
  )) |> 
  filter(!is.na(subscribe))

model_data

In [None]:

set.seed(123)
split <- initial_split(model_data, strata = subscribe)
train <- training(split)
test <- testing(split)



In [None]:
knn_recipe <- recipe(subscribe ~ Age + total_playtime + session_count, data = train) |> 
  step_center(all_predictors()) |> step_scale(all_predictors()) 
knn_recipe



In [None]:
knn_spec <- nearest_neighbor(
  mode = "classification",
  neighbors = tune(),         # we'll tune 'k'
  weight_func = "rectangular" # default: equal weighting
) |> 
  set_engine("kknn")
knn_spec


In [None]:
knn_wf <- workflow() |> 
  add_recipe(knn_recipe) |> 
  add_model(knn_spec)
knn_wf 



In [None]:
set.seed(999)
folds <- vfold_cv(train, v = 5, strata = subscribe)

k_vals <- tibble(neighbors = seq(1, 25, by = 2))  # try k = 1, 3, ..., 25

knn_results <- tune_grid(
  knn_wf,
  resamples = folds,
  grid = k_vals,
  metrics = metric_set(accuracy, roc_auc)
)



In [None]:
knn_plot <- knn_results |> 
  collect_metrics() |> 
  filter(.metric == "accuracy") |> 
  ggplot(aes(x = neighbors, y = mean)) +
  geom_line() + geom_point() +
  labs(title = "KNN Accuracy vs Number of Neighbors", x = "k (neighbors)", y = "Accuracy")

knn_plot


In [None]:
best_k <- knn_results |> 
  select_best("accuracy")

final_knn_spec <- nearest_neighbor(
  mode = "classification",
  neighbors = best_k$neighbors,
  weight_func = "rectangular"
) |> 
  set_engine("kknn")

final_wf <- workflow() |> 
  add_recipe(knn_recipe) |> 
  add_model(final_knn_spec)

final_fit <- final_wf |> last_fit(split)

final_fit |> collect_metrics()



In [None]:
final_fit |> 
  collect_predictions() |> 
  ggplot(aes(x = .pred_class, fill = subscribe)) +
  geom_bar(position = "dodge") +
  labs(
    title = "Predicted Class Breakdown by Actual Subscription (KNN)",
    x = "Predicted Class", 
    y = "Count", 
    fill = "Actual Subscribe"
  )


In [None]:
# Confusion matrix
final_fit |> collect_predictions() |> 
  conf_mat(truth = subscribe, estimate = .pred_class)

# Probability distribution
final_fit |> collect_predictions() |> 
  ggplot(aes(x = .pred_1, fill = factor(subscribe))) +
  geom_histogram(position = "identity", bins = 30, alpha = 0.6) +
  labs(title = "Predicted Subscription Probabilities (KNN)", x = "Predicted Probability", fill = "Actual Subscribe")

