In [1]:
library(tidyverse)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


In this part, a regression model was performed to predict whether a player subscribes to the news letter based on demographic and behavioral features. Using cleaned and merged data, we selected key predictors including age, total played hours, number of sessions, average session duration, and total play duration. Moreover, the dataset was split into training (70%) and testing (30%) sets. 

The results show that session-based features, especially playtime and engagement, are a valuable metric to distinguish subscribers from non-subscribers. Therefore, this model provides a solid baseline for understanding how player behavior correlates with subscription likelihood. 

In [4]:
players <- read_csv("players.csv")
sessions <- read_csv("sessions.csv")

sessions <- sessions |>
  mutate(start_time = dmy_hm(start_time),
    end_time = dmy_hm(end_time),
    session_duration_minutes = as.numeric(difftime(end_time, start_time, units = "mins")))

session_summary <- sessions |>
  group_by(hashedEmail) |>
  summarise(total_sessions = n(),
    avg_session_duration = mean(session_duration_minutes, na.rm = TRUE),
    total_play_duration = sum(session_duration_minutes, na.rm = TRUE))

player_data <- players |>
  left_join(session_summary, by = "hashedEmail") |>
  mutate(total_sessions = replace_na(total_sessions, 0),
    avg_session_duration = replace_na(avg_session_duration, 0),
    total_play_duration = replace_na(total_play_duration, 0),
    Age = if_else(is.na(Age), median(Age, na.rm = TRUE), Age),
    subscribe = as.factor(subscribe),
    experience = as.factor(experience),
    gender = as.factor(gender))

model_data <- player_data |>
  select(subscribe, Age, played_hours, total_sessions, avg_session_duration, total_play_duration)

set.seed(123)
split <- createDataPartition(model_data$subscribe, p = 0.7, list = FALSE)
train_data <- model_data[split, ]
test_data <- model_data[-split, ]

model <- train(subscribe ~ .,
               data = train_data,
               method = "glm",
               family = "binomial",
               trControl = trainControl(method = "none"))

predictions <- predict(model, newdata = test_data)
confusionMatrix(predictions, test_data$subscribe)

results <- test_data |>
  select(subscribe) |>
  mutate(predicted = predictions)

head(results)

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


ERROR: Error in createDataPartition(model_data$subscribe, p = 0.7, list = FALSE): could not find function "createDataPartition"
