In [None]:
library(readr)
library(repr)
library(tidyverse)
library(tidymodels)
options(repr.matrix.max.rows = 10)

### Data Description

This project uses data collected from a research Minecraft server operated by UBC Computer Science. The data consist of two tables: `players.csv` and `sessions.csv`.

- `players.csv`: one row per unique player  
- `sessions.csv`: one row per play session

#### Players Dataset

The `players.csv` dataset includes **196 players** and **7 variables**:

| Variable | Description |Type of data
|---|---|---|
`experience` | Self-reported Minecraft skill level (Beginner/Amateur/Regular/Veteran/Pro)| chr
`subscribe` | Whether the player subscribed to the research newsletter (TRUE/FALSE)  | lgl
`hashedEmail` | Anonymous player identifier  | chr
`played_hours` | Total hours played on the server  | chr
`name` | Player name (not used for modeling)  | chr
`gender` | Player gender  |chr
`Age` | Player-reported age  | dbl

#### Sessions Dataset

The `sessions.csv` dataset contains **1535 sessions** and **5 variables**. Each row represents one play session.

| Variable | Description |Type of data
|---|---|---|
`hashedEmail` | Anonymous player ID  | chr
`start_time` | Session start time  |chr
`end_time` | Session end time  | chr
`original_start_time` | Start time (Unix timestamp)  | dbl
`original_end_time` | End time (Unix timestamp)  |dbl

The `hashedEmail` column links the two datasets. In this planning stage, analysis focuses on the player-level data from `players.csv`, while session-level timing data will be considered in the final project.

In [None]:
player <- read_csv("data/players.csv")
session <- read_csv("data/sessions.csv")
player
session

Question:Can a player’s total play time, experience level, and age predict whether they subscribe to the research newsletter?
How the data helps
The players.csv dataset contains a binary subscribe variable indicating whether each player signed up for the research newsletter. Features such as played_hours, experience, and Age are available to use as predictors. This allows the formulation of a binary classification problem at the player level.

In [None]:
player_means <- player |>
 summarise(across(where(is.numeric), ~ round(mean(.x, na.rm = TRUE), 2)))

player_means

In the players.csv dataset, the average total play time is approximately 5.85 hours, and the average age of players is around 21.14 years.
This suggests that most players are relatively new to the server, and participation is concentrated among younger users.
The distribution of play time is likely right-skewed, with many users playing a small amount and a few playing much more.

In [None]:
library(ggplot2)

ggplot(player, aes(x = played_hours)) +
  geom_histogram(bins = 15) +
  labs(
    title = "Distribution of Total Played Hours",
    x = "Total Played Hours",
    y = "Count of Players"
  )

Most players spent very little time on the server, while a small number of players accumulated many hours. This distribution is heavily right-skewed, suggesting that player engagement varies widely, with a long tail of highly active players. This will be important for later modeling, as the feature played_hours may require normalization or transformation.

In [None]:
ggplot(player, aes(x = subscribe)) +
  geom_bar() +
  labs(
    title = "Count of Players by Subscription Status",
    x = "Subscribed to Newsletter",
    y = "Number of Players" )

Most players did not subscribe to the newsletter. This indicates that subscription is relatively rare in the dataset, which may result in class imbalance during modeling.

In [None]:
ggplot(player, aes(x = subscribe)) +
  geom_bar() +
  labs(
    title = "Count of Players by Subscription Status",
    x = "Subscribed to Newsletter",
    y = "Number of Players" )

### Methods and Plan

For this project, I will build a predictive model to determine whether a player will subscribe to the research newsletter based on their gameplay behaviour and demographics.

Since the outcome variable `subscribe` is (TRUE/FALSE), I plan to use ** regression** as the primary modeling method. Logistic regression is appropriate for binary classification problems and is easy to interpret, as it shows how each predictor affects the probability of subscribing.

I will use `played_hours`, `experience`, and `Age` as predictors. Before modeling, I will check for skewed variables and consider transformations if necessary (e.g., log‐transforming heavily skewed play‐time values). I will also check for missing values and handle them appropriately.

The data will be split into **training and testing sets (80/20)** to evaluate the model’s performance. Model accuracy and confusion matrix will be used as evaluation metrics.

