# DSCI 100 Project 2025- 

## Introduction

### Background

Understanding player engagement is an important component of managing and expanding an online gaming community. The UBC research group is collecting detailed behavioural data from a Minecraft server, where every player action is recorded across individual play sessions. The dataset includes both player-level characteristics and session-level behavioural metrics such as total play time, number of sessions, and number of in-game events. Predicting which players are likely to subscribe to a game-related newsletter can help the research team better target recruitment efforts, allocate server resources, and design more effective communication strategies. Since newsletter subscription is a binary outcome, this problem represents a typical predictive classification task, where behavioural variables may serve as meaningful indicators of player interest and long-term engagement.


### Data Description

The project uses two datasets: players.csv and sessions.csv.

+) players.csv: Contains one row per unique player. Variables include demographic information and overall player attributes such as age, country, device type, and whether the player subscribed to the newsletter.
| Variable    | Type        | Meaning                          | Notes                |
| ----------- | ----------- | -------------------------------- | -------------------- |
| age         | numeric     | Player age                       | Some missing values  |
| country     | categorical | Country                          | Many unique values   |
| device_type | categorical | Device used                      | PC / Mobile / Tablet |
| subscribe   | categorical | Newsletter subscription (Yes/No) | -                    |
| ...         | ...         | ...                              | ...                  |


+) sessions.csv: Contains one row per play session. Each row includes the player ID, session start and end time, session duration, number of events generated during that session, and other behavioural metrics.
| Variable            | Type        | Meaning                         | Notes              |
| ------------------- | ----------- | ------------------------------- | ------------------ |
| hashedEmail         | categorical | Player ID                       | -                  |
| original_start_time | datetime    | Session start time              | -                  |
| original_end_time   | datetime    | Session end time                | -                  |
| number_of_events    | numeric     | Number of events in the session | Outliers may exist |
| ...                 | ...         | ...                             | ...                |


### Scientific Question


- Broad question: Which player behaviors are most predictive of subscribing to the game newsletter?
- Specific question: Can average session length and total number of sessions predict whether a player subscribes to the newsletter? These explanatory variables are derived from sessions.csv and merged with players.csv using hashedEmail.

### Exploratory Data Analysis and Visualization

#### 1.Load Data

In [None]:
### Run this cell before continuing.
library(tidyverse)
library(tidymodels)
library(tidyclust)
library(forcats)
library(repr)
options(repr.matrix.max.rows = 6)
set.seed(123)

In [None]:
players <- read_csv("players.csv")
sessions <- read_csv("sessions.csv")

players
sessions

#### 2.Minimal Data Wrangling

In [None]:
sessions_summary <- sessions |>
  mutate(
    session_length = (original_end_time - original_start_time) / (1000 * 60 * 60)) |>
  group_by(hashedEmail)|>
  summarize(
    avg_session_length = mean(session_length, na.rm = TRUE),
    total_sessions = n()
  )
sessions_summary 

In [None]:
df <- players |>
  left_join(sessions_summary, by = "hashedEmail") |>
  mutate(subscribe = as.factor(subscribe))  
df

#### 3. Table: Mean Values of Numeric Variables



In [None]:
df_means <- df |>
  select(where(is.numeric)) |>
  summarise(across(everything(), mean, na.rm = TRUE))

df_means

+) Average total sessions per player: X

+) Average session length: Y hours

+) Other numeric variables provide basic understanding of player data

#### 4. Scatter Plot: Avg Session Length vs Total Sessions

In [None]:
ggplot(df, aes(x = avg_session_length, y = total_sessions, color = subscribe)) +
  geom_point(alpha = 0.7) +
  labs(
    x = "Average Session Length (hours)",
    y = "Total Sessions",
    color = "Subscribed",
    title = "Player Activity vs Subscription Status"
  ) +
  scale_color_manual(values = c("darkorange", "steelblue")) +
  theme_minimal(base_size = 12)

+) Subscribed players tend to have longer and more frequent sessions.

+) Some outliers with unusually long or numerous sessions exist.

#### 5. Histogram: Distribution of Total Sessions

In [None]:
ggplot(df, aes(x = total_sessions)) +
  geom_histogram(binwidth = 5, fill = "steelblue", color = "white") +
  labs(
    x = "Total Sessions",
    y = "Number of Players",
    title = "Distribution of Total Sessions"
  ) +
  theme_minimal(base_size = 12)

+) Most players have fewer than 50 sessions.

+) A small number of players have an unusually high number of sessions, considered outliers.

#### 6. Boxplot: Avg Session Length by Subscription

In [None]:
ggplot(df, aes(x = subscribe, y = avg_session_length, fill = subscribe)) +
  geom_boxplot() +
  labs(
    x = "Subscription Status",
    y = "Average Session Length (hours)",
    title = "Average Session Length by Subscription"
  ) +
  scale_fill_manual(values = c("darkorange", "steelblue")) +
  theme_minimal(base_size = 12)

+) Subscribed players have slightly higher average session lengths than non-subscribed players.

+) Boxplots help identify outliers and differences between groups.

In [None]:
show_notes(knn_results)


### Methods and Plan

To predict whether a player subscribes to the newsletter, we propose using Logistic Regression or K-Nearest Neighbors (KNN) Classification. Logistic Regression is appropriate because the response variable is categorical (Yes/No) and models the probability of subscription based on explanatory variables such as average session length and total sessions. KNN is an alternative that classifies players using numeric predictors and requires scaling; model performance depends on the number of neighbors (k) and can be sensitive to outliers. The dataset will be split into 75% training and 25% testing, stratified by subscription status to maintain class balance. Cross-validation will be used to tune KNN hyperparameters. Model evaluation will include accuracy, confusion matrix, and ROC/AUC. The ROC (Receiver Operating Characteristic) curve visualizes the trade-off between true positive and false positive rates at different classification thresholds, while the AUC (Area Under the Curve) quantifies the modelâ€™s ability to distinguish between subscribing and non-subscribing players, where a value of 1 indicates perfect classification and 0.5 indicates performance no better than random. Logistic Regression and KNN will be compared using these metrics to select the best-performing model.