# Planning Report — Predicting Usage of a Video Game Research Server
**Student:** Ansh Taparia 

**Project:** UBC Data Science Project  
**Student ID:** 32652604 

This notebook is fully reproducible.


In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
library(GGally)
library(ISLR)
options(repr.matrix.max.rows = 6)

In [None]:
set.seed(160)

players <- read_csv("data/players.csv")
sessions <- read_csv("data/sessions.csv")

players 
sessions

## Data Description

The dataset `players.csv` contains one row per player and describes individual player characteristics and activity levels in the game.  
Each player has information such as:
- **experience** – the player’s skill level or familiarity with the game (Amateur, Veteran, or Pro).  
- **subscribe** – whether the player subscribed to the newsletter (TRUE/FALSE), which is our **response variable**.  
- **hashedEmail** – a unique player ID.  
- **played_hours** – the total number of hours each player has spent in the game.  
- **name** 
- **gender**  
- **Age**
 
From this, we are only using the `players.csv` dataset and not the `sessions.csv`.  
We look at how the **mean**, **standard deviation (sd)**, and **median** of the two numerical variables — `Age` and `played_hours` — describe the overall player base.  

These summary statistics help us understand the average playtime and age of players, as well as the degree of variation within each group.  


In [None]:
ph <- players |>
  summarise(
    mean = round(mean(played_hours, na.rm = TRUE), 2),
    median = round(median(played_hours, na.rm = TRUE), 2),
    sd = round(sd(played_hours, na.rm = TRUE), 2)
  ) |>
  pivot_longer(everything(), names_to = "stat", values_to = "value") |>
  mutate(variable = "played_hours")

ag <- players |>
  summarise(
    mean = round(mean(Age, na.rm = TRUE), 2),
    median = round(median(Age, na.rm = TRUE), 2),
    sd = round(sd(Age, na.rm = TRUE), 2)
  ) |>
  pivot_longer(everything(), names_to = "stat", values_to = "value") |>
  mutate(variable = "Age")

players_stats <- rbind(ph, ag) |>
  arrange(variable, stat)

players_stats

## Questions

**Broad question:**  
What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

**Specific question (players.csv only):**  
Can a player’s **experience**, **Age**, **gender**, and **played_hours** predict whether they **subscribe** to the newsletter?  
How do these features differ between **experience** groups (Amateur, Veteran, Pro)?


## Exploratory Data Analysis (EDA)

The goal of this analysis is to understand how player characteristics relate to whether a player subscribes to the newsletter.  
I explore three key visualizations that connect player traits to subscription status:

1. **Subscription Rate by Experience:**  
   This shows how likely different player types (Amateur, Veteran, Pro) are to subscribe.  
   It directly answers how subscription patterns differ between experience levels.

2. **Played Hours by Subscription:**  
   This checks if players who spend more time in the game are more likely to subscribe.  
   It compares game engagement between subscribers and non-subscribers.

3. **Age vs. Played Hours, Coloured by Subscription:**  
   This explores whether both age and playtime together influence who subscribes.  
   It helps visualize any combined effects of demographics and behaviour.

These plots help identify which variables might be most predictive of subscribing before fitting a model.



In [None]:
rate_by_exp <- players |>
  group_by(experience) |>
  summarise(sub_rate = mean(subscribe, na.rm = TRUE))

ggplot(rate_by_exp, aes(x = (experience, sub_rate), y = sub_rate)) +
  geom_col() +
  coord_flip() +
  labs(title = "Subscription Rate by Experience",
       x = "Experience", y = "Subscription rate")


In [None]:
ggplot(players, aes(x = subscribe, y = played_hours)) +
  geom_boxplot() +
  labs(title = "Played Hours by Subscription Status",
       x = "Subscribed", y = "Played hours")

In [None]:
ggplot(players, aes(x = Age, y = played_hours, color = subscribe)) +
  geom_point(alpha = 0.6) +
  labs(title = "Age vs Played Hours (colored by Subscription)",
       x = "Age (years)", y = "Played hours")

### Insights

From these plots, several trends appear:
- **Experienced players** (Veterans and Pros) have noticeably higher subscription rates than Amateurs.  
- **Subscribers** generally spend more hours playing than non-subscribers, suggesting engagement predicts interest in the newsletter.  
- The scatterplot shows that **younger and middle-aged players** who spend more time in the game are more likely to subscribe, while less active or older players are less represented among subscribers.

These relationships suggest that experience level and playtime are strong predictors of newsletter subscription.


## Methods and Plan

**Goal.** Predict whether a player subscribes to the newsletter (TRUE/FALSE) using traits in `players.csv`.

**Method.** Logistic regression (classification).  
**Response:** `subscribe`  
**Predictors:** `experience`, `gender`, `Age`, `played_hours`

**Why this method?**  
Logistic regression is chosen because it models a binary outcome and is easy to interpret. It estimates the probability that a player subscribes based on their characteristics, while keeping the model simple. Each coefficient describes how a one-unit change in a variable influences the odds of subscribing, allowing me to identify which player characteristics have the greatest effect. 

**Key assumptions.**  
This method assumes that each observation (player) is independent, that the predictors are correctly coded, and that the relationship between the numeric variables and the log-odds of subscribing is approximately linear. It also assumes there is no extreme correlation among the predictors that could distort the coefficient estimates.

**Limitations.**  
A few challenges may affect the results. Missing values in the Age column could reduce the amount of usable data, and differences in how players report their information might add inaccuracy. The variable `played_hours` is likely right-skewed, meaning a few players have much higher values than most others, which may impact the model fit. There may also be a class imbalance if most players did not subscribe, making it harder for the model to correctly predict the smaller “subscribe” class.

**How the model will be compared and selected.**  
To evaluate model performance, I will focus on how well it distinguishes between subscribers and non-subscribers. The  measures the model’s ability to rank players correctly by their likelihood of subscribing. I will also record accuracy and F1-score as supporting metrics to capture both overall performance and balance between precision and recall. Two versions of the logistic regression will be compared: one using raw played hours and another using a log transformation of played hours to address skewness. The version with the higher cross-validation AUC will be selected as the final model.

**How the data will be processed to apply the model.**  
Before fitting the model, I will prepare the data by converting categorical variables such as experience and gender into factor types and by imputing missing age values with the median. To prevent data
