# Planning Report — Predicting Usage of a Video Game Research Server
**Student:** Ansh Taparia 

**Project:** UBC Data Science Project  
**Student ID:** 32652604 

This notebook is fully reproducible.


In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
library(GGally)
library(ISLR)
set.seed(160)
options(repr.matrix.max.rows = 6)

In [None]:
players <- read_csv("data/players.csv")
sessions <- read_csv("data/sessions.csv")

players 
sessions

## Data Description

The dataset `players.csv` contains one row per player and describes individual player characteristics and activity levels in the game. 

Each player has information such as:
| Variable       | Description                                                                 |
|----------------|------------------------------------------------------------------------------|
| **experience** | The player’s skill level or familiarity with the game (Amateur, Veteran, or Pro). |
| **subscribe**  | Whether the player subscribed to the newsletter (TRUE/FALSE); this is the response variable. |
| **hashedEmail**| A unique player ID used to identify each individual.                         |
| **played_hours** | The total number of hours each player has spent in the game.                |
| **name**       | The player’s name (not used for analysis).                                   |
| **gender**     | The player’s reported gender.                                                |
| **Age**        | The player’s age in years.                                                   |

 
We are only using the `players.csv` dataset and not the `sessions.csv`.  
By looking at how the **mean**, **standard deviation (sd)**, and **median** of the two
numerical variables — `Age` and `played_hours`, we describe the overall player base.

In [None]:
ph <- players |>
  summarise(
    mean = round(mean(played_hours, na.rm = TRUE), 2),
    median = round(median(played_hours, na.rm = TRUE), 2),
    sd = round(sd(played_hours, na.rm = TRUE), 2)
  ) |>
  pivot_longer(everything(), names_to = "stat", values_to = "value") |>
  mutate(variable = "played_hours")

ag <- players |>
  summarise(
    mean = round(mean(Age, na.rm = TRUE), 2),
    median = round(median(Age, na.rm = TRUE), 2),
    sd = round(sd(Age, na.rm = TRUE), 2)
  ) |>
  pivot_longer(everything(), names_to = "stat", values_to = "value") |>
  mutate(variable = "Age")

players_stats <- rbind(ph, ag) |>
  arrange(variable, stat)

players_stats

## Questions

**Broad question:**  
What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

**Specific question (players.csv only):**  
Can a player’s **experience**, **Age**, **gender**, and **played_hours** predict whether they **subscribe** to the newsletter?  

How do these features differ between **experience** groups (Amateur, Veteran, Pro)?

## Exploratory Data Analysis (EDA)

I explore three key visualizations that connect player traits to subscription status:

1. **Subscription Rate by Experience:**  
   This shows how likely different player types (Amateur, Veteran, Pro) are to subscribe.  

2. **Played Hours by Subscription:**  
   This illustrates that players who have higher hours played are more likely to subscribe.   

3. **Age vs. Played Hours, Coloured by Subscription:**  
   This explores whether both age and playtime together influence who subscribes.  

In [None]:
rate_by_exp <- players |>
  group_by(experience) |>
  summarise(sub_rate = mean(subscribe, na.rm = TRUE))

ggplot(rate_by_exp, aes(x = reorder(experience, sub_rate), y = sub_rate)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(title = "Subscription Rate by Experience",
       x = "Experience Level",
       y = "Subscription Rate") +
  theme_minimal()


In [None]:
ggplot(players, aes(x = subscribe, y = played_hours)) +
  geom_boxplot() +
  labs(title = "Played Hours by Subscription Status",
       x = "Subscribed", y = "Played hours")

In [None]:
ggplot(players, aes(x = Age, y = played_hours, color = subscribe)) +
  geom_point(alpha = 0.6) +
  labs(title = "Age vs Played Hours (colored by Subscription)",
       x = "Age (years)", y = "Played hours")

### Insights

From these plots, several trends appear:
- **Experienced players** (Veterans and Pros) represent far more subscribers compared to Amateurs.  
-  **Subscribers** generally spend more hours playing than non-subscribers, suggesting engagement predicts interest in the newsletter.
-  It can be seen from the scatterplot that **younger and middle-aged players** who spend more time in the game are likely to subscribe, while less active or older players are less represented among subscribers.



## Methods and Plan

**Objective:** Use features in `players.csv` to predict whether a player subscribes to the newsletter (TRUE/FALSE).`.

**Method:** Logistic Regression (classification).

**Response:** `subscribe`  
**Predictors:** `experience`, `gender`, `Age`, `played_hours`

**Why this method?**  
Logistic regression is chosen because it models a binary outcome and is easy to interpret. It estimates the probability that a player subscribes based on their characteristics, while keeping the model simple. 

**Key assumptions.**  
This technique essentially assumes independence of each observation-that there is a linear relationship between the independent variables and the log-odds of the outcome. It assumes independence among the independent variables themselves, that is, not too highly correlated.

**Limitations.**  
A couple of limitations - like gaps in the Age field - will affect the model by shrinking usable data. On top of that, the `played_hours` feature is right-skewed; certain users clock way more hours than most, potentially throwing off predictions.

**How the model will be compared and selected.**  
Before fitting the model, numeric predictors like `age` and `played hours` will first be standardized to put them on similar scales. This standardizes the data so that no one variable will have more impact on a model simply by virtue of having larger numeric values.

**How the data will be processed to apply the model.**  
I'll prepare the data before fitting the model by converting categorical variables into factor types and imputing missing age values with the median. To avoid data leakage, imputation will be done inside the training set in the course of cross-validation. The dataset will then be split into a training and a testing set. In case the subscription variable is unbalanced, use a stratified split to make sure both classes are fairly represented in both subsets.

## GitHub Repository

This project is version-controlled on GitHub.  
Repository link: https://github.com/anshtaparia20689/dsci-100-project.git

## Word Count 

**Word Count** : 546