# Planning Report — Predicting Usage of a Video Game Research Server
**Student:** Ansh Taparia 

**Project:** UBC Data Science Project  
**Student ID:** 32652604 

This notebook is fully reproducible.


In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
library(GGally)
library(ISLR)
options(repr.matrix.max.rows = 6)

In [None]:
set.seed(160)

players <- read_csv("data/players.csv")
sessions <- read_csv("data/sessions.csv")

players 
sessions

## Data Description

`players.csv` has one row per player with: experience, subscribe, hashedEmail (ID), played_hours, name, gender, and Age.  
`sessions.csv` has one row per session with hashedEmail, start_time, end_time, and raw timestamps.

To combine both datasets, I count how many sessions each player has in `sessions.csv` and then **inner join** that count to `players.csv` by `hashedEmail`. This keeps only players who actually have at least one recorded session. I then summarize the numeric variables (played_hours, Age, and total_sessions), and make two plots to explore patterns.


In [None]:
ph <- players |>
  summarise(
    mean = round(mean(played_hours, na.rm = TRUE), 2),
    median = round(median(played_hours, na.rm = TRUE), 2),
    sd = round(sd(played_hours, na.rm = TRUE), 2)
  ) |>
  pivot_longer(everything(), names_to = "stat", values_to = "value") |>
  mutate(variable = "played_hours")

ag <- players |>
  summarise(
    mean = round(mean(Age, na.rm = TRUE), 2),
    median = round(median(Age, na.rm = TRUE), 2),
    sd = round(sd(Age, na.rm = TRUE), 2)
  ) |>
  pivot_longer(everything(), names_to = "stat", values_to = "value") |>
  mutate(variable = "Age")

players_stats <- rbind(ph, ag) |>
  arrange(variable, stat)

players_stats

## Questions

**Broad question:**  
What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

**Specific question (players.csv only):**  
Can a player’s **experience**, **Age**, **gender**, and **played_hours** predict whether they **subscribe** to the newsletter?  
How do these features differ between **experience** groups (Amateur, Veteran, Pro)?


## Exploratory Data Analysis (EDA)

I explore how **subscription** relates to player traits.  
I: (1) look at the distributions of **played_hours** and **Age**;  
(2) compare **subscription rates by experience** and **by gender**;  
(3) compare **played_hours** across subscription groups and across experience levels;  
(4) show a small table of group means to summarize differences.

These checks help identify which features (experience, age, gender, playtime) are associated with subscribing, and how patterns differ across player types.


In [None]:
rate_by_exp <- players |>
  group_by(experience) |>
  summarise(sub_rate = mean(subscribe, na.rm = TRUE))

ggplot(rate_by_exp, aes(x = reorder(experience, sub_rate), y = sub_rate)) +
  geom_col() +
  coord_flip() +
  labs(title = "Subscription Rate by Experience",
       x = "Experience", y = "Subscription rate")
