# Planning Report — Predicting Usage of a Video Game Research Server
**Student:** Ansh Taparia 

**Project:** UBC Data Science Project  
**Student ID:** 32652604 

This notebook is fully reproducible.


In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
library(GGally)
library(ISLR)
options(repr.matrix.max.rows = 6)

In [None]:
set.seed(160)

players <- read_csv("data/players.csv")
sessions <- read_csv("data/sessions.csv")

players 
sessions

## Data Description

`players.csv` has one row per player with: experience, subscribe, hashedEmail (ID), played_hours, name, gender, and Age.  
`sessions.csv` has one row per session with hashedEmail, start_time, end_time, and raw timestamps.

To combine both datasets, I count how many sessions each player has in `sessions.csv` and then **inner join** that count to `players.csv` by `hashedEmail`. This keeps only players who actually have at least one recorded session. I then summarize the numeric variables (played_hours, Age, and total_sessions), and make two plots to explore patterns.


In [None]:
players_stats_pre <- bind_rows(
  # --- played_hours stats ---
  players |>
    summarise(
      mean_played_hours   = round(mean(played_hours,   na.rm = TRUE), 2),
      median_played_hours = round(median(played_hours, na.rm = TRUE), 2),
      sd_played_hours     = round(sd(played_hours,     na.rm = TRUE), 2)
    ) |>
    pivot_longer(
      cols = c(mean_played_hours, median_played_hours, sd_played_hours),
      names_to = "stat",
      values_to = "value"
    ) |>
    mutate(variable = "played_hours")

## Questions

**Broad:** Which player characteristics are linked to higher server activity?

**Specific:** Can a player’s **experience**, **Age**, **gender**, **subscribe**, and **played_hours**
help predict their **total number of sessions** on the server?

In [None]:
players_mean <- players_joined |>
  select(played_hours, Age) |>
  summarise(across(c(played_hours, Age), ~ round(mean(.x, na.rm = TRUE), 2))) |>
  pivot_longer(cols = c(played_hours, Age),
               names_to = "variable",
               values_to = "mean_2dp") |>
  arrange(variable)

players_mean


In [None]:
players_mean <- players_joined |>
  select(played_hours, Age) |>
  summarise(across(c(played_hours, Age), ~ round(mean(.x, na.rm = TRUE), 2))) |>
  pivot_longer(cols = c(played_hours, Age),
               names_to = "variable",
               values_to = "mean_2dp") |>
  arrange(variable)

players_mean
