# Data Science Project: Planning Report  
UBC Minecraft Research Server  
Abhijeet Kaler 
DSCI 100  

## 1. Data Description

This project uses two datasets from a Minecraft research server run by a UBC Computer Science group. Players voluntarily joined the server, their gameplay sessions were logged automatically, and they completed a short questionnaire, giving us both behavioural and self-reported data.

### players.csv  
This file contains **196 players** and **7 variables**:

- `experience` (factor): self-reported Minecraft skill level.  
- `subscribe` (logical): newsletter opt-in.  
- `hashedEmail` (ID): anonymous player identifier.  
- `played_hours` (numeric): total hours spent on the server.  
- `name` (character): first name only.  
- `gender` (factor): self-reported gender.  
- `Age` (numeric): age in years (some missing).

### sessions.csv  
This file contains **1535 sessions** and **5 variables**:

- `hashedEmail` (ID): links sessions to players.  
- `start_time`, `end_time` (character): readable timestamps.  
- `original_start_time`, `original_end_time` (numeric): timestamps in milliseconds.

### Data Issues  
Several issues appeared during exploration:

- **Missing values:** Some ages are missing, and some sessions lack end times.  
- **Skewed data:** `played_hours` is extremely right-skewedâ€”most players have low hours while a few contribute very high amounts.  
- **Self-report bias:** Variables like experience and gender may not be perfectly accurate.  
- **Selection bias:** Only players who opted into the research server are represented.  
- **Irrelevant variables:** `name` will not be used in modelling.

For now, the player-level dataset is the focus since it contains the response variable of interest. Later, I may aggregate `sessions.csv` (e.g., session count, average session duration) to create additional predictors.


## 2. Questions

**Broad question:**  
Which types of players are most likely to contribute large amounts of data?

**Specific question:**  
Can player demographics and experience (`Age`, `gender`, `experience`, `subscribe`) predict total hours played (`played_hours`) on the research server?

`played_hours` directly measures data contribution, and the predictors describe player characteristics. This planning stage focuses on understanding these variables before doing any modelling.


In [None]:

library(tidyverse)
library(lubridate)

players <- read_csv("players.csv")
sessions <- read_csv("sessions.csv")

glimpse(players)
glimpse(sessions)


In [None]:
players <- players %>% 
  mutate(
    experience = as.factor(experience),
    gender     = as.factor(gender),
    subscribe  = as.logical(subscribe)
  )

sessions <- sessions %>% 
  mutate(
    start_time = dmy_hm(start_time),
    end_time   = dmy_hm(end_time),
    session_hours = (original_end_time - original_start_time) / (1000 * 60 * 60)
  )

players_means <- players %>% 
  summarise(across(where(is.numeric), ~ round(mean(.x, na.rm = TRUE), 2))) %>% 
  pivot_longer(everything(),
               names_to = "variable",
               values_to = "mean")

players_means

In [None]:

simple_theme <- theme_minimal(base_size = 13)

ggplot(players, aes(x = played_hours)) +
  geom_histogram(binwidth = 10, fill = "lightblue", color = "black") +
  labs(
    title = "Total Hours Played (Histogram)",
    x = "Total Hours Played",
    y = "Count"
  ) +
  simple_theme


ggplot(players, aes(x = experience, y = played_hours)) +
  geom_boxplot(fill = "lightblue") +
  labs(
    title = "Hours Played by Experience Level",
    x = "Experience Level",
    y = "Total Hours Played"
  ) +
  simple_theme


ggplot(players, aes(x = subscribe, y = played_hours)) +
  geom_boxplot(fill = "lightblue") +
  labs(
    title = "Hours Played by Newsletter Subscription",
    x = "Subscribed?",
    y = "Total Hours Played"
  ) +
  simple_theme


ggplot(players, aes(x = Age, y = played_hours)) +
  geom_point(color = "lightblue") +
  labs(
    title = "Age vs Total Hours Played",
    x = "Age",
    y = "Total Hours Played"
  ) +
  simple_theme


## 3. Exploratory Data Analysis and Visualization

A histogram shows `played_hours` is highly right-skewed, with most players contributing little and a few contributing a lot. The experience boxplot suggests more experienced players tend to record more hours. Comparing subscription groups shows slightly higher hours among subscribers, but with overlap. The Age scatterplot shows no strong linear trend, though some younger and mid-range ages have higher playtime. Overall, multiple variables seem related to `played_hours`.



## 4. Methods and Plan

I plan to use **k-nearest neighbours (k-NN regression)** to predict `played_hours` from `Age`, `gender`, `experience`, and `subscribe`. This method is suitable because the relationship between predictors and playtime is likely non-linear, and k-NN makes minimal assumptions.

However, k-NN is sensitive to scaling, outliers, and the choice of *k*. The skewness of `played_hours` may also affect performance, so a transformation may be considered during the full project.

I will split the data into **80% training** and **20% test** sets. All preprocessing (handling missing values, scaling numeric variables, encoding factors) will occur in a recipe fitted only on the training data. I will use cross-validation on the training set to tune *k*, using RMSE or RMSPE. The final model will be evaluated once on the separate test set to estimate predictive accuracy.
