# Predicting Newsletter Sign-Up From Firstweek Playtime on a Minecraft Research Server
*Allan Xue – DSCI 100 Final Project – June 2025*

### Introduction

#### Background

A research group in the UBC Computer Science department, led by Frank Wood, is operating a public Minecraft server to investigate how people explore, build, and cooperate in an open-world game. Due to the nature of the project, the team must recruit and retain players, ensuring they have sufficient resources to manage the large number of players. One early signal of engagement is whether a newcomer decides to enroll in the project newsletter. If we can predict the choice based on the behaviour shown in the first few days, the team can automate follow-up emails and avoid wasting resources on unlikely contributors. 

#### Question
Can a player's total playtime in their first seven days on the server predict whether they will subscribe to the newsletter?

#### Data Description


| File | Rows | Cols | One row = | Key variables (type) | Notes |
|------|------|------|-----------|----------------------|-------|
| **`players.csv`** | **196** | 7 | an individual player | `hashedEmail` (str, unique ID)  <br>`experience` (chr – “Amateur”, “Regular”, …)  <br>`played_hours` (num – lifetime hours when file exported)  <br>`gender` (chr)  <br>`Age` (num)  <br>`subscribe` (lgl) | 2 missing `Age` values; 144 subscribers vs 52 non-subscribers |
| **`sessions.csv`** | **1535** | 5 | one login session | `hashedEmail` (str, FK → players)  <br>`start_time`, `end_time` (chr “dd/mm/YYYY HH:MM”)  <br>`original_start_time`, `original_end_time` (num – Unix ms) | 2 sessions have missing `end_time` (player disconnected abruptly) |


#### Collected variables

* **`hashedEmail`** – identifies a participant while preserving anonymity.  
* **`experience`** – self-reported Minecraft skill tier at sign-up (`Amateur`, `Regular`, `Pro`, `Veteran`, `Legend`).  
* **`played_hours`** – hours played before the current data pull
* **Session times**    
  * `start_time`/`end_time`: human-readable; converted to calendar time during wrangling.  
  * `original_start_time/end_time`: kept as reference.



#### Considerations
* **Missing values:** 2 ages, 2 session `end_time`s. We put `total_minutes_first_week = 0` when a player has no sessions in week 1.  
* **Potential bias:** Players choose to sign up, so the sample may over-represent highly motivated individuals.  
* **Time stamps:** Stored as strings and must be parsed carefully
* **Unit mismatch:** `played_hours` is hours, but predictor is minutes. We keep both and label units in plots.
