# Predicting Newsletter Subscription in a Minecraft Research Server Dataset

## Introduction

### Background

The University of British Columbia's Computer Science department is conducting a research project involving player behavior in a Minecraft server. Participants play freely, and their interactions within the game are logged. One aspect of interest to the research team is predicting which players are likely to subscribe to a related newsletter. Such predictions can help improve participant engagement strategies and allocate resources more effectively, such as server maintenance and software licenses.

### Question

Can player demographics (e.g., gender, age) and gameplay behaviors (e.g., total hours played, self-reported experience) predict newsletter subscription in the Minecraft research server dataset?

## Data Description

### Dataset Overview

Two datasets were used:

* **players.csv**: Contains demographic information and subscription status.
* **sessions.csv**: Contains session-level logs of gameplay activity (not directly used in this phase).

### Summary

* Total observations: Varies by filtered training/test split (\~200+ players)
* Target variable: `subscribe` (Yes or No)
* Key explanatory variables:

  * `scaled_played_hours`: Standardized total hours played
  * `scaled_Age`: Standardized player age
  * `experience`: Ordinal categorical feature
  * `gender_Female`, `gender_Male`, `gender_others`: One-hot encoded gender identities

### Variable Description

| Variable       | Type    | Description                             |
| -------------- | ------- | --------------------------------------- |
| subscribe      | Factor  | Whether player subscribed to newsletter |
| played\_hours  | Numeric | Total hours played in Minecraft         |
| Age            | Numeric | Self-reported age                       |
| experience     | Ordinal | Gameplay experience level               |
| gender\_Female | Binary  | 1 if female, 0 otherwise                |
| gender\_Male   | Binary  | 1 if male, 0 otherwise                  |
| gender\_others | Binary  | Sum of other gender categories          |

### Data Issues

* The `subscribe` variable is somewhat imbalanced.
* Some predictors were standardized using `scale()` to ensure fair distance-based comparisons for KNN.

## Methods & Results

### Data Processing

* Removed unnecessary identifiers (`hashedEmail`, `name`)
* One-hot encoded gender
* Combined minority gender categories into `gender_others`
* Split data into training and test sets (80/20 split, stratified by `subscribe`)
* Used 5-fold cross-validation for hyperparameter tuning

### Model 1: Behavior + Age + Experience

* Features: `scaled_played_hours`, `scaled_Age`, `experience`, `gender_Female`, `gender_Male`, `gender_others`

* Best number of neighbors (K): 7
* Cross-validation Accuracy: **0.731**
* Test Accuracy: **0.694**
* Kappa: **-0.008** (low agreement beyond chance)

### Model 2: Gender only

* Features: `gender_Female`, `gender_Male`, `gender_others`
* Best K: 1
* Cross-validation Accuracy: **0.735**
* Test Accuracy: **0.449**
* Kappa: **0.068** (low but better than random)

### Model 3: Combined (to be completed)

* Planned to include all above features to improve performance

### Visualization

* Plots showed a modest separation in `played_hours` and `age` between subscribed vs. non-subscribed
* Gender alone was not a strong visual differentiator

## Discussion

### Summary

* Behavioral features (played hours, experience) were more predictive than gender alone
* Gender-only model had poor test performance, despite a high CV accuracy (likely overfitting)
* KNN with K=7 was optimal when using behavioral features

### Insights

* Playing behavior is a stronger indicator of engagement than self-identified gender
* This is expected: active players are more likely to be interested in newsletters

### Impact

* Helps project managers target recruitment and plan resource allocation
* Can be expanded into real-time recommendation or engagement systems

### Future Questions

* Can we incorporate more session-based features (e.g., frequency, peak times)?
* Can time-series or ensemble models perform better?
* Is there a causal link between gameplay style and subscription?

