In [None]:
# load libraries
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)

# Individual Project Planning

## Data Description
In this project, we aim to explore player engagement and predictuser behaviour on a research Minecraft server operated by the UBC Computer Science department. The server logs detailed player activities across multiple sessions, providing a unique opportunity to analyze how different player characteristics and behaviors relate to their participation patterns. Two datasets were provided for analysis:

In [None]:
players_data <- read_csv("players.csv")
head(players_data)

#### players.csv

The `players.csv` file contains information about each unique player. It includes 196 observations, with seven variable:
- **experience**: a character variable that describes an individual player's skill level category based on a category.
    - The 5 categories:'Beginner', 'Amateur', 'Regular', 'Pro', 'Veteran'.
- **subscribe**: a logical variable that tells you whether the player is subscribed.
- **hashedEmail**: a character variable that details a unique anonymized player ID.
- **played_hours**: a double variable that showcases the total number of hours played in hours.
- **name**: a character variable that reports the first name of the player.
- **gender**: a character variable that tells the gender of the player.
- **Age**: a double variable that showcases the age of the player in years.

In [None]:
summary_player <- summary(players_data)
summary_player

In [None]:
sessions_data <- read_csv("sessions.csv")
head(sessions_data)

#### sessions.cvs

The `sessions.cvs` file records individual play session performed by each players. It contains 1535 observations, including five variables:
- **hashedEmail**: a character variable that displays the unique anonymized player ID linking session data to player data.
- **start_time**: a character variable that reports the session start timestamp. 
- **end_time**: a character variable that reports the session end timestamp.
- **original_start_time**: a double variable that displays the session start time in UNIX time (milliseconds)
- **original_end_time**: a double variable that displays session start time in UNIX time (milliseconds)

In [None]:
summary_sessions <- summary(sessions_data)
summary_sessions

## Question

#### Broad Question
What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?
#### Specific Question
Are more experienced or older players more likely to subscribe to the newsletter than new or younger players?
- Response variable: `subscribe`.
- Explanatory variables: `experience`, `Age`, `played_hours`.

#### Data Relevance
The dataset is relevant to addressing the question because it captures both demographic characteristics (age and experience) and behavioral engagement metrics (playtime). By analyzing these variables, we can assess whether more experienced or older players are more inclined to subscribe, potentially indicating higher interest in the game's community or research goals.

## Exploratory Data Anlysis and Visualizations

In [None]:
players_data <- read_csv("players.csv",)
head(players_data)

In [None]:
mean_players <- summarize(players_data, 
                             mean_played_hours = mean(played_hours, na.rm = TRUE), 
                             mean_age = mean(Age, na.rm = TRUE))
mean_players

In [None]:
options(repr.plot.height = 8, repr.plot.width = 10)
age_playtime_plot <- players_data |>
    ggplot(aes(x = Age, y = played_hours, colour = subscribe)) +
    geom_point() +
    xlab("Age of the Player (in years)") +
    ylab("Total Number of Hours Played") +
    labs(colour = "Subscribed") +
    ggtitle("Relationship Between Playtime and Age by Subscription Status")
age_playtime_plot

## Methods and Plan

To classify predict whether a player subscribes to the newsletter based on their age, experience, and playtime we will be using the K-nearest-neighbour classification model.

Since KNN is a non-paramteric and intuitive method that classifies players based on the similarity of their characteristics to other in the dataset, it seemed to be the most appropriate option. It also requires minimal assumptions, which include:

- Observations that are close together in feature space are likely to belong in the same class.
- Predictor variables must be numeric and scaled, so that distance calculations are meaningful.
- The choice of K significantly affects model performace, so it must be tuned carefully.

However, using the knn model can present a few potential limitation, which are:

- 


Potential Limitations

KNN can be computationally expensive for large datasets, though this dataset is relatively small.

It is sensitive to irrelevant or highly correlated features.

KNN does not produce coefficients that directly show variable importance, so interpretation may require additional visual or model-based tools (e.g., feature importance from tree-based models).

Model Evaluation Plan

Data Splitting:

Split the dataset into 70% training and 30% testing subsets.

Cross-Validation:

Use 5-fold cross-validation on the training set to select the optimal number of neighbours (K) by minimizing classification error or maximizing F1 score.

Evaluation Metrics:

Assess model performance using accuracy, precision, recall, and AUC (Area Under the ROC Curve).

Comparison:

Compare KNN results with a simple logistic regression baseline to determine whether a nonlinear, distance-based model improves predictive accuracy.

Data Processing Plan

Encoding: Convert categorical variables (e.g., experience) into numeric dummy variables.

Scaling: Standardize numeric features (Age, played_hours) to zero mean and unit variance.

Feature Selection: Evaluate which variables most improve predictive accuracy.

Training: Fit KNN classifier on the processed training data.

Validation: Tune the number of neighbours (K) using grid search and cross-validation.

Testing: Evaluate the tuned model on the held-out test data and interpret results.

✍️ Summary Paragraph

This approach will help determine whether players with similar profiles (in terms of age, experience, and playtime) share similar subscription behaviours. By comparing model performance across different values of K, we can assess how well similarity in these features predicts newsletter engagement. This method is especially suitable when relationships between predictors and outcomes are complex or non-linear — a likely scenario in human gameplay data.