In [None]:
# load libraries
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)

# Individual Project Planning

## Data Description
In this project, we aim to explore player engagement and predictuser behaviour on a research Minecraft server operated by the UBC Computer Science department. The server logs detailed player activities across multiple sessions, providing a unique opportunity to analyze how different player characteristics and behaviors relate to their participation patterns. Two datasets were provided for analysis:

In [None]:
players_data <- read_csv("players.csv")
head(players_data)

#### players.csv

The `players.csv` file contains information about each unique player. It includes 196 observations, with seven variable:
- **experience**: a character variable that describes an individual player's skill level category based on a category.
    - The 5 categories:'Beginner', 'Amateur', 'Regular', 'Pro', 'Veteran'.
- **subscribe**: a logical variable that tells you whether the player is subscribed.
- **hashedEmail**: a character variable that details a unique anonymized player ID.
- **played_hours**: a double variable that showcases the total number of hours played in hours.
- **name**: a character variable that reports the first name of the player.
- **gender**: a character variable that tells the gender of the player.
- **Age**: a double variable that showcases the age of the player in years.

In [None]:
summary_player <- summary(players_data)
summary_player

In [None]:
sessions_data <- read_csv("sessions.csv")
head(sessions_data)

#### sessions.cvs

The `sessions.cvs` file records individual play session performed by each players. It contains 1535 observations, including five variables:
- **hashedEmail**: a character variable that displays the unique anonymized player ID linking session data to player data.
- **start_time**: a character variable that reports the session start timestamp. 
- **end_time**: a character variable that reports the session end timestamp.
- **original_start_time**: a double variable that displays the session start time in UNIX time (milliseconds)
- **original_end_time**: a double variable that displays session start time in UNIX time (milliseconds)

In [None]:
summary_sessions <- summary(sessions_data)
summary_sessions

## Question

#### Broad Question
What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?
#### Specific Question
Are more experienced or older players more likely to subscribe to the newsletter than new or younger players?
- Response variable: `subscribe`.
- Explanatory variables: `experience`, `Age`, `played_hours`.

#### Data Relevance
The dataset is relevant to addressing the question because it captures both demographic characteristics (age and experience) and behavioral engagement metrics (playtime). By analyzing these variables, we can assess whether more experienced or older players are more inclined to subscribe, potentially indicating higher interest in the game's community or research goals.

## Exploratory Data Anlysis and Visualizations

In [None]:
players_data <- read_csv("players.csv",)
head(players_data)

In [None]:
mean_players <- summarize(players_data, 
                             mean_played_hours = mean(played_hours, na.rm = TRUE), 
                             mean_age = mean(Age, na.rm = TRUE))
mean_players

In [None]:
options(repr.plot.height = 8, repr.plot.width = 10)
age_playtime_plot <- players_data |>
    ggplot(aes(x = Age, y = played_hours, colour = subscribe)) +
    geom_point() +
    xlab("Age of the Player (in years)") +
    ylab("Total Number of Hours Played") +
    labs(colour = "Subscribed") +
    ggtitle("Relationship Between Playtime and Age by Subscription Status")
age_playtime_plot

## Methods and Plan

To predict whether a player subscribes to the newsletter based on their age, experience, and playtime, we will be using the K-nearest-neighbour classification model.

Since KNN is a non-parameteric and intuitive method that classifies players based on the similarity of their characteristics to others in the dataset, it seemed to be the most appropriate option. It also requires minimal assumptions, which include:

- Observations that are close together in feature space are likely to belong to the same class.
- Predictor variables must be numeric and scaled so that distance calculations are meaningful.
- The choice of K significantly affects model performance, so it must be tuned carefully.

However, using the KNN model can present a few potential limitations which are:

- Sensitive to irrelevant variables and scaling.
- Choice of k matters, which must be tuned using cross-validation.
- KNN does not provide easily interpretable coefficients, so interpretation may require additional visual elements.

In order to do this, we first split the data into 80% training and 20% testing sets, stratifying for `subscribe`.


must tune the parameter k, using cross-validation against the training data, so we can compare it to different values of k, choosing the best option.



Data Processing Plan

Split the data into training (80%) and testing (20%) sets using stratified sampling.

Preprocess training data:

Convert experience to dummy variables

Center and scale age

Cross-validate KNN on the training set.

Apply best model to the held-out test set.

Use confusion matrix and ROC curve for final evaluation.