In [None]:
# load libraries
library(tidyverse)
library(repr)
library(tidymodels)
library(RColorBrewer)
options(repr.matrix.max.rows = 6)

# Individual Project Planning

## Data Description
In this project, we explore player engagement and predict user behaviour on a research Minecraft server operated by the UBC Computer Science department. The server logs detailed player activities across sessions, providing an opportunity to analyze how player characteristics and behaviours relate to their participation. Two datasets were provided for analysis:

In [None]:
players_data <- read_csv("players.csv")
head(players_data)

#### players.csv

The `players.csv` file contains information about each unique player. It includes 196 observations, with seven variables:
- **experience**: a character variable that describes an individual player's skill level in categories.
    - The 5 categories:'Beginner', 'Amateur', 'Regular', 'Pro', 'Veteran'.
- **subscribe**: a logical variable that tells you whether the player is subscribed.
- **hashedEmail**: a character variable that details a unique anonymized player ID.
- **played_hours**: a double variable that showcases the total number of hours played.
    - Min. = 0.00
    - Median = 0.100
    - Mean = 5.846
    - Max. = 223.100 
- **name**: a character variable that reports the first name of the player.
- **gender**: a character variable that tells the gender of the player.
- **Age**: a double variable that showcases the age of the player in years.
    - Min. = 9.00
    - Median = 19.00
    - Mean = 21.14
    - Max. = 58.00
    - NA's = 2

In [None]:
summary_player <- summary(players_data)
summary_player

In [None]:
sessions_data <- read_csv("sessions.csv")
head(sessions_data)

#### sessions.cvs

The `sessions.csv` file records individual play sessions performed by each player. It contains 1535 observations, including five variables:
- **hashedEmail**: a character variable that displays the unique anonymized player ID linking session data to player data.
- **start_time**: a character variable that reports the session start timestamp. 
- **end_time**: a character variable that reports the session end timestamp.
- **original_start_time**: a double variable that displays the session start time in UNIX time (milliseconds)
    - Min. = 1.712e+12
    - Median = 1.719e+12
    - Mean = 1.719e+12
    - Max = 1.727e+12
- **original_end_time**: a double variable that displays session start time in UNIX time (milliseconds)
    - Min. = 1.712e+12
    - Median = 1.719e+12
    - Mean = 1.719e+12
    - Max = 1.727e+12
    - NA's = 2

In [None]:
summary_sessions <- summary(sessions_data)
summary_sessions

#### Potential Issues

A few potential issues can include:

- Self-reported values may contain errors.
- Sampling bias: we are unsure how the players were selected.
- Newsletter subscription behaviour may depend on unmeasured variables.
- Age distribution: younger players make up most of the dataset.
- Experience level may not be standardized.
- Missing values or errors when collecting data.

## Question

#### Broad Question
What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?
#### Specific Question
Are more experienced or older players more likely to subscribe to the newsletter than new or younger players?
- Response variable: `subscribe`.
- Explanatory variables: `experience`, `Age`, `played_hours`.

#### Data Relevance
The dataset is relevant to addressing the question because it captures both demographic characteristics (age and experience) and behavioral engagement metrics (playtime). By analyzing these variables, we can assess whether more experienced or older players are more inclined to subscribe, potentially indicating higher interest in the game's community or research goals.

#### Data Wrangling
Minimal wrangling will be required:
- Remove values that do not contribute to the predictions.
- `experience` is a categorical value, so it must be converted into dummy variables.

## Exploratory Data Anlysis and Visualizations

In [None]:
players_data <- read_csv("players.csv",)
head(players_data)

In [None]:
mean_players <- summarize(players_data, 
                             mean_played_hours = mean(played_hours, na.rm = TRUE), 
                             mean_age = mean(Age, na.rm = TRUE))
mean_players

In [None]:
options(repr.plot.height = 8, repr.plot.width = 10)
age_playtime_plot <- players_data |>
    ggplot(aes(x = Age, y = played_hours, colour = subscribe)) +
    geom_point(alpha = 0.7) +
    xlab("Age of the Player (in years)") +
    ylab("Total Number of Hours Played") +
    labs(colour = "Subscribed") +
    ggtitle("Relationship Between Playtime and Age by Subscription Status")
age_playtime_plot

In [None]:
age_dist_plot <- ggplot(players_data, aes(x = Age, fill = experiene)) +
    geom_bar(fill = "steelblue") +
    xlab("Player's Age (in years)") +
    ggtitle("Distribution of Player's Age")
age_dist_plot

In [None]:
exp_dist_plot <- ggplot(players_data, aes(x = experience)) +
    geom_bar(fill = 'lightgreen') +
    xlab("Player's Experience Distribution") +
    ggtitle("Distribution of Player's Experience")
exp_dist_plot

## Methods and Plan

To predict whether a player subscribes to the newsletter based on their age, experience, and playtime, we will be using the K-nearest-neighbour classification model.

Since KNN is a non-parametric and intuitive method that classifies players based on the similarity of their characteristics to others in the dataset, it seemed to be the most appropriate option. It also requires minimal assumptions, which include:

- Observations that are close together in feature space are likely to belong to the same class.
- Predictor variables must be numeric and scaled so that distance calculations are meaningful.
- The choice of K significantly affects model performance, so it must be tuned carefully.

However, using the KNN model can present a few potential limitations, which are:

- Sensitive to irrelevant variables and scaling.
- Choice of k matters, which must be tuned using cross-validation.
- KNN does not provide easily interpretable coefficients, so interpretation may require additional visual elements.

To select the best KNN model, we first split the data into 80% training and 20% testing sets, stratifying for `subscribe`. The training data is preprocessed by converting `experience` into dummy variables and centering and scaling age. We then tune k using cross-validation and compare models for different values of k. The most appropriate value of k is then selected, and the corresponding model is applied to the test dataset. Final evaluation is performed using a confusion matrix to ensure a fair assessment of model performance.