## Predicting Players Subscription Status using KNN Classification Modelling 

## Introduction


A research group in Computer Science at UBC has collected data from a Minecraft server, with the goal of predicting usage of a video game research server. This study investigates the factors associated with a player's decision to subscribe to a game-related newsletter. Such a subscription may serve as a proxy for deeper engagement with the server or interest in the research project.


**Reseach Question**
This study explores the question: *What player characteristics and behaviors are most predictive of subscribing to a game-related newsletter?*  
More specifically, we ask: *Can age, gender, experience level, and hours played predict whether a player subscribes to the newsletter?*

**Dataset Description**

To answer this question, we analyze a dataset collected from the Minecraft research server. Each row in the dataset represents an individual player, with the following variables:

- `experience`: The level or rank of the player (categorical)
- `played_hours`: The number of hours the player has spent on the server (numerical)
- `gender`: The gender identity of the player (categorical)
- `age`: The age of the player in years (numerical, ordered)
- `subscribe`: Whether the player subscribed to the game-related newsletter 
The response variable is `subscribe`, and the explanatory variables are `age`, `gender`, `experience`, and `played_hours`.

This report will explore relationships between these variables and apply predictive modeling techniques to determine which characteristics are most useful in predicting newsletter subscription.


In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)
source("cleanup.R")

In [None]:
player_data <- read_csv("https://raw.githubusercontent.com/Cna-51/minecraft_indiv/refs/heads/main/players%20(1).csv") |>
    select(-hashedEmail, -name, -experience) |>
    filter(played_hours > 0) |>
    mutate(subscribe = as.factor(subscribe)) |>
    drop_na()
player_data

In [None]:
player_plot <- player_data |>
    ggplot(aes(x = Age, y = played_hours, colour = subscribe)) +
    geom_point() +
    labs(x = "Player's Age (yrs)", y = "Player hours (hrs)", colour = "Subscribed", title = "Player's Age vs Played Hours")
player_plot

In [None]:
player_split <- initial_split(player_data, prop= 0.7-0.3, strata= subscribe) 
player_training <- training(player_split)
player_testing <- testing(player_split)
player_training
player_testing

In [None]:
set.seed(1234)
player_recipe <- recipe(subscribe ~ played_hours + Age, data = player_training) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors())
player_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 3) |>
    set_engine("kknn") |>
    set_mode("classification")
player_fit <- workflow() |>
    add_recipe(player_recipe) |>
    add_model(player_spec) |>
    fit(data = player_training)
player_predictions <- predict(player_fit, player_testing) |>
    bind_cols(player_testing)
prediction_accuracy <- player_predictions |>
        metrics(truth = subscribe, estimate = .pred_class)             
prediction_accuracy