In [None]:
library(tidyverse)
library(repr)
library(readxl)
library(tidymodels)
source("cleanup.R")
options(repr.matrix.max.rows = 6)

# KNN Classification of Player Subscription Patterns

## Introduction
At UBC a computer science research team directed by Frank Wood is collecting data about how people play video games.The data is going to be used to train autoregressive diffusion models, in an attempt to create a model that learns continuously. They investigated this by creating a minecraft server to begin tracking and recording the movements and actions of players. To run this research they have to make sure they efficiently collect and record players to obtain enough data whilst still having enough resources to run the project, such as software licenses. To help with their research they are looking to see who subscribes to the newsletter as there are stakeholders that may want to know what demographic applies to the newsletter. 


To find out who is most likely to subscribe, we are going to create a model that predicts if the player will **subscribe** to the playing newsletter depending on the **playing hours** and **age** of the players. To do this we will target and tidy data relevant to the question. 


The data set used contains seven variables as a description of the player. This includes experience, subscribe, hashed email, played_hours, name, gender, age. The variables age and played_hours are numeric observations, whilst the rest are characters. The data will then be wrangled and organised clearly to be used in the model. The model will consist of a kknn class prediction based upon the variables of playing hour and age of the players. Ultimately we aime to build a prediction model will produce data that will aid the UBC computer science research team in their research. 

## Methods and Results



### Load and wrangle data
The analysis begins with loading and preparing the dataset, players.csv.This dataset contains three variables relevant to our predictive task: subscription, total hours played, and age. Before performing any modeling, the data must be cleaned to ensure that the dataset is tidy and workable. The initial step involves selecting the three variables and converting the subscription variable into a factor, allowing it to be treated as a categorical outcome in the classification model. Furthermore, the rows that include missing values for either age or played hours are removed.


In [None]:
#Loading in the Data and cleaning/wrangling
player_data <- read_csv("players.csv")|>
            select(subscribe, played_hours, Age)|>
            mutate(subscribe = as.factor(subscribe))|>
            filter(!is.na(Age), !is.na(played_hours))


set.seed(5)        
player_split <-initial_split(player_data, prop = 3/4, strata = subscribe)
player_train <- training(player_split)
player_test <- testing(player_split)

### Summary of Data Set

In [None]:
#summary of data relevant to analysis
nrow(player_data)
ncol(player_data)

summary(player_data)

**Summary Table of Variables from Player_data** <br>
Below is a summary of the relevant variables and their descriptions for the player data set: <br>

|Variable Name|Data Type|Description/Meaning|Summary Statistics/Values|
|:-------------:|:---------:|:-------------------:|:-------------------------:|
|subscribe| logical | If the player is subscribed to the magazine or not| True = 142, False = 52|
|played_hours| numeric | Total hours played by each player | Mean = 5.95, Median = 0.10, Min = 0.00, Max = 223.10| 
|Age| numeric | Player's age in years | Mean = 21.14, Median = 19.00, Min = 9.00, Max = 58.00|

Number of rows: 194 <br>
Number of Columns: 3

## Visualizations for Player Data Set - Exploratory Data Analysis 

In [None]:
library(RColorBrewer)

Visualization 1: Distribution of Age

In [None]:
options(repr.plot.width = 13, repr.plot.height = 8)

dist_age_player <- player_data |>
    ggplot(aes(x = Age)) +
    geom_histogram(binwidth = 5, fill = "steelblue", color = "black") +
    labs(x = "Age of Player (Years)",
         y = "Amount of Players") +
    ggtitle("Distribution of Players Ages") +
    theme(text = element_text(size=20))

dist_age_player

Visualization 2: Distribution of Player Hours

In [None]:
options(repr.plot.width = 13, repr.plot.height = 8)

dist_player_hours <- player_data |>
    ggplot(aes(x = played_hours)) +
    geom_histogram(binwidth = 10, fill = "skyblue", color = "black") +
    labs(x = "Hours Players by Each Player (Hours)",
         y = "Amount of Players") +
    ggtitle("Distribution of Players Played Hours") +
     theme(text = element_text(size=20))

dist_player_hours

Visualization 3: Player age vs Hours played vs Subscribed or Not

In [None]:
options(repr.plot.width = 13, repr.plot.height = 8)

played_hrs_vs_age <- player_data |>
    ggplot(aes(x = played_hours, y = Age, color = subscribe)) +
    geom_point(alpha = 1) +
    labs(x = "Played Hours by Each Player (Hours)",
         y = "Age of Player (Years)",
         color = "Subscribed") +
    ggtitle("Players Age vs Hours Played vs Subscribed or Not") +
    theme(text = element_text(size=20)) +
    scale_color_manual(values = c("TRUE" = "steelblue", "FALSE" = "red"))

played_hrs_vs_age

## Use V fold to decide K

In [None]:
knn_tune <- nearest_neighbor(weight_func = "rectangular", neighbors = tune())|>
            set_engine('kknn') |>
            set_mode('classification')

player_recipe <- recipe(subscribe ~ played_hours + Age, data = player_train)|>
                step_scale(all_predictors())|>
                step_center(all_predictors())

player_vfold <- vfold_cv(player_train, v = 5, strata = subscribe)

k_vals <- tibble(neighbors = seq(from = 1, to = 100, by = 1))

player_k_results <- workflow()|>
                    add_recipe(player_recipe)|>
                    add_model(knn_tune)|>
                    tune_grid(resamples = player_vfold, grid = k_vals)|>
                    collect_metrics()

In [None]:
player_k_accuracy <- player_k_results|>
                    filter(.metric == "accuracy")

player_k_accuracy_plot <- ggplot(player_k_accuracy, aes(x=neighbors, y= mean)) +
                        geom_point()+
                        geom_line()

player_k_best <- player_k_results|>
                    filter(.metric == "accuracy")|>
                    arrange(desc(mean))|>
                    slice(1)|>
                    pull(neighbors)

head(player_k_accuracy)
player_k_accuracy_plot
player_k_best

### Building Model with decided K value

In [None]:
player_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 19)|>
            set_engine('kknn') |>
            set_mode('classification')

player_fit <- workflow()|>
                    add_recipe(player_recipe)|>
                    add_model(player_spec)|>
                    fit(data = player_train)

player_prediction <- predict(player_fit, player_test)|>
                    bind_cols(player_test)

player_prediction_accuracy <- player_prediction|>
                            metrics(truth = subscribe, estimate = .pred_class)
player_prediction_accuracy

### Visualization of Data Analysis

In [None]:
test_accuracy <- player_prediction |>
    mutate(correct = .pred_class == subscribe)

options(repr.plot.width = 13, repr.plot.height = 8)

test_accuracy_plot <- test_accuracy |>
    ggplot(aes(x = played_hours, y = Age, color = correct)) +
    geom_point() +
    scale_color_manual(values = c("red", "steelblue"),
                       labels = c("Incorrect", "Correct")) +
    labs(x = "Hours Played (Hours)",
         y = "Player's Age (Years)",
         color = "Prediction") +
    ggtitle("KNN Classification Accuracy on Test Data") +
    theme(text = element_text(size=20))

test_accuracy_plot

In [None]:
git