### **DSCI100 Project Final Report - Predicting Game Newsletter Subscriptions**

**Introduction**

A research group in Computer Science at UBC is collecting data on how people play video games. Many players were recruited, and many kinds of information about them were recorded as they navigated through a MineCraft server set up by the research group. 
One of the broad questions explored by this study is "What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?"

To further investigate the variables most suited for this purpose, we proposed the question, "Can the variables Age and played_hours work accurately to predict whether a person is or isn't subscribed to game-related newsletters in the game dataset?"

To answer this question, we will be using the players.csv dataset. This dataset contains seven columns with 195 observations of player-specific information. 

The seven columns are:
1. hashedEmail (character)
   - the email/username of each person
   
2. experience (character)
   - Classification of the player type by experience
   - ordered: Beginner, Amateur, Regular, Pro, Veteran

3. subscribe (logical)
   - whether they are subscribed for newsletters or not

4. Name (character)
   - Name of each person participating

5. played_hours (double precision)
   - The total time the player has spent in the game (in hours)

6. gender (character)
   - The gender of the people participating

7. Age (double precision)
   - The age of each person participating

**Methods**

To perform our analysis, we will start by loading the players.csv data and naming is players. Next, since we will be focusing on the Age, played hours, and the subscribe variables, we will remove all the NAs from the Age column and change the values in the "subscribe" column into factors. 

The data analysis will start with splitting the players dataset, with 75% of the data allocated to the training set and 25% for the testing set. The testing set will be put aside for now while we create a KNN-Classification algorithm with the training set. Since K-nearest neighbours is sensitive to the scale of the predictors, the data will be standardized when making the recipe. Next, a K-nearest neighbours model specification will be made, with the "neighbors" argument as tune() to find the best K value. 
A simple tibble will be made to list the K values we will be testing, and the vfold_cv function will be used to perform a five cross-validation later on. After this, a workflow analysis will be made that combines the recipe and the model specification. The tune_grid function will be used instead of fit. The tibble and the cross-validation argument from earlier will be put inside. Lastly, we will use the function collect_metrics to aggregate the mean and standard error. 

Below, we will perform the analysis. At the top of each code cell will be a comment explaining what is being accomplished in the cell.

In [4]:
# Importing relevant packages for answering our question
library(tidyverse)
library(repr)
library(tidymodels)

options(repr.matrix.max.rows = 6)

In [5]:
# Reading in the players.csv dataset
players_url <- "https://raw.githubusercontent.com/ryan-jleung/DSCI-planning-individual/main/players.csv"
players <- read_csv(players_url)

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [34]:
# Wrangling so that players_info has the relevant columns and the "subscribe" column is a factor rather than a logical type.
# Also removes "NA" responses in the "Age" column
players_data <- players |>
    select(Age, played_hours, subscribe) |>
    filter(!is.na(Age)) |>
    mutate(subscribe = as_factor(subscribe))
    

players_data

Age,played_hours,subscribe
<dbl>,<dbl>,<fct>
9,30.3,TRUE
17,3.8,TRUE
17,0.0,FALSE
⋮,⋮,⋮
22,0.3,FALSE
17,0.0,FALSE
17,2.3,FALSE


In [37]:
# Splitting the players data into a testing and training set
players_split <- initial_split(players_data, prop=0.75, strata=subscribe)
players_train <- training(players_split)
players_test <- testing(players_split)

players_train
players_test

Age,played_hours,subscribe
<dbl>,<dbl>,<fct>
17,0,FALSE
21,0,FALSE
22,0,FALSE
⋮,⋮,⋮
17,0,TRUE
20,0,TRUE
17,0,TRUE


Age,played_hours,subscribe
<dbl>,<dbl>,<fct>
17,3.8,TRUE
21,0.7,TRUE
17,0.1,TRUE
⋮,⋮,⋮
17,0.0,FALSE
22,0.3,FALSE
17,2.3,FALSE


In [39]:
# Creating a recipe and scaling/centering the data to standardize and ensure age and played_hours contribute equally to the classification algorithm
players_recipe <- recipe(subscribe ~ Age + played_hours, data=players_train) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors())

players_recipe



[36m──[39m [1mRecipe[22m [36m──────────────────────────────────────────────────────────────────────[39m



── Inputs 

Number of variables by role

outcome:   1
predictor: 2



── Operations 

[36m•[39m Scaling for: [34mall_predictors()[39m

[36m•[39m Centering for: [34mall_predictors()[39m



In [42]:
# Creating the classifier for KNN-Classification using the straight-line distance
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors=tune()) |>
    set_engine("kknn") |>
    set_mode("classification")

knn_spec

K-Nearest Neighbor Model Specification (classification)

Main Arguments:
  neighbors = tune()
  weight_func = rectangular

Computational engine: kknn 


In [53]:
# Creating a 5-fold cross-validation model and tibble testing feasible K-values.
# Note: Given that the players dataset only has 200 observations, testing K-values up to K=15 seems reasonable to avoid over/underfitting.

k_values <- tibble(neighbors=seq(1,15, by=1))
cv_folds <- vfold_cv(players_train, v=5, strata=subscribe)

# k_values
# cv_folds

neighbors
<dbl>
1
2
3
⋮
13
14
15


In [54]:
# Training the classifier with the train/validation split using K-values of 1-15.
# Note: tune_grid() allows us to test multiple combinations of K-values.
players_fit <- workflow() |>
    add_recipe(players_recipe) |>
    add_model(knn_spec) |>
    tune_grid(resamples = cv_folds, grid=k_values) |>
    collect_metrics()

players_fit

neighbors,.metric,.estimator,mean,n,std_err,.config
<dbl>,<chr>,<chr>,<dbl>,<int>,<dbl>,<chr>
1,accuracy,binary,0.4818391,5,0.04895715,Preprocessor1_Model01
1,roc_auc,binary,0.5094697,5,0.05505304,Preprocessor1_Model01
2,accuracy,binary,0.4818391,5,0.04895715,Preprocessor1_Model02
⋮,⋮,⋮,⋮,⋮,⋮,⋮
14,roc_auc,binary,0.4928455,5,0.040588780,Preprocessor1_Model14
15,accuracy,binary,0.7242529,5,0.009854939,Preprocessor1_Model15
15,roc_auc,binary,0.5143089,5,0.043242765,Preprocessor1_Model15
