## Predicting Subscription to Minecraft Research Newsletter
DSCI 100: Group Project

<h2> Introduction </h2>

**Players Data Set**: The players.csv data set describes data regarding individual players of the game. It has 196 observations with the 7 following variables: 
|Variable|Data Type|Description|
|---|---|---|
|**experience**|Character|Describes the players' experience with the game as either "Pro", "Veteran", "Regular", or "Amateur"|
|**subscribe**|Logical|Displays "TRUE" if the player is subscribed to the newsletter, and "FALSE" if they are not|
|**hashedEmail**|Character|The unique identification (privacy safe way of displaying email) of the player|
|**played_hours**|Double|Amount of time (hours) players played during all sessions|
|**name**|Character|Players' first name|
|**gender**|Character|Players' gender as "Female", "Male", "Agender", "Two Spirited", "Non-Binary", or "Prefer not to say"|
|**Age**|Double|Players' age in years|



**Summary Statistics:**
|Variable |Min | Max | Mean  | Q1 | Q2 | Q3 |
|---------|----|-----|------|-----|----|----|
|played_hours (hrs)|0.00|223.10|5.85|0.00|0.10|0.60|
|Age (years)|9.00|58.00|21.14|17.00|19.00|22.75|

In [None]:
library(tidyverse)
library(janitor)
library(tidymodels)
library(repr)
library(GGally)

In [None]:
players <- read_csv("players.csv") |>
clean_names() 
head(players)

**Predictive Question:** Can players' experience, game-play time, and age predict whether or not an individual will subscribe to the Minecraft research newsletter in the player dataset?

<h2>Methods and Results</h2>

<h4>Exploring the Relationship Between 3 Chosen Predictors</h4>

In [None]:
options(repr.plot.width = 10, repr.plot.height = 10)
ggpairs(players, 
        columns = c("age", "experience", "played_hours"), 
        aes(color = subscribe)) +
    ggtitle("Figure 1. Visualization of the Relationship Between Age, Experience, and Played Hours") +
    theme(text = element_text(size = 14))

In [None]:
set.seed(2024)

players <- players |>
    select(subscribe, age, experience, played_hours) |>
    mutate(subscribe = as_factor(subscribe)) |>
    drop_na()  # removes rows with missing values
      

#turn experience into numerical values 
#Fixed the problem so don't think we need this part anymore, just didn't want to delete it incase
#someone put it in for a different reason. 
#players <- players |>
   #mutate(experience = recode(experience,
                     #"Beginner" = 1,
                     #"Amateur"  = 2,
                     #"Regular"  = 3,
                     #"Veteran"  = 4,
                     #"Pro"      = 5))

#splitting data
players_split <- initial_split(players, prop = 0.75, strata = subscribe)
players_train <- training(players_split)
players_test <- testing(players_split)

#model k-classification
knn_tune_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
            set_engine("kknn") |>
            set_mode("classification")

#model recipe
players_recipe <- recipe(subscribe ~ experience + played_hours + age, data = players_train) |>
            step_zv(all_predictors()) |>
            step_scale(all_numeric_predictors()) |>
            step_center(all_numeric_predictors())

#folds
players_vfold <- vfold_cv(players_train, v = 5, strata = subscribe)



#values of k
k_vals <- tibble(neighbors = seq(from = 1, to = 20, by = 1))

#workflow results
knn_results <- workflow() |>
                 add_recipe(players_recipe) |>
                 add_model(knn_tune_spec) |>
                 tune_grid(resamples = players_vfold, grid = k_vals) |>
                 collect_metrics()

#determine the accuracy 
accuracies <- knn_results |>
                  filter(.metric == 'accuracy') |>
                  select(neighbors, mean)

#determines best K value
best_k <- accuracies |>
        arrange(desc(mean)) |>
        head(1) |>
        pull(neighbors)
best_k

In [None]:
#model k-classification
players_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 7) |>
            set_engine("kknn") |>
            set_mode("classification")

#knn fit
players_fit <- workflow() |>
            add_recipe(players_recipe) |>
            add_model(players_spec) |>
            fit(data = players_train)
                                

<h2>Discussion</h2>