# Project Final Report (Group)


## Introduction

- provide some relevant background information on the topic so that someone unfamiliar with it will be prepared to understand the rest of your report
- clearly state the question you tried to answer with your project
- identify and fully describe the dataset that was used to answer the question

## Methods & Results

- describe the methods you used to perform your analysis from beginning to end that narrates the analysis code.
- your report should include code which:
    - loads data 
    - wrangles and cleans the data to the format necessary for the planned analysis
    - performs a summary of the data set that is relevant for exploratory data analysis related to the planned analysis 
    - creates a visualization of the dataset that is relevant for exploratory data analysis related to the planned analysis
    - performs the data analysis
    - creates a visualization of the analysis 
note: all figures should have a figure number and a legend


Can the the player type, age and played hours of players predict if they are going to subscribe to a game-related newsletter and which player type is the most predictive?

- ### Loading Libraries

Here, we load the repr, tidyverse, and tidymodels libraries using the library() function in order to use the specialized functions they provide for data manipulation, computation, and visualization.

In [None]:
library(repr)
library(tidyverse)
library(tidymodels)

- ### Reading Data

Here we read the "players.csv" dataset using the read_csv() function and display the first five rows using the slice_head() function to ensure that the dataset was read properly.

In [None]:
player <- read_csv("https://raw.githubusercontent.com/gatory/DSCI_100_Term_Project/refs/heads/main/data/players.csv")
slice_head(player, n=5)

Here, we remove data that is unnecessary for our purposes from the dataset (columns "HashedEmail", "gender" and "name") using the select() function and reassign the data from the "subscribe" variable to be a factor either yes or no using the mutate() function. We also display the first five rows using the slice_head() function to ensure that the dataset was altered properly.

In [None]:
tidy_player <- player |> 
    select(-hashedEmail, -gender, -name) |>
    mutate(experience = as_factor(experience), subscribe = as_factor(subscribe)) |>
    mutate(subscribe = recode(subscribe, "TRUE" = "Yes", "FALSE" = "No"))
slice_head(tidy_player, n=5)

Here we display a general summary of the new dataset using the summarize() function to calculate the average hours played by all players as well as the average age of all players.

In [None]:
tidy_player_mean <- tidy_player |> summarize(player_hours_mean = mean(played_hours, na.rm = TRUE), age_mean = mean(Age, na.rm = TRUE))
tidy_player_mean

Here we display two graphs using the ggplot() function. The first is a scatterplot titled "Hours Played vs. Age Relationship" and shows the hours played of all players over the age of all players. The second is a bar graph titled "Distribution of Players Across Experience and Subscription" and shows the relative amount of players subscribed across all player types.

In [None]:
options(repr.plot.width = 10, repr.plot.height = 8) 
tidy_player_age_plot <- tidy_player |>
    ggplot(aes(x = Age, y = played_hours, color = subscribe)) +
    geom_point(alpha = 0.9) + 
	scale_x_log10() +
    scale_y_log10() +
    labs(x = "Age (years)", y = "Hours Played (hours)", color = "Subscribed?") +
    ggtitle("Hours Played vs. Age Relationship") +
    theme(text = element_text(size = 18))

tidy_players_experience_plot <- tidy_player |>
    ggplot(aes(y = experience, fill = subscribe)) +
    geom_bar(stat = "count") +
    labs(x = "Player Type", y = "Number of Players", fill = "Subscribed?") +
    ggtitle("Distribution of Players Across Experience and Subscription") +
    theme(text = element_text(size = 18))
tidy_player_age_plot
tidy_players_experience_plot

- ### Modeling Classification Models

- #### Universal Tuning Specifications for Subscription Prediction

Here we use the nearest_neighbors() function to create the specifications that we will later use to tune all the individual player type models.

In [None]:
set.seed(4923)
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
    set_engine("kknn") |>
    set_mode("classification")

- #### Filter and Split Datasets

Here we use the filter() function to select only the rows for each player type and assign them to individual data sets. Then for each set, we split the data into training and testing sets so that our model will not see the data that will be used to test it.

In [None]:
set.seed(4923)
beginner_players <- tidy_player |> filter(experience == "Beginner")
beginner_split <- initial_split(beginner_players, prop = 0.75, strata = subscribe)
beginner_train <- training(beginner_split)
beginner_test <- testing(beginner_split)

regular_players <- tidy_player |> filter(experience == "Regular")
regular_split <- initial_split(regular_players, prop = 0.75, strata = subscribe)
regular_train <- training(regular_split)
regular_test <- testing(regular_split)

amateur_players <- tidy_player |> filter(experience == "Amateur")
amateur_split <- initial_split(amateur_players, prop = 0.75, strata = subscribe)
amateur_train <- training(amateur_split)
amateur_test <- testing(amateur_split)

veteran_players <- tidy_player |> filter(experience == "Veteran")
veteran_split <- initial_split(veteran_players, prop = 0.75, strata = subscribe)
veteran_train <- training(veteran_split)
veteran_test <- testing(veteran_split)

pro_players <- tidy_player |> filter(experience == "Pro") |> na.omit()
pro_split <- initial_split(pro_players, prop = 0.75, strata = subscribe)
pro_train <- training(pro_split)
pro_test <- testing(pro_split)

- #### Folding 5 Folds for each player type

Here we use the vfold_cv() to make 5 folds for each player type.

In [None]:
set.seed(4923)
beginner_vfold <- vfold_cv(beginner_train, v = 5, strata = subscribe)
regular_vfold <- vfold_cv(regular_train, v = 5, strata = subscribe)
amateur_vfold <- vfold_cv(amateur_train, v = 5, strata = subscribe)
veteran_vfold <- vfold_cv(veteran_train, v = 5, strata = subscribe)
pro_vfold <- vfold_cv(pro_train, v = 5, strata = subscribe)

- #### Tuning Model using a suitable range of Ks

Here, we find a suitable range for k based on how many observations exist for each player type. Because each of the player types has a different number of observations, we use the largest range of k-values possible so that we can give each player type the best possible k value once that value is determined.

In [None]:
# Tune for best k given proper range
set.seed(4923)
beginner_k_vals <- tibble(neighbors = seq(from = 1, to = 10, by = 1))
beginner_recipe <- recipe(subscribe ~ played_hours + Age, data = beginner_train) |>
  step_scale(all_predictors()) |>
  step_center(all_predictors())
beginner_fit <- workflow() |>
  add_recipe(beginner_recipe) |>
  add_model(knn_spec) |>
  tune_grid(resamples = beginner_vfold, grid = beginner_k_vals)

regular_k_vals <- tibble(neighbors = seq(from = 1, to = 10, by = 1))
regular_recipe <- recipe(subscribe ~ played_hours + Age, data = regular_train) |>
  step_scale(all_predictors()) |>
  step_center(all_predictors())
regular_fit <- workflow() |>
  add_recipe(regular_recipe) |>
  add_model(knn_spec) |>
  tune_grid(resamples = regular_vfold, grid = regular_k_vals)

amateur_k_vals <- tibble(neighbors = seq(from = 1, to = 20, by = 1))
amateur_recipe <- recipe(subscribe ~ played_hours + Age, data = amateur_train) |>
  step_scale(all_predictors()) |>
  step_center(all_predictors())
amateur_fit <- workflow() |>
  add_recipe(amateur_recipe) |>
  add_model(knn_spec) |>
  tune_grid(resamples = amateur_vfold, grid = amateur_k_vals)

veteran_k_vals <- tibble(neighbors = seq(from = 1, to = 22, by = 1))
veteran_recipe <- recipe(subscribe ~ played_hours + Age, data = veteran_train) |>
  step_scale(all_predictors()) |>
  step_center(all_predictors())
veteran_fit <- workflow() |>
  add_recipe(veteran_recipe) |>
  add_model(knn_spec) |>
  tune_grid(resamples = veteran_vfold, grid = veteran_k_vals)

pro_k_vals <- tibble(neighbors = seq(from = 1, to = 6, by = 1))
pro_recipe <- recipe(subscribe ~ played_hours + Age, data = pro_train) |>
  step_scale(all_predictors()) |>
  step_center(all_predictors())
pro_fit <- workflow() |>
  add_recipe(pro_recipe) |>
  add_model(knn_spec) |>
  tune_grid(resamples = pro_vfold, grid = pro_k_vals)

- #### Analyzing Accuries and Picking Best K

Here, we analyze the accuracies of the predictions made using each k value in each range for each player type. Based on the accuracy for each k value, we can choose the best one to use in our final model that we will test using the test data.

In [None]:
# Analyze for best k
options(repr.plot.width = 5, repr.plot.height = 5)
set.seed(4923)
beginner_accuracies <- beginner_fit |> collect_metrics() |>
  filter(.metric == "accuracy")
beginner_cross_val_plot <- ggplot(beginner_accuracies, aes(x = neighbors, y = mean)) +
    geom_point() +
    geom_line() +
    labs(x = "Neighbors", y = "Accuracy Estimate", title = "Beginner Accuracies") +
    theme(text = element_text(size = 12))

regular_accuracies <- regular_fit |> collect_metrics() |>
  filter(.metric == "accuracy")
regular_cross_val_plot <- ggplot(regular_accuracies, aes(x = neighbors, y = mean)) +
    geom_point() +
    geom_line() +
    labs(x = "Neighbors", y = "Accuracy Estimate", title = "Regular Accuracies") +
    theme(text = element_text(size = 12))

amateur_accuracies <- amateur_fit |> collect_metrics() |>
  filter(.metric == "accuracy")
amateur_cross_val_plot <- ggplot(amateur_accuracies, aes(x = neighbors, y = mean)) +
    geom_point() +
    geom_line() +
    labs(x = "Neighbors", y = "Accuracy Estimate", title = "Amateur Accuracies") +
    theme(text = element_text(size = 12))

veteran_accuracies <- veteran_fit |> collect_metrics() |>
  filter(.metric == "accuracy")
veteran_cross_val_plot <- ggplot(veteran_accuracies, aes(x = neighbors, y = mean)) +
    geom_point() +
    geom_line() +
    labs(x = "Neighbors", y = "Accuracy Estimate", title = "Veteran Accuracies") +
    theme(text = element_text(size = 12))

pro_accuracies <- pro_fit |> collect_metrics() |>
  filter(.metric == "accuracy")
pro_cross_val_plot <- ggplot(pro_accuracies, aes(x = neighbors, y = mean)) +
    geom_point() +
    geom_line() +
    labs(x = "Neighbors", y = "Accuracy Estimate", title = "Pro Accuracies") +
    theme(text = element_text(size = 12))

beginner_cross_val_plot
regular_cross_val_plot
amateur_cross_val_plot
veteran_cross_val_plot
pro_cross_val_plot

- #### Retraining Specification Model Using Best K

In [None]:
Here we 

In [None]:
# Retrain model using best k
set.seed(4923)
beginner_mnist_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 3) |>
    set_engine("kknn") |>
    set_mode("classification")
beginner_mnist_fit <- fit(beginner_mnist_spec, subscribe ~ played_hours + Age, data = beginner_train)

regular_mnist_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 6) |>
    set_engine("kknn") |>
    set_mode("classification")
regular_mnist_fit <- fit(regular_mnist_spec, subscribe ~ played_hours + Age, data = regular_train)

amateur_mnist_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 18) |>
    set_engine("kknn") |>
    set_mode("classification")
amateur_mnist_fit <- fit(amateur_mnist_spec, subscribe ~ played_hours + Age, data = amateur_train)

veteran_mnist_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 20) |>
    set_engine("kknn") |>
    set_mode("classification")
veteran_mnist_fit <- fit(veteran_mnist_spec, subscribe ~ played_hours + Age, data = veteran_train)

pro_mnist_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 4) |>
    set_engine("kknn") |>
    set_mode("classification")
pro_mnist_fit <- fit(pro_mnist_spec, subscribe ~ played_hours + Age, data = pro_train)

- #### Determine Final Accuracies

In [None]:
# See results and accuracies
set.seed(4923)
beginner_mnist_predictions <- predict(beginner_mnist_fit, beginner_test) |> bind_cols(beginner_test)
beginner_mnist_metrics <- beginner_mnist_predictions |> 
    metrics(truth = subscribe, estimate = .pred_class) |> 
    filter(.metric == "accuracy")

regular_mnist_predictions <- predict(regular_mnist_fit, regular_test) |> bind_cols(regular_test)
regular_mnist_metrics <- regular_mnist_predictions |> 
    metrics(truth = subscribe, estimate = .pred_class) |> 
    filter(.metric == "accuracy")

amateur_mnist_predictions <- predict(amateur_mnist_fit, amateur_test) |> bind_cols(amateur_test)
amateur_mnist_metrics <- amateur_mnist_predictions |> 
    metrics(truth = subscribe, estimate = .pred_class) |> 
    filter(.metric == "accuracy")

veteran_mnist_predictions <- predict(veteran_mnist_fit, veteran_test) |> bind_cols(veteran_test)
veteran_mnist_metrics <- veteran_mnist_predictions |> 
    metrics(truth = subscribe, estimate = .pred_class) |> 
    filter(.metric == "accuracy")

pro_mnist_predictions <- predict(pro_mnist_fit, pro_test) |> bind_cols(pro_test)
pro_mnist_metrics <- pro_mnist_predictions |> 
    metrics(truth = subscribe, estimate = .pred_class) |> 
    filter(.metric == "accuracy")

In [None]:
beginner_results <- beginner_mnist_metrics |> mutate(experience = "beginner")
regular_results <- regular_mnist_metrics |> mutate(experience = "regular")
amateur_results <- amateur_mnist_metrics |> mutate(experience = "amateur")
veteran_results <- veteran_mnist_metrics |> mutate(experience = "veteran")
pro_results <- pro_mnist_metrics |> mutate(experience = "pro")

In [None]:
# Finalize results and graph it
total_metrics <- bind_rows(beginner_results, regular_results, amateur_results, veteran_results, pro_results) |> 
                mutate(accuracy = .estimate * 100, experience = as_factor(experience)) |>
                select(experience, accuracy)
total_metrics

options(repr.plot.width = 10, repr.plot.height = 10)

tidy_players_experience_plot <- total_metrics |>
    ggplot(aes(x = accuracy, y = experience)) +
    geom_bar(stat = "identity") +
    labs(x = "Accuracies in %", y = "Player Type") +
    ggtitle("Predictability of Different Player Types") +
    theme(text = element_text(size = 18))

tidy_players_experience_plot

## Discussion

- summarize what you found
- discuss whether this is what you expected to find?
- discuss what impact could such findings have?
- discuss what future questions could this lead to?

## References
- You may include references if necessary, as long as they all have a consistent citation style.