# Project Final Report (Group)


## Introduction

- provide some relevant background information on the topic so that someone unfamiliar with it will be prepared to understand the rest of your report
- clearly state the question you tried to answer with your project
- identify and fully describe the dataset that was used to answer the question

## Methods & Results

- describe the methods you used to perform your analysis from beginning to end that narrates the analysis code.
- your report should include code which:
    - loads data 
    - wrangles and cleans the data to the format necessary for the planned analysis
    - performs a summary of the data set that is relevant for exploratory data analysis related to the planned analysis 
    - creates a visualization of the dataset that is relevant for exploratory data analysis related to the planned analysis
    - performs the data analysis
    - creates a visualization of the analysis 
note: all figures should have a figure number and a legend


Can the the player type, age and played hours of players predict if they are going to subscribe to a game-related newsletter and which player type is the most predictive?

In [None]:
#Loading required libraries
library(repr)
library(tidyverse)
library(tidymodels)

In [None]:
#Read csv file
player <- read_csv("data/players.csv")
slice_head(player, n=5)

In [None]:
#Wrangle unessccary data and re-assign data types
tidy_player <- player |> 
    select(-hashedEmail, -gender, -name) |>
    mutate(experience = as_factor(experience), subscribe = as_factor(subscribe)) |>
    mutate(subscribe = recode(subscribe, "TRUE" = "Yes", "FALSE" = "No"))
slice_head(tidy_player, n=5)

In [None]:
#General Visualization
options(repr.plot.width = 10, repr.plot.height = 8) 
tidy_player_age_plot <- tidy_player |>
    ggplot(aes(x = Age, y = played_hours, color = subscribe)) +
    geom_point(alpha = 0.9) + 
	scale_x_log10() +
    scale_y_log10() +
    labs(x = "Age (years)", y = "Hours Played (hours)", color = "Subscribed?") +
    ggtitle("Hours Played vs. Age Relationship") +
    theme(text = element_text(size = 18))

tidy_players_experience_plot <- tidy_player |>
    ggplot(aes(y = experience, fill = subscribe)) +
    geom_bar(stat = "count") +
    labs(x = "Player Type", y = "Number of Players", fill = "Subscribed?") +
    ggtitle("Distribution of Players Across Experience and Subscription") +
    theme(text = element_text(size = 18))
tidy_player_age_plot
tidy_players_experience_plot

In [None]:
#Filter dataframe base off player type
options(repr.plot.width = 5, repr.plot.height = 5) 
beginner_players <- tidy_player |> filter(experience == "Beginner")

beginner_split <- initial_split(beginner_players, prop = 0.75, strata = subscribe)
beginner_train <- training(beginner_split)
beginner_test <- testing(beginner_split)

knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
    set_engine("kknn") |>
    set_mode("classification")

k_vals <- tibble(neighbors = seq(from = 1, to = 15, by = 1))

beginner_vfold <- vfold_cv(beginner_train, v = 5, strata = subscribe)

beginner_recipe <- recipe(subscribe ~ played_hours + Age, data = beginner_train) |>
  step_scale(all_predictors()) |>
  step_center(all_predictors())

beginner_fit <- workflow() |>
  add_recipe(beginner_recipe) |>
  add_model(knn_spec) |>
  tune_grid(resamples = beginner_vfold, grid = k_vals)

accuracies <- beginner_fit |> collect_metrics() |>
  filter(.metric == "accuracy")

cross_val_plot <- ggplot(accuracies, aes(x = neighbors, y = mean)) +
    geom_point() +
    geom_line() +
    labs(x = "Neighbors", y = "Accuracy Estimate") +
    theme(text = element_text(size = 12))
cross_val_plot

In [None]:
regular_players <- tidy_player |> filter(experience == "Regular")
amateur_players <- tidy_player |> filter(experience == "Amateur")
veteran_players <- tidy_player |> filter(experience == "Veteran")
pro_players <- tidy_player |> filter(experience == "Pro")

## Discussion

- summarize what you found
- discuss whether this is what you expected to find?
- discuss what impact could such findings have?
- discuss what future questions could this lead to?

## References
- You may include references if necessary, as long as they all have a consistent citation style.