In [None]:
library(tidyverse)
players_data <- read_csv("Data/players.csv")
players_data

In [None]:
players_data_tidy <- players_data |>
select(name,gender,Age,played_hours,experience,subscribe,-hashedEmail, -gender) |>
rename(age = Age) |>
arrange(age)
players_data_tidy

**Data Description**

**Broad Question**
- What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

**Specific Question**
- 

In [None]:
summary_mean <- players_data |>
summarize(mean_age = mean(Age, na.rm = TRUE),
          mean_played_hours = mean(played_hours, na.rm = TRUE))
summary_table <- summary_mean |>
pivot_longer(cols = everything(),
             names_to = "variable",
             values_to = "mean")
summary_table

In [None]:
options(repr.plot.width = 10, repr.plot.height = 15)
hours_played_plot <- players_data |>
ggplot(aes(x=Age,y=played_hours,color=subscribe)) +
geom_point(size=5, alpha = 0.8) +
labs(x="Age",y="Time played (hrs)",color="Subscribed", title="Play time vs. Age") +
ylim(0.0,50) +
scale_color_brewer(palette = "Set3")
hours_played_plot

In [None]:
library(dplyr)
library(ggplot2)
options(repr.plot.width=10, repr.plot.height = 10)
age_mean <- players_data |>
group_by(experience) |>
summarize(mean_age = mean(Age, na.rm=TRUE))

experience_plot <- ggplot(age_mean, aes(x=experience,y=mean_age,fill=experience)) +
geom_col() +
labs( x = "Player Experience", y = "Age", title = "Age vs. Player Experience")
experience_plot

**Methods and Plan**


**KNN Classification is the best method to answer this question**
- This is because we are predicting categorical variables instead of a numerical one
- Predictions made on non-linear relationships can be made, which is the case for this question

**Limitations include choosing the right number of nearest neighbours, scaling, and distance between the points**
- The number of nearest neighbours chosen can change how accurate the prediction is
- Not scaling the axis can give inaccurate results
- If multiple points are close together, it may be hard to predict

**Selecting and composing a model**
- To see which is most accurate, tune different k values by testing a number of k values
- Use cross validation
- Use age, experience, and hours played as predictors
- Prediction is true/false for subscribe

**Processing the data**
- Select variables relevant to the question asked
- After tidying the data, split data in 75/25 format
- Use k fold cross validation and determine a suitable k value