Question: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?
Specific Question: Can the age of the player predict if the player subscribes to a game-related newsletter in players.csv?

In [None]:
library(tidyverse)
library(purrr)

In [None]:
players_data <- read_csv("players.csv")
players_data

In [None]:
players_mean <- players_data |>
                select(played_hours, Age) |>
                map_dfr(mean, na.rm = TRUE)
players_mean

| Variable     | Mean |
| ------------ | ---- |
| Hours Played | 6    |
| Age          | 21   |

In [None]:
experience_vs_subscription_graph <- players_data |>
                            ggplot(aes(x = subscribe, fill = experience)) +
                            geom_bar() +
                            labs(x = "If the player subscribed", y = "Number of Players", fill = "Level of Experience") +
                            ggtitle("How the experience of player influences if they subscribed")
experience_vs_subscription_graph

In [None]:
gender_vs_subscription_graph <- players_data |>
                            filter(gender != "Prefer not to say") |>
                            ggplot(aes(x = subscribe, fill = gender)) +
                            geom_bar() +
                            labs(x = "If the player subscribed", y = "Number of Players", fill = "Gender") +
                            ggtitle("How the gender of player influences if they subscribed")
gender_vs_subscription_graph

In [None]:
age_vs_subscription_graph <- players_data |>
                            ggplot(aes(x = Age, fill = subscribe)) +
                            geom_bar() +
                            labs(x = "Age of the Player", y = "Number of Players", fill = "If they subscribed") +
                            ggtitle("How the age of player influences if they subscribed")
age_vs_subscription_graph

In [None]:
age_vs_subscription_graph <- players_data |>
                            filter(played_hours < 3) |>
                            ggplot(aes(x = played_hours, fill = subscribe)) +
                            geom_bar() +
                            labs(x = "Hours spent playing", y = "Number of Players", fill = "If they subscribed") +
                            ggtitle("How the hours spent playing influences if players subscribed")
age_vs_subscription_graph

Insights from the graphs: 

When looking at these graphs it doesn't look like the level of experience of the players, their gender or the hours spent playing had a relationship with whether or not they subscribed. However, there does seem to be a relationship with the age of the player and whether or not they subscribed. It seems as though younger players are 

Method and Plan: 

To address my question I would use knn to predict classification. This would work because I am trying to guess which category players fall into, whether they subscribe or not, based on their age. Seeing that subscritpion is a categorical variable classification is what is being predicted. The limitations are that this would only look at one variable, ignoring the fact others may also influence the prediction. This also requires the assumption that age can predict whether or not players will subscribe. 

To do knn, the data should be split after wrangling but prior to making the model. 75% of the data should go into the training set and the rest into the testing set ensuring shuffling and stratification happened. Cross-validation should also be done to find the best number of neighbours. 


How are you going to process the data to apply the model? For example: Are you splitting the data? How? How many splits? What proportions will you use for the splits? At what stage will you split? Will there be a validation set? Will you use cross validation?