Benjamin Hobson 94708385

(1) Data Description:

players.csv --> The players.csv data set has 196 observations, each recording a player with some basic information about that player. There are 7 columns, each representing a different variable. 
- experience: this variable is of type "chr" and it represents a categorical variable. It sorts players into different groups, like beginniner, amateur, normal, pro, and veteran.
- subscribe: this is a variable of type "lgl" which means it either is true or false. It records whether or not the player is subscribed to the game-related newsletter or not. This could be an interesting variable to observe and perhaps use as a class variable to predict, as it has two categories of either being true or false (perhaps variables like Age and experience can be used to predict the likelihood of someone being subscribed or not).
- hashedEmail: this variable is of type "chr" and provides the player's email (hashed so it is anonymous).
- player_hours: this variable is of type "dbl" and represents how many hours the player has played.
- name: "char" variable that records the name of the player.
- gender: categorical variable of type "chr" that defines gender. More than just male/female, there are a number of different options for everyone.
- age: records age in years, is of type "dbl".
Notes on the data: There are some missing values in the Age column, meaning I need to ensure I use na.rm = TRUE when I perform summary statistics on that variable. Besides that, this data is pretty tidy and unproblematic for manipulating it.

sessions.csv --> this dataframe has 1535 observations/rows and it has 5 variables/columns. From what I can reasonably infer, here is a basic description of each column: 
- hashedEmail: this is an identifier essentially, it's of type "chr" and it is an encrypted version of the users email who recorded the session.
- start_time: this is a varible of type "chr" this gives the exact minute (with the day, year and month attached) of when a player started a session on the server.
- end_time: also a variable of type "chr". Similar to start_time this gives the exact minute (with the day, year and month attached) of when a player ended a session on the server.
- original_start_time and original_end_time (both of type dbl): it seems like these columns display the time since another certain time (some year) in seconds or some small increment. That would explain why the values are so large. I think it is worth seeing the values printed out, to determine their usability in comparison to the other two time variables. 

Notes on the data: One idea that popped into my head after looking through the data couple times is that it might be useful to use mutate to create a new column using the data from the variables start_time and end_time to get the "play_time" column that subtracts the two times to just quickly see how long the player played for. It would make summary statistics on play time much simpler. This would require some work however to turn the "chr" variable into a date format. 



(2) Questions: 
My question of interest: using the players.csv data set, can a player’s age and hours played predict whether they subscribe to the game’s newsletter? This question is interesting to me because if I was trying to boost the amount of subscribers to the newsletter, I would want to know which type of players to target in my advertising efforts. For instance, if a younger age was more receptive to joining the newsletter, I would be more inclined to use social media to increase subscribership. To wrangle this data I will need to account for missing values in the Age column and I will need to standardize my data before I do any predictions to avoid having age or playing time being overpowering in the prediction. It might also be useful to refactor the lgl subscriber variable into a factor variable (as the thing we are trying to predict). This way, after making sure those are taken care of we can use K nearest neighbors voting to attempt to accurately predict whether someone is a subscriber or not. 






(3) Exploratory data analysis and visualization: 

In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)
##source('cleanup.R')

In [None]:
players <- read_csv("data/players.csv")
players

sessions <- read_csv("data/sessions.csv")


In [None]:
players_recode <- players |>
  mutate(subscribe = as_factor(subscribe)) |>
  mutate(subscribe = fct_recode(subscribe, "Subscribed" = "TRUE", "Not subscribed" = "FALSE"))
players_recode

#Now we have turned subscribe from lgl to fct, making it a factor variable 
#as well as improving readability allowing us to use it as the variable we are trying to predict 

In [None]:
#Summary statistics on the quantitative variables in the player dataframe: 

players_mean <- players |>
select(played_hours, Age) |>
map_df(mean, na.rm = TRUE)

players_mean

In [None]:
#Data visualizations to understand the data better 
options(repr.plot.width=12, repr.plot.height=7)
players_plot <- players_recode |> 
ggplot(aes(x = played_hours, y = Age, color = subscribe)) + 
geom_point() +
labs(x = "Time spent playing (hours)", y = "Age of player (years)", title = "Time played and Age (unstandardized)") + 
theme(text = element_text(size = 15))
players_plot

Based on this first plot, it does seem like we can use thes variables to predict because I do observe a relationship between the variables and we can see that as we increase the time spent playing, it seems to increase our chances of subscribership. The data is unstandardized, but when we perform our actual predictive analysis the relationships will be more clear. It seems like right now the main insight is how time spent playing seems to greatly increase the odds of someone subscribing. The next plot I will create will be to understand the basic distribution of our played_hours variable to see where most players fall:

In [None]:

players_histogram <- ggplot(players_recode, aes(x = played_hours)) + 
geom_histogram(binwidth = 0.5) +
labs(x = "Time spent playing (hours)", y = "Count", title = "Distribution of playing time") + 
theme(text = element_text(size = 15))
       
players_histogram + scale_x_continuous(limits = c(0, 50)) + scale_y_continuous(limits = c(0, 25))

From this visualization (focusing on less extreme values where most of our data is centered around) we are able to tell that a huge proportion of the players have played less than 5 hours, and a massive amount of them have played 0-1 hours, which means that they are likely joining the server and not playing much. This may be useful because perhaps by targeting those groups, by getting the newsletters it would increase the amount of time they play.  

(4) Methods and Plan: 
I plan on using a K nearest neighbors classification algorithm to understand the relationship between how the Age and played_hours variable can help predict the subscribe variable. One limitation revealed by the visualizations above is that it seems like a lot of users are close to 0 playing time, which means our data will not be distributed in a very reliable way for predictions close to 0 played_hours. First I will create a recipe using the chosen predictor variables and the target variable. I would then use scale() to ensure my data has been standardized and each predictor variable has the same weight in distance calculations. I would divide the data into a training set and a test set, with a 0.75 to 0.25 split. I will not create a separate validation set but will use cross-validation (5 fold) on the training set to determine the best value for K. Then I will apply the algorithm, making predictions on the test set where I can then evaluate performance. Putting all of these together, I will have a strong foundation in how predictive these variables are, and with the analysis complete inferences can be made about strategy in terms of how to get more subscribers to the newsletter. 

In [None]:
##recipe
##model
