<h2>Introduction</h2> Researchers at UBC led by Frank Wood collected data about how people play video games by setting up MineCraft servers and recording various types of data. One of their goals was to determine which player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types? 

In this proposal, we will focus on the specific question: 

<em>Which set of variables is more predictive of subscription: Player Characteristics (age and gender) or Player Behaviors (total game sessions and average length of gaming sessions)?</em>

<h2>Dataset Descriptions</h2> For this inquiry, we will be combining 2 datasets: <br>
    <ul>
    <li>player.csv  -> data about each unique player</li>
	<li>sessions.csv  -> data on individual player gaming sessions</li>
</ul>

<h4>player.csv Description</h4 

players.csv contains 196 unique player data collected through self-reported survey and records or players actions.

| Variable Name    | Data Type | Meaning |
| -------- | ------- |---------|
| experience  | chr   |     Player’s self reported skill level    |
| subscribe | lgl     |Whether or not Player subscribed to newsletter  |
| hashedEmail    | chr    | Anonymous and Unique Player ID | 
| played_hours  | dbl    |Total cumulative Hours played |
| name | chr     | Player’s name |
| gender    | chr    | Player’s self reporter gender  |
| Age  | dbl    |  Players age in years  | 

Issues: <br>
    <ul>
    <li>Missing Values: 2 missing age values</li>
	<li>Inherit Self-Reported Bias</li>
</ul>

<h4>sessions.csv Description</h4 

sessions.csv contains 1535 records of each single game session and ID of which player it belongs to. Collected through recording player playtimes. 

| Variable Name    | Data Type | Meaning |
| -------- | ------- |---------|
| hashedEmail  | chr   |     Anonymous and Unique Player ID    |
| start_time | chr    |Timestamp for start of session  |
| end_time    | chr    | Timestamp for end of session | 
| original_start_time  | dbl    |Unix timestamp for session start |
| original_end_time | dbl     | Unix timestamp for session end |


Issues: <br>
    <ul>
    <li>Missing Values: 2 missing end_time values</li>
	<li>Data is not tidy - multipe entries for single player</li> 
    <li>Nonparticipating Players: 125/196 (64%) players actually played</li>
</ul>

<h4>Summary Statistics</h4 
<br>
    <ul>
    <li>Subscription Rate: 74.47% (144/196)  </li>
	<li>Mean played_hours 5.85 </li> 
    <li>played_hours Range: 0-223.1</li> 
    <li>Mean age: 21.14 </li>
	<li>Age Range: 9-58 </li> 
    <li>Average Sessions per active players: 12.28 </li> 
    <li>Average Sessions per total registered players: 7.83  </li>
</ul>

<h2>Exploratory Data Analysis and Visualization</h2>We will begin by loading the needed packages into R 

In [None]:
library(tidyverse)
library(lubridate) #Needed for tiding the sessions.csv dataset later

Now load player.csv. It requires a little bit of tidying: getting rid of the N/A values previously mentioned

In [None]:
players <- read_csv("players.csv") |> 
        drop_na()
players

<h4>Mean Values of player.csv</h4 <br> 

| Variable | Mean Value | 
| -------- | ------- |
| player_hours  | 8.85   | 
| Age | 21.14   |

Now we will load sessions.csv. However, sessions.csv requires tidying since there are multiple entries/rows for an individual player

In [None]:
sessions <- read_csv("sessions.csv") |> 
    select(-original_start_time, -original_end_time) |> #unessecary data which does not help us
    mutate(start_time = dmy_hm(start_time), #using the lubridate package to turn the format of the data into something workable
           end_time = dmy_hm(end_time)
  ) |>

mutate(session_length = as.numeric(difftime(end_time, start_time, units = "mins"))) |> 

 group_by(hashedEmail) |>
  summarise(total_sessions = n(), avg_session_time_mins = mean(session_length, na.rm = TRUE)
  )

sessions

Now that we have both datasets and they are tidy, we can combine them to create our final dataset!

In [None]:
final_data <- left_join(players, sessions) |> 

mutate(across(everything(), ~ replace_na(.x, 0))) #the N/A values originally are needed and cannot be ommited

final_data

Now we will create a few visualizations which will allow us to better understand our data

In [None]:
ggplot(final_data, aes(x = total_sessions)) +
  geom_histogram(
    binwidth = 5,      
    fill = "blue",          
    color = "black",         
    alpha = 0.7           
  ) +
  labs(
    title = "Distribution of Amount of Gaming Sessions",       
    x = "Amount of Sessions",         
    y = "Amount of Players"            
  ) 

In this histogram, we can see the distribution of the amount of gaming sessions each player had over the course of the research. We can see how almostall players had few total gaming sessions and only a very select few had more than ~50 sessions  

In [None]:
ggplot(final_data, aes(x = Age)) +
  geom_histogram(
    binwidth = 1,      
    fill = "blue",          
    color = "black",         
    alpha = 0.7         
  ) +
  labs(
    title = "Distrubuition of Ages",     
    x = "Age",           
    y = "Amount of Players"          
  ) 

From this histogram showing the distribution of ages throughout the players, we can see some trends. Most noteably being the enormous spike at 17 years old. This is important for our questions because it will prove to skew our predictor involving age since a very large majority of players are 17 years old.

In [None]:
ggplot(final_data, aes(x = experience, fill = subscribe)) +
  geom_bar(position = "dodge") +   
  labs(
    title = "Subscription Status by Experience Level",
    x = "Self-Reported Experience Level",
    y = "Count",
    fill = "Subscribed"
  ) 

From this double bar graph, we are able to see individuals who subscribed versus those who didn't grouped by their Self Reported skill Level. This shows us that the most players regardless of skill level subscribed to the news letter. It also shows us that most players are Self Reported Amateurs.

<h2>Methods and Plan</h2> Question Restated: Which set of variables is more predictive of subscription: Player Characteristics (age and gender) or Player Behaviors (total game sessions and average length of gaming sessions) <br>
<br>
To do this, we will create 2 separate Classification Models both predicting subscription with different predictors, one predicting based on Player Characteristics and one based on Playter Behaviours. Then by using 10-fold cross-validation, we will be able determine the performance metrics of each model and will be able to compare the two and decide the better one. We would split the data in both models into 80% training and 20% testing before a 10-fold cross-validation on the training set. This method is appropriate because cross-validation provides out-of-sample performance estimates which allow for fair comparison. The model with a higher performance metrics will be considered the most predictive. <br>
<br>
Some potential issues with this method is the sample size and class imbalance. These might affect our accuracy of both models as they might not be able to capture the seemingly complex and arbitrary patterns of certain variables. 
