Individual Planning Report Group 43 - Anson Ng Student ID (34713040) 


**Data description**

In [None]:
#load library first
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)
source('cleanup.R')

There are two datasets the player dataset, and the session dataset. The player dataset is the dataset that contains information regarding each player that has logged on to play the game.

This includes

- experience 
- subscription
- hashedEmail 
- played_hours 
- name
- gender
- Age

The Session dataset is a dataset that illustrates the start play time and the end play time of each person for every session they play. Thus there are identicical hashed emails with two or more start and end times. The original start and end time is another value that representes the start and end time but it is recorded in the form of UNIX time.

The variables include

- hashedEmail
- start_time
- end_time
- original_start_time (in UNIX time (milliseconds))
- original_end_time (in UNIX time (milliseconds))

Based on the datasets, there are three types of variables within the player dataset. 
1. "experience", "name", "hashedEmail" and "gender" are character factors since their output are words.
2. "subscribe" is a logical factor which means either true or false
3. "played_hours", and "age" are numbers that include decimals.


- "experience" details the level of gaming skill each player possess, this ranges range from amateur, Pro, Veteran and beginner. 
- "name" details the players real life name
- "hashEmail" details the code that the game has given the player to identify the player, this is equivalent to a gamer ID to remember who the player is through code.
- "gender" details the players gender which can either be "male", "female", "other" and "prefer not to say" for those who choose not to answer
- "subscribe" determines whether the player has chosen to subscribe to the newsletter those who have selected "true" while those who didn't select "false"
- "played_hours" refers to the amount of hours each player has played in total
- "age" refers to how old the player is. 
It is important to note that within the "Age" variable there are values that are NA thus when processing the data we must remember to pass the command that ignores NA. Additionally the gender column does not only include "Male" and "Female", but also "Prefer not to say", and other thus may need to be filtered if we choose to only calculate "Male" and "Female".

The session dataset is the dataset that notes each time a player starts and end their game time. This dataset contains information including 

- "hashedEmail" details the player who is playing. It uses the same variable as the "hashedEmail" of the player file; thus if necessary, we can combine to see what time each player starts or ends and their information. 
- "start_time" details the date and time that the player starts playing the game. It is important to note that this column has two pieces of information, the start_time date and time which may require additional wrangling. 
- "end_time" details the date and time that the player has finished playing the game. Similar to the "start_time", it contains both the data and time so it will also require additional wrangling. 
- "original_start_time" is the same variable as the start_time but is formatted as a UNIX time (milliseconds). 
- "original_end_time" is the same variable as the end_time but is formatted as a UNIX time (milliseconds). 

With the steps detailed in later steps, we will be able to determine the mean value of certain quantitative data within our tables include 
1. "played_hours" the average amount of hours played being "5.85 (hours)"  
2. "age"   the average age of players being "21.14 (years old)"

**Question** 

**Broad Question:**
What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

**Specific Question:**
From a player’s gender or played_hours, which is more predictive in determining the likelihood of subscribing to the game-related newsletter?

**Data and Approach:**
My question will be addressed by comparing the predictive ability of gender and played_hours. To do this, I plan to apply a linear regression model to assess which variable — gender or played_hours — provides higher predictive accuracy. Before applying the model, the data will need to be wrangled. This includes:
- Filtering out any missing or unavailable values,
- Selecting only the relevant columns: played_hours, gender, and subscription,
- Converting the gender variable to a categorical predictor, and
- Ensuring that played_hours is stored as a numeric variable.
    
Notably the session data is not necessary as we are using the total played hours not each session to determine the predictability. 

**Exploring Data Analysis**

To explore the data we can start with is the summary of variables. Not all variables can provide a summary statistic, for example, the "name", "gender", "subscription", "experience" and "hashedEmail" wouldn't be able to provide a summary statistic as they aren't quantitative data. Instead we focus on the quantitative variables that can be used to calculate the mean value; this includes "played_hours" and "age". 

We will use the R commands as stated below to wrangle and summarize the data. 

First we make sure we are able to load the data 

In [None]:
#load the datasets
players <- read_csv("data/players.csv")
sessions <- read_csv("data/sessions.csv")

players
sessions    

Wrangling data is important because we first must make sure that the variables are workable and easy to read thus our first steps is to ensure that our data is able to be process. In order to wrangle the data to perform exploratory visuals of our data, 
we should
- Select only the relevant columns
- Remove missing or invalid values
- Convert variables to the desired type

In [None]:
#filter and select to find (mean) for "played_hours" and "age"
mean_data <- players |> 
select(played_hours, Age) |> 
summarize(mean_played_hours = round(mean(played_hours,na.rm = TRUE ),2), mean_age = round(mean(Age,  na.rm = TRUE),2))

mean_data

In [None]:
# wrangling the data
wrangled_players <- players |> 
  select(subscribe, played_hours, gender) |>
filter(!is.na(played_hours), !is.na(gender), !is.na(subscribe)) |> 
  mutate(subscribe = as.factor(subscribe),gender = as.factor(gender), played_hours = as.numeric(played_hours))

wrangled_players 

With the Wrangled data, we are able to perform a few exploratory visualizations 

In [None]:
gender_ver_subscribe <- wrangled_players |>
ggplot(aes(x=gender, fill = subscribe)) + 
geom_bar() + 
labs(x = "gender", y="count", fill = "subscribe", title = "Subscription by Gender" ) + 
theme_minimal()

summary_subscribe <- wrangled_players |>
  group_by(subscribe) |>
  summarise(mean_hours = mean(played_hours, na.rm = TRUE))

hours_ver_subscribe <- ggplot(summary_subscribe, aes(x = subscribe, y = mean_hours, fill = subscribe)) +
  geom_col() +
  labs(title = "Average Played Hours by Subscription Status", x = "Subscription Status",y = "Average Played Hours (hours)",fill = "Subscribed")


In [None]:
gender_ver_subscribe
hours_ver_subscribe

Through our two visuals we are able to determine different patterns to be be aware of.
First, we must note that The male category has the tallest bar overall, indicating that there are more male players in the dataset than any other gender. Since both the subscribed and unsubscribed counts are higher for males, this primarily reflects the larger representation of male players rather than a meaningful difference in subscription behavior. This imbalance may result in a limitation for a predictive model, as the  small number of non-subscribers across genders could reduce the model’s ability to capture patterns related to gender and subscription.


Our second Visual shows those subscribed have a higher on average played time than those without. This demonstrates a likely trend that the more hours someone plays the more they will want to subscribe. 

**Method and Plan**

**Question** 
 which is is more predictive in determining whether one would subcribe, the played_hours or the gender

 **Method**
 Create two linear regression models for each variable played_hours and gender. With the linear regression model, we can produce a RMSE value that will demontrate the predictor with a lower value to be more predictive of subscription likelihood. 


this method is appropriate as we are able to associate the subscription yes/no with 0/1 . With these numbers, we can calculate the probability of subscription based on how often 0 or 1 appears with each variable. Additionally the coefficients tell us how much the probability changes for each unit change in a predictor. 

**Assumptions**
Assumptions that will need to be made includes 
- the relationship between each variable and subscription is linear
- there aren't too many outliers
- there are observations for all variables


 **Plan** 
 1. Using the wrangled data, we will split it into **(60% training) (40% testing)**.
 2. With the training data, use 5-fold cross-validation to tune and validate the data
 3. take the played_hours variable and fit into the linear regression function with a recipe, fit, and workflow
 4. use the training set to fit the model and predict on the testing set.
 5. determine the RMSPE of the played_hours
 6. repeat for variable gender 