Individual Planning Report Group 43 - Anson Ng Student ID (34713040) 


**Data description**

In [None]:
#load library first
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)
source('cleanup.R')

There are two datasets the player dataset, and the session dataset. The player dataset is the dataset that contains information regarding each player that has logged on to play the game.

This player dataset includes
| Variable | Type | Description |
|----------|------|-------------|
|experience|character|Gaming skill level: Beginner, Amateur, Pro, Veteran|
|name|character|Player’s real name|
|hashedEmail|character|Unique player ID (hashed email)|
|gender|character|Player’s gender: Male, Female, Other, Prefer not to say|
|subscribe|logical|Whether player subscribed to newsletter (TRUE/FALSE)|
|played_hours|numeric|Total hours played|
|age|numeric|Player’s age (some NAs present)|

It is important to note 
- age contains missing values; we must handle NAs.
- gender contains categories beyond Male/Female, which may need filtering.
- Some variables (e.g., name, hashedEmail) are identifiers and not predictive.

The session dataset include
| Variable | Type | Description |
|----------|------|-------------|
|hashedEmail|character|Player identifier|
|start_time|datetime|Session start (date & time)|
|end_time|datetime|Session end (date & time)|
|original_start_time|numeric|Start time in UNIX milliseconds|
|original_end_time|numeric|End time in UNIX milliseconds|

It is important to note 
- Each player may have multiple sessions.
- Start and end times are stored both in datetime and UNIX format, requiring wrangling for consistency.
- For our analysis, total played_hours is sufficient; session-level details are not required as played_hours already represents the total time each player has spent in the game, session-level start and end times are unnecessary for predicting subscription and would only add redundant complexity

With the steps detailed in later steps, we will be able to determine the mean value of certain quantitative data within our tables include 
1. "played_hours" the average amount of hours played being "5.85 (hours)"  
2. "age"   the average age of players being "21.14 (years old)"

**Question** 

**Broad Question:**
What player characteristics and behaviors are most predictive of subscribing to a game-related newsletter, and how do these features differ between player types?

**Specific Question:**
Which is more predictive of subscription: a player’s gender or played_hours?

**Data and Approach:**
We will compare the predictive ability of gender and played_hours for subscription. Only the players dataset is required to determine predictability. Steps include:
- Filter missing values
- Select relevant columns: played_hours, gender, subscribe
- Convert gender to categorical and ensure played_hours is numeric
- applying a linear regression model. 
    

**Exploring Data Analysis**

To explore the data determine the summary of quantitiave variables. Thee "name", "gender", "subscription", "experience" and "hashedEmail" will not ahve a summary statistic as they aren't quantitative data. Instead we focus on the quantitative variables "played_hours" and "age" 


We will use the R commands as stated below to wrangle and summarize the data. 

In [None]:
#load the datasets
players <- read_csv("data/players.csv")
sessions <- read_csv("data/sessions.csv")

players
sessions    

Wrangling data is important because we first must make sure that the variables are workable and easy to read thus our first steps is to ensure that our data is able to be process. In order to wrangle the data to perform exploratory visuals of our data, 
we should
- Select only the relevant columns
- Remove missing or invalid values
- Convert variables to the desired type

In [None]:
#filter and select to find (mean) for "played_hours" and "age"
mean_data <- players |> 
select(played_hours, Age) |> 
summarize(mean_played_hours = round(mean(played_hours,na.rm = TRUE ),2), mean_age = round(mean(Age,  na.rm = TRUE),2))

mean_data

In [None]:
# wrangling the data
wrangled_players <- players |> 
  select(subscribe, played_hours, gender) |>
filter(!is.na(played_hours), !is.na(gender), !is.na(subscribe)) |> 
  mutate(subscribe = as.factor(subscribe),gender = as.factor(gender), played_hours = as.numeric(played_hours))

wrangled_players 

With the Wrangled data, we are able to perform a few exploratory visualizations 

In [None]:
gender_ver_subscribe <- wrangled_players |>
ggplot(aes(x=gender, fill = subscribe)) + 
geom_bar() + 
labs(x = "gender", y="count", fill = "subscribe", title = "Subscription by Gender" ) + 
theme_minimal()

summary_subscribe <- wrangled_players |>
  group_by(subscribe) |>
  summarise(mean_hours = mean(played_hours, na.rm = TRUE))

hours_ver_subscribe <- ggplot(summary_subscribe, aes(x = subscribe, y = mean_hours, fill = subscribe)) +
  geom_col() +
  labs(title = "Average Played Hours by Subscription Status", x = "Subscription Status",y = "Average Played Hours (hours)",fill = "Subscribed")


In [None]:
gender_ver_subscribe
hours_ver_subscribe

The first visual shows that males are overrepresented, which may bias predictions. 

The second visual indicates that subscribers have higher average played hours, suggesting it is a strong predictor of subscription.

**Method and Plan**

**Question** 
 which is is more predictive in determining whether one would subcribe, the played_hours or the gender

 **Method**
 Create two linear regression models for each variable played_hours and gender. With the linear regression model, we can produce a RMSE value that will demontrate the predictor with a lower value to be more predictive of subscription likelihood. Additionally the coefficients will show how changes in the predictor affect subscription probability ( We must remember to convert the subscription to binary 1/0)

**Assumptions**
Assumptions that will need to be made includes 
- the relationship between each variable and subscription is linear
- there aren't too many outliers
- there are observations for all variables

**Potential limitations or witnesses**
- Binary outcome may bias toward majority class like male
- Non-linear relationships not captured
- non-linear relationships may not be captured with a linear regression, so there may be an over or under estimation. 
 **Plan** 
 1. Using the wrangled data, we will split it into **(60% training) (40% testing)**.
 2. With the training data, use K-fold cross-validation to tune and validate the data to determine the most optimal fold.
 3. take the played_hours variable and fit into the linear regression function with a recipe, fit, and workflow
 4. use the training set to fit the model and predict on the testing set.
 5. determine the RMSPE of the played_hours
 6. repeat for variable gender

With these models, we will compare the two linear regression models and the CROSS-VALIDATION ERORR. When we compare our error, the lower the error the more predictive the model is and the higher coefficient the stronger the predictability. 