### Setting Up Libraries and Parameters

In [1]:
#Run this first.
library(tidyverse)
library(repr)
library(RColorBrewer)
library(tidymodels)
options(repr.matrix.max.rows = 10)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.6     [32m✔[39m [34mrsample     [39

In [2]:
players <- read_csv("data/players.csv")
players

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,TRUE,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,TRUE,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,FALSE,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
Amateur,TRUE,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
Regular,TRUE,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21
⋮,⋮,⋮,⋮,⋮,⋮,⋮
Amateur,TRUE,b6e9e593b9ec51c5e335457341c324c34a2239531e1890b93ca52ac1dc76b08f,0.0,Bailey,Female,17
Veteran,FALSE,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778b35c5802c3292c87bd,0.3,Pascal,Male,22
Amateur,FALSE,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db299bd4fedb06a46ad5bb,0.0,Dylan,Prefer not to say,57
Amateur,FALSE,f19e136ddde68f365afc860c725ccff54307dedd13968e896a9f890c40aea436,2.3,Harlow,Male,17


# Part 1: Data Description 

The **players.csv** dataset provides essential demographic and behavioural information about each player, including their name, gender, age, experience level, subscription stats, and total hours played. It consists of 196 observations and 7 variables, each representing a unique attribute of the players. The data was collected from a Minecraft research server managed by a UBC Computer Science group led by Frank Wood, which records player interactions and engagement patterns.

### Variable Summary
- `experience` *(character)*: level of player experience (e.g., Amateur, Regular, Pro, Veteran)
- `hashedEmail` *(character)*: each player's encrypted unique identifier
- `name` *(character)*: player's username
- `gender` *(character)*: self-identified gender of the player
- `played_hours` *(double)*: total number of hours the player spent in the game
- `Age` *(double)*: age of the player in years
- `subscribe` *(logical)*: whether the player subscribed to the newsletter of the game

In [3]:
summary(players)

  experience        subscribe       hashedEmail         played_hours    
 Length:196         Mode :logical   Length:196         Min.   :  0.000  
 Class :character   FALSE:52        Class :character   1st Qu.:  0.000  
 Mode  :character   TRUE :144       Mode  :character   Median :  0.100  
                                                       Mean   :  5.846  
                                                       3rd Qu.:  0.600  
                                                       Max.   :223.100  
                                                                        
     name              gender               Age       
 Length:196         Length:196         Min.   : 9.00  
 Class :character   Class :character   1st Qu.:17.00  
 Mode  :character   Mode  :character   Median :19.00  
                                       Mean   :21.14  
                                       3rd Qu.:22.75  
                                       Max.   :58.00  
                               

# Part 2: Questions

##### **Question:**
What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?
##### **Specific Question:**
Can demographic and gameplay-related features like `Age`, `gender`, `experience`, and `played_hours` predict whether a player will `subscribe` to the Mincraft research server newsletter?

##### Specific Question Explanation: 


# Part 3: Exploratory Data Analysis and Visualization

# Part 4: Methods and Plan

#### The Method and Why it is Appropriate
To address the research question, a **logistic regression model** will be used to conduct this experiment. This method is appropriate because it is most suitable for modelling a **binary response variable** such as `subscribe`, where there are only two possible outcomes (true/false). Multiple explanatory variables like `Age`, `gender`, `experience`, and `played_hours` can also be used for this model, which is beneficial to identify which variables most strongly influence the likelihood of subscription.

#### Assumptions
Assumptions ensure that the logistic regression model's predictions and variable interpretations are unbiased and reliable.
To apply the logistic regression model, assumptions include:
- **Independence of observations:** Each player is an independent data point, and a player's subscription status does not influence another player's.
- **Linearity in the logit:** The relationship between each numeric predictor and the log-odds of the response variable should be approximately linear.
- **No multicollinearity:** Explanatory variables should not be highly correlated with each other. 
- **Large sample size:** The sample size should be sufficiently large to avoid overfitting. 

#### Potential Limitations and Weaknesses 
A limitation of the logistic regression model is its assumption of a linear relationship between the predictors and the log-odds of the outcome, which limits its ability to model non-linear patterns and may oversimplify complex player behaviours. 
Additionally, logistic regression is designed for binary classification, so it cannot be applied to continuous response variables. The model is also sensitive to outliers; extreme values in `played_hours` can affect coefficient estimates and reduce overall model stability.

#### Compare and Select
The chosen model will be selected based on both predictive strength and interpretability. Using metrics like accuracy, precision, and the area under the ROC curve will help determine how well the model distinguishes between subscribed and non-subscribed players while maintaining interpretability. Additionally, a 5-fold cross-validation will be applied to the training set to tune model parameters and prevent overfitting.

#### Data Processing Plan
The dataset will first be cleaned by removing missing `Age` values, log-transforming `played_hours` to reduce skewness, and encoding the categorical variables (`experience` and `gender`) as numerical dummy variables. The data will then be stratified by `subscribe` and split into training (70%) and testing (30%) sets. Within the training data, 5-fold cross-validation will be applied to tune the model and assess generalization before testing the final predictive performance.

# Part 5: GitHub Repository