# 1. Data Description
### 1.1 Read in Data and Preview

In [1]:
library(tidyverse)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


In [3]:
# Read the data
players  <- read_csv("players.csv")
sessions <- read_csv("sessions.csv")

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [4]:
# Preview datasets
head(players)
head(sessions)

experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21
Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17


hashedEmail,start_time,end_time,original_start_time,original_end_time
<chr>,<chr>,<chr>,<dbl>,<dbl>
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,30/06/2024 18:12,30/06/2024 18:24,1719770000000.0,1719770000000.0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,17/06/2024 23:33,17/06/2024 23:46,1718670000000.0,1718670000000.0
f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3398304c7ae42581fdc,25/07/2024 17:34,25/07/2024 17:57,1721930000000.0,1721930000000.0
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,25/07/2024 03:22,25/07/2024 03:58,1721880000000.0,1721880000000.0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,25/05/2024 16:01,25/05/2024 16:12,1716650000000.0,1716650000000.0
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,23/06/2024 15:08,23/06/2024 17:10,1719160000000.0,1719160000000.0


### 1.2 Overview: Number of Observations and Variables

In [5]:
dataset_overview <- tibble(
  dataset        = c("players", "sessions"),
  n_observations = c(nrow(players), nrow(sessions)),
  n_variables    = c(ncol(players), ncol(sessions))
)

dataset_overview

dataset,n_observations,n_variables
<chr>,<int>,<int>
players,196,7
sessions,1535,5


### 1.3 Variable Summaries (players.csv, sessions.csv)

In [12]:
players_var_summary <- tibble(
  variable = c("experience", "subscribe", "hashedEmail",
               "played_hours", "name", "gender", "Age"),
  
  type = c("categorical", "categorical", "character",
           "numeric", "character", "categorical", "numeric"),
  
  description = c(
    "Self reported experience level",
    "YES or NO for subscribed to the game-related newsletter",
    "Hashed email identifier used to link to sessions.csv",
    "Total hours the player has spent on the server",
    "Player's in-game name (may not be unique)",
    "Player gender",
    "Player age in years"
  )
)

players_var_summary

sessions_var_summary <- tibble(
  variable = c("hashedEmail", "start_time", "end_time",
               "original_start_time", "original_end_time"),
  
  type = c("character", "character", "character",
           "numeric", "numeric"),
  
  description = c(
    "Hashed email identifier used to link to players.csv",
    "Session start time as readable string (dd/mm/yyyy HH:MM)",
    "Session end time as readable string",
    "Session start time as system clock value",
    "Session end time as system clock value"
  )
)

sessions_var_summary


variable,type,description
<chr>,<chr>,<chr>
experience,categorical,Self reported experience level
subscribe,categorical,YES or NO for subscribed to the game-related newsletter
hashedEmail,character,Hashed email identifier used to link to sessions.csv
played_hours,numeric,Total hours the player has spent on the server
name,character,Player's in-game name (may not be unique)
gender,categorical,Player gender
Age,numeric,Player age in years


variable,type,description
<chr>,<chr>,<chr>
hashedEmail,character,Hashed email identifier used to link to players.csv
start_time,character,Session start time as readable string (dd/mm/yyyy HH:MM)
end_time,character,Session end time as readable string
original_start_time,numeric,Session start time as system clock value
original_end_time,numeric,Session end time as system clock value


### 1.4 Numeric Summary Statistics (2 decimals)

In [19]:
players_numeric_summary <- players |>
  summarise(
    played_hours_mean = round(mean(played_hours, na.rm = TRUE), 2),
    played_hours_sd   = round(sd(played_hours, na.rm = TRUE), 2),
    played_hours_min  = round(min(played_hours, na.rm = TRUE), 2),
    played_hours_max  = round(max(played_hours, na.rm = TRUE), 2),
    Age_mean          = round(mean(Age, na.rm = TRUE), 2),
    Age_sd            = round(sd(Age, na.rm = TRUE), 2),
    Age_min           = round(min(Age, na.rm = TRUE), 2),
    Age_max           = round(max(Age, na.rm = TRUE), 2)
  )

players_numeric_summary

sessions_numeric_summary <- sessions |>
  summarise(
    original_start_time_mean = round(mean(original_start_time, na.rm = TRUE), 2),
    original_start_time_sd   = round(sd(original_start_time, na.rm = TRUE), 2),
    original_start_time_min  = round(min(original_start_time, na.rm = TRUE), 2),
    original_start_time_max  = round(max(original_start_time, na.rm = TRUE), 2),
    original_end_time_mean   = round(mean(original_end_time, na.rm = TRUE), 2),
    original_end_time_sd     = round(sd(original_end_time, na.rm = TRUE), 2),
    original_end_time_min    = round(min(original_end_time, na.rm = TRUE), 2),
    original_end_time_max    = round(max(original_end_time, na.rm = TRUE), 2)
  )

sessions_numeric_summary

played_hours_mean,played_hours_sd,played_hours_min,played_hours_max,Age_mean,Age_sd,Age_min,Age_max
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
5.85,28.36,0,223.1,21.14,7.39,9,58


original_start_time_mean,original_start_time_sd,original_start_time_min,original_start_time_max,original_end_time_mean,original_end_time_sd,original_end_time_min,original_end_time_max
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1719201000000.0,3557491589,1712400000000.0,1727330000000.0,1719196000000.0,3552813134,1712400000000.0,1727340000000.0


### 1.5 Categorical summaries

In [20]:
# players: experience
players_experience_counts <- players |>
  count(experience)

players_experience_counts

# players: subscribe
players_subscribe_counts <- players |>
  count(subscribe)

players_subscribe_counts

# players: gender
players_gender_counts <- players |>
  count(gender)

players_gender_counts

experience,n
<chr>,<int>
Amateur,63
Beginner,35
Pro,14
Regular,36
Veteran,48


subscribe,n
<lgl>,<int>
False,52
True,144


gender,n
<chr>,<int>
Agender,2
Female,37
Male,124
Non-binary,15
Other,1
Prefer not to say,11
Two-Spirited,6


### 1.6 Missing values

In [22]:
players_missing <- tibble(
  variable   = c("experience", "subscribe", "hashedEmail",
                 "played_hours", "name", "gender", "Age"),
  n_missing  = c(
    sum(is.na(players$experience)),
    sum(is.na(players$subscribe)),
    sum(is.na(players$hashedEmail)),
    sum(is.na(players$played_hours)),
    sum(is.na(players$name)),
    sum(is.na(players$gender)),
    sum(is.na(players$Age))
  )
)

players_missing

sessions_missing <- tibble(
  variable  = c("hashedEmail", "start_time", "end_time",
                "original_start_time", "original_end_time"),
  n_missing = c(
    sum(is.na(sessions$hashedEmail)),
    sum(is.na(sessions$start_time)),
    sum(is.na(sessions$end_time)),
    sum(is.na(sessions$original_start_time)),
    sum(is.na(sessions$original_end_time))
  )
)

sessions_missing

variable,n_missing
<chr>,<int>
experience,0
subscribe,0
hashedEmail,0
played_hours,0
name,0
gender,0
Age,2


variable,n_missing
<chr>,<int>
hashedEmail,0
start_time,0
end_time,2
original_start_time,0
original_end_time,2


### 1.7 Data Quality

#### Uniqueness of hashedEmail in players

In [23]:
players_hashedEmail_uniqueness <- players |>
  summarise(
    n_players          = n(),
    n_unique_hashed_id = length(unique(hashedEmail))
  )

players_hashedEmail_uniqueness

n_players,n_unique_hashed_id
<int>,<int>
196,196


#### Number of players have at least one session

In [24]:
players_with_sessions <- sessions |>
  select(hashedEmail) |>
  distinct() |>
  inner_join(players, by = "hashedEmail") |>
  summarise(n_players_with_sessions = n())

players_with_sessions

n_players_with_sessions
<int>
125


# 2. Question Establishment

### Chosen Broad Question: 
Question 1: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

### Chosen Specific Question: 
Can a player’s experience level, age, gender, and total hours played predict whether they subscribe to the newsletter in the players.csv dataset?

### Description
* Response variable: **subscribe** is chosen since its a TRUE/FALSE, categorical variable, it is available for prediction using classification algorithms
* Explanatory variables:
    * **experience** (factor: Pro, Veteran, etc.)
    * **played_hours** (numeric)
    * **Age** (numeric)
    * **gender** (categorical)
* How data will help answer question:
    * players.csv directly provides all predictors.
    * No merging is needed, unless more specific description is wanted e.g., number of sessions
    * subscribe is cleanly provided as a catagorical variable with only 2 classes
* Wrangling needed:
    * Convert categorical variables (experience, gender) to factors
    * Check for missing values (NA or ####) in all predictors
    * Possibly normalize played_hours
    * Result in a tidy tibble by removing redundent or duplicate columns (hashedEmail)
* Predictive method:
    * Works with k-NN classification