In [2]:
library(tidyverse)
library(repr)
library(tidymodels)
library(GGally)
library(ISLR)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.6     [32m✔[39m [34mrsample     [39

## 1. Data Discription

In [16]:
players <- read_csv("data/players.csv")
sessions <- read_csv("data/sessions.csv")

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [28]:
subscribe <- players |>
  summarise(true = sum(subscribe, na.rm = TRUE),
    false = sum(!subscribe, na.rm = TRUE),
    t = 144/(144+52),
    f = 52/(144+52))

played_hours <- players |>
  summarise(mean = round(mean(played_hours, na.rm = TRUE), 2),
    min = min(played_hours, na.rm = TRUE),
    max = max(played_hours, na.rm = TRUE))

Age <- players |>
  summarise(mean = round(mean(Age, na.rm = TRUE), 2),
    min = min(Age, na.rm = TRUE),
    max = max(Age, na.rm = TRUE),)

subscribe
played_hours
Age

true,false,t,f
<int>,<int>,<dbl>,<dbl>
144,52,0.7346939,0.2653061


mean,min,max
<dbl>,<dbl>,<dbl>
5.85,0,223.1


mean,min,max
<dbl>,<dbl>,<dbl>
21.14,9,58


### 1.1 Summary for `players.csv`

General information
- Number of observations: 196

- Number of variables: 7

- Purpose: Contains player profile and activity data.


| Variable Name | Type | Description | Summary |
|----------------|------|--------------|--------------------------|
| experience | chr | Player experience level  | / |
| hashedEmail | chr | Unique identifier (hashed email) for linking datasets | / |
| name | chr | Player name | / |
| gender | chr | Player gender identity | /|
| subscribe | lgl  | Whether the player has an active subscription | 144 True (73.5%), 52 False (26.5%) |
| played_hours | dbl | Total number of hours played | Mean = 5.85, Min = 0, Max = 223.1 |
| Age | dbl | Player age | Mean = 21.14, Min = 9, Max = 58, 2 NAs |

### 1.2 Summary for `sessions.csv`

General information
- Number of observations: 1,535

- Number of variables: 5

- Purpose: Records game session information per player.


| Variable Name | Type | Description 
|----------------|------|-------------
| hashedEmail | chr | Player identifier | 
| start_time | chr | Session start time | 
| end_time | chr | Session end time | 
| original_start_time | dbl | System-recorded session start|
| original_end_time | dbl | System-recorded session end|

### 1.3 Potential issues

- Missing Data: 2 missing entries for `Age`, 2 missing `end_time` and `original_end_time` values.
- Categorical Imbalance: Most players are Male, indicate that gender may not be a good predictor