# Exploring Data from a Video Game Research Server

In [1]:
### Please run thsi cell before moving forward:
library(tidyverse)
library(repr)
library(tidymodels)
library(cowplot)
options(repr.matrix.max.rows = 6)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.6     [32m✔[39m [34mrsample     [39

In [28]:
players_data <- read.csv("data/players.csv")
players_data

experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<int>
Pro,TRUE,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,TRUE,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,FALSE,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
⋮,⋮,⋮,⋮,⋮,⋮,⋮
Amateur,FALSE,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db299bd4fedb06a46ad5bb,0.0,Dylan,Prefer not to say,57
Amateur,FALSE,f19e136ddde68f365afc860c725ccff54307dedd13968e896a9f890c40aea436,2.3,Harlow,Male,17
Pro,TRUE,d9473710057f7d42f36570f0be83817a4eea614029ff90cf50d8889cdd729d11,0.2,Ahmed,Other,


In [7]:
seasons_data <- read.csv("data/sessions.csv")
seasons_data

hashedEmail,start_time,end_time,original_start_time,original_end_time
<chr>,<chr>,<chr>,<dbl>,<dbl>
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,30/06/2024 18:12,30/06/2024 18:24,1.71977e+12,1.71977e+12
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,17/06/2024 23:33,17/06/2024 23:46,1.71867e+12,1.71867e+12
f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3398304c7ae42581fdc,25/07/2024 17:34,25/07/2024 17:57,1.72193e+12,1.72193e+12
⋮,⋮,⋮,⋮,⋮
fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33cbb5e894a3867ca44d,28/07/2024 15:36,28/07/2024 15:57,1.72218e+12,1.72218e+12
fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33cbb5e894a3867ca44d,25/07/2024 06:15,25/07/2024 06:22,1.72189e+12,1.72189e+12
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,20/05/2024 02:26,20/05/2024 02:45,1.71617e+12,1.71617e+12


## Data Description

#### I have chosen to work with the players dataset provided from "players.csv" 

In [12]:
number_observations <- nrow(players_data)
number_variables    <- ncol(players_data)
dimensions_players <- cat("Observations:", number_observations, 
                          " Variables:", number_variables)


Observations: 196  Variables: 7

- There are only 196 players, giving us only 196 observations to work with. **A stratified approach to training and validation might be best (i.e. 5-fold K cross-validation)**.

- Two of the 7 variables provided will most likely **not be used** for any statistical analysis and modelling regardless of the question chosen. These are:
    1.  "hashedEmail"
    2.  "name"

In [32]:
statistical_summary_played_hours <- players_data |> 
    summarize(
        played_hours_mean   = round(mean(played_hours, na.rm = TRUE), 2),
        played_hours_median = round(median(played_hours, na.rm = TRUE), 2),
        played_hours_min    = round(min(played_hours, na.rm = TRUE), 2),
        played_hours_max    = round(max(played_hours, na.rm = TRUE), 2))

played_none <- players_data |>
    filter(played_hours == 0) |>
    summarize(count = n(), .groups = "drop")

statistical_summary_played_hours
played_none

played_hours_mean,played_hours_median,played_hours_min,played_hours_max
<dbl>,<dbl>,<dbl>,<dbl>
5.85,0.1,0,223.1


count
<int>
85


**Played Hours Numeric Summary:**
- Based on the values provided by the statistical summary, the data show a **right-skewed distribution**. Looking further into the dataset and calculating the number of players who played 0 hours, I suspect the reason for the right-skew is that a rather large number of players **(85 players, 43.37%) have not played the game and are included in the study**.
- This prompt me to invistage this idea via a **histogram visualization**.
- This observation also suggests that it might be interesting to ask questions about **what characteristics predict players whom might contribute significantly to the project** - assuming that this factor, **large amount of data contribution/player**, produces better results for the Pacific Team. (relating to **Question 2**)

In [33]:
statistical_summary_age <- players_data |> 
    summarize(
        age_mean            = round(mean(Age, na.rm = TRUE), 2),
        age_median          = round(median(Age, na.rm = TRUE), 2),
        age_min             = round(min(Age, na.rm = TRUE), 2),
        age_max             = round(max(Age, na.rm = TRUE), 2)) 
statistical_summary_age

age_mean,age_median,age_min,age_max
<dbl>,<dbl>,<dbl>,<dbl>
21.14,19,9,58


**Age Numeric Summary:**

- The study has some very young and very old players (outliers) but overall the age group for most participants is late teens to early twenties.


In [23]:
experience_count_per <- players_data |>
    group_by(experience) |>
    summarize(count = n(), .groups = "drop") |>
    mutate(per = round(100 * count / sum(count), 2))

experience_count_per

experience,count,per
<chr>,<int>,<dbl>
Amateur,63,32.14
Beginner,35,17.86
Pro,14,7.14
Regular,36,18.37
Veteran,48,24.49


**Experience - Categorical Composition:**

- There is a reasonable spread of experience with the smallest class being "Pro" at around 7%. **A great question and experiement to do would be to see the contribution of each class to play time and their newsletter subscription status**.

In [27]:
gender_count_per <- players_data |>
    group_by(gender) |>
    summarize(count = n(), .groups = "drop") |>
    mutate(per = round(100 * count / sum(count), 2))

gender_count_per

gender,count,per
<chr>,<int>,<dbl>
Agender,2,1.02
Female,37,18.88
Male,124,63.27
⋮,⋮,⋮
Other,1,0.51
Prefer not to say,11,5.61
Two-Spirited,6,3.06


**Gender - Categorical Composition:**

- There is a **strong class imbalance with "Male" category** coming at about **63.27%**. Given what we know about how many players have played 0 hours and the rather large imbalance towards "Male", **it's imparative to check what the contributation of this class to the study per instance**. *My hypothesis is that many male players have signed up but did not engage with the game and study suggesting an explanation to the large 0 hours played spike*.
- The category is also **noisy due to it being self-reported** and given that "Other" and "Prefer not to say" are present instead of contributing to the present categories or contributing to gender categories not represented in the population on the surface.

In [34]:
subscription_status <- players_data |>
    group_by(subscribe) |>
    summarize(count = n(), .groups = "drop") |>
    mutate(per = round(100 * count / sum(count), 2))

subscription_status

subscribe,count,per
<lgl>,<int>,<dbl>
False,52,26.53
True,144,73.47


**Subscription Status - Categorical Composition:**

- There is a **strong imbalance towards "TRUE"** which suggests a strong interest in the experiment. However, **since subscription status can be thought of as a proxy for interest, one should examine the relationship between subscription status and contribution to the experiment (playing time)** as you would *expect* newsletter subscribers to have higher playing hours on average.

## Questions

## Exploratory Data Analysis and Visualization

## Methods and Plan