# Data Description
#### Broad Question Prompt from assignment: 
What player characteristics and behaviours are predictive of subscribing to the game-related newsletter?

#### Specific Question I formulated: 
Can player demographics and play behaviour (ex: total playtime, number of sessions, player type) predict whether a player subscribes to the newsletter?

#### Dataset to be used: players.csv
since sessions.csv only has data relevant to playtime analyses, while I want to figure out how player newsletter subscription relates to variables like experience, time, and age in each player

## 0.0) Loading in the data

In [1]:
getwd()

In [2]:
list.files()

In [3]:
library(tidyverse)
sessions_original <- read_csv("sessions.csv")
players_original <- read_csv("players.csv")

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m

In [4]:
print(players_original, rows = 10)

[90m# A tibble: 196 × 7[39m
   experience subscribe hashedEmail              played_hours name  gender   Age
   [3m[90m<chr>[39m[23m      [3m[90m<lgl>[39m[23m     [3m[90m<chr>[39m[23m                           [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m  [3m[90m<dbl>[39m[23m
[90m 1[39m Pro        TRUE      f6daba428a5e19a3d475748…         30.3 Morg… Male       9
[90m 2[39m Veteran    TRUE      f3c813577c458ba0dfef809…          3.8 Chri… Male      17
[90m 3[39m Veteran    FALSE     b674dd7ee0d24096d1c0196…          0   Blake Male      17
[90m 4[39m Amateur    TRUE      23fe711e0e3b77f1da7aa22…          0.7 Flora Female    21
[90m 5[39m Regular    TRUE      7dc01f10bf20671ecfccdac…          0.1 Kylie Male      21
[90m 6[39m Amateur    TRUE      f58aad5996a435f16b0284a…          0   Adri… Female    17
[90m 7[39m Regular    TRUE      8e594b8953193b26f498db9…          0   Luna  Female    19
[90m 8[39m Amateur    FALSE     1d23

In [5]:
players_original <- tibble(players_original)

In [33]:
#(this is just the selection of variables i'll be using later, just thought it was more tidy to store it up here)
player_data <- players_original |>
    select(experience, subscribe, played_hours, gender, Age)
player_data

experience,subscribe,played_hours,gender,Age
<chr>,<lgl>,<dbl>,<chr>,<dbl>
Pro,TRUE,30.3,Male,9
Veteran,TRUE,3.8,Male,17
Veteran,FALSE,0.0,Male,17
Amateur,TRUE,0.7,Female,21
Regular,TRUE,0.1,Male,21
Amateur,TRUE,0.0,Female,17
Regular,TRUE,0.0,Female,19
Amateur,FALSE,0.0,Male,21
Amateur,TRUE,0.1,Male,47
Veteran,TRUE,0.0,Female,22


## 1) Numbers of observations/variables in the players.csv dataset

In [32]:
#number of observations/variables (given from source, unaltered):
n_players.csv <- nrow(players_original)
p_players.csv <- ncol(players_original)
cat("players.csv:", n_players.csv, "rows,", p_players.csv, "columns\n")

players.csv: 196 rows, 7 columns


Therefore, in players.csv, there are: 
- 196 observations
- 7 variables

## 2) Variable names, types, and descriptions

In [30]:
desc_of_vars <- c(
    "How the player identifies in gaming experience level",
    "Whether player has subscribed to the newsletter (True or false)",
    "Email address of player",
    "Total time spent in the game (in hours)",
    "Name of player",
    "Gender of player",
    "Age of player (in years)")
   
description_players <- tibble(
    variable = names(players_original),
    type = map_chr(players_original, ~class(.x)[1]),
    description = desc_of_vars)

description_players

variable,type,description
<chr>,<chr>,<chr>
experience,character,How the player identifies in gaming experience level
subscribe,logical,Whether player has subscribed to the newsletter (True or false)
hashedEmail,character,Email address of player
played_hours,numeric,Total time spent in the game (in hours)
name,character,Name of player
gender,character,Gender of player
Age,numeric,Age of player (in years)


#### potential issues
- played_hours: may have extreme outliers
- Age: may have missing values?
- hashedEmail, name: no serious issue but irrelevant to data analysis
- may have inconsistent or duplicated entries

#### possible hidden issues
- "experience" variable is highly biased depending on the person
- sampling bias: only includes players around a certain area
- playtime records may not be accurate due to issues with technology, for example, if playtime was recorded using built-in game timers
- is playtime defined by active play, or even idle/afk play?

#### how the data were collected
- The dataset doesn't specify specificaclly how the data was collected. However, one can assume that the data was taken from activity reports/logs in-game, automatically recorded by the system

## 3) Summary Statistics

In [31]:
summary(players_original)

  experience        subscribe       hashedEmail         played_hours    
 Length:196         Mode :logical   Length:196         Min.   :  0.000  
 Class :character   FALSE:52        Class :character   1st Qu.:  0.000  
 Mode  :character   TRUE :144       Mode  :character   Median :  0.100  
                                                       Mean   :  5.846  
                                                       3rd Qu.:  0.600  
                                                       Max.   :223.100  
                                                                        
     name              gender               Age       
 Length:196         Length:196         Min.   : 9.00  
 Class :character   Class :character   1st Qu.:17.00  
 Mode  :character   Mode  :character   Median :19.00  
                                       Mean   :21.14  
                                       3rd Qu.:22.75  
                                       Max.   :58.00  
                               

Definitions of each term (from the above summary of the dataset):
- length = (in this context) how many rows are in the column
- class = What type of R object it is (category, e.g. numeric, factor, character)
- Mode = how the data is stored in memory (e.g. numeric, character, logical, list)
- Mean = average value
- Median = middle value
- Min/Max = Smallest/largest values
- 1st Qu. = first quarter of the data (first 25% of the dataset), and so on for 2nd, 3rd, 4th Qu.

## 3.1) Means of each numeric variable

In [36]:
numeric_players_data_only <- players_original |>
    select(played_hours, Age)

mean_table <- numeric_players_data_only |>
    summarise(across(everything(), ~ round(mean(.x, na.rm = TRUE), 2))) |>
    pivot_longer(everything(), names_to = "variable", values_to = "mean")

mean_table

variable,mean
<chr>,<dbl>
played_hours,5.85
Age,21.14


The average hours played is 5.85 hours, and the average age is 21.14 years.