In [50]:
library(tidyverse)

In [51]:
players<-read_csv("data/players.csv")
sessions<-read_csv("data/sessions.csv")

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [25]:
players_variables<- tibble(
    variable = c("experience", "subscribe", "hashedEmail", "played_hours",
               "name", "gender", "Age"),
    
    type = c("chr", "lgl", "chr", "dbl", "chr", "chr", "dbl"),
    
    description = c("User experience level",
                    "Subscription",
                    "Hashed email identifier",
                    "Hours played",
                    "Username",
                    "Gender",
                    "Age in years"))

sessions_variables<- tibble(
        variable = c("hashedEmail",	"start_time", "end_time", "original_start_time", "original_end_time"),

        type = c( "chr", "chr", "chr", "dbl", "dbl"),

        description = c("Hashed email identifier",
                        "Session Start time in dd/mm/yyyy, 24h time",
                        "Session End time in dd/mm/yyyy, 24h time",
                        "Original Start time of session as Unix timestamp in milliseconds",
                        "Original End time of session as Unix timestamp in milliseconds"))
         

In [31]:
players_st_sum<- players|>
  summarise(across(where(is.numeric), list(
    mean = ~mean(.x, na.rm = TRUE),
    max  = ~max(.x, na.rm = TRUE),
    min  = ~min(.x, na.rm = TRUE)
  ))) 
sessions_st_sum<- sessions|>
 summarise(across(where(is.numeric), list(
    mean = ~mean(.x, na.rm = TRUE),
    max  = ~max(.x, na.rm = TRUE),
    min  = ~min(.x, na.rm = TRUE)
  )))

In [47]:
players_var_sum<- players|>
    select(experience, subscribe, gender)|>
    mutate(across(everything(), as.character))|>
    pivot_longer(everything(), 
                 names_to = "variable",
                 values_to = "value")|>
    group_by(variable, value)|>
    summarise(count = n(), .groups = "drop")  

## 1. Data Description

__General Overview__

players.csv: A list of all unique players, including data about each player.
>Total number of observations (rows) - 196

>Total number of variables (columns) - 7

sessions.csv: A list of individual play sessions by each player, including data about the session.
>Total number of observations (rows) - 1535

>Total number of variables (columns) - 5

__Variable Summary Tables (players - sessions):__

In [28]:
players_variables
sessions_variables 

variable,type,description
<chr>,<chr>,<chr>
experience,chr,User experience level
subscribe,lgl,Subscription
hashedEmail,chr,Hashed email identifier
played_hours,dbl,Hours played
name,chr,Username
gender,chr,Gender
Age,dbl,Age in years


variable,type,description
<chr>,<chr>,<chr>
hashedEmail,chr,Hashed email identifier
start_time,chr,"Session Start time in dd/mm/yyyy, 24h time"
end_time,chr,"Session End time in dd/mm/yyyy, 24h time"
original_start_time,dbl,Original Start time of session as Unix timestamp in milliseconds
original_end_time,dbl,Original End time of session as Unix timestamp in milliseconds


__Summary Statistics (numeric):__

In [32]:
players_st_sum
sessions_st_sum

played_hours_mean,played_hours_max,played_hours_min,Age_mean,Age_max,Age_min
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
5.845918,223.1,0,21.13918,58,9


original_start_time_mean,original_start_time_max,original_start_time_min,original_end_time_mean,original_end_time_max,original_end_time_min
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1719201000000.0,1727330000000.0,1712400000000.0,1719196000000.0,1727340000000.0,1712400000000.0


__Summary Variables (Categorical):__
>Session data categorical variables have no applicable counts to summarize

In [48]:
players_var_sum

variable,value,count
<chr>,<chr>,<int>
experience,Amateur,63
experience,Beginner,35
experience,Pro,14
experience,Regular,36
experience,Veteran,48
gender,Agender,2
gender,Female,37
gender,Male,124
gender,Non-binary,15
gender,Other,1


__How the data was collected__
- Players' actions are recorded as they navigate through the world. (Minecraft behavioral data)
-  Collected over 10,000 hours of  over ten thousand multiplayer Minecraft gameplay
- The project records spoken interactions between players from the player's microphone.

__Potential Issues with the Data__
> Sampling bias:
- Age range implications - parental consent for players under 13, "The parent can revoke consent any time"
-  Geographic ties - individuals who respond are likely UBC students or whomever they have advertised the study to

>Non-response bias: 
- Access to the study - Individuals who may never get the access link to Plaicraft/never discover the study; implies a higher range of players in the university age range

>Other:
- Ability to delete recorded data- although a necessary ethical option, you can delete some of your data as to not be used in the study, "provide us with session ID (found in the email or SMS with your access link) of the game session you would like us to delete. Alternatively, email support@plaicraft.ai for assistance."

## 2. Questions

__Broad Question__
> Question 1: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

__Specific Question__
> Can the mean hours played predict a subscription (subscribe = TRUE) in the players.csv dataset? Is hours played an 

__>Notes__
 - The players dataset contains the hour of MC played and subscription (TRUE/FALSE)
 - allows me to wrangle/calculate mean hours played and predict if a higher or equal to the mean would result in a subscription
- Are hours played an accurate predictor of subscriptions?