# DSCI 100 – Individual Planning Stage

**Minecraft** is a game many of us grew up with, whether it was playing on a split screen with your siblings, playing online with your friends from a distance, or even just on your phone in the middle of class. In this project, we will explore **two datasets**: `players.csv`, and `sessions.csv`. Below is shown the first few observations of the two datasets.

In [1]:
library(tidyverse)
library(repr)
library(tidymodels)
library(dplyr)
library(lubridate)
options(repr.matrix.max.rows = 6)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.6     [32m✔[39m [34mrsample     [39

In [2]:
url_players <- "https://raw.githubusercontent.com/egshiglened/dsci_100_project_033/main/players.csv"
url_sessions <- "https://raw.githubusercontent.com/egshiglened/dsci_100_project_033/main/sessions.csv"

players <- read_csv(url_players)
sessions <- read_csv(url_sessions)

head(players) 
head(sessions)

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21
Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17


hashedEmail,start_time,end_time,original_start_time,original_end_time
<chr>,<chr>,<chr>,<dbl>,<dbl>
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,30/06/2024 18:12,30/06/2024 18:24,1719770000000.0,1719770000000.0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,17/06/2024 23:33,17/06/2024 23:46,1718670000000.0,1718670000000.0
f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3398304c7ae42581fdc,25/07/2024 17:34,25/07/2024 17:57,1721930000000.0,1721930000000.0
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,25/07/2024 03:22,25/07/2024 03:58,1721880000000.0,1721880000000.0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,25/05/2024 16:01,25/05/2024 16:12,1716650000000.0,1716650000000.0
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,23/06/2024 15:08,23/06/2024 17:10,1719160000000.0,1719160000000.0


The dataset `players.csv` has information of **196** players, including their age, gender, experience level, total play hours and whether they subscribed to a newsletter. These features describe who the players are and how they interact with the research community. The `sessions.csv` contains **1535** gameplay sessions, that show when each gameplay started and ended. This data shows how long and how often players engage with the game. However, potential issues that may affect our overall answers to the question include unclear units for session duration, possible time zone differences between players, and we may need to order the experience level of the players. In addition, the sample may not fully represent the Minecraft community as a whole since it only comes from a specific research server. 

In [3]:
players_summary <- summary(players)
sessions_summary <- summary(sessions)
players_summary
sessions_summary 

  experience        subscribe       hashedEmail         played_hours    
 Length:196         Mode :logical   Length:196         Min.   :  0.000  
 Class :character   FALSE:52        Class :character   1st Qu.:  0.000  
 Mode  :character   TRUE :144       Mode  :character   Median :  0.100  
                                                       Mean   :  5.846  
                                                       3rd Qu.:  0.600  
                                                       Max.   :223.100  
                                                                        
     name              gender               Age       
 Length:196         Length:196         Min.   : 9.00  
 Class :character   Class :character   1st Qu.:17.00  
 Mode  :character   Mode  :character   Median :19.00  
                                       Mean   :21.14  
                                       3rd Qu.:22.75  
                                       Max.   :58.00  
                               

 hashedEmail         start_time          end_time         original_start_time
 Length:1535        Length:1535        Length:1535        Min.   :1.712e+12  
 Class :character   Class :character   Class :character   1st Qu.:1.716e+12  
 Mode  :character   Mode  :character   Mode  :character   Median :1.719e+12  
                                                          Mean   :1.719e+12  
                                                          3rd Qu.:1.722e+12  
                                                          Max.   :1.727e+12  
                                                                             
 original_end_time  
 Min.   :1.712e+12  
 1st Qu.:1.716e+12  
 Median :1.719e+12  
 Mean   :1.719e+12  
 3rd Qu.:1.722e+12  
 Max.   :1.727e+12  
 NA's   :2          

**Description of variables for players.csv** :

| Column | Data type     | Description |
|--------|-----------|-------------|
|`experience` | Character   | Tells us the experience level of each player (e.g. "Pro", "Amateur") |
|`subscribe` | Logical   | Tells us whether the player subcribed to the newsletter or not. 'TRUE' means they have and 'FALSE' means they have not. | 
| `hashedEmail` | Character   | Shows us the player identifier that is unique to each player. |
| `played_hours` | Double (numeric)   | Describes the total hours played by each player |
| `name` | Character   | Players' in-game name (won't be used for analysis) | 
|`gender` | Character   | Players' gender |
|`Age` | Double (numeric)   | Players' age |


**Description of variables for sessions.csv** :

| Column | Data type | Description | 
|---------|----------|--------------|
|`hashedEmail`| Character | Identifier (similar to the one in `players.csv`) that can help us link each session to the player |
|`start_time` | Character | Timestamp that shows when the session started |
|`end_time` | Character | Timestamp that shows when the session ended |
|`original_start_time` | Double (numeric) | Start time in milliseconds |
|`original_end_time` | Double (numeric) | End time in milliseconds |



## Questions
**Broad question:**
What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

**Specific question:**
Does the number of sessions played and the average session duration help predict whether a player subscribes to the newsletter, and do these relationships vary across experience levels? 

Next steps:
- Removing NAs
- Wrangling `sessions` to  convert units, get session count and average duration per player
- Merge the `sessions` and `players` data

There will not be any tidying because the data is **already tidy**, because each row represents a single play session, each column contains one variable such as age, experience, and each cell has one value. We will be wrangling a bit to get the data ready for analysis and visualization. 


In [4]:
sessions_clean <- sessions |>
mutate(start_time = as_datetime(start_time),
       end_time = as_datetime(end_time), 
       duration = as.numeric(difftime(end_time, start_time, units = "mins"))) |>
group_by(hashedEmail) |>
summarise(num_sessions = n(), 
          avg_session_duration = mean(duration, na.rm = TRUE))

playersessions <- players |>
left_join(sessions_clean, by = "hashedEmail")
playersessions
          

experience,subscribe,hashedEmail,played_hours,name,gender,Age,num_sessions,avg_session_duration
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>,<int>,<dbl>
Pro,TRUE,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9,27,-4.854432e+05
Veteran,TRUE,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17,3,1.416667e+00
Veteran,FALSE,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17,1,8.333333e-02
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
Amateur,FALSE,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db299bd4fedb06a46ad5bb,0.0,Dylan,Prefer not to say,57,1,0.08333333
Amateur,FALSE,f19e136ddde68f365afc860c725ccff54307dedd13968e896a9f890c40aea436,2.3,Harlow,Male,17,6,0.49722222
Pro,TRUE,d9473710057f7d42f36570f0be83817a4eea614029ff90cf50d8889cdd729d11,0.2,Ahmed,Other,,1,0.25000000
