# Individual Planning Report - Chris Gibbons (Group 2)

In [6]:
library(tidyverse)

players_data <- read_csv("https://raw.githubusercontent.com/chrispy-tacos/dsci_idv_planning/refs/heads/main/players.csv")
sessions_data <- read_csv("https://raw.githubusercontent.com/chrispy-tacos/dsci_idv_planning/refs/heads/main/sessions.csv")

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


## Data Description

In [13]:
players_data |> slice(1) #example observation
summary(players_data)
unique(players_data$experience)
unique(players_data$gender)

experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9


  experience        subscribe       hashedEmail         played_hours    
 Length:196         Mode :logical   Length:196         Min.   :  0.000  
 Class :character   FALSE:52        Class :character   1st Qu.:  0.000  
 Mode  :character   TRUE :144       Mode  :character   Median :  0.100  
                                                       Mean   :  5.846  
                                                       3rd Qu.:  0.600  
                                                       Max.   :223.100  
                                                                        
     name              gender               Age       
 Length:196         Length:196         Min.   : 9.00  
 Class :character   Class :character   1st Qu.:17.00  
 Mode  :character   Mode  :character   Median :19.00  
                                       Mean   :21.14  
                                       3rd Qu.:22.75  
                                       Max.   :58.00  
                               

The players data consist of 196 observations, with each corresponding to a unique participant in the PLAICraft server. 7 variables are recorded for each participant:

- **experience:** Qualitative character variable classifying the participant's Minecraft experience as one of 5 categories.
  
- **subscribe:** Qualitative logical variable indicating whether or not the participant subscribes to a game-related newspaper.

- **hashedEmail:** Qualitative character variable indicating the participant's encrypted email address.

- **played_hours:** Quantitative double variable measuring the hours the participant has spent playing on the PLAICraft server.

- **name:** Qualitative character variable indicating the participant's name.

- **gender:** Qualitative character variable classifying the participant's gender as one of 7 categories.

- **Age:** Quantitative double variable indicating the participant's age (in years).

Further inspection of this data reveals that the "played_hours" variable contains major outliers. Furthermore, the "experience" and "gender" variables are categorical, and should thus be listed as factor variables.

In [12]:
sessions_data |> slice(1) #example observation
summary(sessions_data)

hashedEmail,start_time,end_time,original_start_time,original_end_time
<chr>,<chr>,<chr>,<dbl>,<dbl>
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,30/06/2024 18:12,30/06/2024 18:24,1719770000000.0,1719770000000.0


 hashedEmail         start_time          end_time         original_start_time
 Length:1535        Length:1535        Length:1535        Min.   :1.712e+12  
 Class :character   Class :character   Class :character   1st Qu.:1.716e+12  
 Mode  :character   Mode  :character   Mode  :character   Median :1.719e+12  
                                                          Mean   :1.719e+12  
                                                          3rd Qu.:1.722e+12  
                                                          Max.   :1.727e+12  
                                                                             
 original_end_time  
 Min.   :1.712e+12  
 1st Qu.:1.716e+12  
 Median :1.719e+12  
 Mean   :1.719e+12  
 3rd Qu.:1.722e+12  
 Max.   :1.727e+12  
 NA's   :2          

The sessions data consist of 1535 observations, with each corresponding to an individual play session of a given participant on the PLAICraft server. 5 variables are recorded for each session:

- **hashedEmail:** Qualitative character variable indicating the encrypted email address of the participant whose session it is.
  
- **start_time:** Qualitative character variable describing the date and time the session started.

- **end_time:** Qualitative character variable describing the date and time the session ended.

- **original_start_time:** Quantitative double variable measuring the exact start of the session in UNIX time.

- **original_end_time:** Quantitative double variable measuring the exact end of the session in UNIX time.

Further inspection of this data reveals that the "start_time" and "end_time" variables record two separate values in each cell: date and time. Therefore, the data is untidy.

## Questions of Interest

This project will broadly consider what player characteristics and behaviours are most predictive of subscribing to a game-related newsletter. Specifically, this project will attempt to determine whether a participant's age and number of individual play sessions on PLAICraft can predict their subscription status to such a newsletter.

To prepare to answer this question using a predictive method, the sessions data can be wrangled to summarize the count of observations occurring per hashed email address, then combined with the players data to contain age and said count as two predictive variables in the same dataset together with the response variable "subscribe". The appropriate predictive model can be trained on this dataset.

## GitHub Repository

Link: https://github.com/chrispy-tacos/dsci_idv_planning.git