In [1]:
#load libraries
library(tidyverse)
library(dplyr)
library(repr)
library(tidymodels)
library(GGally)
library(ISLR)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.6     [32m✔[39m [34mrsample     [39

# Introduction
This project will explore if the player's age and duration spent in the game is predictive of a player subscribing to a game-related newsletter. Exploring the datasets will help us understand which characteristics can help the research group become more efficient in recruiting and distributing their resources. 

In [23]:
#Inspect the datasets
players <- read_csv("players.csv")
sessions <- read_csv("sessions.csv")
head(players)
summary(players)
# head(sessions)
# summary(sessions)

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21
Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17


  experience        subscribe       hashedEmail         played_hours    
 Length:196         Mode :logical   Length:196         Min.   :  0.000  
 Class :character   FALSE:52        Class :character   1st Qu.:  0.000  
 Mode  :character   TRUE :144       Mode  :character   Median :  0.100  
                                                       Mean   :  5.846  
                                                       3rd Qu.:  0.600  
                                                       Max.   :223.100  
                                                                        
     name              gender               Age       
 Length:196         Length:196         Min.   : 9.00  
 Class :character   Class :character   1st Qu.:17.00  
 Mode  :character   Mode  :character   Median :19.00  
                                       Mean   :21.14  
                                       3rd Qu.:22.75  
                                       Max.   :58.00  
                               

In [24]:
players_mean <- players |>
    select(played_hours, Age)|>
    map_df(mean, na.rm = TRUE)|>
    round(2)

players_mean

played_hours,Age
<dbl>,<dbl>
5.85,21.14


# Data description
The `players` dataframe contains 196 observations and 7 columns.
|Variables|Type|Description|Notes/Potential issues|
|:----------------:|----------|:--------------:|:--------------:|
|`experience`  |character|tell us the player's experience level. |Notes: Rated out of five levels (Beginner, Amateur, Regular, Veteran, and Pro)|
|`subscribe`   | logical |tells us if the player has subscribed to the newsletter yet.|Notes: 144 of the players have and 52 have not.|
|`hashedEmail` |character|identifies the players|Notes: each one is unique for each player|
|`played_hours`|numeric  |Total time spent in game (hours)|Mean: 5.8 hours, min: 0 hours, max: 223.1 hours|
|`name`        |character|Player's name|Notes: also unique for players|
|`gender`      |character|Player's gender|Notes: contains more than 2 values (female, male) |
|`Age`         |numeric  |Player's age (years)|Issue: contains 2 missing values (NAs)|

There are some values in the dataset that skews the data and can cause the analysis of the data to be unreliable. One example of this is the missing values in the `Age` column. We can fix this in many ways, either by using `na.rm = TRUE` to tell R to ignore the missing data, or take the mean of the values in the column and use it in place of the missing data. Another example is seen in the `played_hours` column where most values are under 1. The data is skewed buy a few values that reach close to 223. Possible solutions to this problem include listing the values close to 223 as outliers or using a logarithm scale. 

# Question
**Can the player's age and number of hours spent in the game predict whether or not they will subscribe to the game-related newsletter?**

This is a predictive question. We will use the `players` data frame wtih `Age` and `played_hours` as our predictors and `subscribe` as our response variable (the one we want to predict). It looks that every column corresponds to a variable and each row corresponds to an observation and every cell has a singular value. This suggests that the data frame is tidy.

To clean up, I will take out the unrelated columns, keeping only the relevant ones (`Age`, `played_hours`, and `subscribe`). Since we want to work with `subscribe` as a categorical variable, I will use the function `mutate` to turn `subscribe` into factor type data. To resolve any `NA`s when manipulating `Age` I will implement `na.rm = TRUE` in the argument.


In [48]:
players_data <- players |> 
    select(subscribe, played_hours, Age)|>
    mutate(subscribe = as_factor(subscribe))|>
    mutate(subscribe = fct_recode(subscribe, "Yes" = "TRUE", "No" = "FALSE"))|>
    mutate(played_hours_log = log(played_hours + 1))
head(players_data)
players_data_mean <- players_data |>
    select(played_hours, Age)|>
    map_df(mean, na.rm = TRUE)|>
    round(2)

players_data_mean

subscribe,played_hours,Age,played_hours_log
<fct>,<dbl>,<dbl>,<dbl>
Yes,30.3,9,3.4436181
Yes,3.8,17,1.56861592
No,0.0,17,0.0
Yes,0.7,21,0.53062825
Yes,0.1,21,0.09531018
Yes,0.0,17,0.0


played_hours,Age
<dbl>,<dbl>
5.85,21.14


# Exploratory Data Analysis and Visualization
