# Data Science Project: Individual Planning Stage 

In [1]:
library(tidyverse)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


## Data Description

#### Reading in the data

In [2]:
players <- read_csv("players.csv")
head(players)

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21
Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17


#### Variable Breakdown

As seen above, the players dataset contains 196 observations and 7 variables. Each variable can be broken down in the following manner:

1. **experience**: Categorical variable; Minecraft experience level 

2. **subscribe**: Categorical variable; lack or presence of subscription to a game-related newsletter

3. **hasedEmail**: Identifier variable; unique identifier for player's email

4. **played_hours**: Quantitative variable; total hours spent playing Minecraft 

5. **name**: Identifier variable; player's name

6. **gender**: Categorical variable; player's gender

7. **Age**: Quantitative variable; player's age

#### Summary Statistics

In [3]:
summary_stats <- summary(players)
summary_stats

  experience        subscribe       hashedEmail         played_hours    
 Length:196         Mode :logical   Length:196         Min.   :  0.000  
 Class :character   FALSE:52        Class :character   1st Qu.:  0.000  
 Mode  :character   TRUE :144       Mode  :character   Median :  0.100  
                                                       Mean   :  5.846  
                                                       3rd Qu.:  0.600  
                                                       Max.   :223.100  
                                                                        
     name              gender               Age       
 Length:196         Length:196         Min.   : 9.00  
 Class :character   Class :character   1st Qu.:17.00  
 Mode  :character   Mode  :character   Median :19.00  
                                       Mean   :21.14  
                                       3rd Qu.:22.75  
                                       Max.   :58.00  
                               

As seen above, no values are missing for both the identifier and categorical variables. However, when it comes to the categorical variables, they are encoded as characters when they should be encoded as factors, except for the subscribe variable, which is properly encoded as a logical. Let's fix this:

In [4]:
players <- players |>
    mutate(experience = as.factor(experience),
           gender = as.factor(gender))

summary(players)

    experience subscribe       hashedEmail         played_hours    
 Amateur :63   Mode :logical   Length:196         Min.   :  0.000  
 Beginner:35   FALSE:52        Class :character   1st Qu.:  0.000  
 Pro     :14   TRUE :144       Mode  :character   Median :  0.100  
 Regular :36                                      Mean   :  5.846  
 Veteran :48                                      3rd Qu.:  0.600  
                                                  Max.   :223.100  
                                                                   
     name                         gender         Age       
 Length:196         Agender          :  2   Min.   : 9.00  
 Class :character   Female           : 37   1st Qu.:17.00  
 Mode  :character   Male             :124   Median :19.00  
                    Non-binary       : 15   Mean   :21.14  
                    Other            :  1   3rd Qu.:22.75  
                    Prefer not to say: 11   Max.   :58.00  
                    Two-Spirited    

Now that the categorical variables are properly encoded, we can observe the counts for each category. This gives us the following modes:

1. experience: Amateur (63)
2. subscribe: TRUE (144)
3. gender: Male (124)

For the three categorical variables, we observe imbalance issues across categories. Firstly, though the experience variable seems roughly balanced across its categories, we see that the pro category is severely underrepresented. This seems to make sense, since fewer players will naturally have high experience levels, but the veteran level, being a level higher than pro, has a much higher count. Secondly, the TRUE category is severely overrepresented in the subscribe variable, leading to class imbalance in what seems to be the natural target variable for the players dataset. Thirdly, in the gender variable, the male category is also severely overrepresented. All these imbalances across categories lead to a less representative dataset, which could lead to a biased model.

As for the quantitative variables, we also see some issues. Notably, the played_hours variable has a low spread, with an IQR equal to 0.6. Yet, its mean is significantly higher than its median, with a difference of 5.746. This suggests that the distribution for the played_hours variable is skewed to the right and likely contains severe outliers. This issue is further accentuated by the maximum value of 223.100 hours, which very drastically drifts away from the upper percentile of 0.6 hours. Then, the age variable also has a low spread, with an IQR of 5.75. The distribution for this variable seems to be skewed to the right as well, due to having a higher mean than median, specifically by 2.14. Yet, this suggests a skewness not as drastic as for the played_hours variable, which is further reinforced by the lower difference between the maximum value and upper percentile (15.25). However, it is worth noting that the age variable has 2 missing values, which makes imputing or removal of the appropriate rows necessary before model-building.  

## Questions

## Exploratory Data Analysis and Visualization

In [5]:
means_quant_vars <- players |>
    select(played_hours, Age) |>
    map_dfr(mean)

means_quant_vars

played_hours,Age
<dbl>,<dbl>
5.845918,


## Methods and Plan