# Project Planning Individual Stage

In [1]:
### Run this cell before continuing. 
library(tidyverse)
library(repr)
options(repr.matrix.max.rows = 6)
source("cleanup.R")

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
“cannot open file 'cleanup.R': No such file or directory”


ERROR: Error in file(filename, "r", encoding = encoding): cannot open the connection


## 1. Data Description

### players.csv

This data shows the identities of players and their experiences and behaviours in the game. The data has a total of **196 observations** and **7 variables**.
The variables are as listed below:

| **Variable**     | **Type** | **Description**                     | **Notes** |
|------------------|----------|--------------------------------------|-----------|
| `experience`     | chr      | Players gaming experience level (Beginner, Amateur, Regular, Veteran, Pro) | Might be subjectively classified. |
| `subscribe` | lgl | Whether each player subscribed (TRUE) to a game-related newsletter or not (FALSE)  | Can be used as a target variable for prediction |
| `hashedEmail` | chr | Player ID | 196 different IDs. Can act as an identification for each player |
| `played_hours` | dbl | Number of hours played by each player | Might be right-skewed with some invalid data |
| `name` | chr | The name of each player | 196 different names. Can act as the same function as `hashedEmail` | 
| `gender` | chr | The gender of each player | Mostly male/female. Some group of gender might be  |
| `Age` | dbl | The age of each player | Have two missing values. Can be used as a predictor. |

### Potential Data Issues

- Two missing values in `Age` can lead to error in data processing
- The number in `played_hours` is might be right-skewed (many low-hour players, few high-hour players)
- Some values in the `played_hours` column is 0.0, which means the player played for 0.0 hours, and can be invalid.
- `gender` column is dominated by `Male` and `Female`. Therefore, other assigned genders might be underrepresented in the data processing.
- The values in the `experience` column might be subjective since there is not formal definition of each gaming experiences.

### Likely Data Collection Method

- `subscribe` data is collected from the in-app subscription data.
- `experience`, `name`, `gender`, and `Age` data is likely to be user-input data.
- `played_hours` likely comes from platform usage logs.

### Summary Statistics

In [4]:
players <- read_csv("data/players.csv")

players |> summarise(
    played_hours_mean = mean(played_hours, na.rm=TRUE),
    age_mean = mean(Age, na.rm = TRUE),
    played_hours_median = median(played_hours, na.rm=TRUE),
    total_players = n(),
    subscribe_count = sum(subscribe == "TRUE", na.rm=TRUE),
    subscribe_percentage = subscribe_count / total_players *100)|>
    round(2)


[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


played_hours_mean,age_mean,played_hours_median,total_players,subscribe_count,subscribe_percentage
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
5.85,21.14,0.1,196,144,73.47


- Mean Played Hours = 5.85 hours
- Mean Age = 21.14
- Median Played Hours = 0.1 hours
- Total Players = 196 people
- Count Subscribe = 144
- Percentage of Players subscribed to newsletter = 73.47%