In [2]:
library(tidyverse)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


# Data Science Project: Planning Stage
#### Name: Daniel Jung 
#### Student #: 13449384 

### (1) Data Description:
The provided data consists of two CSV files, players.csv and sessions.csv

**Dataset Summary**
- players.csv: 196 observations (players) and 7 variables.
- sessions.csv: 1535 observations (sessions) and 5 variables.

**player.csv**
| Variable | Data Type | Description |
| :--- | :--- | :--- |
| experience | character  | The player's self-reported experience level (e.g., Amateur, Veteran). |
| subscribe | logical | Whether the player subscribed to a newsletter (TRUE/FALSE). |
| hashedEmail | character | A hashed identifier** for each player. |
| played_hours| numeric | Total hours played by the player. |
| name | character | The player's name. |
| gender | character  | The player's self-reported gender. Contains 7 unique categories. |
| Age | numeric | The player's age.

**sessions.csv**
| Variable | Data Type (in R) | Description |
| :--- | :--- | :--- |
| hashedEmail | character | The player ID, used to link to the `players.csv` file (Foreign Key). |
| start_time | character | The session start time.|
| end_time | character | The session end time. Note: 2 missing values and requires conversion to a datetime object. |
| original_start_time | numeric | The session start time as a Unix timestamp. |
| original_end_time | numeric | The session end time as a Unix timestamp. |


#### Key Data Issues and Potential Problems

1.  **Missing Data:**
    * `players.csv`: 2 missing values exist in the `Age` column.
    * `sessions.csv`: 2 missing values exist in the `end_time` and `original_end_time` columns.
2.  **Data Type Conversion:**
    * The experience and gender variables are loaded as character data types but must be converted to factor for appropriate statistical modeling in R.

#### Summary
| Variable | Mean |
| :--- | :--- |
| `played_hours` | 5.85 |
| `Age` | 21.14 |

In [8]:
players <- read_csv("data/players.csv")

mean_players <- players |>
    summarise(
        mean_played_hours = round(mean(played_hours, na.rm = TRUE), 2),
        mean_Age = round(mean(Age, na.rm = TRUE), 2)
    )

mean_players

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


mean_played_hours,mean_Age
<dbl>,<dbl>
5.85,21.14


### (2) Questions:
**Research Question:** Question 1: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

**Goal**: To identify the key player characteristics and behaviors that most strongly predict newsletter subscription and to understand how these predictive factors vary among different player types.

**Data Wrangle Plan**:
- The categorical variables `gender` and `experience` must be converted from `character` to `factor` type for use in the model.
- The 7 categories in the `gender` variable may need to be grouped into fewer levels (e.g., 'Male', 'Female', 'Other').
