#### Step 1
I imported the appropriate libraries and loaded the datesets into variables

In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)
source('cleanup.R')

In [None]:
players <- read_csv("players.csv")
sessions <- read_csv("sessions.csv")

## Data Description

#### Players Dataset

| Column Name  |  Data Type  |                            Description                             |
|--------------|-------------|--------------------------------------------------------------------|
| experience   | Categorical | Skill level: Beginner, Amateur, Regular, Veteran, Pro              |
| subscribe    | Boolean     | Whether the player is subscribed (TRUE/FALSE)                      |
| hashed_email | String      | Hashed email acting as player identification                       |
| played_hours | Float       | Total hours played                                                 |
| name         | String      | Name of player                                                     |
| gender       | Categorical | Male, Female, Non-binary, Two-Spirited, Agender, Prefer not to say |
| Age          | Integer     | Age of player                                                      |


**Key Observations**
- Number of Observations : 196
- Number of Variables : 7
- Issues observed:
    - The groupings for 'experience' may be subjective - different players may have different expectations of different experience levels
    - The long strings for 'hashed_email' make it difficult to verify if there are duplicate entries of the same player
    - There are extreme values for 'played_hours' such as 218.1 hours played, which may prove that there are outliers in the dataset
- Potential Issues:
    - It is unclear whether 'experience' is self-reported or officially assigned, which may lead to inaccurate results
    - Some categories such as a particular age range or skill level may be underrepresented, which may lead to data skew

In [None]:
hours_mean <- mean(players$played_hours)
hours_median <- median(players$played_hours)
hours_min <- min(players$played_hours)
hours_max <- max(players$played_hours)
# hours_mean
# hours_median
# hours_min
# hours_max

age_mean <- mean(players$Age, na.rm = TRUE)
age_median <- median(players$Age, na.rm = TRUE)
age_min <- min(players$Age, na.rm = TRUE)
age_max <- max(players$Age, na.rm = TRUE)
# age_mean
# age_median
# age_min
# age_max

**Summary Statistics for Players Dataset**

| Column Name  |  Mean  | Median | Min |  Max  |
|--------------|--------|--------|-----|-------|
| played_hours | 5.85   | 0.1    |  0  | 223.1 |
| Age          | 21.14  | 19     |  9  |  58   |


#### Sessions Dataset

| Column Name  |  Data Type  |                            Description                             |
|--------------|-------------|--------------------------------------------------------------------|
| hashed_email   | String | Hashed email of the player which acts as identification             |
| start_time    | Datetime     |  Starting timestamp of session                  |
| end_time | Datetime      | Ending timestamp of session                       |
| original_start_time | Numeric       | Timestamp of session start in milliseconds                                   |
| original_end_time         | Numeric      | Timestamp of session end in milliseconds                         |


**Key Observations**
- Number of Observations : 1535
- Number of Variables : 5
- Issues observed:
    -  The values of original_start_time and original_end_time are not easily readable by humans
    -  There is a very large range of time played, as the start and end times suggest that there are sessions that last for hours, and sessions that last only minutes
- Potential Issues:
    - As the values of hashed_email are not easily human readable, some users could have sessions that overlap in timestamps
        - Additionally, it is difficult to tell if the same session of the same user is logged more than once