#### Step 1
I imported the appropriate libraries and loaded the datesets into variables

In [1]:
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)
source('cleanup.R')

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.6     [32m✔[39m [34mrsample     [39

ERROR: Error in file(filename, "r", encoding = encoding): cannot open the connection


In [None]:
players <- read_csv("players.csv")
sessions <- read_csv("sessions.csv")

## Data Description

#### Players Dataset

| Column Name  |  Data Type  |                            Description                             |
|--------------|-------------|--------------------------------------------------------------------|
| experience   | Categorical | Skill level: Beginner, Amateur, Regular, Veteran, Pro              |
| subscribe    | Boolean     | Whether the player is subscribed (TRUE/FALSE)                      |
| hashed_email | String      | Hashed email acting as player identification                       |
| played_hours | Float       | Total hours played                                                 |
| name         | String      | Name of player                                                     |
| gender       | Categorical | Male, Female, Non-binary, Two-Spirited, Agender, Prefer not to say |
| Age          | Integer     | Age of player                                                      |


**Key Observations**
- Number of Observations : 196
- Number of Variables : 7
- Issues observed:
    - The groupings for 'experience' may be subjective - different players may have different expectations of different experience levels
    - The long strings for 'hashed_email' make it difficult to verify if there are duplicate entries of the same player
    - There are extreme values for 'played_hours' such as 218.1 hours played, which may prove that there are outliers in the dataset
- Potential Issues:
    - It is unclear whether 'experience' is self-reported or officially assigned, which may lead to inaccurate results
    - Some categories such as a particular age range or skill level may be underrepresented, which may lead to data skew

In [None]:
hours_mean <- mean(players$played_hours)
hours_median <- median(players$played_hours)
hours_min <- min(players$played_hours)
hours_max <- max(players$played_hours)
# hours_mean
# hours_median
# hours_min
# hours_max

age_mean <- mean(players$Age, na.rm = TRUE)
age_median <- median(players$Age, na.rm = TRUE)
age_min <- min(players$Age, na.rm = TRUE)
age_max <- max(players$Age, na.rm = TRUE)
# age_mean
# age_median
# age_min
# age_max

**Summary Statistics for Players Dataset**

| Column Name  |  Mean  | Median | Min |  Max  |
|--------------|--------|--------|-----|-------|
| played_hours | 5.85   | 0.1    |  0  | 223.1 |
| Age          | 21.14  | 19     |  9  |  58   |


#### Sessions Dataset

| Column Name  |  Data Type  |                            Description                             |
|--------------|-------------|--------------------------------------------------------------------|
| hashed_email   | String | Hashed email of the player which acts as identification             |
| start_time    | Datetime     |  Starting timestamp of session                  |
| end_time | Datetime      | Ending timestamp of session                       |
| original_start_time | Numeric       | Timestamp of session start in milliseconds                                   |
| original_end_time         | Numeric      | Timestamp of session end in milliseconds                         |


**Key Observations**
- Number of Observations : 1535
- Number of Variables : 5
- Issues observed:
    -  The values of original_start_time and original_end_time are not easily readable by humans
    -  There is a very large range of time played, as the start and end times suggest that there are sessions that last for hours, and sessions that last only minutes
- Potential Issues:
    - As the values of hashed_email are not easily human readable, some users could have sessions that overlap in timestamps
        - Additionally, it is difficult to tell if the same session of the same user is logged more than once

## Question: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

For this project, I will focus on the broad question of what player characteristics are more predictive of subscribing to a game-related newsletter. Specifically, I aim to answer the question of **"Can a player's in-game activity patterns, such as total playtime, number of sessions, and time period activity, predict whether they will subscribe to a game-related newsletter?"**.

The Minecraft server logs provide data on each player's activity and session information, allowing me to summarise and use data wrangling to turn the data into summary statistics and metrics for each individual player. By summarising metrics such as total playtime, average session duration, number of sessions, and combining these with the player's subscription status, I will be able to create a dataset suitable for predicting whether player is subscribed to game-related newsletters based on their metrics. At the moment, I feel that K-NN classification will be more suitable compared to linear regression to predict the subscription status of a player based on their summary metrics, because the relationship between the predictors and subscription status is non-linear, and K-NN allows for fewer assumptions about the data.