In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
library(cowplot)

In [None]:
player_data <- read_csv("https://raw.githubusercontent.com/angusesgus/Individual-planning-report/refs/heads/main/players.csv")
glimpse(player_data)
summary(player_data)
session_data <- read_csv("https://raw.githubusercontent.com/angusesgus/Individual-planning-report/refs/heads/main/sessions.csv")
glimpse(session_data)
summary(session_data)

(1) Data Description:
    
    Dataset 1 has 196 observations and 7 variables, while dataset 2 has 1535 observations with 5 variables.
Dataset 1:
| Variable Name   | Type      | Description                                      | Issues         |
|-----------------|-----------|--------------------------------------------------|----------------|
| experience      | chr       | Experience of each player                        | Not factor     |
| subscribe       | lgl       | Whether the player subscribed to the newsletter  | None           |
| hashedEmail     | chr       | Hashed email of player                           | None           |
| played_hours    | numeric   | Total playtime across all sessions (hours)       | None           |
| name            | chr       | Player first name                                | None           |
| gender          | chr       | Player gender                                    | Not factor     |
| age             | numeric   | Age of the player in years                       | None           |

Dataset 2:

| Variable Name       | Type      | Description                                      | Issues                          |
|---------------------|-----------|--------------------------------------------------|---------------------------------|
| hashedEmail         | chr       | Hasehd email of each player                      | None                            |
| start_time          | chr       | Time play session begins                         | None                            |
| end_time            | chr       | Time play session ends                           | None                            |
| original_start_time | numeric   | Start time in unix time                          | Not accurate enough to be useful|
| original_end_time   | numeric   | End time in unix time                            | Not accurate enough to be useful|

Summary statistics
Mean age: 21.14
Mean hours played: 5.846

(2) Questions:

Broad question: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

Specific question: Can experience, gender, and age predict whether a player subscribes or not in the player dataset?

I will use the data from dataset 1 in the experience, gender, and age colomns to predict whether the observations in the subscription colomn will be true or false.
Possible wrangling I would do is to change experience and gender into factors, so that I can compute these values in categories.

(3) Exploratory Data Analysis and Visualization

In [None]:
slice(player_data, 1:10)

There is not much wrangling to tidy data here, as each colomn has one variable, each row has one observation, and each cell has one value. 
However, it is necessary to change experience and gender into factors so that a visualization can be made.

In [None]:
player_data <- player_data |>
    mutate(across(c(experience, gender), as.factor))
slice(player_data, 1:10)

In [None]:
mean_data <- player_data |>
    summarize(across(c(played_hours, Age), mean, na.rm = TRUE))
mean_data

In [None]:
options(repr.plot.width = 15, repr.plot.height = 7)
plot_1 <- ggplot(player_data, aes(x = experience, fill = subscribe)) +
    geom_bar(position = "fill") +
    labs(x = "Experience of player", y = "Proportion of subscribers", title = "Experience vs subscribers")
plot_2 <- ggplot(player_data, aes(x = gender, fill = subscribe)) +
    geom_bar(position = "fill") +
    labs(x = "Gender of player", y = "Proportion of subscribers", title = "Gender vs subscribers")
plot_3 <- ggplot(player_data, aes(x = Age, fill = subscribe)) +
    geom_histogram(bins = 15, position = "fill") +
    labs(x = "Age of player", y = "Proportion of subscribers", title = "Age vs subscribers")
plot_1
plot_2
plot_3
                 

These visualizations show that only the age of players may contribute to whether they subscribed or not. plot_3 shows that younger players are more likely to subscribe.

(4) Methods and Plan

I chose to use the knn classification method. I believe this is suitable for this problem because it works well to predict variables with classes including subscribe (true or false).
Assumptions that are required are that players with similar features will more likely have the same subscription status. Another assumption is that the dataset represents the overall population well.
A potential limitation/weakness is that a effective "k" must be chosen to avoid overfitting or underfitting.
The optimal "k" will be selected through cross validation, and the preformance of the model will be measured through accuracy, precision, recall, and F1-score.
I will split the data 0.8, 80% training and 20% testing data. I will fold 5 times with vfold to use cross validation.