# Predicting Newsletter Subscription Based on Player Age and Playtime

---

## General Overview
- **Number of Observations (Rows):** 400
- **Number of Variables (Columns):** 7

---

## Potential Issues
- Missing values in some columns (e.g., `Age` and `played_hours`).
- Some columns such as `hashedEmail` and  `Name` may not be useful for analysis due to their hashed nature.
- Possible outliers in `played_hours` (e.g., values like 218.1 and 223.1).

---

## Variables Summary

| **Variable Name** | **Data Type** | **Description**                                                                 | **Potential Issues**                                                                 |
|--------------------|---------------|---------------------------------------------------------------------------------|-------------------------------------------------------------------------------------|
| **experience**     | Categorical   | Player's experience level ("Beginner," "Amateur," "Regular," "Veteran," "Pro"). | No major issues.                                                                    |
| **subscribe**      | Boolean       | Indicates whether the player has subscribed (TRUE or FALSE).                    | No major issues.                                                                    |
| **hashedEmail**    | String        | Hashed email address of the player.                                             | Not useful for analysis.                                                            |
| **played_hours**   | Numeric       | Total hours played by the player.                                               | Missing values. |
| **name**           | String        | Name of the player.                                                             | Not useful for analysis.                                       |
| **gender**         | Categorical   | Gender of the player (e.g., "Male," "Female," "Non-binary," "Prefer not to say").| Inconsistent categories (e.g., "Two-Spirited," "Agender"). May need standardization.|
| **Age**            | Numeric       | Age of the player.                                                              | Some unrealistic values (e.g., 9, 49, 50). Missing values.                          |

---

## Key Insights

### Demographics:
- The majority of players are male (70%), with a small percentage identifying as non-binary, agender, or preferring not to say.

### Gaming Behavior:
- Most players have played very few hours (median = 0.1), but there are extreme outliers (e.g., 223.1 hours).
- The most common experience level is "Amateur," followed by "Veteran" and "Regular."

### Subscription Status:
- Approximately 75% of players are subscribed, indicating a high subscription rate.

---

## Potential Issues

### Data Quality:
- Missing values in `Age` and `played_hours`.
- Outliers in `played_hours` and `Age` that may need to be addressed.
- Inconsistent gender categories that may require standardization.

### Data Collection:
- The dataset may suffer from self-reporting bias (e.g., players may misreport their age or hours played).
- The hashed email column (`hashedEmail`) is not useful for analysis and could be removed.
- The `name` column is also not useful for analysis and could be removed.

### Ethical Concerns:
- The dataset includes sensitive information (e.g., gender, age), which should be handled carefully to ensure privacy.

### Summary:
- This report will examine data from collected from a UBC led Minecraft server. The data gathered includes variables such as the amount of hours played on the server, age of the player, gender, experience, email and name. This report will pose the following question: Can age and hours played act as predictor for whether a player will subscribe to the newsletter. The dataset used to answer this question was players.csv which contains info on whether the player subscribes to the newsletter the name of the players as well as their gender and age, in addition this dataset contained the amount of hours played and the experience level of the players. Of these variables the only ones used for this examination was the hours played the age and whether the player subcribes or not. 

---


In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)

In [None]:
url <- "https://raw.githubusercontent.com/g-amadorz/dsci-project/refs/heads/main/data/players.csv"

players <- read_csv(url)

In [None]:
players

In [None]:
players_clean <- players |>
    mutate(age = Age, subscribe = as.factor(subscribe)) |>
    select(age, played_hours, subscribe) |>
    drop_na()

glimpse(players_clean)


In [None]:
hours_played_hist <- players_clean |>
    ggplot(aes(x = played_hours, fill = subscribe)) +
    geom_histogram() +
    scale_x_log10() +
    facet_grid(rows=vars(subscribe)) +
    labs(x = "Played Hours", fill = "Subscribed to Newsletter", y = "Count") +
    theme(text = element_text(size = 14))

hours_played_hist
    

In [None]:
hours_played_density_plot <- players_clean |>
  ggplot(aes(x = played_hours, fill = subscribe)) +
  geom_density(alpha = 0.5) +
    scale_x_log10() +
  labs(title = "Density of Played Hours by Subscription Status",
       x = "Played Hours",
       y = "Density") +
    theme(text = element_text(size = 14))

hours_played_density_plot

In [None]:
age_hist <- players_clean |>
    ggplot(aes(x = age, fill = subscribe)) +
    geom_histogram() +
    facet_grid(rows=vars(subscribe)) +
    labs(x = "Age of Player", fill = "Subscribed to Newsletter", y = "Count") +
    theme(text = element_text(size = 14))

age_hist

In [None]:
age_density_plot <- players_clean |>
  ggplot(aes(x = age, fill = subscribe)) +
  geom_density(alpha = 0.5) +
  labs(title = "Density of Age by Subscription Status",
       x = "Age",
       y = "Density") +
    theme(text = element_text(size = 14))

age_density_plot

In [None]:
age_hours_scatter_plot <- players_clean |>
    ggplot(aes(x = played_hours, y = age, color = subscribe)) +
    geom_point(alpha=0.5) +
    scale_x_log10()+
    labs(x = "Hours played", 
         y = "Age Of Player", 
         color = "Subscribed?") +
            scale_color_manual(values = c("darkorange", "steelblue")) +
            ggtitle("Scatter Plot for Age and Hours Played, with Subscribed") + 
            theme(text = element_text(size = 14))

age_hours_scatter_plot

**KNN Cross Validation**

In [None]:
player_split <- initial_split(players_clean, prop = 0.75, strata = subscribe)
player_train <- training(player_split)
player_test <- testing(player_split)

player_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
  set_engine("kknn") |>
  set_mode("classification")


player_recipe <- recipe(subscribe ~ age + played_hours, data = player_train) |>
  step_scale(all_predictors()) |>
  step_center(all_predictors())



In [None]:
set.seed(123)

players_vfold <- vfold_cv(player_train, v = 5, strata = subscribe)

knn_results <- workflow() |>
                 add_recipe(player_recipe) |>
                 add_model(player_spec) |>
                 tune_grid(resamples = players_vfold, grid = tibble(neighbors = c(2,3,4,5,6))) |>
                 collect_metrics()

accuracies <- knn_results |>
                 filter(.metric == 'accuracy')

cross_val_plot <- ggplot(accuracies, aes(x = neighbors, y = mean)) +
                  geom_point() +
                  geom_line() +
                  labs(x = 'Neighbors', y = 'Accuracy Estimate') +
                  theme(text = element_text(size = 20))

cross_val_plot

***Players Model***

In [None]:
set.seed(9999)
tuned_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 4) |>
       set_engine("kknn") |>
       set_mode("classification")

player_fit <- workflow() |>
             add_recipe(player_recipe) |>
             add_model(tuned_spec) |>
            fit(data = player_train)
player_fit

***Testing Players Model***

In [None]:
player_predictions <- predict(player_fit, player_test) |>
                        bind_cols(player_test)

player_metrics <- player_predictions |> metrics(truth = subscribe, estimate = .pred_class)

player_conf_mat <- player_predictions |>
                        conf_mat(truth = subscribe, estimate = .pred_class) 
player_metrics
player_conf_mat

### Notes 

I don't think that age and played_hours are good predicitors for subscription. There is a lot to be desired in the confusion matrix along with the accuracy of the possible K values for the neighbours, out of the 5 chosen, not one was above 60%. The amount of folds chosen was five, which is due to the small sample size of observations.

## Discussion

Our aim for this project was to asses whether the variables "age" and "hours played" could effectively predict whether or not an Minecraft player was subscribed to a Minecraft-related news letter. To achieve this, we created visualizations to asses trends between the predictor variables and the subscription status, as well as conducted a K nearest neighbours (KNN) regression analysis in order to determine the accuracy, precision, and recall of our model. After exploring this model's preformance, questions have been raised on the validity of Age and Hours Played as predictors for subscription status. 

#### KNN Results 

The output from our KNN workflow gave us a minimal misclassification value of 0.4069, which means that roughly 40.69% of the time, our test data was misclassified. More specifically, the overall accuracy of the model was 53.06%, the precision, which is the percent of subscription predictions that are actually correct, is 72.41% (true positives / (true positives + false positives)) = (21/(21+8)))