# Analyzing the Predictability of Different Player Types (Final Group Report)

## Introduction

- ### Background Information
    - In this project, we explored how player characteristics and behaviour might help predict whether a player subscribes to a game-related newsletter. We were provided with two datasets: players.csv and sessions.csv. For this analysis, we focused on the players.csv file. Each player/observation contains the columns experience, subscribe status, hashed email, played hours, name, gender and age.

- ### Question
    - **Can the player type, age and played hours of players.csv predict if they are going to subscribe to a game-related newsletter and which player type is the most predictive?**

- ### Reason for Chosen Predictors
    - Played_hours tells us how much time each player has spent on the server. Our assumption is that players who spend more time playing are more engaged in the game and more likely to want updates, news, or event announcements, which things that are typically shared through newsletters. So, higher played hours might be linked to a greater chance of subscribing.
    - Experience provides extra context beyond just how long someone has played. Two players might have the same number of hours, but very different levels of skill or game knowledge. We included this variable to see whether more experienced players are more likely to subscribe, or if newer players are more eager to receive tips and updates through the newsletter.
    - Age is a basic but important factor. Players of different ages often have different habits. Younger players might be more likely to stay connected with game updates and community content, while older players might not be as interested. So age could help us spot differences in who’s more likely to subscribe.

- ### Description of Dataset
| Variable        | Type           | Description                                |
|-----------------|----------------|--------------------------------------------|
| hashedEmail     | Character      | Hashed email address of player             |
| experience      | Character      | Self-reported skill level                  |
| subscribe       | Logical        | Subscribed to newsletter (Yes/No)          |
| played_hours    | Numeric (dbl)   | Total hours played on server               |
| name            | Character      | Player's chosen name                       |
| gender          | Character      | Gender reported by player                  |
| Age             | Numeric (dbl)   | Player Age                                 |

 - The dataset we will be using is players.csv. There are 196 observations and 7 variables in this dataset. Issues in players.csv include missing age values and potential outliers in played_hours, where some players have much higher playtime.
 - The data was collected from players who voluntarily joined the Minecraft server. Demographic data was self-reported, while session data was logged automatically. 

- ### Assumptions and Limitations:
    - Potential issues include bias from self-reported data (e.g., experience) and sampling bias, as it only includes players who chose to participate.
    - Assumes independent observations, which may not fully apply since gaming sessions can be correlated
    - The relationships between predictors and subscription status may not be perfectly linear, which could impact model performance.
    - May require balancing if peak vs non-peak times are imbalanced.


## Methods & Results

- ### Loading Libraries
    - Here, we load the necessary libraries we need for the rest of the project using the library() function. This includes the repr, tidyverse, and tidymodels libraries which will help us manipulate , compute, and visualize data.

In [None]:
library(repr)
library(tidyverse)
library(tidymodels)

- ### Reading Data
    - Here we read the "players.csv" dataset from a URL using the read_csv() function and display the first five rows using the slice_head() function to ensure that the dataset was read properly.

In [None]:
player <- read_csv("https://raw.githubusercontent.com/gatory/DSCI_100_Term_Project/refs/heads/main/data/players.csv")
slice_head(player, n=5)

- ### Wrangling Data
    - Here, we remove data that is unnecessary for our purposes from the dataset (columns "HashedEmail", "gender" and "name") using the select() function
    - Reassign "subscribe" variable to be a factor data type rather than a logical data type and changed its values to be either 'Yes' or 'No' using the mutate() function.
    - We display the first five rows using the slice_head() function to ensure that the dataset was altered properly.

In [None]:
tidy_player <- player |> 
    select(-hashedEmail, -gender, -name) |>
    mutate(experience = as_factor(experience), subscribe = as_factor(subscribe)) |>
    mutate(subscribe = recode(subscribe, "TRUE" = "Yes", "FALSE" = "No"))
slice_head(tidy_player, n=5)

- ### Summarizing Data
    - Here we display a general summary of the new dataset using the summarize() function to calculate the average hours played by all players as well as the average age of all players.

In [None]:
tidy_player_mean <- tidy_player |> group_by(experience, subscribe) |> summarize(player_hours_mean = mean(played_hours, na.rm = TRUE), age_mean = mean(Age, na.rm = TRUE))
tidy_player_mean

- ### Visualizing Data
    - Here we display two graphs using the ggplot() function.
        - The first is a scatterplot titled "Hours Played vs. Age Relationship" and shows the hours played of each players over the their age and whether they are subscribed or not.
        - The second is a bar graph titled "Distribution of Players Across Experience and Subscription" and shows the relative amount of players subscribed across all player types.
    - Graph visualization suggests there may be some corralation between whether a player is subscribe and their age and  played hours

In [None]:
options(repr.plot.width = 10, repr.plot.height = 8) 
tidy_player_age_plot <- tidy_player |>
    ggplot(aes(x = Age, y = played_hours, color = subscribe)) +
    geom_point(alpha = 0.9) + 
	scale_x_log10() +
    scale_y_log10() +
    labs(x = "Age (years)", y = "Hours Played (hours)", color = "Subscribed?") +
    ggtitle("Hours Played vs. Age Relationship (Fig. 1)") +
    theme(text = element_text(size = 18))

tidy_players_experience_plot <- tidy_player |>
    ggplot(aes(y = experience, fill = subscribe)) +
    geom_bar(stat = "count") +
    labs(x = "Number of Players", y = "Player Type", fill = "Subscribed?") +
    ggtitle("Distribution of Players Across Experience and Subscription (Fig.2)") +
    theme(text = element_text(size = 18))

tidy_player_age_plot
tidy_players_experience_plot

- ### Modeling Data

- #### Universal Tuning Specifications for Subscription Prediction
    - Here we use the nearest_neighbors() function to create the specifications that we will later use to tune all the individual player type models. 

In [None]:
set.seed(4923)
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
    set_engine("kknn") |>
    set_mode("classification")

- #### Filter and Split Datasets
    - Here we use the filter() function to select only the rows for each player type and assign them to individual data sets.
    - For each player type data set, we split the data into training and testing sets so that our model will not see the data that will be used to test it.

In [None]:
set.seed(4923)
beginner_players <- tidy_player |> filter(experience == "Beginner")
beginner_split <- initial_split(beginner_players, prop = 0.75, strata = subscribe)
beginner_train <- training(beginner_split)
beginner_test <- testing(beginner_split)

regular_players <- tidy_player |> filter(experience == "Regular")
regular_split <- initial_split(regular_players, prop = 0.75, strata = subscribe)
regular_train <- training(regular_split)
regular_test <- testing(regular_split)

amateur_players <- tidy_player |> filter(experience == "Amateur")
amateur_split <- initial_split(amateur_players, prop = 0.75, strata = subscribe)
amateur_train <- training(amateur_split)
amateur_test <- testing(amateur_split)

veteran_players <- tidy_player |> filter(experience == "Veteran")
veteran_split <- initial_split(veteran_players, prop = 0.75, strata = subscribe)
veteran_train <- training(veteran_split)
veteran_test <- testing(veteran_split)

pro_players <- tidy_player |> filter(experience == "Pro") |> na.omit()
pro_split <- initial_split(pro_players, prop = 0.75, strata = subscribe)
pro_train <- training(pro_split)
pro_test <- testing(pro_split)

- #### Folding 5 Folds for each player type
    - Here we create the vfold for each player type data set so we can use it to cross validate our predictions

In [None]:
set.seed(4923)
beginner_vfold <- vfold_cv(beginner_train, v = 5, strata = subscribe)
regular_vfold <- vfold_cv(regular_train, v = 5, strata = subscribe)
amateur_vfold <- vfold_cv(amateur_train, v = 5, strata = subscribe)
veteran_vfold <- vfold_cv(veteran_train, v = 5, strata = subscribe)
pro_vfold <- vfold_cv(pro_train, v = 5, strata = subscribe)

- #### Tuning Model using a suitable range of Ks
    - Here, we find a suitable range for k based on how many observations exist for each player type data set. 
        - Because each of the player types has a different number of observations, we use the largest range of k-values possible so that we can give each player type the best possible k value once that value is determined.

In [None]:
# Tune for best k given proper range
set.seed(4923)
beginner_k_vals <- tibble(neighbors = seq(from = 1, to = 10, by = 1))
beginner_recipe <- recipe(subscribe ~ played_hours + Age, data = beginner_train) |>
  step_scale(all_predictors()) |>
  step_center(all_predictors())
beginner_fit <- workflow() |>
  add_recipe(beginner_recipe) |>
  add_model(knn_spec) |>
  tune_grid(resamples = beginner_vfold, grid = beginner_k_vals)

regular_k_vals <- tibble(neighbors = seq(from = 1, to = 10, by = 1))
regular_recipe <- recipe(subscribe ~ played_hours + Age, data = regular_train) |>
  step_scale(all_predictors()) |>
  step_center(all_predictors())
regular_fit <- workflow() |>
  add_recipe(regular_recipe) |>
  add_model(knn_spec) |>
  tune_grid(resamples = regular_vfold, grid = regular_k_vals)

amateur_k_vals <- tibble(neighbors = seq(from = 1, to = 20, by = 1))
amateur_recipe <- recipe(subscribe ~ played_hours + Age, data = amateur_train) |>
  step_scale(all_predictors()) |>
  step_center(all_predictors())
amateur_fit <- workflow() |>
  add_recipe(amateur_recipe) |>
  add_model(knn_spec) |>
  tune_grid(resamples = amateur_vfold, grid = amateur_k_vals)

veteran_k_vals <- tibble(neighbors = seq(from = 1, to = 22, by = 1))
veteran_recipe <- recipe(subscribe ~ played_hours + Age, data = veteran_train) |>
  step_scale(all_predictors()) |>
  step_center(all_predictors())
veteran_fit <- workflow() |>
  add_recipe(veteran_recipe) |>
  add_model(knn_spec) |>
  tune_grid(resamples = veteran_vfold, grid = veteran_k_vals)

pro_k_vals <- tibble(neighbors = seq(from = 1, to = 6, by = 1))
pro_recipe <- recipe(subscribe ~ played_hours + Age, data = pro_train) |>
  step_scale(all_predictors()) |>
  step_center(all_predictors())
pro_fit <- workflow() |>
  add_recipe(pro_recipe) |>
  add_model(knn_spec) |>
  tune_grid(resamples = pro_vfold, grid = pro_k_vals)

- #### Analyzing Accuracies and Picking Best K
    - Here, we analyze the accuracies of the predictions made using each k value in each range for each player type.
        - Based on the accuracy for each k value, we can choose the best one, the one with the highest accuracy, to use in our final model that we will test using the test data. 

In [None]:
# Analyze for best k
options(repr.plot.width = 5, repr.plot.height = 5)
set.seed(4923)
beginner_accuracies <- beginner_fit |> collect_metrics() |>
  filter(.metric == "accuracy")
beginner_cross_val_plot <- ggplot(beginner_accuracies, aes(x = neighbors, y = mean)) +
    geom_point() +
    geom_line() +
    labs(x = "Neighbors", y = "Accuracy Estimate", title = "Beginner Accuracies") +
    theme(text = element_text(size = 12))

regular_accuracies <- regular_fit |> collect_metrics() |>
  filter(.metric == "accuracy")
regular_cross_val_plot <- ggplot(regular_accuracies, aes(x = neighbors, y = mean)) +
    geom_point() +
    geom_line() +
    labs(x = "Neighbors", y = "Accuracy Estimate", title = "Regular Accuracies") +
    theme(text = element_text(size = 12))

amateur_accuracies <- amateur_fit |> collect_metrics() |>
  filter(.metric == "accuracy")
amateur_cross_val_plot <- ggplot(amateur_accuracies, aes(x = neighbors, y = mean)) +
    geom_point() +
    geom_line() +
    labs(x = "Neighbors", y = "Accuracy Estimate", title = "Amateur Accuracies") +
    theme(text = element_text(size = 12))

veteran_accuracies <- veteran_fit |> collect_metrics() |>
  filter(.metric == "accuracy")
veteran_cross_val_plot <- ggplot(veteran_accuracies, aes(x = neighbors, y = mean)) +
    geom_point() +
    geom_line() +
    labs(x = "Neighbors", y = "Accuracy Estimate", title = "Veteran Accuracies") +
    theme(text = element_text(size = 12))

pro_accuracies <- pro_fit |> collect_metrics() |>
  filter(.metric == "accuracy")
pro_cross_val_plot <- ggplot(pro_accuracies, aes(x = neighbors, y = mean)) +
    geom_point() +
    geom_line() +
    labs(x = "Neighbors", y = "Accuracy Estimate", title = "Pro Accuracies") +
    theme(text = element_text(size = 12))

beginner_cross_val_plot
regular_cross_val_plot
amateur_cross_val_plot
veteran_cross_val_plot
pro_cross_val_plot

- #### Retraining Specification Model Using Best K
    - Here we create new specifications for each player type-specific model using the best k value derived from the previous step.
    - This will be the specification that we will use to train the final classification model for each player type. 

In [None]:
# Retrain model using best k
set.seed(4923)
beginner_mnist_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 3) |>
    set_engine("kknn") |>
    set_mode("classification")
beginner_mnist_fit <- fit(beginner_mnist_spec, subscribe ~ played_hours + Age, data = beginner_train)

regular_mnist_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 6) |>
    set_engine("kknn") |>
    set_mode("classification")
regular_mnist_fit <- fit(regular_mnist_spec, subscribe ~ played_hours + Age, data = regular_train)

amateur_mnist_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 18) |>
    set_engine("kknn") |>
    set_mode("classification")
amateur_mnist_fit <- fit(amateur_mnist_spec, subscribe ~ played_hours + Age, data = amateur_train)

veteran_mnist_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 20) |>
    set_engine("kknn") |>
    set_mode("classification")
veteran_mnist_fit <- fit(veteran_mnist_spec, subscribe ~ played_hours + Age, data = veteran_train)

pro_mnist_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 4) |>
    set_engine("kknn") |>
    set_mode("classification")
pro_mnist_fit <- fit(pro_mnist_spec, subscribe ~ played_hours + Age, data = pro_train)

- #### Determine Final Accuracies
    - Here, we use each model we created for each player type and make predictions for the test datasets that we split earlier.

In [None]:
# See results and accuracies
set.seed(4923)
beginner_mnist_predictions <- predict(beginner_mnist_fit, beginner_test) |> bind_cols(beginner_test)
beginner_mnist_metrics <- beginner_mnist_predictions |> 
    metrics(truth = subscribe, estimate = .pred_class) |> 
    filter(.metric == "accuracy")

regular_mnist_predictions <- predict(regular_mnist_fit, regular_test) |> bind_cols(regular_test)
regular_mnist_metrics <- regular_mnist_predictions |> 
    metrics(truth = subscribe, estimate = .pred_class) |> 
    filter(.metric == "accuracy")

amateur_mnist_predictions <- predict(amateur_mnist_fit, amateur_test) |> bind_cols(amateur_test)
amateur_mnist_metrics <- amateur_mnist_predictions |> 
    metrics(truth = subscribe, estimate = .pred_class) |> 
    filter(.metric == "accuracy")

veteran_mnist_predictions <- predict(veteran_mnist_fit, veteran_test) |> bind_cols(veteran_test)
veteran_mnist_metrics <- veteran_mnist_predictions |> 
    metrics(truth = subscribe, estimate = .pred_class) |> 
    filter(.metric == "accuracy")

pro_mnist_predictions <- predict(pro_mnist_fit, pro_test) |> bind_cols(pro_test)
pro_mnist_metrics <- pro_mnist_predictions |> 
    metrics(truth = subscribe, estimate = .pred_class) |> 
    filter(.metric == "accuracy")

In [None]:
beginner_results <- beginner_mnist_metrics |> mutate(experience = "beginner")
regular_results <- regular_mnist_metrics |> mutate(experience = "regular")
amateur_results <- amateur_mnist_metrics |> mutate(experience = "amateur")
veteran_results <- veteran_mnist_metrics |> mutate(experience = "veteran")
pro_results <- pro_mnist_metrics |> mutate(experience = "pro")

- ### Visualizating Results
    - Here, the final accuracies of the predictions made for each player type are tabulated and graphed so that they can be compared side by side.
    - The more accurate the prediction is, the better that player type is for predicting subscription status based on their played hours and age.

In [None]:
total_metrics <- bind_rows(beginner_results, regular_results, amateur_results, veteran_results, pro_results) |> 
                mutate(accuracy = .estimate * 100, experience = as_factor(experience)) |>
                select(experience, accuracy)
total_metrics

options(repr.plot.width = 10, repr.plot.height = 10)

tidy_players_experience_plot <- total_metrics |>
    ggplot(aes(x = accuracy, y = experience, fill = experience)) +
    geom_bar(stat = "identity") +
    labs(x = "Accuracies in %", y = "Player Type", fill = "Player Type") +
    ggtitle("Predictability of Different Player Types (Fig. 3)") +
    theme(text = element_text(size = 18))

tidy_players_experience_plot

## Discussion
### Findings Summary
Our final models showed that beginner, regular, and amateur players had the highest prediction accuracies, consistently reaching 70% or above. These results suggest that for players who are newer or moderately experienced, there is a clearer link between their in-game activity and their decision to subscribe to newsletters. These players may be more invested in learning and staying informed, which makes them more likely to sign up for game newsletters.
However, our models didn’t work as well for pro and veteran players. The prediction accuracy for these groups was much lower, which suggests that their behaviour doesn’t clearly show whether they’ll subscribe or not. This matches something we noticed early in our analysis, even though we thought experienced players would be more engaged, they actually spent less time on the server. This could mean that their reasons for playing, or for subscribing, are very different from newer players.

### Was it what we expected?
I think this is what we expected to find for the regular and beginner players as it is more likely that beginner and regular players are more likely to subscribe to the newsletter if they have more hours played as it would mean that they are interested in the updates and content of the games and would like to be up to date with very new detail about the game. 

The strong performance for beginners and regular players met our expectations. These players are still learning and exploring the game, so it makes sense that they would want updates and news, which are things they can get through a newsletter.
What surprised us was the lower accuracy for pro players. You might think that players who are really into the game would also want to stay connected and get updates, but our results, and our initial visualizations showed that this wasn’t the case. Pro players might already know the game well and don’t need updates, or they might get their news from other places like online forums or communities.
Another point that stood out was that players with the most experience were spending the least amount of time on the server. This goes against the idea that more experience means more time spent playing. It may be that experienced players log in less often but play more efficiently, or they could be returning players who already know the game well and don’t need regular sessions.

### What impacts does this finding have?
These findings could inform how game developers and marketing teams design their player engagement strategies. For instance, newsletters may be most effective when targeted toward newer or average players who play regularly and are more likely to be interested in learning more about the game. Developers could even make different versions of the newsletter, for example, providing beginner tips for new players and advanced patch notes or balance changes for pros.
For pro or veteran players, who are harder to predict and may not care about newsletters, developers could try other ways to keep them interested. This could include things like in-game messages, special events, or rewards for being active in the community.
Our project also shows that even with a small dataset and just a few simple features, we can still find useful patterns. By using a basic model like KNN, we were able to learn something real about how different players interact with the game and what keeps them engaged.

### What future questions could this lead to?
This project brings up some interesting new questions we could look into next such as:
- Can additional features improve accuracy? Things like how often players log in, how many sessions they have, what achievements they’ve earned, or even what they say in the chat might help us make better predictions.
- Do players’ habits change over time? It would be useful to follow players over a longer period to see if beginners usually subscribe early on and then stop, or if pro players ever go back to using the newsletter.
- Would other models work better? We used KNN, which is simple and easy to understand, but other methods like logistic regression could potentially offer better performance, especially for experience types with lower prediction accuracy.
- Can we make things more personal? Can we build recommendation systems or  newsletter content based on each player’s behavior and interests based on these findings?
This also could lead to the question of whether focusing only on beginner and regular players would significantly increase the number of players who subscribe to the newsletter.

### Potential issues for future references
One issue we ran into during this project was that some player experience groups didn’t have enough data. For example, there were fewer players in the pro and veteran categories compared to beginners or regulars. Because of this, our model didn’t have as much information to learn from, which likely made the predictions for those groups less accurate.

When a dataset is small or unbalanced, it becomes harder for the model to recognize patterns. This can make the model either learn random details that don’t really matter (called overfitting), or not learn enough to make good predictions (called underfitting). In our case, the model worked well for groups with more data, but not as well for smaller ones. This means our results might not fully reflect how predictive the features are for every type of player.

In future work, it would be helpful to collect more data, especially for underrepresented player types. Another option would be to use techniques like oversampling or data balancing methods to even out the number of examples in each group.

Ultimately, this project highlights the potential of using player data not just to study gameplay, but to better understand player engagement, communication habits, and content preferences.