## Final Project Report: Exploring Simultaneous Player Prediction Using K-Nearest Neighbor Regression


**Background information**




**Goal:**

The question we are trying to answer is: “given the time of day, the day of the week, and player experience, which time window is most likely to have the highest number of simultaneous players?" This will be achieved using the KNN Regression algorithm to predict the number of  simultaneous players and identify peak demand period. 


**About the dataset:**

Both datasets contain information about players’ game sessions and their personal profiles, respectively. 
The sessions.csv dataset consists of 1535 observations and 5 variables. The hashedEmail variable is a unique identifier for each player, represented as a hashed email address. It has repeats, indicating that some players have multiple game sessions. The start_time and end_time variables represent the start and end times of each game session, formatted as DD/MM/YYYY HH, while the original_end_time and original_start_time variables represent the same information in UNIX time format.
On the other hand, the players.csv dataset has 196 observations and 9 variables. While the hashedEmail remains the same for each respective player, each value appears only once in this dataset, indicating no repeats. Each player’s experience level is given, such as amateur, beginner, pro, regular, and veteran is also given within this dataset. Additional personal information such as the player’s gender, age, name, subscription status, individual ID, and organization name is also mentioned. Unlike the sessions.csv, this dataset shows timestamps as played_hours, representing the total hours played by each player on PLAICraft.
These two datasets provide us with sufficient information to predict the time slots that extract the largest amounts of simultaneous active players. 



## Methods & Results

**Preprocessing and Exploratory Data Analysis**


Importing libraries and Setting Seed :
The first thing we will do is import all the neccesary libraries needed.

In [5]:
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)
# used lubridate in order to separate datetime data into useful form
library(lubridate)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.6     [32m✔[39m [34mrsample     [39

## Importing Dataset

Here we load the data and do a little bit of wrangling to make it tidy 

In [2]:
sessions_data <- read_csv("sessions.csv")
players_data <- read_csv("players.csv")

sessions_dt <- sessions_data |>
    mutate(date_start_time = dmy_hm(start_time),
         day_of_week = wday(date_start_time, label = TRUE),
         hour_of_day = hour(date_start_time)) |>
    mutate(date_end_time = dmy_hm(end_time)) |>
    select(hashedEmail, day_of_week, hour_of_day)
players_select <- players_data |>
    select(experience, hashedEmail, played_hours)

sessions_players_merge <- left_join(sessions_dt, players_select, by = "hashedEmail")

hourly_data <- sessions_players_merge |>
    group_by(day_of_week, hour_of_day, experience) |>
    summarize(simultaneous_players = n(), .groups = "drop")
average_day_data <- hourly_data |>
    group_by(hour_of_day, experience) |>
    summarize(avg_players = mean(simultaneous_players), .groups = "drop")
average_day_data


[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m196[39m [1mColumns: [22m[34m9[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, age
[33mlgl[39m (3): subscribe, individualId, organizationName

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this messag

hour_of_day,experience,avg_players
<int>,<chr>,<dbl>
0,Amateur,7.857143
0,Beginner,1.666667
0,Pro,1.500000
⋮,⋮,⋮
23,Pro,1.333333
23,Regular,5.571429
23,Veteran,1.500000


*figure 1*

## Splitting Data Into Training and Testing Sets

We will split our data into training and testing sets before working on the model or performing any exploratory data analysis. Since we are trying to predict the average number of players, the avg_player variable will be stratified.

Since this is a random split, it is important to set a seed for reproducibility. For this, we have chosen seed 1111

In [3]:
set.seed(1111)
plaicraft_split <- initial_split(average_day_data, prop = 8/10, strata = avg_players)
plaicraft_training <- training(plaicraft_split)
plaicraft_testing <- testing(plaicraft_split)


## Performing the Data Analysis
### Finding the Best K-value
To find the best k-value (neighbors) we will need to use 5-fold cross-validation on the training data set to select the optimal *k* for our classification. 
This can be achived using the following tasks:
- Create model specification tuning on the number of neighbours
- Create a recipe that uses `hour_of_day` as predictors. Here we also steps for scaling and centering the data.
- Perform 10-fold cross validation using a workflow
- Collect the metrics from the results of the workflow analysis
- In the dataset, filter out the metric rmse and find the k value using the slice_min() function



In [4]:
set.seed(1111)
plaicraft_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |> 
      set_engine("kknn") |>
      set_mode("regression") 


plaicraft_recipe <- recipe(avg_players ~ hour_of_day, data = plaicraft_training) |>
      step_scale(all_predictors()) |>
      step_center(all_predictors())

plaicraft_vfold <- vfold_cv(plaicraft_training, v = 10, strata = avg_players)

plaicraft_workflow <- workflow() |>
    add_recipe(plaicraft_recipe) |>
    add_model(plaicraft_spec)

plaicraft_workflow



“The number of observations in each quantile is below the recommended threshold of 20.
[36m•[39m Stratification will use 3 breaks instead.”


══ Workflow ════════════════════════════════════════════════════════════════════
[3mPreprocessor:[23m Recipe
[3mModel:[23m nearest_neighbor()

── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps

• step_scale()
• step_center()

── Model ───────────────────────────────────────────────────────────────────────
K-Nearest Neighbor Model Specification (regression)

Main Arguments:
  neighbors = tune()
  weight_func = rectangular

Computational engine: kknn 


In [3]:
set.seed(1234)
gridvals <- tibble(neighbors = seq(from = 1, to = 10, by = 2))
plaicraft_results <- workflow() |>
    add_recipe(plaicraft_recipe) |>
    add_model(plaicraft_spec) |>
    tune_grid(resamples = plaicraft_vfold, grid = gridvals) |>
    collect_metrics()
plaicraft_results

plaicraft_min <- plaicraft_results |>
   filter(.metric == "rmse") |>
   slice_min(mean, n = 1)
plaicraft_min

ERROR: Error in tibble(neighbors = seq(from = 1, to = 10, by = 2)): could not find function "tibble"


*figure 2 and 3*

**Results:** 

From Figure 3, we can conclude that the K value with the least RMSE is **9**, meaning it provides the most accurate predictions. We will now use this K value to continue with our KNN regression model

## Building the Model

In [None]:
set.seed(4444)
k_min <- plaicraft_min |>
         pull(neighbors)

plaicraft_best_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = k_min) |>
         set_engine("kknn") |>
         set_mode("regression")

plaicraft_best_fit <- workflow() |>
         add_recipe(plaicraft_recipe) |>
         add_model(plaicraft_best_spec) |>
         fit(data = plaicraft_training)

plaicraft_summary <- plaicraft_best_fit |>
          predict(plaicraft_testing) |>
          bind_cols(plaicraft_testing) |>
          metrics(truth = avg_players, estimate = .pred)


plaicraft_summary

*figure 4*

Figure 4 shows the results of three evaluation metrics for this model

We will use the trained model (plaicraft_best_fit) to generate predictions based on the training data (plaicraft_training) and combine these predictions with the original data. The results will be visualized in a scatter plot, showing the actual average number of players against the hour of the day, with the model’s predicted values overlaid as a black line. The points will be color-coded based on the experience level.


## Visualizing the Results

In [None]:
set.seed(1000)
options(repr.plot.width = 7, repr.plot.height = 7)

        
plaicraft_preds <- plaicraft_best_fit |>
    predict(plaicraft_training) |>
    bind_cols(plaicraft_training)
plaicraft_preds

plaicraft_plot <- plaicraft_preds |>
    ggplot(aes(x = hour_of_day, y = avg_players, colour = experience)) +
    geom_point(alpha = 0.5) +
    xlab("Hour of Day") +
    ylab("Average Number of Players") +
    geom_line(aes(x = hour_of_day, y = .pred), color = "black") +
    labs(color = "Experience") +
    ggtitle(paste0("K = ", k_min))


plaicraft_plot

*figure 5 and 6*

From Figure 6, we can conclude that the busiest hour of the day is between 0:00 and 5:00, specifically at 3:00 AM.

## Effectiveness of the model 
To better demonstrate the effectiveness of the model, we will visualize both the training and testing results. This will allow us to compare how well the model performs on data it has seen (training data) versus data it has not seen (testing data), helping to assess its accuracy and ability to generalize to new, unseen data.

In [4]:
set.seed(1000)
options(repr.plot.width = 7, repr.plot.height = 7)

plaicraft_pred_training <- plaicraft_best_fit |>
    predict(plaicraft_training) |>
    bind_cols(plaicraft_training)


plaicraft_pred_testing <- plaicraft_best_fit |>
    predict(plaicraft_testing) |>
    bind_cols(plaicraft_testing)


plaicraft_pred_training_plot <- plaicraft_pred_training |>
    ggplot(aes(x = hour_of_day, y = avg_players, colour = experience)) +
    geom_point(alpha = 0.5) +
    xlab("Hour of Day") +
    ylab("Average Number of Players") +
    geom_line(aes(x = hour_of_day, y = .pred), color = "black") +
    labs(color = "Experience") +
    ggtitle(paste0("K = ", k_min))

plaicraft_pred_testing_plot <- plaicraft_pred_testing |>
    ggplot(aes(x = hour_of_day, y = avg_players, colour = experience)) +
    geom_point(alpha = 0.5) +
    xlab("Hour of Day") +
    ylab("Average Number of Players") +
    geom_line(aes(x = hour_of_day, y = .pred), color = "black") +
    labs(color = "Experience") +
    ggtitle(paste0("K = ", k_min)) 
plot_grid(plaicraft_pred_testing_plot, plaicraft_pred_training_plot, ncol = 2, nrow = 1)

plaicraft_pred_training_plot 
plaicraft_pred_testing_plot 



ERROR: Error in bind_cols(predict(plaicraft_best_fit, plaicraft_training), plaicraft_training): could not find function "bind_cols"


In [None]:

set.seed(4000)

ranked_hours <- plaicraft_pred_training |> 
  select(hour_of_day, .pred) |> 
  group_by(hour_of_day) |> 
  summarize(avg_predicted_players = mean(.pred), .groups = "drop") |> 
  arrange(desc(avg_predicted_players)) |> 
  slice_head(n = 5)
ranked_hours
