# DSCI 100 Project Report
## Predicting time windows with high-demand usage for efficient allocation of licenses
### Introduction
Over the past few decades, digital technology has advanced more rapidly than any other human innovation and has reached the point where our society is almost completely dependent on it. The increasing volume of data, as a consequence of such advancement, has made data science is one of the fastest growing field across every industry(IBM, 2021).

A popular field where the importance of data science has skyrocketed is the gaming industry. With the use of data, developers can patterns and preferences, enabling them enhance the gaming experince of the players(Whitehead, 2024). Data science helps gaming companies develop "effective monetisation strategies" (Whitehead, 2024) by examining the spending patterns of the players and forecasting behaviour. In this porject, data science methods will be used to explore a dataset, extracted from the Minecraft server, in an attempt to predict the time windows where player activity patterns are high. Using these predictions will allow better allocation of server licenses. The data, collected by a research group in Computer Science led by Frank Wood at UBC Point Grey Campus, will be used to answer the question: which day of the week is a player most likely to log on based on their age, gender and experience?
 


#### Data Description: identify and fully describe the dataset that was used to answer the question. Provide a full descriptive summary of the dataset, including information such as the number of observations, summary statistics, number of variables, name and type of variables, what the variables mean, any issues you see in the data, any other potential issues related to things you cannot directly see, how the data were collected, etc. Make sure to use bullet point lists or tables to summarize the variables in an easy-to-understand format. Note that the selected dataset(s) will probably contain more variables than you need. 

Rough Description 
Dataset 1: Player Characteristics and Behaviors
Aspect	Details
Number of observations	27 (players)
Number of variables	7
Variables:	
- experience	Categorical (factor) — Player experience level (e.g., Pro, Veteran, Amateur, Regular, Beginner)
- subscribe	Logical — Whether the player is subscribed to the game-related newsletter (TRUE/FALSE)
- hashedEmail	Character — Unique hashed identifier for each player (anonymized email)
- played_hours	Numeric — Total hours the player has played on the server
- name	Character — Player name
- gender	Categorical (factor) — Player gender (Male, Female, Non-binary)
- Age	Numeric — Age of the player in years

Summary Statistics (selected numeric variables):
Age ranges from 8 to 25 years.

Played hours vary from 0 to 48.4 hours, with many players having low or zero hours.

Notes & Potential Issues:
Small sample size (27 players).

Some players have zero playtime — may represent inactive accounts or new users.

Gender categories include Male, Female, and Non-binary — good inclusivity.

Age distribution is skewed towards younger players (mostly teens).

Data collection method: presumably logged from server and survey data (for demographics).

Player identities anonymized by hashing emails.

Dataset 2: Gameplay Session Logs
Aspect	Details
Number of observations	24 (gameplay sessions)
Number of variables	5
Variables:	
- hashedEmail	Character — Hashed unique player identifier, matches Dataset 1
- start_time	Date-time string — Timestamp when session started (format: dd/mm/yyyy HH:MM)
- end_time	Date-time string — Timestamp when session ended
- original_start_time	Numeric — Unix epoch timestamp for session start (milliseconds since 1970-01-01)
- original_end_time	Numeric — Unix epoch timestamp for session end

Notes & Potential Issues:
Sessions have varying lengths, some very short (minutes).

Dates span from April to August 2024.

Timestamps are in local time (assumed), but time zones are not explicitly mentioned.

Data linkage: hashedEmail connects this to player info.

Session data allows calculation of session duration, day of week, time of day.

Only 24 sessions recorded here — possibly a subset of total gameplay.

Some players have multiple sessions (repeat rows with same hashedEmail).

In [2]:
# importing libraries
library(tidyverse)
library(tidymodels)
library(repr)
library(RColorBrewer)
library(ggplot2)
#setting seed 
set.seed(26)

In [52]:
# Importing the data sets  

#Data set 1 - players (A list of all unique players, including data about each player)
players <- read_csv("data/players.csv")
#Data set 2 - sessions (A list of individual play sessions by each player, including data about the session.)
session <- read_csv("data/sessions.csv")
#players
#session

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [42]:
# Finding the age range of the players
#players_age_analysis <- players |>
#                        summarize (min_player_age = min(Age, na.rm = TRUE), 
#                                   max_player_age = max(Age, na.rm = TRUE))
# players_age_analysis

## age range is from 8-50


In [46]:
# combining the data 
combined_data <- merge(players, session, by = "hashedEmail") 

combined_data

hashedEmail,experience,subscribe,played_hours,name,gender,Age,start_time,end_time,original_start_time,original_end_time
<chr>,<chr>,<lgl>,<dbl>,<chr>,<chr>,<dbl>,<chr>,<chr>,<dbl>,<dbl>
0088b5e134c3f0498a18c7ea6b8d77b4b0ff1636fc93355ccc95b45423367832,Regular,TRUE,1.5,Isaac,Male,20,23/05/2024 00:22,23/05/2024 01:07,1.71642e+12,1.71643e+12
0088b5e134c3f0498a18c7ea6b8d77b4b0ff1636fc93355ccc95b45423367832,Regular,TRUE,1.5,Isaac,Male,20,22/05/2024 23:12,23/05/2024 00:13,1.71642e+12,1.71642e+12
060aca80f8cfbf1c91553a72f4d5ec8034764b05ab59fe8e1cf0eee9a7b67967,Pro,FALSE,0.4,Lyra,Male,21,28/06/2024 04:28,28/06/2024 04:58,1.71955e+12,1.71955e+12
0ce7bfa910d47fc91f21a7b3acd8f33bde6db57912ce0290fa0437ce0b97f387,Beginner,TRUE,0.1,Osiris,Male,17,19/09/2024 21:01,19/09/2024 21:12,1.72678e+12,1.72678e+12
0d4d71be33e2bc7266ee4983002bd930f69d304288a8663529c875f40f1750f3,Regular,TRUE,5.6,Winslow,Male,17,30/08/2024 03:40,30/08/2024 04:04,1.72499e+12,1.72499e+12
0d4d71be33e2bc7266ee4983002bd930f69d304288a8663529c875f40f1750f3,Regular,TRUE,5.6,Winslow,Male,17,27/08/2024 19:18,27/08/2024 19:52,1.72479e+12,1.72479e+12
0d4d71be33e2bc7266ee4983002bd930f69d304288a8663529c875f40f1750f3,Regular,TRUE,5.6,Winslow,Male,17,30/08/2024 17:49,30/08/2024 18:48,1.72504e+12,1.72504e+12
0d4d71be33e2bc7266ee4983002bd930f69d304288a8663529c875f40f1750f3,Regular,TRUE,5.6,Winslow,Male,17,31/08/2024 22:44,31/08/2024 23:20,1.72514e+12,1.72515e+12
0d4d71be33e2bc7266ee4983002bd930f69d304288a8663529c875f40f1750f3,Regular,TRUE,5.6,Winslow,Male,17,24/08/2024 03:15,24/08/2024 03:48,1.72447e+12,1.72447e+12
0d4d71be33e2bc7266ee4983002bd930f69d304288a8663529c875f40f1750f3,Regular,TRUE,5.6,Winslow,Male,17,31/08/2024 03:14,31/08/2024 03:59,1.72507e+12,1.72508e+12


In [18]:
#  visulaizing relationships
#visual_data <- combined_data |>
#        select(experience, gender, Age)
#
#visuals <- visual_data |> 
#        ggpairs(aes(alpha = 0.05)) +
#        theme(text = element_text(size = 20)) 
#visuals
        

## Cleaning Data

In [51]:
# converting start and end time to days of the week 
combined_data_weekdays <- combined_data |>
mutate( start_time = dmy_hm(start_time),
    end_time = dmy_hm(end_time),
    start_day_of_week = wday(start_time, label = TRUE, abbr = FALSE), 
    end_day_of_week = wday(end_time, label = TRUE, abbr = FALSE))
#combined_data_weekdays

# Now we have start time and end time in terms of weekdays ! 
# source - https://lubridate.tidyverse.org/reference/day.html

In [62]:
# predict day of the week using age, gender and experience
polished_data <- combined_data_weekdays |>
        mutate(gender = as_factor(gender), experience = as_factor(experience)) |>
        select(gender, experience, Age, start_day_of_week) 

#polished_data
nrow(polished_data)
#distinct(polished_data)

# perfect, now we can start splitting the data set

In [65]:
#Splitting the data 
data_split <- initial_split(polished_data, prop = 0.7, strata = start_day_of_week)
training_set <- training(data_split)
testing_set <- testing(data_split)
#training_set 
#testing_set


# Taking out the NAs
training_set <- training_set |>
  drop_na(start_day_of_week, Age, gender, experience)
testing_set <- training_set |>
  drop_na(start_day_of_week, Age, gender, experience)

In [70]:
# training our data set using KNN engine model 
# finding the best k 


# setting the recipe
knn_recipe <- recipe(start_day_of_week ~ Age + gender + experience, data = training_set)

#buiding the model 
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
  set_engine("kknn") |>
  set_mode("classification")

# setting up for cross validation
knn_vfold <- vfold_cv(training_set, v = 2, strata = start_day_of_week)

k_vals <- tibble(neighbors = seq(from = 1, to = 100, by = 5))  # Try neighbors from 2 to 10

knn_results <- workflow() |>
  add_recipe(knn_recipe) |>
  add_model(knn_spec) |>
  tune_grid(resamples = knn_vfold, grid = k_vals) |>
  collect_metrics()

knn_accuracy <- knn_results %>%
  filter(.metric == "accuracy")

#knn_accuracy
cross_val_plot <- knn_accuracy |>
  ggplot(aes(x = neighbors, y = mean)) +
  geom_point() +
  geom_line() +
  labs(x = "Number of Neighbors (k)", y = "Cross-validated Accuracy") +
  theme(text = element_text(size = 14))
#cross_val_plot

best_k <- knn_accuracy |>
  arrange(desc(mean)) |>
  slice(1) |>
  pull(neighbors)
best_k

In [72]:
# The best k = 36 !!!!
final_knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = best_k) |>
  set_engine("kknn") |>
  set_mode("classification")

final_fit <- workflow() |>
  add_recipe(knn_recipe) |>
  add_model(final_knn_spec)|>
  fit(data = training_set)

# Predict test set
test_predictions <- predict(final_fit, testing_set) |>
  bind_cols(testing_set)

# Confusion matrix & accuracy
conf_mat <- conf_mat(test_predictions, truth = start_day_of_week, estimate = .pred_class)
accuracy <- accuracy(test_predictions, truth = start_day_of_week, estimate = .pred_class)

print(conf_mat)
print(accuracy)

           Truth
Prediction  Sunday Monday Tuesday Wednesday Thursday Friday Saturday
  Sunday         0      0       0         0        0      0        0
  Monday        60     69      59        60       53     38       55
  Tuesday        7      9       5         1        4      5        8
  Wednesday      6     15      11        18       10     12       10
  Thursday       7      5       7         8       14     18       10
  Friday        94     46      60        58       75     53       99
  Saturday       0      0       0         0        0      0        0
[90m# A tibble: 1 × 3[39m
  .metric  .estimator .estimate
  [3m[90m<chr>[39m[23m    [3m[90m<chr>[39m[23m          [3m[90m<dbl>[39m[23m
[90m1[39m accuracy multiclass     0.149


This means your model correctly predicted the day of the week only 15% of the time 
KNN is not the right tool for this ?????

References - 
1. IBM. (2021, September 21). Data science: Transforming the future with artificial intelligence. IBM. Retrieved June 20, 2025, from https://www.ibm.com/think/topics/data-science
2. Whitehead, R. (2024, May 23). Role of data science in the gaming industry. I.O.A. Global. Retrieved June 20, 2025, from https://ioaglobal.org/blog/role-of-data-science-in-gaming-industry/
3. Tidyverse. (2024, December 8). Get/set days component of a date-time. lubridate. https://lubridate.tidyverse.org/reference/day.html