# DSCI 100 Project Report
## Predicting time windows with high-demand usage for efficient allocation of licenses
### Introduction
Over the past few decades, digital technology has advanced more rapidly than any other human innovation and has reached the point where our society is almost completely dependent on it. The increasing volume of data, as a consequence of such advancement, has made data science one of the fastest growing fields across every industry(IBM, 2021). <br>

A popular field where the importance of data science has skyrocketed is the gaming industry. With the use of data, developers can identify patterns and preferences, enabling them to enhance the gaming experience for players(Whitehead, 2024). Data science helps gaming companies develop "effective monetisation strategies" (Whitehead, 2024) by examining the spending patterns of the players and forecasting thier behaviour. In this project, data science methods will be used to explore a dataset, extracted from the Minecraft server, in an attempt to predict the time windows where player activity patterns are high. Using these predictions will allow better allocation of server licenses. The data, collected by a research group in Computer Science led by Frank Wood at UBC Point Grey Campus, will be used to answer the question: Which day of the week is a player most likely to log on based on their age, gender and experience? <br>
 

This predictive analysis will help companies understand player usage patterns and allow them to determine the allocation of licenses during different time windows throughout the week. To answer the question, two datasets were used: players.csv and sessions.csv. These datasets were merged using a common identifier, hashedEmail, to be able to view player characteristics along with their session timings. <br><br>
##### Dataset 1: Player Details (players.csv) <br>
This dataset contains information about the players registered with the Minecraft server. Each row represents observations about a unique player.<br>
Number of observations - 196<br>
Number of variables - 7<br><br>
##### Dataset 2: Session Details (sessions.csv) <br>
This dataset contains information about the players' sessions along with the start and end of session timestamp<br>
Number of observations - 1534<br>
Number of variables - 5<br><br>
##### Variables <br>
1. hashedEmail(chr): A unique, anonymized identifier for each player, containing encrypted email addresses to protect their privacy.<br>
2. name(chr): The name of the players registered with the server<br>
3. gender(chr): The players' self identified gender (Male, Female, Non-binary, Two-Spirited, Prefer not to say, Agender, Other)<br>
4. experience(chr): Player's experience level (Beginner, Amateur, Regular, Pro, Veteran)<br>
5. Age(dbl): Age of the player. (Age range is from 8-50)<br>
6. subscribe(lgl): Whether the player has subscribed to the newsletter or not. (TRUE/FALSE)<br>
7. played_hours(dbl): Total hours played by a player (ranging from 0 - 225)<br>
8. start_time(chr) - timestamp for the start of each player's session (reported as date and time - Day/Month/Year Hour:Minute)<br>
9. end_time(chr) - timestamp for the end of each player's session (reported as date and time - Day/Month/Year Hour:Minute)<br>
10. original_start_time(dbl)- session start time stamp reported in milliseconds<br>
11. original_end_time(dbl)- session end time stamp reported in milliseconds<br>

Data Summary characteristics - 
- Most of the players reported their gender as male.
- The mean age of players is around 20.52 years, with most players being around 19 years of age.
- Most of the players are at the Amateur experience level(63), while only 14 players are at the Pro experience level.
  

### Methods & Results

##### Loading the Data and setting seed value 

In [None]:
### Loading the Data and setting seed value 
# importing libraries
library(tidyverse)
library(tidymodels)
library(repr)
library(RColorBrewer)
library(ggplot2)
#setting seed 
set.seed(26)

##### Import the datasets

In [None]:
# Importing the data sets  

#Data set 1 - players (A list of all unique players, including data about each player)
players <- read_csv("data/players.csv")
#Data set 2 - sessions (A list of individual play sessions by each player, including data about the session.)
session <- read_csv("data/sessions.csv")

# Merge datasets using hashedEmail
combined_data <- merge(players, session, by = "hashedEmail") 

##### Data Cleaning and Wrangling

In [None]:
#Methods - 
#Missing Data: Some players have missing  information (e.g., age or gender). These rows were removed before model fitting
#Time Format: Timestamps were initially in string format and required conversion to datetime objects.

## Methods & Results

### Loading the Data and setting seed value 
# importing libraries
library(tidyverse)
library(tidymodels)
library(repr)
library(RColorBrewer)
library(ggplot2)
#setting seed 
set.seed(26)


# Import the datasets
players <- read_csv("data/players.csv")
sessions <- read_csv("data/sessions.csv")

# Merge datasets using hashedEmail
combined_data <- merge(players, sessions, by = "hashedEmail")
```

### Data Cleaning and Wrangling
```r
# Convert timestamps to proper datetime and extract weekday
combined_data_weekdays <- combined_data |>
  mutate(
    start_time = dmy_hm(start_time),
    end_time = dmy_hm(end_time),
    start_day_of_week = wday(start_time, label = TRUE, abbr = FALSE),
    end_day_of_week = wday(end_time, label = TRUE, abbr = FALSE)
  )

# Select and format relevant columns
polished_data <- combined_data_weekdays |>
  mutate(
    gender = as_factor(gender),
    experience = as_factor(experience)
  ) |>
  select(gender, experience, Age, start_day_of_week)

# Remove missing values
polished_data <- polished_data |> drop_na()
```

### Exploratory Data Analysis
#### Summary Statistics
```r
# Summary of Age
players_summary <- players |>
  summarize(
    n_players = n(),
    n_missing_age = sum(is.na(Age)),
    min_age = min(Age, na.rm = TRUE),
    max_age = max(Age, na.rm = TRUE),
    mean_age = mean(Age, na.rm = TRUE),
    median_age = median(Age, na.rm = TRUE),
    sd_age = sd(Age, na.rm = TRUE)
  )

# Summary by Gender and Experience
gender_summary <- players |>
  group_by(gender) |>
  summarize(count = n())

experience_summary <- players |>
  group_by(experience) |>
  summarize(count = n())
```

#### Sessions by Weekday
```r
weekday_activity <- combined_data_weekdays |>
  count(start_day_of_week, sort = TRUE)

# Plot
ggplot(weekday_activity, aes(x = reorder(start_day_of_week, -n), y = n)) +
  geom_col(fill = "#3c8dbc") +
  labs(
    title = "Figure 1: Player Activity by Day of the Week",
    x = "Weekday",
    y = "Number of Sessions"
  ) +
  theme_minimal(base_size = 14)

In [None]:
# importing libraries
library(tidyverse)
library(tidymodels)
library(repr)
library(RColorBrewer)
library(ggplot2)
#setting seed 
set.seed(26)

In [None]:
# Importing the data sets  

#Data set 1 - players (A list of all unique players, including data about each player)
players <- read_csv("data/players.csv")
#Data set 2 - sessions (A list of individual play sessions by each player, including data about the session.)
session <- read_csv("data/sessions.csv")
#head(players)
#head(session)

#players
#session

In [None]:
# combining the data 
combined_data <- merge(players, session, by = "hashedEmail") 

combined_data

In [None]:
## Finding variable ranges 
#1. Experience 
#combined_data |> distinct(experience)

# Beginner, Amateur, Regular, Pro, Veteran

# 2. Age
# Finding the age range of the players
#players_age_analysis <- players |>
#                        summarize (min_player_age = min(Age, na.rm = TRUE), 
#                                   max_player_age = max(Age, na.rm = TRUE))
# players_age_analysis

## age range is from 8-50


#Gender
#combined_data |> distinct(gender)
# Male, Female, Non-binary, Two-Spirited, Prefer not to say, Agender, Other

#hours player

#players_hr_analysis <- combined_data |>
#                        summarize (min_player_hr = min(played_hours, na.rm = TRUE), 
#                                   max_player_hr = max(played_hours, na.rm = TRUE))
#players_hr_analysis

## hr range is from 0-225


# Summary of player age, gender, experience
players_summary <- players |>
  summarize(
    n_players = n(),
    n_missing_age = sum(is.na(Age)),
    min_age = min(Age, na.rm = TRUE),
    max_age = max(Age, na.rm = TRUE),
    mean_age = mean(Age, na.rm = TRUE),
    median_age = median(Age, na.rm = TRUE),
    sd_age = sd(Age, na.rm = TRUE)
  )



# Categorical summaries
gender_counts <- players |> count(gender)
experience_counts <- players |> count(experience)
subscribe_counts <- players |> count(subscribe)

# View results
players_summary
gender_counts
experience_counts
subscribe_counts


## Cleaning Data

In [None]:
# converting start and end time to days of the week 
combined_data_weekdays <- combined_data |>
mutate( start_time = dmy_hm(start_time),
    end_time = dmy_hm(end_time),
    start_day_of_week = wday(start_time, label = TRUE, abbr = FALSE), 
    end_day_of_week = wday(end_time, label = TRUE, abbr = FALSE))
#combined_data_weekdays

# Now we have start time and end time in terms of weekdays ! 
# source - https://lubridate.tidyverse.org/reference/day.html

In [None]:
weekday_activity <- combined_data_weekdays |>
  count(start_day_of_week, sort = TRUE)

weekday_activity

In [None]:
# predict day of the week using age, gender and experience
polished_data <- combined_data_weekdays |>
        mutate(gender = as_factor(gender), experience = as_factor(experience)) |>
        select(gender, experience, Age, start_day_of_week) 

#polished_data
nrow(polished_data)
#distinct(polished_data)

# perfect, now we can start splitting the data set

In [None]:
#Splitting the data 
data_split <- initial_split(polished_data, prop = 0.7, strata = start_day_of_week)
training_set <- training(data_split)
testing_set <- testing(data_split)
#training_set 
#testing_set


# Taking out the NAs
training_set <- training_set |>
  drop_na(start_day_of_week, Age, gender, experience)
testing_set <- testing_set |>
  drop_na(start_day_of_week, Age, gender, experience)

In [None]:
# training our data set using KNN engine model 
# finding the best k 


# setting the recipe
knn_recipe <- recipe(start_day_of_week ~ Age + gender + experience, data = training_set)

#buiding the model 
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
  set_engine("kknn") |>
  set_mode("classification")

# setting up for cross validation
knn_vfold <- vfold_cv(training_set, v = 2, strata = start_day_of_week)

k_vals <- tibble(neighbors = seq(from = 1, to = 100, by = 5))  # Try neighbors from 2 to 10

knn_results <- workflow() |>
  add_recipe(knn_recipe) |>
  add_model(knn_spec) |>
  tune_grid(resamples = knn_vfold, grid = k_vals) |>
  collect_metrics()

knn_accuracy <- knn_results %>%
  filter(.metric == "accuracy")

#knn_accuracy
cross_val_plot <- knn_accuracy |>
  ggplot(aes(x = neighbors, y = mean)) +
  geom_point() +
  geom_line() +
  labs(x = "Number of Neighbors (k)", y = "Cross-validated Accuracy") +
  theme(text = element_text(size = 14))
#cross_val_plot

best_k <- knn_accuracy |>
  arrange(desc(mean)) |>
  slice(1) |>
  pull(neighbors)
best_k

In [None]:
# The best k = 36 !!!!
final_knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = best_k) |>
  set_engine("kknn") |>
  set_mode("classification")

final_fit <- workflow() |>
  add_recipe(knn_recipe) |>
  add_model(final_knn_spec)|>
  fit(data = training_set)

# Predict test set
test_predictions <- predict(final_fit, testing_set) |>
  bind_cols(testing_set)

# Confusion matrix & accuracy
conf_mat <- conf_mat(test_predictions, truth = start_day_of_week, estimate = .pred_class)
accuracy <- accuracy(test_predictions, truth = start_day_of_week, estimate = .pred_class)

print(conf_mat)
print(accuracy)

This means your model correctly predicted the day of the week only 15% of the time 
KNN is not the right tool for this ?????

References - 
1. IBM. (2021, September 21). Data science: Transforming the future with artificial intelligence. IBM. Retrieved June 20, 2025, from https://www.ibm.com/think/topics/data-science
2. Whitehead, R. (2024, May 23). Role of data science in the gaming industry. I.O.A. Global. Retrieved June 20, 2025, from https://ioaglobal.org/blog/role-of-data-science-in-gaming-industry/
3. Tidyverse. (2024, December 8). Get/set days component of a date-time. lubridate. https://lubridate.tidyverse.org/reference/day.html

In [None]:
Rough Description 
Dataset 1: Player Characteristics and Behaviors
Aspect	Details
Number of observations	27 (players)
Number of variables	7
Variables:	
- experience	Categorical (factor) — Player experience level (e.g., Pro, Veteran, Amateur, Regular, Beginner)
- subscribe	Logical — Whether the player is subscribed to the game-related newsletter (TRUE/FALSE)
- hashedEmail	Character — Unique hashed identifier for each player (anonymized email)
- played_hours	Numeric — Total hours the player has played on the server
- name	Character — Player name
- gender	Categorical (factor) — Player gender (Male, Female, Non-binary)
- Age	Numeric — Age of the player in years

Summary Statistics (selected numeric variables):
Age ranges from 8 to 25 years.

Played hours vary from 0 to 48.4 hours, with many players having low or zero hours.

Notes & Potential Issues:
Small sample size (27 players).

Some players have zero playtime — may represent inactive accounts or new users.

Gender categories include Male, Female, and Non-binary — good inclusivity.

Age distribution is skewed towards younger players (mostly teens).

Data collection method: presumably logged from server and survey data (for demographics).

Player identities anonymized by hashing emails.

Dataset 2: Gameplay Session Logs
Aspect	Details
Number of observations	24 (gameplay sessions)
Number of variables	5
Variables:	
- hashedEmail	Character — Hashed unique player identifier, matches Dataset 1
- start_time	Date-time string — Timestamp when session started (format: dd/mm/yyyy HH:MM)
- end_time	Date-time string — Timestamp when session ended
- original_start_time	Numeric — Unix epoch timestamp for session start (milliseconds since 1970-01-01)
- original_end_time	Numeric — Unix epoch timestamp for session end

Notes & Potential Issues:
Sessions have varying lengths, some very short (minutes).

Dates span from April to August 2024.

Timestamps are in local time (assumed), but time zones are not explicitly mentioned.

Data linkage: hashedEmail connects this to player info.

Session data allows calculation of session duration, day of week, time of day.

Only 24 sessions recorded here — possibly a subset of total gameplay.

Some players have multiple sessions (repeat rows with same hashedEmail).