# DSCI 100 Project Report
## Predicting time windows with high-demand usage for efficient allocation of licenses
### Introduction
Over the past few decades, digital technology has advanced more rapidly than any other human innovation and has reached the point where our society is almost completely dependent on it. The increasing volume of data, as a consequence of such advancement, has made data science is one of the fastest growing field across every industry(IBM, 2021).

A popular field where the importance of data science has skyrocketed is the gaming industry. With the use of data, developers can patterns and preferences, enabling them enhance the gaming experince of the players(Whitehead, 2024). Data science helps gaming companies develop "effective monetisation strategies" (Whitehead, 2024) by examining the spending patterns of the players and forecasting behaviour. In this porject, data science methods will be used to explore a dataset, extracted from the Minecraft server, in an attempt to predict the time windows where player activity patterns are high. Using these predictions will allow better allocation of server licenses. The data, collected by a research group in Computer Science led by Frank Wood at UBC Point Grey Campus, will be used to answer the question: which day of the week is a player most likely to log on based on their age, gender and experience?
 


#### Data Description: identify and fully describe the dataset that was used to answer the question. Provide a full descriptive summary of the dataset, including information such as the number of observations, summary statistics, number of variables, name and type of variables, what the variables mean, any issues you see in the data, any other potential issues related to things you cannot directly see, how the data were collected, etc. Make sure to use bullet point lists or tables to summarize the variables in an easy-to-understand format. Note that the selected dataset(s) will probably contain more variables than you need. 

Rough Description 
Dataset 1: Player Characteristics and Behaviors
Aspect	Details
Number of observations	27 (players)
Number of variables	7
Variables:	
- experience	Categorical (factor) — Player experience level (e.g., Pro, Veteran, Amateur, Regular, Beginner)
- subscribe	Logical — Whether the player is subscribed to the game-related newsletter (TRUE/FALSE)
- hashedEmail	Character — Unique hashed identifier for each player (anonymized email)
- played_hours	Numeric — Total hours the player has played on the server
- name	Character — Player name
- gender	Categorical (factor) — Player gender (Male, Female, Non-binary)
- Age	Numeric — Age of the player in years

Summary Statistics (selected numeric variables):
Age ranges from 8 to 25 years.

Played hours vary from 0 to 48.4 hours, with many players having low or zero hours.

Notes & Potential Issues:
Small sample size (27 players).

Some players have zero playtime — may represent inactive accounts or new users.

Gender categories include Male, Female, and Non-binary — good inclusivity.

Age distribution is skewed towards younger players (mostly teens).

Data collection method: presumably logged from server and survey data (for demographics).

Player identities anonymized by hashing emails.

Dataset 2: Gameplay Session Logs
Aspect	Details
Number of observations	24 (gameplay sessions)
Number of variables	5
Variables:	
- hashedEmail	Character — Hashed unique player identifier, matches Dataset 1
- start_time	Date-time string — Timestamp when session started (format: dd/mm/yyyy HH:MM)
- end_time	Date-time string — Timestamp when session ended
- original_start_time	Numeric — Unix epoch timestamp for session start (milliseconds since 1970-01-01)
- original_end_time	Numeric — Unix epoch timestamp for session end

Notes & Potential Issues:
Sessions have varying lengths, some very short (minutes).

Dates span from April to August 2024.

Timestamps are in local time (assumed), but time zones are not explicitly mentioned.

Data linkage: hashedEmail connects this to player info.

Session data allows calculation of session duration, day of week, time of day.

Only 24 sessions recorded here — possibly a subset of total gameplay.

Some players have multiple sessions (repeat rows with same hashedEmail).

In [None]:
# importing libraries
library(tidyverse)
library(tidymodels)
library(repr)
library(RColorBrewer)
library(ggplot2)
#setting seed 
set.seed(26)

In [None]:
# Importing the data sets  

#Data set 1 - players (A list of all unique players, including data about each player)
players <- read_csv("data/players.csv")
#Data set 2 - sessions (A list of individual play sessions by each player, including data about the session.)
session <- read_csv("data/sessions.csv")
#players
#session

In [None]:
# Finding the age range of the players
players_age_analysis <- players |>
                        summarize (min_player_age = min(Age, na.rm = TRUE), 
                                   max_player_age = max(Age, na.rm = TRUE))
# players_age_analysis

## age range is from 8-50


In [None]:
# combining the data 
combined_data <- merge(players, session, by = "hashedEmail") |>
                    head()
combined_data

In [None]:
#  visulaizing relationships
#visual_data <- combined_data |>
#        select(experience, gender, Age)
#
#visuals <- visual_data |> 
#        ggpairs(aes(alpha = 0.05)) +
#        theme(text = element_text(size = 20)) 
#visuals
        

## Cleaning Data

In [None]:
# converting start and end time to days of the week 
combined_data_weekdays <- combined_data |>
mutate( start_time = dmy_hm(start_time),
    end_time = dmy_hm(end_time),
    start_day_of_week = wday(start_time, label = TRUE, abbr = FALSE), 
    end_day_of_week = wday(end_time, label = TRUE, abbr = FALSE))
# Now we have start time and end time in terms of weekdays ! 
# source - https://lubridate.tidyverse.org/reference/day.html

In [None]:
# predict day of the week using age, gender and experience
polished_data <- combined_data_weekdays |>
        mutate(gender = as_factor(gender), experience = as_factor(experience)) |>
        select(gender, experience, Age,start_day_of_week, end_day_of_week)
#polished_data
# perfect, now we can start splitting the data set

In [None]:
#Splitting the data 


References - 
1. IBM. (2021, September 21). Data science: Transforming the future with artificial intelligence. IBM. Retrieved June 20, 2025, from https://www.ibm.com/think/topics/data-science
2. Whitehead, R. (2024, May 23). Role of data science in the gaming industry. I.O.A. Global. Retrieved June 20, 2025, from https://ioaglobal.org/blog/role-of-data-science-in-gaming-industry/
3. Tidyverse. (2024, December 8). Get/set days component of a date-time. lubridate. https://lubridate.tidyverse.org/reference/day.html