In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)
source('cleanup.R')

In [None]:

sessions <- read_csv("sessions.csv")
players  <- read_csv("players.csv")

players_means_table <- players |>
  summarise(
    played_hours = round(mean(played_hours, na.rm = TRUE), 2),
    Age          = round(mean(Age,          na.rm = TRUE), 2)
  ) |>
  pivot_longer(
    cols      = c(played_hours, Age),
    names_to  = "Variable",
    values_to = "Mean_Value"
  )

print(players_means_table)


updated_sessions <- sessions |>
  rename(player_id = hashedEmail) |>
  mutate(start_time_updated = as_datetime(original_start_time), end_time_updated   = as_datetime(original_end_time)) |>
  mutate(duration_minutes = as.numeric(difftime(end_time_updated,
                                           start_time_updated,
                                           units = "mins")) ) |>
  select(player_id, start_time_updated, end_time_updated, duration_minutes) |>
  mutate(player_id = as.character(player_id))

print(updated_sessions)

hourly_activity_df <- updated_sessions |>
  filter(duration_minutes > 0) |>
  mutate(date = floor_date(start_time_updated, "day"),
    hour = hour(start_time_updated),
    day_of_week = wday(start_time_updated, label = TRUE, week_start = 1) ) |>
  select(player_id, date, hour, day_of_week) |>
  distinct() |>
  group_by(date, hour, day_of_week) |>
  summarise(num_unique_players_starting = n()) |>
  ungroup() |>
  mutate(day_of_week = as_factor(day_of_week) )

print(hourly_activity_df)


#Plot 1: Average Players Starting by Hour 
plot_avg_by_hour <- hourly_activity_df |>
  group_by(hour) |>
  summarise( avg_players = mean(num_unique_players_starting, na.rm = TRUE) ) |>
  ungroup() |>
  ggplot(aes(x = hour, y = avg_players)) +
  geom_line() +
  geom_point() +
  labs(
    title = "Average Unique Players Starting Sessions by Hour of Day",
    x     = "Hour of Day (0–23)",
    y     = "Average Unique Players") +
  scale_x_continuous(breaks = seq(0, 23, by = 2)) +
  theme( plot.title = element_text(hjust = 0.5))

print(plot_avg_by_hour)


# Plot 2: Average Players Starting by Day of Week
plot_avg_by_day <- hourly_activity_df |>
  group_by(day_of_week) |>
  summarise(avg_players = mean(num_unique_players_starting, na.rm = TRUE)) |>
  ungroup() |>
  ggplot(aes(x = day_of_week, y = avg_players, group = 1)) +
  geom_line() +
  geom_point() +
  labs(
    title = "Average Unique Players Starting Sessions by Day of Week",
    x     = "Day of Week",
    y     = "Average Unique Players") +
  theme( plot.title = element_text(hjust = 0.5) )

print(plot_avg_by_day)

1. Data Description
This project uses two datasets, namely players.csv and sessions.csv. However, there will be an emphasis on the sessions dataset.
The players dataset contains 196 observations and 7 variables, each representing a unique player. The variables include player demographics , experience, and behavioral metrics such as “played_hours.” The dataset includes both numerical variables and categorical variables. One issue that sticks out to me is that several players have zero played hours, which may represent really short sessions or data collected before gameplay occurred. 
The sessions dataset contains 1,535 sessions, each representing a single continuous gameplay instance from a specific player. Key variables include session start and end times, player identifiers, and timestamps in milliseconds. After converting the time into proper datetime formats, we compute new variables for start and end time, as well as session duration in minutes. There are potential issues, including sessions with zero duration, unsuitable timestamps, and players with very large or very small numbers of sessions.
Together, these two datasets contain important information about who is playing and what day and what time of day the server is busiest at. Understanding these patterns is essential for resource planning and anticipating server demand, all to make sure the server is capable of accommodating such heavy traffic.
2. Broad and Specific Questions
Broad Question:
From the three broad questions provided, I chose question number 3, which is "We are interested in demand forecasting, namely, what time windows are most likely to have large number of simultaneous players.“
Specific question:
For the specific question, I have chosen “Can the variables  predict the number of players starting sessions in each hour?”
This question is supported by the sessions dataset, because each session includes a start time and player ID. By looking into the number of unique players beginning sessions within each hour, we should be able to obtain information useful for forecasting demand.
3. Exploratory Analysis & Visualizations
I computed the mean of the quantitative variables in the players dataset (played_hours and Age), and presented them in a summary table. Then, I constructed two visualizations:
Average unique players starting per hour, which shows the periods of busy traffic, typically in the late afternoon and evening. This insight helps frame the predictive modelling problem.
Average unique players starting per day of week, which shows the  weekly patterns, such as lower activity on weekdays and increased activity during weekends.
From these plots, it shows and justifies that day_of_week as an explanatory variable. 
4. Methods and Plan
I plan to use a linear regression model to predict hourly unique player counts using hour, day_of_week, and other relevant variables. I think linear regression is the most appropriate method considering the situation because the response variable is numerical, and the model should be able to show the clear relationships between predictors and player activity. I will split the data into training and testing sets (80/20), evaluate performance using RMSE, and, if necessary, compare with alternative models such as KNN.

