# Individual Planning Report

### Introduction: A question that we will be exploring

A broad question: “We are interested in demand forecasting, namely, what time windows are most likely to have large number of simultaneous players. This is because we need to ensure that the number of licenses on hand is sufficiently large to accommodate all parallel players with high probability.” 

Which discusses about whether or not server can handle large number of players at once.

**My formulated question: Which time windows during the day and which days of the week are most likely to have a large number of engaged players?**

In Minecraft, playing the game has a significantly greater impact on server traffic than simply connecting to the server, and I want to understand data related to engaged players, those who play the game long enough to have a noticeable effect on server traffic, and therefore understand how large or small the server has to be. 

In this project, an engaged player is defined as one who plays for more than **1 hour per session**.

To answer our question, I will use the data from session.csv provided, and use data related to the start and end time of each session to filter the play session that has at least 60 minutes of play time and use regression to predict which time and day of the week has the highest expected number of engaged players.

### Data Description

In [None]:
#We will use following packages in our project
library(tidyverse)

In [None]:
#First lets load our data!
url <- "https://raw.githubusercontent.com/danchoi0320-ui/dsci100-group20-DanChoi-Individual-Planning-Report/refs/heads/main/sessions.csv"
session_data <- read_csv(url)

#Now lets view our data
session_data

This dataset contains 1535 observations and 5 variables.

We have 3 character variables and 2 dbl(numeric) variables:

Character:
- **hashedEmail**: anonymized unique player ID  
- **start_time**: time and date when the session started  
- **end_time**: time and date when the session ended

Numeric:
- **original_start_time**: numeric UNIX timestamp of the start time  
- **original_end_time**: numeric UNIX timestamp of the end time

Examining the data, there were at least two NA values, which won’t significantly affect our results but may cause a slight error in our prediction.

Additionally, since we cannot view what each player is doing in-game, we cannot accurately predict the amount of traffic each player is causing to the server.

The data were collected from the PLAICraft platform, where participants accessed a browser-based Minecraft server and played in a shared open-world environment. Each record corresponds to a player’s session, including the automatically recorded start and end timestamps.

#### Summary Statistic

We have 2 numeric variable, so lets calculate summary statistics of this variables!

In [None]:
summary_stat <- session_data |>
  summarise(
    min_start = round(min(original_start_time, na.rm = TRUE), 2),
    mean_start = round(mean(original_start_time, na.rm = TRUE), 2),
    max_start = round(max(original_start_time, na.rm = TRUE), 2),
    min_end   = round(min(original_end_time, na.rm = TRUE), 2),
    mean_end  = round(mean(original_end_time, na.rm = TRUE), 2),
    max_end   = round(max(original_end_time, na.rm = TRUE), 2)
  )
summary_stat

Therefore this is our summary statistic:

For original_start_time:
- **min:** 1.71e+12
- **mean:** 1.72e+12
- **max:** 1.73e+12

For original_end_time:
- **min:** 1.71e+12
- **mean:** 1.72e+12
- **max:** 1.73e+12

### Exploratory Data Analysis and Visualization

In [None]:
#Lets load the dataset again, for demonstration
url <- "https://raw.githubusercontent.com/danchoi0320-ui/dsci100-group20-DanChoi-Individual-Planning-Report/refs/heads/main/sessions.csv"
session_data <- read_csv(url)
session_data

Mean values of original_start_time: 1.72e+12

Mean values of original_end_time: 1.72e+12

This data is already tidy since it has:
- One variable per one column
- One observation per one row
- One value per one cell