In [None]:
library(tidyverse)
library(tidymodels)
options(repr.matrix.max.rows = 6)

In [None]:
players <- read_csv("players.csv")
sessions <- read_csv("sessions.csv")

In [None]:
glimpse(players)

## Players Data

| Attribute | Detail |
| :--- | :--- |
| **Number of rows** | 196 observations |
| **Number of columns** | 7 information |
| **Data Collection** | from the Minecraft server |

---

### Variable Names, Types, and Meanings

| Variable Name | Type | Meaning |
| :--- | :--- | :--- |
| `experience` | character | The level of experience that the user feels they have in the game minecraft (Pro, Veteran, Amateur, Regular) |
| `subscribe` | logical / boolean | yes or no if individual is already subscribed to the newsletter |
| `hashedEmail` | chr | a string of individuals emails but hash encrypted |
| `played_hours` | double/decimal | how many hours the individual has played on the server |
| `name` | chr | the name fo the individual |
| `gender` | chr | the preffered gender of the individual |
| `age` | dbl | how old the individual is |

---

### Missing Values or Inconsistencies You See

* in the last observation, there is someone named ahmed with gender as Other and no age specified. This is not very useful data and it might ruin the analysis. This could be removed so it isn't used.
* Another one for Devin, he didn't specify his age (NA) but did fill in his gender. maybe we can include this individual when comparisons for gender but not for age.

---

### Potential Issues Not Visible (e.g., bias, incomplete data)

* (This section was not explicitly detailed by the user but inferred from the previous section).

In [None]:
glimpse(sessions)

## Session Data

| Attribute | Detail |
| :--- | :--- |
| **Number of rows** | 1535 observations |
| **Number of columns** | 5 information |
| **Data Collection** | from the Minecraft server |

---

### Variable Names, Types, and Meanings

| Variable Name | Type | Meaning |
| :--- | :--- | :--- |
| `hashedEmail` | chr | the email of the user who played for this session |
| `start_time` | chr | the date time of when they started played for this session |
| `end_time` | | chr: the date time of when they stopped played for this session |
| `original_start_time` | dbl | the number correspondent value of the date time they started playing for this session |
| `original_end_time` | dbl | the number correspondent value of the date time they stopped playing for this session |

---

### Missing Values or Inconsistencies You See

* there seems to be no issues. the data seems to be correct. and there is a range of sessions for different times throughout the day.

---

### Potential Issues Not Visible (e.g., bias, incomplete data)

*  maybe can change all the start_time and end_time into datetime format so when we editing or displaying the data its better.


## Demand Forecasting: Predicting Active Players

### Broad Goal

The broad goal is **demand forecasting**, namely, what time windows are most likely to have large number of simultaneous players. This is because we need to ensure that the number of licenses on hand is sufficiently large to accommodate all parallel players with high probability. **Achieving accurate demand forecasting will minimize player frustration due to capacity issues and optimize capital expenditure on server resources and licensing.**

---

### Specific Question for Analysis

**Can day of week and time of day predict the number of currently active players?**

| Component | Variable | Details |
| :--- | :--- | :--- |
| **Response Variable (Y)** | **number of active players** | The metric representing server load at any point in time. |
| **Explanatory Variables (X)** | **day of week** and **time of day** | The temporal factors hypothesized to drive player activity. |

---

### Why This Question is Important (Business Rationale)

* if we know what times there are generally more people we know which times where the server is the busiest. we can prepare for these times and buy more licenses to accomdate for these players. **This directly impacts player satisfaction and ensures a seamless experience during peak gaming periods.**
* if we know what times the server will have its heaviest load, then we can prepare the server if we need to do anything to prepare for this largest time. and then when we have the least, we could reduce the amount the server needs to save on costs. **This dynamic scaling approach maximizes cost efficiency by only running maximum capacity when necessary, directly reducing operational expenses.**
* additionally, just in case, if we know how many people there are at peak times we know the max and the minimum for liceneses on hand to be able to run all the players. **Establishing these clear maximum and minimum capacity requirements is essential for budgeting and future infrastructure planning.**

### How I will apply what we learned to find this?
* we need to do some wrangling on the data:
    * edit the session data to with a new column called: day_of_week, hour_of_day. use timestamp editing functions such as wday(), hour()
    * day_of_week can be put into a categorical factor data for each day of the week
    * time of day can be either a number (1-24) or split into categories (morning, afternoon, evening, night)
* after wrangling, group by these variables and calculate average active players per group
* Some visualisitons / plots that can be created:
    * Average players by day of week using geom_col() with x = day_of_week, y = avg_players
    * Average players by hour of day: geom_line() and geom_point() with x = hour_of_day, y = avg_players
    * What model can be used: linear regression with active_players ~ day_of_week + hour_of_day