# Data Description:

A research group at UBC has been collecting data on a Minecraft server to try to learn more about how people play games. They have provided some data for students to analyze. 
  
There were two datasets given:
- players.csv
- sessions.csv

In players.csv, the table contains a list of all unique players to play their Minecraft server.  
This table contains the following seven columns:
- `experience` - Amount of previous playtime
- `subscribe` - True if the user is subscribed to a Gaming Related Newsletter
- `hashedEmail` - Hidden email address of the user
- `played_hours` - Hours played on their Minecraft server
- `name`, `gender`, `age` - Basic demographic information  
  
There are 196 registered players in the dataset.

In sessions.csv, the table consists of a list of all individual play sessions.  
This table contains the following five columns:
- `hashedEmail` - Same hashed email used in players.csv
- `start_time` - Start date and time
- `end_time` - End date and time
- `original_start_time`, `original_end_time` - Likely times that got converted into `start_time` and `end_time`
  
There are 1535 playing sessions in the dataset.

There are still some potential issues with this dataset:
- Self reported demographics and experience may not be accurate
- The restriction on age can remove some possible player data
- People may create multiple accounts
- End time may not be accurate if the user leaves the tab open
- The gaming newsletter may not be specific to what the user likes

## Summary statistics:  

To obtain the summary statistics, we must first load the R libraries and settings.

In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
library(dplyr)
library(lubridate)
# library(tidyr)
options(repr.matrix.max.rows = 7)
options(repr.plot.height = 8, repr.plot.width = 10)

Then, by reading the datasets out to variables, we can mutate the data to more descriptive categorical data. In sessions, we also want to seperate the dates and times into seperate columns, then transform the dates into a date format for later calculations.  
It is now easier to find the summary statistics of both tables.

In [None]:
sessions <- read_csv("sessions.csv")
players <- read_csv("players.csv")
players <- players |>
    mutate(experience = as_factor(experience), subscribe = as_factor(subscribe), gender = as_factor(gender)) |>
    mutate(subscribe = fct_recode(subscribe, "Yes" = "TRUE", "No" = "FALSE"))
sessions <- sessions |>
    mutate(start_time=as_datetime(parse_date_time(start_time, orders = "dmy HM"))) |>
    mutate(end_time=as_datetime(parse_date_time(end_time, orders = "dmy HM")))
sessions
players

We can see from the above tables that there are 1535 recorded playing sessions and 196 registered users in this dataset.  
  
Next, we are going to focus on the categorical data. By using `group_by` and `summarize`, we can find the counts for all people with different experience, whether they are subscribed, and different genders. Using `arrange` to sort by descending, we can see the most common value.

In [None]:
experience_count <- players |>
    group_by(experience) |>
    summarize(count = n()) |>
    arrange(-count)
subscribed_count <- players |>
    group_by(subscribe) |>
    summarize(count = n()) |>
    arrange(-count)
gender_count <- players |>
    group_by(gender) |>
    summarize(count = n()) |>
    arrange(-count)
experience_count
subscribed_count
gender_count

From these three tables, we can determine that:
- 63/196 ≈ 32% of all players are Amateur (Described on website as "Played a few hours of Minecraft")
- 144/196 ≈ 73% are subscribed to the newsletter
- 124/196 ≈ 63% of all players are Male  
  
The next statistics we will look at is are the numeric ones, playtime and age. By using `arrange` and `summarize`, we can find various measurements, including maximum, minimum, standard deviation, etc.

In [None]:
playtime <- players |>
    select(played_hours) |>
    arrange(played_hours)
max_playtime <- playtime |>
    arrange(-played_hours) |>
    slice(1) |>
    pull()
min_playtime <- playtime |>
    slice(1) |>
    pull()
mean_playtime <- playtime |>
    summarize(mean = mean(played_hours)) |>
    pull()
median_playtime <- playtime |>
    summarize(median = median(played_hours)) |>
    pull()
sd_playtime <- playtime |>
    summarize(sd = sd(played_hours)) |>
    pull()
max_playtime
min_playtime
round(mean_playtime, digits = 2)
round(median_playtime, digits = 2)
round(sd_playtime, digits = 2)

In [None]:
age <- players |>
    select(Age) |>
    arrange(Age)
max_age <- age |>
    arrange(-Age) |>
    slice(1) |>
    pull()
min_age <- age |>
    slice(1) |>
    pull()
mean_age <- age |>
    summarize(mean = mean(Age, na.rm = TRUE)) |>
    pull()
median_age <- age |>
    summarize(median = median(Age, na.rm = TRUE),) |>
    pull()
sd_age <- age |>
    summarize(sd = sd(Age, na.rm = TRUE)) |>
    pull()
max_age
min_age
round(mean_age, digits = 2)
round(median_age, digits = 2)
round(sd_age, digits = 2)

| Method | Hours of Playtime | Age |
| ------ | ------ | ------ |
| Max    | 223.1  | 50 |
| Min    | 0      | 8 |
| Mean   | 5.85   | 20.52 |
| Median | 0.1    | 19 |
| SD     | 28.36  | 6.17 |

The final columns to look at are the start and end times. This includes both the original times, which seem to be represented in Unix time before conversion into a standard datetime format. As no timezone is indicated, the code will produce a result in UTC.

In [None]:
min_start <- sessions |>
    summarize(time=min(start_time, na.rm=TRUE)) |>
    pull()
max_start <- sessions |>
    summarize(time=max(start_time, na.rm=TRUE)) |>
    pull()
mean_start <- sessions |>
    summarize(time=mean(start_time, na.rm=TRUE)) |>
    pull()
median_start <- sessions |>
    summarize(median = median(start_time, na.rm = TRUE),) |>
    pull()
min_start
max_start
mean_start
median_start
min_end <- sessions |>
    summarize(time=min(end_time, na.rm=TRUE)) |>
    pull()
max_end <- sessions |>
    summarize(time=max(end_time, na.rm=TRUE)) |>
    pull()
mean_end <- sessions |>
    summarize(time=mean(end_time, na.rm=TRUE)) |>
    pull()
median_end <- sessions |>
    summarize(median = median(end_time, na.rm = TRUE),) |>
    pull()
min_end
max_end
mean_end
median_end

In [None]:
min_original_start <- sessions |>
    summarize(time=min(original_start_time, na.rm=TRUE)) |>
    pull()
max_original_start <- sessions |>
    summarize(time=max(original_start_time, na.rm=TRUE)) |>
    pull()
mean_original_start <- sessions |>
    summarize(time=mean(original_start_time, na.rm=TRUE)) |>
    pull()
median_original_start <- sessions |>
    summarize(time=median(original_start_time, na.rm=TRUE)) |>
    pull()
min_original_start
max_original_start
mean_original_start
median_original_start
min_original_end <- sessions |>
    summarize(time=min(original_end_time, na.rm=TRUE)) |>
    pull()
max_original_end <- sessions |>
    summarize(time=max(original_end_time, na.rm=TRUE)) |>
    pull()
mean_original_end <- sessions |>
    summarize(time=mean(original_end_time, na.rm=TRUE)) |>
    pull()
median_original_end <- sessions |>
    summarize(time=median(original_end_time, na.rm=TRUE)) |>
    pull()
min_original_end
max_original_end
mean_original_end
median_original_end

| Method | Start Time     | End Time       | Original Start Time | Original End Time |
| ------ | -------------- | -------------- | ------------------- | ----------------- |
| Min    | Apr 6th, 9:27  | Apr 6th, 9:31  | 1.7124e+12          | 1.7124e+12        |
| Max    | Sep 26th, 6:09 | Sep 26th, 7:39 | 1.72733e+12         | 1.72734e+12       |
| Mean   | Jun 24th, 3:54 | Jun 24th, 2:26 | 1.71920e+12         | 1.71919e+12       |
| Median | Jun 24th, 2:51 | Jun 23rd, 22:04| 1.7192e+12          | 1.71918e+12       |

As the results showed that all dates were confined to 2024, I have ommited it from the table.

## Question:

In [None]:
playtime <- ggplot(players, aes(x=experience, y=played_hours, fill=subscribe)) +
    geom_bar(stat="identity", position="dodge") +
    labs(x="Amount of Experience (in Minecraft)", y="Playtime (hours)", fill="Gaming Newsletter Subscription") +
    ggtitle("Amount of playtime for each level of experience")
experience_count_plot <- ggplot(experience_count, aes(x=experience, y=count, fill=experience)) +
    geom_bar(stat="identity") +
    labs(x="Amount of Experience (in Minecraft)", y="Number of Players")
experience_count_plot
playtime