# Term Project - Eric Xiao (Group 34)

In [1]:
library(tidyverse)

── [1mAttaching core tidyverse packages[22m ───────────────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.2
[32m✔[39m [34mggplot2  [39m 3.5.2     [32m✔[39m [34mtibble   [39m 3.3.0
[32m✔[39m [34mlubridate[39m 1.9.4     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.1.0     
── [1mConflicts[22m ─────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


## 0. Context

This project is about a minecraft research server and we will be answering some questions that will help with logistics of conducting the research. For example, recruiting players that will contribute a lot of data, or determining how many licenses or how much hardware is required etc.

## 1. Dataset Description

This dataset consists of two files, the first, `players.csv` contains data about all the players that have signed up to play on the server. The other, `sessions.csv` contains data about the play sessions that participants had on the server.

**NOTE: the variables (column names) have been renamed to be PascalCase since they were previously not consistent**

*The following markdown tables were made using https://www.tablesgenerator.com/markdown_tables*

### Players.csv

There are a total of 7 variables and 196 observations.

| Variable    | Type | Interpreted Meaning                                                          | Statistical Summary                                          | Possible Issues                                                                                               |
|-------------|------|------------------------------------------------------------------------------|--------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------|
| Experience  | fct  | The amount of experience the player has with Minecraft                       | 35 Beginners, 63 Amateurs, 36 Regulars, 14 Pros, 48 Veterans | The distinction between some of these labels such as "pro" and "veteran" are not that clear in their meanings |
| Subscribe   | lgl  | Represents whether or not the player subscribed to a game-related newsletter | 52 not subscribed, 144 subscribed to the newsletter          | N/A                                                                                                           |
| HashedEmail | chr  | The email of the player, which has been hashed for their privacy             | N/A                                                          | Not really any uses for this variable in terms of data analysis                                               |
| PlayedHours | dbl  | The total number of hours that the player has played on the server           | Average of 5.84 hours played                                 | N/A                                                                                                           |
| Name        | chr  | Their name                                                                   | N/A                                                          | Not really any uses for this variable in terms of data analysis                                               |
| Gender      | fct  | Their gender                                                                 | 124 Male, 37 Female, 35 other gender minorities              | There is a large difference in number of female and male players                                              |
| Age         | dbl  | Their age                                                                    | Average age of 21.14                                         | Two players have NA as their age                                                                              |

### Sessions.csv

There are a total of 5 variables and 1535 observations.

| Variable          | Type | Interpreted Meaning                                                 | Statistical Summary                               | Possible Issues                                                                                                                             |
|-------------------|------|---------------------------------------------------------------------|---------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------|
| HashedEmail       | chr  | The email of the player playing in this session, hashed for privacy | N/A                                               | Cannot be used for data analysis, but can be used to match the player in `players.csv` to the session                                       |
| StartTime         | chr  | The starting time of the play session                               | TODO: add this after tidying                      | Not formatted well to be interpreted as a time, should probably split into multiple columns                                                 |
| EndTime           | chr  | The end time of the play session                                    | TODO: add this after tidying                      | *Same issues as StartTime*                                                                                                                  |
| OriginalStartTime | dbl  | The Unix timestamp in milliseconds, representing the start time     | First session started on Apr 06 2024 (1.7124e+12) | Since we already have StartTime, this is not really necessary and is less accurate since it only goes to ±10000000 milliseconds = ±2h 46min |
| OriginalEndTime   | dbl  | The Unix timestamp in milliseconds, representing the end time       | Last session ended on Sep 26 2024 (1.72734e+12)   | *Same issues as OriginalStartTime*                                                                                                          |

In [2]:
players <- read_csv("https://raw.githubusercontent.com/avahbot/dsci-100-term-project/27334598713332bcae2f0dfbe56513e06786391f/data/players.csv") |>
    rename(Experience = experience,
           Subscribe = subscribe,
           HashedEmail = hashedEmail,
           PlayedHours = played_hours,
           Name = name,
           Gender = gender,
           Age = Age) # technically doesn't need to be here

sessions <- read_csv("https://raw.githubusercontent.com/avahbot/dsci-100-term-project/27334598713332bcae2f0dfbe56513e06786391f/data/sessions.csv") |>
    rename(HashedEmail = hashedEmail,
           StartTime = start_time,
           EndTime = end_time,
           OriginalStartTime = original_start_time,
           OriginalEndTime = original_end_time)

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m─────────────────────────────────────────────────────────────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m─────────────────────────────────────────────────────────────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ

In [3]:
# Used to get variable statistical summaries

players |> group_by(Experience) |> summarize(count = n())
players |> group_by(Subscribe) |> summarize(count = n())
players |> group_by(Gender) |> summarize(count = n())
players |> summarize(hours = mean(PlayedHours))
players |> summarize(age = mean(Age, na.rm = TRUE))

Experience,count
<chr>,<int>
Amateur,63
Beginner,35
Pro,14
Regular,36
Veteran,48


Subscribe,count
<lgl>,<int>
False,52
True,144


Gender,count
<chr>,<int>
Agender,2
Female,37
Male,124
Non-binary,15
Other,1
Prefer not to say,11
Two-Spirited,6


hours
<dbl>
5.845918


age
<dbl>
21.13918


In [4]:
sessions |> summarize(start = min(OriginalStartTime))
sessions |> summarize(end = max(OriginalEndTime, na.rm = TRUE))

start
<dbl>
1712400000000.0


end
<dbl>
1727340000000.0
