**Planning Report:**

**(1) Data Description:**

This project uses two datasets provided by a UBC research group:
- `players.csv` contains a list of all unique players, including data about each player.
- `sessions.csv` contains a list of individual play sessions by each player, including data about the session.
  
The `players.csv` dataset has 196 observations and 7 variables, where each row represents a different player. The variables include:
- `experience`, a categorical variable that describes the player's level.
- `subscribe` a logical variable (TRUE, FALSE, or NA) which indicates if they are subscribed to a game-related newsletter.
- `hashedEmail`, a unique, anonymous ID for each player generated from their personal email (to keep it private). This also links the two datasets together.
- `played_hours`, numeric variable of the number of hours a player spent on the server.
- `name`, the player's first name.
- `gender`, categorical variable of the player's gender.
- `Age`, numeric variable of the player's age.

The `sessions.csv` dataset has 1,535 observations and 5 variables, where each row represents the individual play sessions. The variables include:
- `hashedEmail`, where the session that was played is linked to their encrypted email address.
- `start_time` and `end_time`, shows the session start or end time as character strings shown in the format DD/MM/YYYY HH:MM.
- `original_start_time` and `original_end_time`, numeric variable of the timestamp of the session's start or end time.

**Summary Statistics Insights:**

From the `players` dataset, the average age is 21.14 years, and players spend an average of 5.85 hours on the server. Most players (73%) are subscribed to a gaming-related newsletter, and majority identify as a male (63%). The majority experience level is Amateur (32%).

From the `sessions` dataset, the average original start time is 1.719201e+12 and the average original end time is 1.719196e+12. We also see that most players only had a few gaming sessions. Note: hashedEmail has a summary from the sessions dataset and not players because each row in players is a different user. In the sessions dataset, each row is a logged gaming session, so a single user can have multiple gaming sessions. Also, I did not provide a summary for the start_time and end_time variables as they are in date&time format and I could not find a way to get the mean of this; we have not learned how to do so.

**Potential Issues in the Data:**

- Numerical variables can have NA values.
- We do not know how the data was collected, which could affect how we generalize the data.
- Since we don't know how the data was collected, there might be some bias in how users were selected.
- In the sessions dataset, some players record far more gaming sessions on the server than others, which could alter some analyses.
- The values in original_start_time and original_end_time are very difficult to read or understand.


In [35]:
### Run this cell before continuing.
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)
source('cleanup.R')

“cannot open file 'cleanup.R': No such file or directory”


ERROR: Error in file(filename, "r", encoding = encoding): cannot open the connection


In [2]:
# loading in the datasets

players <- read_csv("project_data/players.csv")
sessions <- read_csv("project_data/sessions.csv")

players
sessions

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,TRUE,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,TRUE,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,FALSE,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
⋮,⋮,⋮,⋮,⋮,⋮,⋮
Amateur,FALSE,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db299bd4fedb06a46ad5bb,0.0,Dylan,Prefer not to say,57
Amateur,FALSE,f19e136ddde68f365afc860c725ccff54307dedd13968e896a9f890c40aea436,2.3,Harlow,Male,17
Pro,TRUE,d9473710057f7d42f36570f0be83817a4eea614029ff90cf50d8889cdd729d11,0.2,Ahmed,Other,


hashedEmail,start_time,end_time,original_start_time,original_end_time
<chr>,<chr>,<chr>,<dbl>,<dbl>
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,30/06/2024 18:12,30/06/2024 18:24,1.71977e+12,1.71977e+12
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,17/06/2024 23:33,17/06/2024 23:46,1.71867e+12,1.71867e+12
f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3398304c7ae42581fdc,25/07/2024 17:34,25/07/2024 17:57,1.72193e+12,1.72193e+12
⋮,⋮,⋮,⋮,⋮
fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33cbb5e894a3867ca44d,28/07/2024 15:36,28/07/2024 15:57,1.72218e+12,1.72218e+12
fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33cbb5e894a3867ca44d,25/07/2024 06:15,25/07/2024 06:22,1.72189e+12,1.72189e+12
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,20/05/2024 02:26,20/05/2024 02:45,1.71617e+12,1.71617e+12


In [34]:
# summary statistics for players

# numerical variables in players
mean_values_players <- players |>
    summarize(
        mean_played_hours = round(mean(played_hours), 2),
        mean_age = round(mean(Age, na.rm = TRUE), 2))

mean_values_players

# categorical variables in players

experience_summary <- players |>
    group_by(experience) |>
    summarize(
        Count = n(),
        Proportion = round(Count / nrow(players), 2))

subscribe_summary <- players |>
    group_by(subscribe) |>
    summarize(
        Count = n(),
        Proportion = round(Count / nrow(players), 2))

gender_summary <- players |>
    group_by(gender) |>
    summarize(
        Count = n(),
        Proportion = round(Count / nrow(players), 2))

experience_summary
subscribe_summary
gender_summary

# summary statistics for sessions

# numerical variables in sessions
mean_values_sessions <- sessions |>
    summarize(
        mean_original_start_time = round(mean(original_start_time), 2),
        mean_original_end_time = round(mean(original_end_time, na.rm = TRUE), 2))

mean_values_sessions


# number of sessions per player
hashedEmail_summary <- sessions |>
    group_by(hashedEmail) |>
    summarize(
        Count = n(),
        Proportion = round(Count / nrow(sessions), 2))     

hashedEmail_summary        


mean_played_hours,mean_age
<dbl>,<dbl>
5.85,21.14


experience,Count,Proportion
<chr>,<int>,<dbl>
Amateur,63,0.32
Beginner,35,0.18
Pro,14,0.07
Regular,36,0.18
Veteran,48,0.24


subscribe,Count,Proportion
<lgl>,<int>,<dbl>
False,52,0.27
True,144,0.73


gender,Count,Proportion
<chr>,<int>,<dbl>
Agender,2,0.01
Female,37,0.19
Male,124,0.63
⋮,⋮,⋮
Other,1,0.01
Prefer not to say,11,0.06
Two-Spirited,6,0.03


mean_original_start_time,mean_original_end_time
<dbl>,<dbl>
1719201000000.0,1719196000000.0


hashedEmail,Count,Proportion
<chr>,<int>,<dbl>
0088b5e134c3f0498a18c7ea6b8d77b4b0ff1636fc93355ccc95b45423367832,2,0
060aca80f8cfbf1c91553a72f4d5ec8034764b05ab59fe8e1cf0eee9a7b67967,1,0
0ce7bfa910d47fc91f21a7b3acd8f33bde6db57912ce0290fa0437ce0b97f387,1,0
⋮,⋮,⋮
fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33cbb5e894a3867ca44d,310,0.2
fe218a05c6c3fc6326f4f151e8cb75a2a9fa29e22b110d4c311fb58fb211f471,1,0.0
fef4e1bed8c3f6dcd7bcd39ab21bd402386155b2ff8c8e53683e1d2793bf1ed1,1,0.0
