# Individual Planning Report

In [1]:
library(tidyverse)
library(glue)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.2     [32m✔[39m [34mtibble   [39m 3.3.0
[32m✔[39m [34mlubridate[39m 1.9.4     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.1.0     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


**Part 1 - Data Description**

Provide a full descriptive summary of the dataset, including information such as the number of observations, summary statistics (report values to 2 decimal places), number of variables, name and type of variables, what the variables mean, any issues you see in the data, any other potential issues related to things you cannot directly see, how the data were collected, etc. Make sure to use bullet point lists or tables to summarize the variables in an easy-to-understand format.

Note that the selected dataset(s) will probably contain more variables than you need. In fact, exploring how the different variables in the dataset affect your model may be a crucial part of the project. You need to summarize the full data regardless of which variables you may choose to use later on.

## Players Dataset Description

In [11]:
players <- read_csv('https://raw.githubusercontent.com/evanvoorbergen/dsci100-project/refs/heads/main/data/players.csv') |>
    rename(age = Age) |>
    rename(hashed_email = hashedEmail)

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [30]:
head(players)

experience,subscribe,hashed_email,played_hours,name,gender,age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21
Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17


In [13]:
dim(players)

sum(is.na(players))

In [14]:
# Experience
experience_count <- players |>
    group_by(experience) |>
    summarize(count = n())
experience_count

experience_most <- experience_count |>
    slice_max(count) |>
    pull(experience)
print(glue('The experience level with most players: {experience_most}'))

experience_least <- experience_count |>
    slice_min(count) |>
    pull(experience)
print(glue('The experience level with least players: {experience_least}'))

experience,count
<chr>,<int>
Amateur,63
Beginner,35
Pro,14
Regular,36
Veteran,48


The experience level with most players: Amateur
The experience level with least players: Pro


In [15]:
# Subscribers
subscribe_numbers <- players |>
    group_by(subscribe) |>
    summarize(count = n()) |>
    mutate(total_players = sum(count)) |>
    mutate(proportion_sub = round(count / total_players, 2)) |>
    select(-total_players)
subscribe_numbers

subscribe,count,proportion_sub
<lgl>,<int>,<dbl>
False,52,0.27
True,144,0.73


In [23]:
# Hours Played
hour_count <- players |>
    group_by(played_hours) |>
    summarize(count = n())
# hour_count

hour_most <- hour_count |>
    max()
print(glue('The most hours played: {hour_most}'))

hour_least <- hour_count |>
    min()
print(glue('The least hours played: {hour_least}'))

hour_mean <- players |>
    summarize(mean_hours = mean(played_hours)) |>
    pull() |>
    round(2)
print(glue('The average hours played: {hour_mean}'))

The most hours played: 223.1
The least hours played: 0
The average hours played: 5.85


In [27]:
# Gender
gender_count <- players |>
    group_by(gender) |>
    summarize(count = n())
gender_count

gender_most <- gender_count |>
    slice_max(count) |>
    pull(gender)
print(glue('The gender that plays most: {gender_most}'))

gender_least <- gender_count |>
    slice_min(count) |>
    pull(gender)
print(glue('The gender that plays least: {gender_least}'))

gender,count
<chr>,<int>
Agender,2
Female,37
Male,124
Non-binary,15
Other,1
Prefer not to say,11
Two-Spirited,6


The gender that plays most: Male
The gender that plays least: Other


In [28]:
# Name
name_count <- players |>
    group_by(name) |>
    summarize(count = n()) |>
    arrange()
# name_count


#all names are only used once

In [36]:
# age
age_count <- players |>
    group_by(age) |>
    summarize(count = n())
age_count

age_most <- age_count |>
    slice_max(age) |>
    pull(age)
print(glue('The oldest player is: {age_most}'))

age_least <- age_count |>
    slice_min(age) |>
    pull(age)
print(glue('The youngest player is: {age_least}'))

age_mean <- players |>
    summarize(mean_age = mean(age, na.rm = TRUE)) |>
    pull() |>
    round(2)
print(glue('The mean age is: {age_mean}'))

age_mode <- age_count |>
    slice_max(count) |>
    pull(age)
print(glue('The most common age is: {age_mode}'))

na_number <- players |>
    summarize(num_NA = sum(is.na(age))) |>
    pull()
print(glue('{na_number} people have not provided their age'))

#must remove NA ages!

age,count
<dbl>,<int>
9.0,1
10.0,1
11.0,1
12.0,1
14.0,2
15.0,2
16.0,3
17.0,73
18.0,7
19.0,7


The oldest player is: 58
The youngest player is: 9
The mean age is: 21.14
The most common age is: 17
2 people have not provided their age


In [42]:
sessions <- read_csv('https://raw.githubusercontent.com/evanvoorbergen/dsci100-project/refs/heads/main/data/sessions.csv') |>
    rename(hashed_email = hashedEmail) |>
    separate(start_time, c("start_date", "start_time"), " ") |>
    separate(end_time, c("end_date", "end_time"), " ")
    

[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [43]:
head(sessions)

hashed_email,start_date,start_time,end_date,end_time,original_start_time,original_end_time
<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,30/06/2024,18:12,30/06/2024,18:24,1719770000000.0,1719770000000.0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,17/06/2024,23:33,17/06/2024,23:46,1718670000000.0,1718670000000.0
f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3398304c7ae42581fdc,25/07/2024,17:34,25/07/2024,17:57,1721930000000.0,1721930000000.0
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,25/07/2024,03:22,25/07/2024,03:58,1721880000000.0,1721880000000.0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,25/05/2024,16:01,25/05/2024,16:12,1716650000000.0,1716650000000.0
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,23/06/2024,15:08,23/06/2024,17:10,1719160000000.0,1719160000000.0


In [44]:
dim(sessions)

In [59]:
#hashed Email
email_count <- sessions |>
    group_by(hashed_email) |>
    summarize(count = n())
# email_count

max_sessions <- email_count |>
    slice_max(count) |>
    pull(count)
print(glue('The maximum number of sessions for a player is: {max_sessions}'))

min_sessions <- email_count |>
    arrange(count) |>
    slice(1) |>
    pull(count)
print(glue('The minimum number of sessions for a player is: {min_sessions}'))

mean_sessions <- email_count |>
    summarize(mean = mean(count)) |>
    pull()
print(glue('The mean number of sessions for a player is: {mean_sessions}'))

The maximum number of sessions for a player is: 310
The minimum number of sessions for a player is: 1
The mean number of sessions for a player is: 12.28


In [68]:
#start_date
startdate_count <- sessions |>
    group_by(start_date) |>
    summarize(count = n())
# start_date

max_starters <- startdate_count |>
    slice_max(count) |>
    pull(start_date)
print(glue('The most common start date is: {max_starters}'))


The most common start date is: 25/07/2024


In [70]:
#end_date
enddate_count <- sessions |>
    group_by(end_date) |>
    summarize(count = n())


max_enders <- enddate_count |>
    slice_max(count) |>
    pull(end_date)
print(glue('The most common end date is: {max_enders}'))

The most common end date is: 25/07/2024


In [85]:
#start time
starttime_count <- sessions |>
    group_by(start_time) |>
    summarize(count = n())
# start_time

max_startertime <- starttime_count |>
    slice_max(count) |>
    pull(start_time)
print(glue('The most common start time: {max_startertime}'))

The most common start time: 02:29
The most common start time: 02:31


In [80]:
#end time
endtime_count <- sessions |>
    group_by(end_time) |>
    summarize(count = n())
# end_time

max_endertime <- endtime_count |>
    slice_max(count) |>
    pull(end_time)
print(glue('The most common end time: {max_endertime}'))

The most common end time: 03:39


In [86]:
#og start time
ogstarttime_count <- sessions |>
    group_by(original_start_time) |>
    summarize(count = n())
# start_time

max_ogstartertime <- ogstarttime_count |>
    slice_max(count) |>
    pull(original_start_time)
print(glue('The most common start time: {max_ogstartertime}'))

The most common start time: 1.72189e+12


In [89]:
#og end time
ogendtime_count <- sessions |>
    group_by(original_end_time) |>
    summarize(count = n())
# ogendtime_count

max_ogendertime <- ogendtime_count |>
    slice_max(count) |>
    pull(original_end_time)
print(glue('The most common end time: {max_ogendertime}'))

The most common end time: 1.72189e+12
