In [1]:
### Beginning by loading libraries and project data

In [2]:
library(tidyverse)
library(repr)
options(repr.matrix.max.rows = 6)


── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


<mark> Will need to find a way to load in the data that does not require the use of additional files

### Start of Coherent Project Planning

#### Question to Explore: During the day, at what time do players who start playing tend to play for long periods of time at?

To answer this question, I will need to explore the start times of players, and the duration of play of the players. It may also be interesting to explore at what time the most players end their play sessions at in order to help with demand forecasting. 

For these reasons, I will only need to load in one of the datasets that is available to me. I will then need to select the start_time and end_time columns. I will need to transform this into tidy data, and mutate it to create a new column that has the duration of times played. Let's begin with these steps.

In [20]:
sessions_data <- read_csv("sessions.csv") |> select(-hashedEmail)
sessions_data

[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


start_time,end_time,original_start_time,original_end_time
<chr>,<chr>,<dbl>,<dbl>
30/06/2024 18:12,30/06/2024 18:24,1.71977e+12,1.71977e+12
17/06/2024 23:33,17/06/2024 23:46,1.71867e+12,1.71867e+12
25/07/2024 17:34,25/07/2024 17:57,1.72193e+12,1.72193e+12
⋮,⋮,⋮,⋮
28/07/2024 15:36,28/07/2024 15:57,1.72218e+12,1.72218e+12
25/07/2024 06:15,25/07/2024 06:22,1.72189e+12,1.72189e+12
20/05/2024 02:26,20/05/2024 02:45,1.71617e+12,1.71617e+12


In [21]:
### Tidying the data


sessions_tidy <- sessions_data |>
                    select(start_time, end_time) |>
                    separate(col = start_time, into = c("date_start", "time_start"), sep = " ") |>
                    separate(col = end_time, into = c("date_end", "time_end"), sep = " ") |>
                    separate(col = date_start, into = c("day_start", "month_start", "year_start"), sep = "/") |>
                    separate(col = date_end, into = c("day_end", "month_end", "year_end"), sep = "/") |>
                    separate(col = time_start, into = c("hour_start", "minute_start"), sep = ":") |>
                    separate(col = time_end, into = c("hour_end", "minute_end"), sep = ":") 




sessions_tidy

day_start,month_start,year_start,hour_start,minute_start,day_end,month_end,year_end,hour_end,minute_end
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
30,06,2024,18,12,30,06,2024,18,24
17,06,2024,23,33,17,06,2024,23,46
25,07,2024,17,34,25,07,2024,17,57
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
28,07,2024,15,36,28,07,2024,15,57
25,07,2024,06,15,25,07,2024,06,22
20,05,2024,02,26,20,05,2024,02,45


I now need to make a column for the duration of the time that is played. I am more concerned with the **time** of day that players are playing at, rather than the month and year of their game play. Thus, I will remove the columns pertaining to month and year. I should be carful about the day columns though, because someone may have played through midnight.

In [22]:
sessions_days <- sessions_tidy |>
        select(-month_start, -year_start, -month_end, -year_end)

sessions_days
names(sessions_days)

day_start,hour_start,minute_start,day_end,hour_end,minute_end
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
30,18,12,30,18,24
17,23,33,17,23,46
25,17,34,25,17,57
⋮,⋮,⋮,⋮,⋮,⋮
28,15,36,28,15,57
25,06,15,25,06,22
20,02,26,20,02,45


I need to convert the characters to numbers so that I can manipulate the data

In [36]:
sessions_numeric <- sessions_days |>
        mutate(
            day_start = as.numeric(day_start), 
               minute_start = as.numeric(minute_start), 
               hour_start = as.numeric(hour_start),
               day_end = as.numeric(day_end),
               hour_end = as.numeric(hour_end),
               minute_end = as.numeric(minute_end)
              )
sessions_numeric

day_start,hour_start,minute_start,day_end,hour_end,minute_end
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
30,18,12,30,18,24
17,23,33,17,23,46
25,17,34,25,17,57
⋮,⋮,⋮,⋮,⋮,⋮
28,15,36,28,15,57
25,6,15,25,6,22
20,2,26,20,2,45


I want to see if there are any cases where people played across multiple days, because I will need to handle those cases separately.

In [12]:
sessions_with_duration <- sessions_numeric |>
        mutate(day_duration = (day_end-day_start))

ERROR: [1m[33mError[39m in `mutate()`:[22m
[1m[22m[36mℹ[39m In argument: `day_duration = (day_end - day_start)`.
[1mCaused by error in `day_end - day_start`:[22m
[33m![39m non-numeric argument to binary operator
