# Data Science Project Planning

## Data Description

### Player Data

Player data has 196 observations and 7 variables.

Two numerical variables:

  + played_hours: total hours played by a player
  + Age: age of player

Four character variables:

  + experience: level of the player 
  + hashedEmail: anonymized email for privacy.
  + name: player's first name
  + gender: player's gender

One logical variable:

  + subscribe: whether the player is subscribed to game-related newsletter

Summary stats:
* Age Range: [9,58]
* Age Average: 21 (21.14)
* Played Hours Range: [0, 223.1]
* Played Hours Average: 5.85 hours

Potential Issues:
* How was played hours measured? (Self reported or tracked?)
* Wide age range may skew results.
* Experience levels may be inconsistently defined if self-reported.
* N/A values in data may skew results

### Sessions Data

The sessions data has 1535 observations and 5 variables.

Two numerical variables:

  + original_start_time and original_end_time: session start and end times in UNIX milliseconds.

Three character variables:

  + start_time and end_time: human-readable versions of the above.
  + hashedEmail links sessions to players.

Summary stats for the numerical timestamps are not meaningful without computing session durations.

Potential Issues:
* Unclear time zone handling across sessions.
* Possible missing or inconsistent start/end times.
* Multiple emails per player could fragment data.
* Missing (NA) values could bias results.

## Questions

## Exploratory Data Analysis and Visualization

In [34]:
library(tidyverse)
sessions <- read_csv("dsci project/data/sessions.csv")
glimpse(sessions)

[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


Rows: 1,535
Columns: 5
$ hashedEmail         [3m[90m<chr>[39m[23m "bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8a…
$ start_time          [3m[90m<chr>[39m[23m "30/06/2024 18:12", "17/06/2024 23:33", "25/07/202…
$ end_time            [3m[90m<chr>[39m[23m "30/06/2024 18:24", "17/06/2024 23:46", "25/07/202…
$ original_start_time [3m[90m<dbl>[39m[23m 1.71977e+12, 1.71867e+12, 1.72193e+12, 1.72188e+12…
$ original_end_time   [3m[90m<dbl>[39m[23m 1.71977e+12, 1.71867e+12, 1.72193e+12, 1.72188e+12…


In [10]:
library(tidyverse)
players <- read_csv("dsci project/data/players.csv")
glimpse(players)

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


Rows: 196
Columns: 7
$ experience   [3m[90m<chr>[39m[23m "Pro", "Veteran", "Veteran", "Amateur", "Regular", "Amate…
$ subscribe    [3m[90m<lgl>[39m[23m TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, T…
$ hashedEmail  [3m[90m<chr>[39m[23m "f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8…
$ played_hours [3m[90m<dbl>[39m[23m 30.3, 3.8, 0.0, 0.7, 0.1, 0.0, 0.0, 0.0, 0.1, 0.0, 1.6, 0…
$ name         [3m[90m<chr>[39m[23m "Morgan", "Christian", "Blake", "Flora", "Kylie", "Adrian…
$ gender       [3m[90m<chr>[39m[23m "Male", "Male", "Male", "Female", "Male", "Female", "Fema…
$ Age          [3m[90m<dbl>[39m[23m 9, 17, 17, 21, 21, 17, 19, 21, 47, 22, 23, 17, 25, 22, 17…


## Methods and Plan