# Individual Planning Report

Please install the libraries listed in the below cell in advance.

In [1]:
### Run this cell before continuing. 
library(tidyverse)
library(repr)
library(janitor)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors

Attaching package: ‘janitor’


The following objects are masked from ‘package:stats’:

    chisq.test, fisher.test




## 1. Data Description

### Dataset: players.csv

This report explores data collected by a UBC research group on how people play video games, focusing on a Minecraft server they set up. The first dataset (`players.csv`) contains 196 observations, with each row representing one player (see Table 1 for a detailed description of each variable). Several issues were also identified and will be addressed in Part 3:

- Variable names are not in a standardized format.
- `experience` and `gender` should be treated as a factor variable.

The code below first reads the dataset using the shortest relative path and then calculates the minimum and maximum values for the two numeric variables, `played_hours` and `age`.

In [2]:
players <- read_csv("data/players.csv")

players_min_max <- players |>
    summarize(played_hours_min = round(min(played_hours, na.rm = TRUE), 2),
    age_min = round(min(Age, na.rm = TRUE),2),
    played_hours_max = round(max(played_hours, na.rm = TRUE), 2),
    age_max = round(max(Age, na.rm = TRUE),2))
players_min_max

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


played_hours_min,age_min,played_hours_max,age_max
<dbl>,<dbl>,<dbl>,<dbl>
0,9,223.1,58


#### Table 1 - Variable Description of Dataset players
| Name | Type | Description | Min | Max |
| ---- | ---- | ---- | ---- | ---- |
| `experience` | chr | Player experience level (Amateur, Beginner, Pro, Regular, and Veteran) | N/A | N/A |
| `subscribe` | lgl | Whether the player has subscribed to a game-related newsletter | N/A | N/A |
| `hashedEmail` | chr | Anonymized email identifier for each player | N/A | N/A |
| `played_hours` | dbl | Number of hours the player played | 0 | 223.1 |
| `name` | chr | Player's name | N/A | N/A |
| `gender` | chr | Player's gender | N/A | N/A |
| `Age` | dbl | Player's age in years | 9 | 58 |


### Data: sessions.csv

The second dataset (`sessions.csv`) contains 1535 observations, with each row representing an individual play sessions by one player (see Table 2 for a detailed description of each variable). Several issues were also identified:
- `hashedEmail` is not in a standardized format.
- In order to compute min and max, we need to convert `start_time` and `end_time` into POSIXct type using `as.POSIXct()` function, which stores date and time in seconds with the number of seconds.

The code below first reads the dataset using the shortest relative path and then calculates the minimum and maximum values for the two numeric variables, `original_start_time` and `original_end_time`.

In [61]:
sessions <- read_csv("data/sessions.csv")

sessions_min_max <- sessions |>
    summarize(original_start_time_min = round(min(original_start_time, na.rm = TRUE), 2),
    original_end_time_min = round(min(original_end_time, na.rm = TRUE),2),
    original_start_time_max = round(max(original_start_time, na.rm = TRUE), 2),
    original_end_time_max = round(max(original_end_time, na.rm = TRUE),2))
sessions_min_max

original_start_time_min,original_end_time_min,original_start_time_max,original_end_time_max
<dbl>,<dbl>,<dbl>,<dbl>
1712400000000.0,1712400000000.0,1727330000000.0,1727340000000.0


#### Table 2: Variable Description of Dataset sessions
| Name | Type | Description | Min | Max |
| ---- | ---- | ---- | ---- | ---- |
| `hashedEmail` | chr | Player's hashed email address | N/A | N/A |
| `start_time` | chr | The time that the player starts to play | N/A | N/A |
| `end_time` | chr | Players' email | N/A | N/A |
| `original_start_time` | dbl | Number of hours the player played | 1.7124e+12 | 1.72733e+12 |
| `original_end_time` | dbl | Name of the player | 1.7124e+12	 | 1.72734e+12 |

## 2. Questions

My **broad question** is Question 2: “Which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts.” 

My **specific question** is: “Can players’ gender, experience level, age, and whether the player has subscribed to a game-related newsletter, predict the total number of hours the player played?" In this context, players who have played for a greater number of hours are considered to contribute a “larger amount of data”, while those with fewer hours are considered to contribute less.

To answer this question, `players.csv` will be used as it contains both the explanatory variables (`gender`, `experience`, `Age`, `subscribe`), and response variable of interest (`played_hours`). To prepare the data for a multivariable linear regression analysis, data will be wrangled in the following steps:
1. Put variable names into standardized format using `clean_names()`.
2. Convert `experience`, `gender`, and `subscribe` into factor variables using `as_factor()`.
3. Produce a final dataset that only includes:
    - `played_hours` (numeric)
    - `gender` (factor)
    - `experience` (factor)
    - `age` (numeric)
    - `subscribe` (factor)

## 3. Exploratory Data Analysis and Visualization

Following the steps above, we convert `players` into a tidy tibble (`players_aggregate`):

In [5]:
players_aggregate <- players |>
    clean_names() |>
    mutate(experience = as_factor(experience), 
           gender = as_factor(gender),
           subscribe = as_factor(subscribe)) |>
    select(played_hours, gender, experience, age, subscribe)
players_aggregate

played_hours,gender,experience,age,subscribe
<dbl>,<fct>,<fct>,<dbl>,<fct>
30.3,Male,Pro,9,TRUE
3.8,Male,Veteran,17,TRUE
0.0,Male,Veteran,17,FALSE
0.7,Female,Amateur,21,TRUE
0.1,Male,Regular,21,TRUE
0.0,Female,Amateur,17,TRUE
0.0,Female,Regular,19,TRUE
0.0,Male,Amateur,21,FALSE
0.1,Male,Amateur,47,TRUE
0.0,Female,Veteran,22,TRUE


Compute the mean value for each quantitative variable in the players.csv data set. Report the mean values in a table format.


In [35]:
mean <- players_aggregate |>
    summarize(played_hours_mean = mean(played_hours, na.rm = TRUE),
    age_mean = mean(age, na.rm = TRUE))
mean

played_hours_mean,age_mean
<dbl>,<dbl>
5.845918,21.13918


#### Table 2: Mean values
| Played Hours | Age |
| ---- | ----|
| 5.845918 | 21.13918|

## 4. Methods and Plan