## 1) Dataset Description

### Player.csv
This data set has 196 observations and 7 variables, described in Table 1.1, below. The average age of players is 21.14 and the average game play hours is 5.85 as supported in section 3) Exploratory Data Analysis.

Table 1.1:
| Variable | Type | Meaning | 
|---|---|---|
experience | chr | experience level of player |
subscribe | lgl| player is subscribed to newsletter or not | 
hashedEmail | chr | hashed player email |
played_hours | dbl | total hours played in game |
name | chr | player name |
gender | chr | player gender |
Age | int | player age |

Issues and Potential issues:

- It is unclear what the experience level of player is determined by. It appears that played_hours does not affect experience level.

How the data was collected:

The data was collected through a MineCraft server by a research group in Computer Science and UBC, led by Frank Wood. 


### Sessions.csv
This data set has 1535 observations and 5 variables, described in Table 2.1 below.

Table 2.1:
| Variable | Type | Meaning |
|---|---|---|
hashedEmail | chr | hashed player email |
start_time |  chr | start date and time of session in game |
end_time | chr | end date and time of session in game |
original start time|  dbl | UNIX date and time of start_time |
original end time | dbl | UNIX date and time of end_time |

Issues and Potential issues:

- Issue with original_start_time and original_end_time: UNIX value is not shown fully in the original data set.
- Issue with start_time and end_time: the data is not tidy, meaning that the variable includes two values (date and time).

How the data was collected:

The data was collected through a MineCraft server by a research group in Computer Science and UBC, led by Frank Wood. 

## 2) Questions:

Question 1: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

 “Can [explanatory variable(s)] predict [response variable] in [dataset]?”

## 3) Exploratory Data Analysis and Visualization

Make a few exploratory visualizations of the data to help you understand it.
Use our visualization best practices to make high-quality plots (make sure to include labels, titles, units of measurement, etc)

Explain any insights you gain from these plots that are relevant to address your question
Note: do not perform any predictive analysis here. We are asking for an exploration of the relevant variables to demonstrate that you understand them well before performing any additional modelling, and to identify potential problems you anticipate encountering.

In [1]:
#install packages
library(tidyverse)
library(repr)
library(tidymodels)
library(dplyr)


── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.6     [32m✔[39m [34mrsample     [39

In [2]:
# load datasets
player_data <- read_csv("data/players.csv")
sessions_data <- read_csv("data/sessions.csv")

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [7]:
# Observations
play_rows <- nrow(player_data)
print(play_rows)
sess_rows <- nrow(sessions_data)
print(sess_rows)

[1] 196
[1] 1535


In [10]:
# tidy sessions.csv
options(scipen = 999)

sessions_start <- separate(sessions_data, 
         col = start_time,
         into = c("start_date", "start_time"),
         sep = " "
         )
sessions_tidy <- separate(sessions_start, 
         col = end_time,
         into = c("end_date", "end_time"),
         sep = " "
         )
sessions_tidy

hashedEmail,start_date,start_time,end_date,end_time,original_start_time,original_end_time
<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,30/06/2024,18:12,30/06/2024,18:24,1719770000000,1719770000000
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,17/06/2024,23:33,17/06/2024,23:46,1718670000000,1718670000000
f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3398304c7ae42581fdc,25/07/2024,17:34,25/07/2024,17:57,1721930000000,1721930000000
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,25/07/2024,03:22,25/07/2024,03:58,1721880000000,1721880000000
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,25/05/2024,16:01,25/05/2024,16:12,1716650000000,1716650000000
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,23/06/2024,15:08,23/06/2024,17:10,1719160000000,1719160000000
fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33cbb5e894a3867ca44d,15/04/2024,07:12,15/04/2024,07:21,1713170000000,1713170000000
ad6390295640af1ed0e45ffc58a53b2d9074b0eea694b16210addd44d7c81f83,21/09/2024,02:13,21/09/2024,02:30,1726880000000,1726890000000
96e190b0bf3923cd8d349eee467c09d1130af143335779251492eb4c2c058a5f,21/06/2024,02:31,21/06/2024,02:49,1718940000000,1718940000000
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,16/05/2024,05:13,16/05/2024,05:52,1715840000000,1715840000000


In [9]:
# summarize players.csv

player_summary <- summarize(player_data,
                            Avg_play_hours = mean(played_hours, na.rm = TRUE),
                            Avg_age = mean(Age, na.rm = TRUE)
                            )
player_summary

Avg_play_hours,Avg_age
<dbl>,<dbl>
5.845918,21.13918


## 4) Methods and Plan
Propose one method to address your question of interest using the selected dataset and explain why it was chosen. Do not perform any modelling or present results at this stage. We are looking for high-level planning regarding model choice and justifying that choice.

In your explanation, respond to the following questions:

Why is this method appropriate?
Which assumptions are required, if any, to apply the method selected?
What are the potential limitations or weaknesses of the method selected?
How are you going to compare and select the model?
How are you going to process the data to apply the model? For example: Are you splitting the data? How? How many splits? What proportions will you use for the splits? At what stage will you split? Will there be a validation set? Will you use cross validation?

## 5) GitHub Repository
https://github.com/ericayagi/toy_ds_project