<h2>Introduction</h2> Researchers at UBC led by Frank Wood collected data about how people play video games by setting up MineCraft servers and recording various types of data. One of their goals was to determine which player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types? 

In this proposal, we will focus on the specific question: 

<em>Which set of variables is more predictive of subscription: Player Characteristics (age and gender) or Player Behaviors (total game sessions and average length of gaming sessions)?</em>

<h2>Dataset Descriptions</h2> For this inquiry, we will be combining 2 datasets: <br>
    <ul>
    <li>player.csv  -> data about each unique player</li>
	<li>sessions.csv  -> data on individual player gaming sessions</li>
</ul>

<h4>player.csv Description</h4 

players.csv contains 196 unique player data collected through self-reported survey and records or players actions.

| Variable Name    | Data Type | Meaning |
| -------- | ------- |---------|
| experience  | chr   |     Player’s self reported skill level    |
| subscribe | lgl     |Whether or not Player subscribed to newsletter  |
| hashedEmail    | chr    | Anonymous and Unique Player ID | 
| played_hours  | dbl    |Total cumulative Hours played |
| name | chr     | Player’s name |
| gender    | chr    | Player’s self reporter gender  |
| Age  | dbl    |  Players age in years  | 

Issues: <br>
    <ul>
    <li>Missing Values: 2 missing age values</li>
	<li>Inherit Self-Reported Bias</li>
</ul>

<h4>sessions.csv Description</h4 

sessions.csv contains 1535 records of each single game session and ID of which player it belongs to. Collected through recording player playtimes. 

| Variable Name    | Data Type | Meaning |
| -------- | ------- |---------|
| hashedEmail  | chr   |     Anonymous and Unique Player ID    |
| start_time | chr    |Timestamp for start of session  |
| end_time    | chr    | Timestamp for end of session | 
| original_start_time  | dbl    |Unix timestamp for session start |
| original_end_time | dbl     | Unix timestamp for session end |


Issues: <br>
    <ul>
    <li>Missing Values: 2 missing end_time values</li>
	<li>Data is not tidy - multipe entries for single player</li> 
    <li>Nonparticipating Players: 125/196 (64%) players actually played</li>
</ul>

<h4>Summary Statistics</h4 
<br>
    <ul>
    <li>Subscription Rate: 74.47% (144/196)  </li>
	<li>Mean played_hours 5.85 </li> 
    <li>played_hours Range: 0-223.1</li> 
    <li>Mean age: 21.14 </li>
	<li>Age Range: 9-58 </li> 
    <li>Average Sessions per active players: 12.28 </li> 
    <li>Average Sessions per total registered players: 7.83  </li>
</ul>

<h2>Exploratory Data Analysis and Visualization</h2>We will begin by loading the needed packages into R 

In [1]:
library(tidyverse)
library(lubridate) #Needed for tiding the sessions.csv dataset later

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


Now load player.csv. It requires a little bit of tidying: getting rid of the N/A values previously mentioned

In [4]:
players <- read_csv("players.csv") |> 
        drop_na()
players

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,TRUE,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,TRUE,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,FALSE,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
Amateur,TRUE,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
Regular,TRUE,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21
Amateur,TRUE,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17
Regular,TRUE,8e594b8953193b26f498db95a508b03c6fe1c24bb5251d392c18a0da9a722807,0.0,Luna,Female,19
Amateur,FALSE,1d2371d8a35c8831034b25bda8764539ab7db0f63938696917c447128a2540dd,0.0,Emerson,Male,21
Amateur,TRUE,8b71f4d66a38389b7528bb38ba6eb71157733df7d1740371852a797ae97d82d1,0.1,Natalie,Male,47
Veteran,TRUE,bbe2d83de678f519c4b3daa7265e683b4fe2d814077f9094afd11d8f217039ec,0.0,Nyla,Female,22


<h4>Mean Values of player.csv</h4 <br> 

| Variable | Mean Value | 
| -------- | ------- |
| player_hours  | 8.85   | 
| Age | 21.14   |

Now we will load sessions.csv. However, sessions.csv requires tidying since there are multiple entries/rows for an individual player

In [5]:
sessions <- read_csv("sessions.csv") |> 
    select(-original_start_time, -original_end_time) |> #unessecary data which does not help us
    mutate(start_time = dmy_hm(start_time), #using the lubridate package to turn the format of the data into something workable
           end_time = dmy_hm(end_time)
  ) |>

mutate(session_length = as.numeric(difftime(end_time, start_time, units = "mins"))) |> 

 group_by(hashedEmail) |>
  summarise(total_sessions = n(), avg_session_time_mins = mean(session_length, na.rm = TRUE)
  )

sessions

[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


hashedEmail,total_sessions,avg_session_time_mins
<chr>,<int>,<dbl>
0088b5e134c3f0498a18c7ea6b8d77b4b0ff1636fc93355ccc95b45423367832,2,53.000000
060aca80f8cfbf1c91553a72f4d5ec8034764b05ab59fe8e1cf0eee9a7b67967,1,30.000000
0ce7bfa910d47fc91f21a7b3acd8f33bde6db57912ce0290fa0437ce0b97f387,1,11.000000
0d4d71be33e2bc7266ee4983002bd930f69d304288a8663529c875f40f1750f3,13,32.153846
0d70dd9cac34d646c810b1846fe6a85b9e288a76f5dcab9c1ff1a0e7ca200b3a,2,35.000000
11006065e9412650e99eea4a4aaaf0399bc338006f85e80cc82d18b49f0e2aa4,1,10.000000
119f01b9877fc5ea0073d05602a353b91c4b48e4cf02f42bb8d661b46a34b760,1,50.000000
18936844e06b6c7871dce06384e2d142dd86756941641ef39cf40a9967ea14e3,41,29.682927
1a2b92f18f36b0b59b41d648d10a9b8b20a2adff550ddbcb8cec2f47d4d881d0,1,18.000000
1d2371d8a35c8831034b25bda8764539ab7db0f63938696917c447128a2540dd,1,5.000000


Now that we have 