In [19]:
library(tidyverse)
library(repr)
library(broom)
library(rvest)
options(repr.matrix.max.rows = 196) #limits output of dataframes to 10 rows

# Project Final Report

## Introduction

#### General Information
This Final Report will be conducted on datasets provided by a research group in Computer Science at UBC, led by Frank Wood, in which the research group collected data about how players play video games. The datasets provided include `players.csv`, which provides the general information about each participant, and `sessions.csv`, which provides individual session information from each player.  

#### Aim

The aim of this final report is to gain relevent information regarding two questions. Firstly a broad question
* Which player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

and secondly, a more specific question
* How accurately can age predict subscription to a game-related newsletter compared to hours played in `players.csv`?

By doing so, we can help the computer science research group and their stakeholders understand their data using our specific question to gain insight into the broader question.  

#### Datasets

As stated in our specific question, we will be working on the dataset `players.csv`. 

#### Data Characteristics  
The `players.csv` represents 196 individuals who particiapated in the study, described by 7 features, including:
* `experience <chr>` - Player's experience
* `subscribe <lgl>` - Whether the player subscribed to a video game newsletter
* `hashedEmail <chr>` - Unique identifier for each player
* `played_hours <dbl>` - Total hours played
* `name <chr>` - First Name
* `gender <chr>` - Gender
* `age <dbl>` - Age in years

##### Data Summary
* `experience` is composed of
    * `Beginner`
    * `Amateur`
    * `Regular`
    * `Veteran`
    * `Pro`
* `subscribe` is composed of 52 `TRUE` and 144 `FALSE`  
* `gender` is composed of
    * `Male`
    * `Female`
    * `Non-binary`
    * `Two-Spirited`
    * `Agender`
    * `Other`
    * `Prefer not to say`
* `played_hours`
    * Maximum - 223.1 hours
    * Minimum - 0 hours
    * Average - 5.845918 Hours
* `Age`
    * Oldest - 50 years
    * Youngest - 8 years
    * Mean - 20.52 years


##### Potential Issues
* Column `Age` contains `NA`
* Dataset is male dominated
* Dataset is majority Amateur players.
* Order of experience not specified, we will be assuming `Beginner -> Amateur -> Regular -> Pro -> Veteran`

## Methods and Results

Our specific question is primarily a predictive data analysis question. We are looking to determine if `age` can predict whether a player will subscribe to a video game newsletter more accurately then `played_hours` can. Because `subscribe` has two categories `TRUE` if the player subscribed to a video game newsletter or `FALSE` if the player didn't, we can seperately analyze the accuracy of using `age` as to predict `subscription` and using `played_hours` to predict `subscription`.  

In order to determine which classification method we should use, we need to first look at data regarding `age` and `played_hours`.

In [20]:
# Reading and viewing the players.csv data
url <- "https://raw.githubusercontent.com/ckwok07/DSCI-100-Project-Final-Report/refs/heads/main/data/players.csv"
players <- read_delim(url, delim = ",", skip = 1)
players

players_age_na <- players |>
    filter(is.na(age))
players_age_na

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m9[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, age
[33mlgl[39m (3): subscribe, individualId, organizationName

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>,<lgl>,<lgl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9,,
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17,,
Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17,,
Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21,,
Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21,,
Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17,,
Regular,True,8e594b8953193b26f498db95a508b03c6fe1c24bb5251d392c18a0da9a722807,0.0,Luna,Female,19,,
Amateur,False,1d2371d8a35c8831034b25bda8764539ab7db0f63938696917c447128a2540dd,0.0,Emerson,Male,21,,
Amateur,True,8b71f4d66a38389b7528bb38ba6eb71157733df7d1740371852a797ae97d82d1,0.1,Natalie,Male,17,,
Veteran,True,bbe2d83de678f519c4b3daa7265e683b4fe2d814077f9094afd11d8f217039ec,0.0,Nyla,Female,22,,


experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>,<lgl>,<lgl>


## Discussion