# Final Report : Insert Title Name
---

## Introduction 
---

## Methods & Results
---

In [1]:
### Libraries
library(tidyverse)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


Begin by loading in the `players.csv` dataset:

In [2]:
raw_players <- read_csv("data/players.csv", show_col_types = FALSE) # hide output

Perform the following wrangling steps to tidy the data and prepare for $k$-nearest neighbours classification:
1. Rename `hashedEmail` and `Age` with `rename()` for consistency with other column names.
2. Keep only the relevant predictor columns `age`, `played_hours` and the label column `subscribe` using `select()`.
3. Remove `NA` values using `drop_na()`. `NA` values can be assumed to be random rather than informative (DSCI 100 textbook, Ch. 5.7.3). <i><b> ****requires proper citation </b></i>
4. Transform the label `subscribe` from `lgl` to `fct` type with `mutate()` and `as_factor()`.
5. Reassign the default category `"TRUE"` and `"FALSE"` to `"Yes"` and `"No"` to make the answer to the predictive question clearer.

In [3]:
options(repr.matrix.max.rows = 6) # limit to 6 observations for brevity
players <- raw_players |>
    rename("hashed_email" = "hashedEmail", "age" = "Age") |>
    select(age, played_hours, subscribe) |>
    drop_na() |>
    mutate(subscribe = as_factor(subscribe)) |>
    mutate(subscribe = fct_recode(subscribe, "Yes" = "TRUE", "No" = "FALSE"))
players

age,played_hours,subscribe
<dbl>,<dbl>,<fct>
9,30.3,Yes
17,3.8,Yes
17,0.0,No
⋮,⋮,⋮
22,0.3,No
17,0.0,No
17,2.3,No


Before beginning $k$-nn analysis, obtain some basic summary statistics on `players`:

In [4]:
players_summary <- players |>
    summarize(
        observation_count = n(),
    	min_hours = min(played_hours),
    	max_hours = max(played_hours),
    	mean_hours = mean(played_hours),
    	min_age = min(age), 
    	max_age = max(age),
    	mean_age = mean(age)
    )
players_summary

observation_count,min_hours,max_hours,mean_hours,min_age,max_age,mean_age
<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
194,0,223.1,5.904639,8,50,20.52062


Both the range of values and mean of `played_hours` and `age` differ greatly, so we will have to scale and centre the data to ensure they contribute equally to the Euclidean distance formula used by $k$-nn. 

Next, check for class imbalances:

In [6]:
total_number_of_observations <- nrow(players)
players_classes <- players |>
	group_by(subscribe) |>
	summarize(
		count = n(),
		percentage = n() / total_number_of_observations * 100
        )
players_classes

subscribe,count,percentage
<fct>,<int>,<dbl>
No,52,26.80412
Yes,142,73.19588


## Discussion
---

## References
---