**Group Members:** Jojo Hu (47939335), (TODO: add your names)

In [18]:
library(tidyverse)
library(tidymodels)

# Introduction

## Background Information
A UBC Computer Science research group, led by Frank Wood, is studying how people play Minecraft by collecting player activity data from a specific server. To run the project effectively, they need to recruit the right players and manage various software and hardware resources. They want to know "kinds" of players are most likely to contribute a large amount of data so that they can target those players in their recruiting efforts.

## Research Question
We want to know: can we predict total playtime based on player age and newsletter subscription status? We will investigate this research question using KNN regression on the `players.csv` dataset

## Dataset
First, we load the data, taking care to convert the `experience` and `gender` columns to factors.

In [7]:
players <- read_csv("https://raw.githubusercontent.com/Yh194/ubc-dsci100-21/refs/heads/main/data/players.csv") |>
    mutate(across(c(experience, gender), as.factor))
# order experience levels
players$experience <- factor(players$experience, levels = c('Beginner', 'Amateur', 'Regular', 'Pro', 'Veteran'))

# list possible values for factors and their distribution
table(players$gender)
table(players$experience)

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.



          Agender            Female              Male        Non-binary 
                2                37               124                15 
            Other Prefer not to say      Two-Spirited 
                1                11                 6 


Beginner  Amateur  Regular      Pro  Veteran 
      35       63       36       14       48 

The `players.csv` file is a comma-separated CSV file with 7 columns and 196 rows.

### Specification

| Column Name      | Data Type | Meaning                                                   | Possible values/constraints                        |
|------------------|-----------|-----------------------------------------------------------|----------------------------------------------------|
| **experience**   | `fct`     | How much experience the player has in the game            | 'Beginner', 'Amateur', 'Regular', 'Pro', 'Veteran' |
| **subscribe**    | `lgl`     | Whether or not the player is subscribed to the newsletter |                                                    |
| **hashedEmail**  | `chr`     | The hash of the player's email, serving as a unique ID    |                                                    |
| **played_hours** | `dbl`     | The number of hours the player has played for             | Non-negative                                       |
| **name**         | `chr`     | The player's name                                         |                                                    |
| **gender**       | `fct`     | The player's gender                                       | 'Male', 'Female', 'Non-binary', 'Prefer not to say', 'Agender', 'Two-Spirited', 'Other' |
| **Age**          | `dbl`     | The player's age                                          | Non-negative                                       |

### Irrelevant Variables
`name` and `hashedEmail` are unique for each player, so they are not useful in our analysis.

### Potential Issues
- Played hours may be inaccurate because players might stay idle/AFK within the game, which would lead to an artificially inflated playtime
- Some players have disproportionately high playtime compared to others, which could skew averages
- Players could lie about or omit information such as age, gender, etc.
- There are a lot more male players (124) compared to other genders, which could introduce bias
- There are `NA` values present in the age column
- Skill level/experience is a very subjective column
  - Different people may have different opinions about what level of skill constitutes a "Pro" for example
  - The ranking is ambiguous, you can guess which value is more skilled than another but it's not clearly defined

# Methods and Results

The data is already cleaned and wrangled, so we can move on to (TODO)

In [20]:
exp_recipe <- recipe(played_hours ~ experience + subscribe, players) |>
    step_bin2factor(subscribe) |>
    step_dummy(all_predictors()) |>
    step_normalize(all_predictors())

# tmp
exp_recipe |> prep() |> bake(players) |> head()

played_hours,experience_Amateur,experience_Regular,experience_Pro,experience_Veteran,subscribe_no
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
30.3,-0.6864892,-0.47313,3.5963417,-0.5680401,-0.5993903
3.8,-0.6864892,-0.47313,-0.2766417,1.7514571,-0.5993903
0.0,-0.6864892,-0.47313,-0.2766417,1.7514571,1.65985
0.7,1.449255,-0.47313,-0.2766417,-0.5680401,-0.5993903
0.1,-0.6864892,2.1028,-0.2766417,-0.5680401,-0.5993903
0.0,1.449255,-0.47313,-0.2766417,-0.5680401,-0.5993903


# Discussion

TODO

# References

TODO (Do we need this section?)