In [1]:
set.seed(888)
# libraries used by different members for the report
library(tidyverse)
library(tidymodels)
library(repr)
library(readxl)
library(rvest)
library(stringr)
library(janitor)
library(lubridate)
library(GGally)
library(ISLR)

# libraries for visualization
library(ggplot2)
library(dplyr)
library(tidyr)
library(patchwork)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.6     [32m✔[39m [34mrsample     [39

# Introduction: 

**The research project**
The Pacific Laboratory for Artificial Intelligence (PLAI) at UBC, led by Professor Frank Wood, is trying to build embodied AI agents that can behave like real human players inside Minecraft. Data was taken from players in a minecraft server called PLAICraft, and the players’ behaviours and traits were recorded. Currently, there are 196 observations of players, and in order to save resources, such as software licenses and server hardware, they need to recruit players who will play on the server for several hours.

**The question**
We want to know if player's characteristics, like their experience, subscription to the game's newsletter, gender, and age can predict how long in hours a player would play according to the dataset players?

**Why**
To better grasp human players behaviors to build a believable AI, significant amounts of data is needed. This data is collected through interactions the players will have in the server. So it is crucial for the recruted participants to stay online for longer amounts of time.

3 characteristics of the players were chosen to be able to provide a comprehensive list of what should be prioritised when recruiting efforts. Since all 3 are all self identified, they may provide bias (social desirability in reporting gender, or overstatement of experience, etc).

The demographic of age was not chosen because as seen in the exploration of the data below, most players have similar ages. The high concentration around a single age group doesnt allow for age to have significance when recruiting. The highest and the lowest contibuting player will both most likely be in the same age range. Age is also a risky variable for prediction since the targeted players could grow older over the time of the research.

Only data set players is needed for this question as it has all the information about the demographic. Sessions could be useful to then see players habits but just to answer the question of most data collected, player habits would be too specific.


**The dataset: players**


There are 196 observations over rows for 7 variables in the columns of a tibble:

|**variable**|**data type**|**categories**|**meaning**|
|-|-|-|-|
| experience | character | 5 | skillset of the player: Beginner, Amateur, Regular, Veteran, Pro|
| subscribe | logical | 2 | indicating active subscription status: TRUE (subscribed) or FALSE (not subscribed) |
| hashedEmail | character | 196 | unique identifications |
| played_hours | real number | n/a | time in hours spent on the server by a player |
| name | character | 196 | unique identifications |
| gender | character | 7 | gender of the player : Male, Female, Non-binary, Agender, Two-Spirited, Prefer not to say, Other|
| age | real number | n/a | age of the player |


potential issues:
- gender variable is inclusive but could reduce data accuracy since categories like "Prefer not to say" introduce ambiguity, as they could represent individuals from another gender group
- positively skewed played_hours, majority of values are very close to 0h with a few big outliers (around 200h)
- 3 (experience, gender, age) of the variables are self identified, they may provide bias (social desirability in reporting age or gender, or overstatement of experience, etc).
  - emails and name are self identified as well but are not determining characteristics of players that affect play-time/engagenent. they identify too specifically and dont represent a "type" of player

<h1> Methods and Results </h1>

In order to answer the question--can the following characteristics of a player: experience, subscription to the game's newsletter, gender, and age predict the player's total played hours-- k-NN Regression model will be used. K-NN Regression is a model used to predict a numerical outcome from a set of predictor variables. Since our label, played hours, is numerical, we would expect to have numerical values for our prediction outcome. First, the players data set is loaded below.


In [2]:
players <- read_csv("https://raw.githubusercontent.com/ctrl-tiramisu/dsci100-group-008/refs/heads/main/players.csv", show_col_types = FALSE)
head(players)

experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21
Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17


The aim is to use a K-NN regression model to predict a player's played hours from the variables subscription status, experience level, and gender. We will be excluding the age variable as a predictor because...Since KNN relies on distance calculations and requires numerical data, we will first convert some of the categorical variables into "made-up" variables that will represent the variables' categorical values:

* First we change the categorical variables into factors:

In [40]:
players_tidy <- players|>
    mutate(
        Age = as.numeric(Age),
        subscribe = factor(subscribe),
        experience = factor(experience, levels = c("Beginner", "Amateur", "Regular", "Veteran", "Pro")),
        gender = as_factor(gender) )
head(players_tidy)

experience,subscribe,hashedEmail,played_hours,name,gender,Age
<fct>,<fct>,<chr>,<dbl>,<chr>,<fct>,<dbl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21
Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17


*  In this step, we will now change the categorical variables into "made-up" numerical values that will represent each of the categorical variables' values. In addition, since there are not very many players who input their gender as Agender, Non-binary, Prefer not to say, Two-Spirited, and Other, we will be combining them all into one category called "Other" which we will also represent with a numerical value.

In [45]:
# Combining the genders with small data into one category
players_tidy <- players_tidy |> mutate(
    gender = case_when(
      gender %in% c("Male") ~ "Male",
      gender %in% c("Female") ~ "Female",
      TRUE ~ "Other"),
    gender = as.factor(gender))

# Making the "made-up" numerical values
players_numerical <- players_tidy |> 
mutate(subscribe = fct_recode(subscribe, "1" = "TRUE", "2" = "FALSE"),
       experience = fct_recode(experience, "1" = "Beginner", "2" = "Amateur",
                               "3" = "Regular", "4"= "Pro", "5" = "Veteran"),
       gender = fct_recode(gender,  
                           "1" = "Male",
                           "2" = "Female", 
                           "3" = "Other",
                           )
       )
head(players_numerical)

experience,subscribe,hashedEmail,played_hours,name,gender,Age
<fct>,<fct>,<chr>,<dbl>,<chr>,<fct>,<dbl>
4,1,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,1,9
5,1,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,1,17
5,2,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,1,17
2,1,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,2,21
3,1,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,1,21
2,1,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,2,17


* Next, we will convert the data type of the categorical variables we made "made-up" numerical values for into dbl data type

In [46]:
players_finals <- players_numerical |> mutate(experience = as.integer(experience),
       subscribe = as.integer(subscribe),
       gender = as.integer(gender) )

head(players_finals)
       

experience,subscribe,hashedEmail,played_hours,name,gender,Age
<int>,<int>,<chr>,<dbl>,<chr>,<int>,<dbl>
5,2,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,2,9
4,2,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,2,17
4,1,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,2,17
2,2,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,1,21
3,2,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,2,21
2,2,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,1,17


In [53]:
#Checking if the count is correct
player_experience1 <- players_finals |> count(experience)
player_experience1

player_experience2 <- players_tidy |> count(experience)
player_experience2

player_subscribe1 <- players_finals |> count(subscribe)
player_subscribe1

player_subscribe2 <- players_tidy |> count(subscribe)
player_subscribe2

player_gender1 <- players_finals |> count(gender)
player_gender1

player_gender2 <- players_tidy |> count(gender)
player_gender2

experience,n
<int>,<int>
1,35
2,63
3,36
4,48
5,14


experience,n
<fct>,<int>
Beginner,35
Amateur,63
Regular,36
Veteran,48
Pro,14


subscribe,n
<int>,<int>
1,52
2,144


subscribe,n
<fct>,<int>
False,52
True,144


gender,n
<int>,<int>
1,37
2,124
3,35


gender,n
<fct>,<int>
Female,37
Male,124
Other,35


In [33]:
players_split <- initial_split(players_selected, prop = 0.80, strata = played_hours)
players_train <- training(players_split)
players_test <- testing(players_split)

players_recipe <- recipe(played_hours~ experience + subscribe + gender, data = players_train) |>
step_scale(all_predictors() ) |>
step_center(all_predictors() )


players_spec <- nearest_neighbor(weight_func = "rectangular", 
                                 neighbors = tune() ) |>
set_engine("kknn") |>
set_mode("regression")

players_vfold <- vfold_cv(players_train, v= 5, strata = played_hours)

players_wkflw <- workflow() |>
add_recipe(players_recipe) |>
add_model(players_spec)

gridvals <- tibble(neighbors = seq(from = 1, to = 50, by = 1) )

players_results <- players_wkflw |> 
tune_grid(resamples = players_vfold, grid = gridvals) |>
collect_metrics() 
players_results

→ [31m[1mA[22m[39m | [31merror[39m:   [1m[33mError[39m in `step_scale()`:[22m
               [1mCaused by error in `prep()`:[22m
               [1m[22m[31m✖[39m All columns selected for the step should be double or integer.
               [36m•[39m 3 factor variables found: `experience`, `subscribe`, and `gender`

There were issues with some computations   [1m[31mA[39m[22m: x1

There were issues with some computations   [1m[31mA[39m[22m: x4

There were issues with some computations   [1m[31mA[39m[22m: x5



“All models failed. Run `show_notes(.Last.tune.result)` for more information.”


ERROR: [1m[33mError[39m in `estimate_tune_results()`:[22m
[33m![39m All models failed. Run `show_notes(.Last.tune.result)` for more information.


* Table that shows the equivalencies of the new values based on the original values

<u>Experience:</u>

|Old      |New|
|---------|---|
|Amateur  | 1 |
|Beginner | 2 |
|Pro      | 3 |
|Regular  | 4 |
|Veteran  | 5 |

<u>subscribe:</u>

|Old      |New|
|---------|---|
|TRUE     | 1 |
|FALSE    | 2 |

<u>gender:</u>

|Old                                                           |New|
|--------------------------------------------------------------|---|
|Male                                                          | 1 |
|Female                                                        | 2 |
|Agender                                                       | 3 |
|Non-binary + Other + Prefer not to say + Two-Spirited         | 4 |


