# Introduction

**The Research Project**
The Pacific Laboratory for Artificial Intelligence (PLAI) at UBC, led by Professor Frank Wood, is trying to build embodied AI agents that can behave like real human players inside Minecraft. Data was taken from players in a minecraft server called PLAICraft, and the playersâ€™ behaviours and traits were recorded. Currently, there are 196 observations of players, and in order to save resources, such as software licenses and server hardware, they need to recruit players who will play on the server for several hours.

**The Question**
We want to know if players' characteristics, like their experience, subscription to the game's newsletter, gender, and age, can predict how long in hours a player would play according to the dataset players?

**Why**
To better grasp human players' behaviors to build a believable AI, significant amounts of data are needed. This data is collected through interactions the players will have in the server. So it is crucial for the recruited participants to stay online for longer periods of time. The four characteristics of the players (experience, subscription to the game's newsletter, gender, and age) were chosen to be able to provide a comprehensive list of what should be prioritised when recruiting efforts. Since 3 of the variables (experience, gender, age) are self-identified, they may provide bias (social desirability in reporting gender, or overstatement of experience, etc). Only the data set players is needed for this question as it has all the information about the demographic. Sessions could be useful to see players' habits, but just to answer the question of most data collected, player habits would be too specific.


**The Dataset: Players**

There are 196 observations over rows for 7 variables in the columns of a tibble:

|**variable**|**data type**|**categories**|**meaning**|
|-|-|-|-|
| experience | character | 5 | skillset of the player: Beginner, Amateur, Regular, Veteran, Pro|
| subscribe | logical | 2 | indicating active subscription status: TRUE (subscribed) or FALSE (not subscribed) |
| hashedEmail | character | 196 | unique identifications |
| played_hours | real number | n/a | time in hours spent on the server by a player |
| name | character | 196 | unique identifications |
| gender | character | 7 | gender of the player : Male, Female, Non-binary, Agender, Two-Spirited, Prefer not to say, Other|
| age | real number | n/a | age of the player |


potential issues:
- Gender variable is inclusive but could reduce data accuracy since categories like "Prefer not to say" introduce ambiguity, as they could represent individuals from another gender group
- Positively skewed played_hours, majority of values are very close to 0h with a few big outliers (around 200h)
- Like mentioned, 3 (experience, gender, age) of the variables are self identified, they may provide bias (social desirability in reporting age or gender, or overstatement of experience, etc).
- Emails and name are self identified as well but are not determining characteristics of players that affect play-time/engagenent. they identify too specifically and dont represent a "type" of player

# Methods & Results

To answer our question: Can experience, subscription to the game's newsletter, gender, and age predict how long in hours a player's total played hours will be? We will be performing K-NN regression. Our predictor variables will be experience, subscription to the game's newsletter, gender, and age, and our class/ label will be played hours. Since our class/label played hours is numerical, using K-NN regression would make sense since we are aiming to produce numerical prediction values.

* Loading the Data



* Wrangling and Cleaning the Data

what we did:
- tidied the data to 100% ensure smooth analysis
- Combined the genders with way too little data into "Other"
- Changed our categorical variables into "dummmy" variables
- Dropped the NAs in the data set

Generally, the players data set is already tidy with each row having a single observation, each column a single variable, and each value being in a single cell. However, to go a further step into making sure the analysis goes through smoothly, the following is done. 

Furthermore, since the gender variable contains 7 categories, where some of the categories have way too little data. Having too many categories and some of them having way too little data may affect the analysis negatively. To resolve this issue, we combine the genders that have too little data into the "Other" category. Leaving three overall categories under the gender variable-- Male, Female, and Other.

Moreover, since KNN relies on distance calculations and requires numerical data, the values of the categorical variables being used for the analysis (experience, gender, and subscribe) are converted into "dummy" numerical values representing each of the categories under our categorical variables:

Next, the NA values in the data that are not useful must be dealt with. Since the players' data set only has two NA values under the Age variable, removing them will suffice. It was discussed earlier that the huge range in our data set entails that there may be outliers. After calculating the upper fence and the lower fence in our data, it was found that hours above 100 are outliers in the data, and therefore, they are removed.

* Summary Statistics of The Data Set

The table below shows the summary statistics of the players' played hours in our data set. It can be seen that the range of played hours is quite large, with the minimum being 0 (meaning under 1 hour) and the maximum being 223 hours. This entails that there may be outliers in our data set. 

* Visualization of The Data Set

Figure 1: The bar plot below shows that most of the players' played hours are below 25. In fact, most are in the 0 

* Data Analysis

In the code below, the model is tuned in order to find an optimal k-value that can be used for the K-NN regression model. 

The data will be split into a training set (0.70) and a testing set (0.30). The training set is used below to find the optimal k-value.

After finding the optimal k-value, 13, this will now be used to predict the played hours in the test data using the variables in the training data.





* Visualization of the Analysis

In [5]:
# Loading the Data
library(tidyverse)
players <- read_csv("https://raw.githubusercontent.com/ctrl-tiramisu/dsci100-group-008/refs/heads/main/players.csv", show_col_types = FALSE)
summary(players)

# Wrangling and Cleaning the Data
#First, we will be doing the following tidying to ensure 


# Summary Statistics of The Data Set

# Visualization of The Data Set



  experience        subscribe       hashedEmail         played_hours    
 Length:196         Mode :logical   Length:196         Min.   :  0.000  
 Class :character   FALSE:52        Class :character   1st Qu.:  0.000  
 Mode  :character   TRUE :144       Mode  :character   Median :  0.100  
                                                       Mean   :  5.846  
                                                       3rd Qu.:  0.600  
                                                       Max.   :223.100  
                                                                        
     name              gender               Age       
 Length:196         Length:196         Min.   : 9.00  
 Class :character   Class :character   1st Qu.:17.00  
 Mode  :character   Mode  :character   Median :19.00  
                                       Mean   :21.14  
                                       3rd Qu.:22.75  
                                       Max.   :58.00  
                               