Please run the cell below to ensure that the rest of the report goes as expected!

In [2]:
library(tidyverse)
library(repr)
library(ggplot2)
library(RColorBrewer)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


# From Sessions to Predictions: Modeling Engagement with KNN Regression

---

## Introduction

From the title you're probably wondering, what are 'sessions,' what is being predicted, and most likely what even is the term 'KNN Regression.' Well no need to worry! Our project stems from a UBC Computer Science study called **PLAIcraft** with the goal and I quote, "PLAICraft is a research data collection project with the aim of enabling advanced embodied AI research." - PLAIcraft Frequently Asked Questions. They have tasked us with helping them out by answering a few questions for them based on two anonymised datasets regarding player data, and individual play session data. This leads to the question which we attempted to answer for them!

### Our Question

**Can we predict the number of future sessions a player will engage in based on their initial session characteristics and player information?**

To simplify that down more, what our goal was for this data analysis was to utilize a players specific data, as well as the session data for only their very first session to determine if they will return for subsequent sessions, and if they do, how many? 

### The Dataset

Now for the bread and butter, well perhaps the dataset is the bread and the data analysis that comes later is the butter, who knows. Through this brief section we will load the data and talk about it and most importantly help you to understand what all of it means before we get into the next section which is where the real fun begins.

In [4]:
players <- read_csv("data/players.csv")
sessions <- read_csv("data/sessions.csv")

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m9[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, age
[33mlgl[39m (3): subscribe, individualId, organizationName

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this messag

In [5]:
head(players)

experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>,<lgl>,<lgl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9,,
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17,,
Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17,,
Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21,,
Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21,,
Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17,,


Alright, so what we've done is loaded both datasets and then directly above this is a sneak-peek at the player information dataset! A lot of the data here is pretty obvious such as **name**, **gender**, and **age**. The other columns of data represent somewhat less obvious data such as the **experience** column denoting the players self-identified minecraft skill level which they can choose as Pro, Amateur, Regular, and more. To the right of that is the **subscribe** which shows whether or not they enrolled in the email list of the study. Look right once more and we see the **hashedEmail** column which is an anonymised encoding of each users email (this comes in handy later on), and finally is the **played_hours** data which represents each users *total* hours played on the server. As we are not cleaning up the data yet, the individualId and organizationName columns are still visible but they contain no data so when we move on later we will take care of that.

In [6]:
head(sessions)

hashedEmail,start_time,end_time,original_start_time,original_end_time
<chr>,<chr>,<chr>,<dbl>,<dbl>
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,30/06/2024 18:12,30/06/2024 18:24,1719770000000.0,1719770000000.0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,17/06/2024 23:33,17/06/2024 23:46,1718670000000.0,1718670000000.0
f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3398304c7ae42581fdc,25/07/2024 17:34,25/07/2024 17:57,1721930000000.0,1721930000000.0
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,25/07/2024 03:22,25/07/2024 03:58,1721880000000.0,1721880000000.0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,25/05/2024 16:01,25/05/2024 16:12,1716650000000.0,1716650000000.0
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,23/06/2024 15:08,23/06/2024 17:10,1719160000000.0,1719160000000.0


Now we're looking at the combined players sessions dataset which includes session data for the sessions of all the players in the study. Again, this is only a sneak-peek into it, when in reality it has over three thousand rows. We still see **hashedEmail** in this dataset, but the rest of the variables are different! The **start_time** and **end_time** columns represent the corresponding date and time in which each play-session was initiated and terminated. The **original_start_time** and **original_end_time** columns are a bit more complicated, but they are just the UNIX timestamp versions of the other start and end time columns, so don't worry about them!