**1) Data Description**  
**Introduction:** The data analyzed in this project were collected from a Minecraft research server hosted by the Programming Languages for Artificial Intelligence (PLAI) Group at UBC. The objective of the study is to better understand player behavior and engagement in a controlled gaming environment. Players interact with the game server, and both their demographic details and gameplay actions are recorded.  

**players.csv:** This file consists of 7 different variables and contains demographic and account-related information for each unique player. There is 196 different observations.
  
**Variables in players.csv:**  
1) experience: This is a character variable, it specifies the players specific experience level. It is categorized into Veteran, Pro, Regular, Amateur, and Beginner.  
2) subscribe: This is a logical variable, it specifies whether the player is subcribed or not. It is categorized into TRUE or FALSE.  
3) hashedEmail: This is a character variable, it contains the players email in a hashed format. This anonymizes the email information.  
4) played_hours: This is a numerical variable (dbl). It contains the total number of gameplay hours per player. The mean number of hours player per player is 5.85 hours, the minimum is 0 hours, the max is 223.1 hours.     
5) name: This is a character variable, it contains the players chosen name for the game.  
6) gender: This is a character variable, it contains the players specified gender.  
7) Age: This is a numerical variable (dbl). It contains the players age. The mean value for age is 21.14 years old, the minimum is 9 years old, and the maximum is 58.    

**Errors in players.csv**  
Some players have missing Age values, which is a concern because it reduces the completeness of the data and could bias any analysis or predictions that rely on age as an important demographic factor.  

**sessions.csv** This file contains 5 different variable sand includes information on individual gameplay sessions recorded in the game. Each row captures a single gameplay session linked to a player. There is 1535 different observations.  

**Variables in sessions.csv**  
1) hashedEmail: This is a character variable, it contains the players email in a hashed format. This anonymizes the email information.  
2) start_time: This is a character variable, it contains the gameplay session start time in the DD/MM/YYYY and HH:MM format.  
3) end_time: This is a character variable, it contains the gameplay session end time in the DD/MM/YYYY and HH:MM format.
4) original_start_time: This is a numerical variable (dbl). It contains the UNIX timestamp (in milliseconds) corresponding to the session start time.  
5) original_end_time: This is a numerical variable (dbl). It contains the UNIX timestamp (in milliseconds) corresponding to the session end time.

**Errors in sessions.csv**  
This dataset contains missing values in the end_time and original_end_time columns, and the time data are stored in two different formats (human-readable strings and UNIX timestamps), which will need to be cleaned and transformed before calculating useful information such as session duration.  





**2. Question:** Can player demographics and gameplay behavior predict whether a player is subscribed to the game newsletter?  
To address this question, I will first merge the players.csv dataset with sessions.csv using the common hashedEmail identifier. I will compute useful behavioral metrics such as the total number of sessions and average session duration for each player. These variables, along with existing demographic features (e.g., age, experience, played_hours), will form the predictor set. I will then standardize numeric variables and encode categorical ones as needed to prepare the data for use in K-Nearest Neighbors (KNN). This wrangling will ensure the dataset is tidy and suitable for analysis.

In [1]:
library(tidyverse)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


In [23]:
players_data <- read_csv("https://raw.githubusercontent.com/gavind1111/DSCI-100-Project-Planning-Stage-Gavin/refs/heads/main/players.csv")
sessions_data <- read_csv("https://raw.githubusercontent.com/gavind1111/DSCI-100-Project-Planning-Stage-Gavin/refs/heads/main/sessions.csv")

sessions_selected <- sessions_data |> 
  select(hashedEmail, start_time, end_time) |> # Keeping only the variables needed for analysis
  mutate(start_time = dmy_hm(start_time), end_time = dmy_hm(end_time), session_duration = as.numeric(end_time - start_time)) |> # Convert start and end times from character to datetime format
  group_by(hashedEmail) |> # Group by player so we can summarize behaviour per user
  summarize(average_session_duration = mean(session_duration, na.rm = TRUE), total_sessions = n()) #Summarizing the sessions file into the average time played per player, and number of sessions per player


combined_data <- merge(players_data, sessions_selected, by = "hashedEmail") #Combining the two datasets together using hashedEmail 

players_means <- players_data |>
  select(where(is.numeric)) |>       # keep only numeric variables
  summarize(across(everything(), mean, na.rm = TRUE))

players_means





age_plot <- players_data |>
    ggplot(aes(x = Age)) +
    geom_histogram(bins = 20) +
    labs(title = "Distribution of Player Ages", x = "Age (years)", y = "Count")

playtime_plot <- players_data |> 
    ggplot(aes(x = played_hours)) +
    geom_histogram() +
    labs(title = "Distibution of Total Played Hours per Player", x = "Played Hours (hours)", y = "Count")





[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


played_hours,Age
<dbl>,<dbl>
5.845918,21.13918
