# Group 26 DSCI 100 Final Project

## Introduction

A UBC Computer Science research group, led by Frank Wood, is collecting data on how people play video games using a Minecraft server that tracks players’ actions. The team must carefully manage recruitment and resources, such as server capacity and software licenses, to handle the number of participants effectively.

The broad question posed is which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts?

Our team's specific question is can player characteristics (experience and age) predict total hours played in the player dataset? This question addresses our broader question by helping us discover what kinds of players are playing the most (demographically) under the assumption that more hours correspond to more sessions and, consequently, a larger contribution of data.

Here is a breakdown of the data:

Player data has 196 observations and 7 variables.

Two numerical:

  + played_hours: total hours played by a player
  + Age: age of player

Four character:

  + experience: level of the player ("Beginner", "Amateur", "Regular", "Veteran", "Pro")
  + hashedEmail: anonymized email for privacy.
  + name: player's first name
  + gender: player's gender

One logical:

  + subscribe: whether the player is subscribed to game-related newsletter

KNN Classification:

To address our research question, we will use K-Nearest Neighbours (KNN) classification to evaluate how well experience and age predict played hours.

First, we convert played_hours into a factor with categories Low, Medium, and High activity. The experience variable is mapped to a 1–5 numeric scale. Both predictors are then standardized, and any remaining NA values are removed. Finally, the KNN classification algorithm is applied to the cleaned and prepared dataset.

## Methods and Results

### 1) Loading and Cleaning

In [None]:
library(tidyverse)
library(tidymodels)
players_url <- "https://raw.githubusercontent.com/averykimura/avery_gamet_project_planning_DSCI100/refs/heads/main/players.csv"
players <- read_csv(players_url, show_col_types = FALSE)

In [None]:
players <- rename(players, hashed_email = hashedEmail, age = Age) #variable names are all consistent

players <- select(players, age, played_hours, experience)
            #select only necessary variable

### 2) Preparing for Analysis

#### Summary Stats

In [None]:
players_summary <- summarise(players, 
  age_min = min(age, na.rm=TRUE),
  age_max = max(age, na.rm=TRUE),
  age_mean = mean(age, na.rm=TRUE),
  played_hours_min = min(played_hours, na.rm=TRUE),
  played_hours_max = max(played_hours, na.rm=TRUE),
  played_hours_mean = mean(played_hours, na.rm=TRUE),
  played_hours_median = median(played_hours, na.rm=TRUE))

                             
players_summary

The played_hours distribution is extremely right-skewed. Most players have very low played hours while a small number have very high hours. This skew matters when choosing category sizes because using sized bins would group almost all players into the lowest category.

#### Creating Classifier Variable

In [None]:
players_hours <- players |>
    mutate(played_hours_group = case_when(played_hours <= 0 ~ "None", played_hours <= 1.5  ~ "Low", played_hours <= 4  ~ "Medium", TRUE ~ "High"))|>
    select(age, experience, played_hours_group)|>
    mutate(played_hours_group = factor(played_hours_group, levels = c("None", "Low", "Medium", "High")))

check <- players_hours |>
  count(played_hours_group)

players_factor <- mutate(players_hours, played_hours_group = factor(played_hours_group, levels = c("None", "Low", "Medium", "High")))

check

Because the median played_hours was extremely low, we double-checked the distribution and found that a large number of observations had exactly 0 hours. To avoid having all of these lumped into the ‘low’ category, we created a separate ‘None’ category specifically for played_hours = 0. After that, we converted played_hours into a factor with ordered levels (None, Low, Medium, High) so it could be used as the response variable in our KNN classification.

#### Wrangling Predictors

In [None]:
players_ready <- players_factor |>
    mutate(experience = factor(experience,
                        levels = c("Beginner", "Amateur", "Regular", "Veteran", "Pro"),
                        ordered = TRUE),
                        experience_score = as.numeric(experience))|>
    select(played_hours_group, age, experience_score)|>
    filter(!is.na(age))

Since KNN requires numerical predictors, we converted the experience variable into a 1–5 numeric scale. We also removed the two observations with missing age categories, as they prevented the predictors from being fully numeric. After these adjustments, the dataset was fully wrangled and ready for KNN classification.

#### Visualization

In [None]:
#creates a visualization of the dataset that is relevant for exploratory data analysis related to the planned analysis
#Create visualization here with age on X, experience score on Y and played hours group as color
#(or if you have a diff idea go for it)
#all figures should have a figure number (in title of graph) and a legend

WRITE HERE: Note that data is not standardized and explain graph

### 3) KNN Classification 

#### Split Data

In [None]:
set.seed(2000)

players_split <- initial_split(players_ready, prop=0.75, strata=played_hours_group)
players_train <- training(players_split)
players_test <- testing(players_split)

#### Preprocess Data

In [None]:
set.seed(2000)
players_recipe <- recipe(played_hours_group ~., data=players_train)|>
                step_center(all_predictors())|>
                step_scale(all_predictors())

#### Cross Validation and Picking K

In [None]:
set.seed(2000)
tune_model <- nearest_neighbor(weight_func = "rectangular", neighbors=tune())|>
        set_engine("kknn")|>
        set_mode("classification")


v_fold <- vfold_cv(players_train, v=5, strata=played_hours_group)
k_vals <- tibble(neighbors = seq(from = 1, to = 10, by = 2))

results <- workflow() |>
       add_recipe(players_recipe) |>
       add_model(tune_model) |>
        tune_grid(resamples = v_fold, grid = k_vals) |>
        collect_metrics()

accuracies <- results |> 
      filter(.metric == "accuracy")

best_k <- accuracies |>
        arrange(desc(mean)) |>
        head(1) |>
        pull(neighbors)
accuracies
best_k

#### Determine Model Accuracy

In [None]:
set.seed(2000)
players_model <- nearest_neighbor(weight_func="rectangular", neighbors=9)|>
                set_engine("kknn")|>
                set_mode("classification")   

players_fit <- workflow() |>
       add_recipe(players_recipe) |>
       add_model(players_model) |>
       fit(data = players_train)

players_predict <- predict(players_fit, players_test)|>
                    bind_cols(players_test)

players_accuracy<- players_predict|>
                    accuracy(truth=played_hours_group, estimate=.pred_class)

players_precision<- players_predict|>
                    precision(truth=played_hours_group, estimate=.pred_class, event_level="first")

players_recall<- players_predict|>
                    recall(truth=played_hours_group, estimate=.pred_class, event_level="first")

players_confmat<- players_predict|>
                    conf_mat(truth=played_hours_group, estimate=.pred_class)

players_accuracy
players_precision
players_recall
players_confmat

#### Visualize Results

WRITE HERE: create scatter plot of actual vs predicted. Plot actual data points in age vs experience_score, colored by their true played_hours_group.


## Discussion

WRITE HERE: Conclude something along the lines of: Even as we attempted to make the categories more even in size, the k-NN model still primarily predicted the majority categories (none and low). This indicates that age and experience_score are only weak predictors of played hours, especially for distinguishing Medium and High groups. use data from acuracy, precision, recall and graph to make point. 