# Data Science Final Project Report

In [None]:
library(tidyverse)
library(dplyr)
library(repr)

## Introduction

### Background

Video games are increasingly being used as platforms for research, offering rich data on user behaviour in interactive environments. A research group at the University of British Columbia (UBC) has set up a customized Minecraft server to study how players interact with the game world, logging detailed information about each player's characteristics and in-game activity. These data can help address practical challenges such as server capacity planning and targeted participant recruitment by identifying patterns in user engagement. In this project, we use data from the Minecraft server to investigate whether player characteristics—specifically gender and gaming experience—can predict how many hours a player spends on the server. To conduct our analysis, we use R to wrangle, clean, and visualize the data, and apply appropriate statistical models to answer our predictive question. The findings may provide insights that support more efficient resource allocation and outreach efforts for the research team.

### Question

For this report, I decided to use question two as my guiding question, which is as follows:
        We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can
        target those players in our recruiting efforts.
From this question, I am interested in discovering if playing time (in hours) and age of a player can predict wether a player is subscribed, and if they can, use these predictions to help see which types of players are most likely to contribute the most data by being subscribed to minecraft and willing to pay to keep playing.

### Data Description

The data set that was used to answer the question is from the file players.csv, and it contains 196 observations. It has 7 variables:
 ##### character data type
 - experience
 - hashedEmail
 - name
 - gender
##### double data type
- Age
- played_hours
##### logical data type
- subscribe

The variables needed for this analysis will be gender, played_hours, and Age. The Age variable describes the age of players, while the played_hours variable shows the time players spend in the game in hours. Finally, the subscribe variable describes whether a player is subscribed to the game or not. There are no issues with the variables I will be using; therefore, once I modify the data to contain only the variables I need, it will be ready to use in my analyses. The data set can be viewed below:

In [None]:
players_data <- read_csv("data/players.csv")
players_data

## Method

The first step of analysis is to load the data set into R, which was done above using the read_cvs function. The next step is to select only the columns of interest, which are Age, played_hours, and subscribe. This was done using the summarize function. Furthermore, once you reach above 10 hours, all players are subscribed no matter the age, therefore we filter for playing time only under 10 hours to get a better understanding of how age and played_hours BOTH impact subscription status (A graph with all the values would be bad for K nearest neighbor because the points are spaced badly due to scale).

In [None]:
players_data <- players_data |>
    select(Age, played_hours, subscribe) |>
    filter(played_hours<10)

players_data

The next step is to make a graph showing the relationship between the variables, and the best method was to make a scatterplot with age on the x axis and played_hours on the y axis, then colouring the points based on whether the player is subscribed or not. This was done using ggplot and geom_point, and I assigned the graph to an object called players_plot

In [None]:
players_plot <- players_data |>
    ggplot(aes(x = Age, y = played_hours, color = subscribe)) +
    geom_point() +
    labs( x= "Age of player (in years)", y = "Time spent playing (in hours)", color = "Subscription Status") +
    ggtitle("Relationship between age of players and their time spend playing mincraft (in hours) and how it relates to subscription status")
players_plot

In [None]:
Nex we will create two sample mean distributions of 

Now that a summary and vizualization of the data have been made for exploratory data analysis, we can split out data into training and testing data, and tune our model to the best K value using 5-fold cross validation. 

In [None]:
set.seed(2025)
players_split <- initial_split(prop = 0.75, strata = subscribe)
players_training <- training(players_split)
players_testing <- testing(players_split)

players_recipe <- recipe(subscribe ~ Age + played_hours, data = players_testing) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors())

players_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
    set_engine("kknn") |>
    set_mode("classification")

players_vfold <- vfold_cv(players_training, v = 5, strata = subscribe)

k_vals <- tibble(neighbors = seq(from = 1, to = 100, by = 5))

players_wrk <- workflow() |>
    add_recipe(players_recipe) |>
    add_model(players_spec) |>
    tune_grid(resample = players_vfold, grid = k_vals) |>
    collect_metrics()

accuracies <- players_wrk |>
  filter(.metric == "accuracy")

accuracies

    