# Data Science Final Project Report

In [None]:
library(tidyverse)
library(dplyr)
library(repr)
library(tidymodels)

## Introduction

### Background

Video games are increasingly being used as platforms for research, offering rich data on user behaviour in interactive environments. A research group at UBC has set up a customized Minecraft server to study how players interact with the game world, logging detailed information about each player's characteristics and in-game activity. These data can help address practical challenges such as server capacity planning and targeted participant recruitment by identifying patterns in user engagement. In this project, we use data from the Minecraft server to investigate whether player characteristics—specifically gender and gaming experience—can predict how many hours a player spends on the server. To conduct our analysis, we use R to wrangle, clean, and visualize the data, and apply appropriate statistical models to answer our predictive question. The findings may provide insights that support more efficient resource allocation and outreach efforts for the research team.

### Question

For this report, I decided to use question two as my guiding question, which is as follows:
        We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can
        target those players in our recruiting efforts.
From this question, I am interested in discovering if playing time (in hours) and age of a player can predict wether a player is subscribed, and if they can, use these predictions to help see which types of players are most likely to contribute the most data by being subscribed to minecraft and willing to pay to keep playing. I believe that age and playing time can be used ot predict subscription as they are two numerical variables that can be used as predictors by KNN classification.

### Data Description

The data set that was used to answer the question is from the file players.csv, and it contains 196 observations. It has 7 variables:
 ##### character data type
 - experience
 - hashedEmail
 - name
 - gender
##### double data type
- Age
- played_hours
##### logical data type
- subscribe

The variables needed for this analysis will be gender, played_hours, and Age. The Age variable describes the age of players, while the played_hours variable shows the time players spend in the game in hours. Finally, the subscribe variable describes whether a player is subscribed to the game or not. There are no issues with the variables I will be using; therefore, once I modify the data to contain only the variables I need, it will be ready to use in my analyses. The data set can be viewed below:

In [None]:
players_data <- read_csv("data/players.csv")
players_data

## Method

The first step of analysis is to load the data set into R, which was done above using the read_cvs function. The next step is to select only the columns of interest, which are Age, played_hours, and subscribe. This was done using the summarize function. Furthermore, once you reach above 10 hours, all players are subscribed no matter the age, therefore we filter for playing time only under 10 hours to get a better understanding of how age and played_hours BOTH impact subscription status (A graph with all the values would be bad for K nearest neighbor because the points are spaced badly due to scale).

In [None]:
players_data <- players_data |>
    select(Age, played_hours, subscribe) |>
    filter(played_hours<10) |>
    mutate(subscribe = as_factor(subscribe))

players_data

The next step is to make a graph showing the relationship between the variables, and the best method was to make a scatterplot with age on the x axis and played_hours on the y axis, then colouring the points based on whether the player is subscribed or not. This was done using ggplot and geom_point, and I assigned the graph to an object called players_plot

In [None]:
players_plot <- players_data |>
    ggplot(aes(x = Age, y = played_hours, color = subscribe)) +
    geom_point() +
    labs( x= "Age of player (in years)", y = "Time spent playing (in hours)", color = "Subscription Status") +
    ggtitle("Relationship between age of players and their time spend playing mincraft (in hours)")
players_plot

This initial visualization shows that age and playing time might not be very good at predicting subscription status, especially using KNN classification, as the points get very spread out at the top of the graph. This could prove to be an error once we implement the KNN classification algorithm. Something else we can see from this graph is that the relationship between the two variables is not linear, and doesnt seem to be very strong either.

In [None]:
Nex we will create two sample mean distributions of

Now that a summary and visualization of the data have been made for exploratory data analysis, we can being creating the model. The first step is to set a seed so the results are reproducible. Then we split the data into a training and a testing set so we can evaluate how accurate our model is. We have to create a recipe and a model, then fit both to the data, and use cross validation and values of K ranging from 1-100, increasing by 5 to find the best one. We then pull the k value with the highest accuracy score to find which K value to use 

In [None]:
set.seed(2025)
players_split <- initial_split(players_data, prop = 0.75, strata = subscribe)
players_training <- training(players_split) |>
    filter(!is.na(Age))
players_testing <- testing(players_split) |>
    filter(!is.na(Age))

players_recipe <- recipe(subscribe ~ Age + played_hours, data = players_training) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors())

players_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
    set_engine("kknn") |>
    set_mode("classification")

players_vfold <- vfold_cv(players_training, v = 5, strata = subscribe)

k_vals <- tibble(neighbors = seq(from = 1, to = 100, by = 5))

players_wrk <- workflow() |>
    add_recipe(players_recipe) |>
    add_model(players_spec) |>
    tune_grid(resample = players_vfold, grid = k_vals) |>
    collect_metrics()

accuracy_of_neighbors <- players_wrk |>
  filter(.metric == "accuracy")

best_k <- accuracy_of_neighbors |>
        arrange(desc(mean)) |>
        head(1) |>
        pull(neighbors)
best_k

    

Now that we have found the best K value, we can train our model using that K value on our training data.

In [None]:
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = best_k) |>
  set_engine("kknn") |>
  set_mode("classification")

knn_fit <- workflow() |>
  add_recipe(players_recipe) |>
  add_model(knn_spec) |>
  fit(data = players_training)

 players_predictions <- predict(knn_fit, players_testing) |>
    bind_cols(players_testing)
players_predictions

players_predictions |>
  metrics(truth = subscribe, estimate = .pred_class) |>
  filter(.metric == "accuracy")

Now that we have successfully shown the accuracy of our data, we can create a new observation and predict it using our workflow.  We can then make a visualization of the new observation to visually demonstrate the analysis we have performed, and see how well the prediction worked.

In [None]:
new_player <- tibble(Age = 30, played_hours = 0.5)
player_predicted <- predict(knn_fit, new_player)
player_predicted


players_plot + 
    geom_point(aes(x = 30, y = 0.5), color = "black", size = 4)

### Discussion

In conclusion, this analysis has shown that age and playing time can be used to predict the subscription status of a player, however, I would say KNN classification might not be the best algorithm to predict this data as the scatter plot shows that there are many more subscribed players, and therefore the model tended to predict too many observations in the test data as true. In the future, I would have maybe added data points for the FALSE subscription status to make the KNN classification algorithm have a better effect. Overall, the accuracy of the model was 75%, which is not horrible, and the best K was chosen to be 21. When a new observation was given for the model to predict, it did a pretty good job, and its prediction aligns well with where the point lies on the scatterplot, as seen above.

I would say that this conclusion is pretty similar to what I expected to find, as I predicted in the beginning that these two variables would be able to predict playing time; however, I was surprised that the accuracy of the model was a bit low even after tuning to find the best K.

I think the findings of this analysis could help the UBC computer science team in picking what demographic of age to pick for their studies, and cross-reference it with playing time to see what players are more likely to subscribe to the game and continue playing after the study. This could help them find players that are actually committed to continuing playing Minecraft; therefore, it answers the overarching problem of finding players that will keep giving the study the most amount of data.

This analysis might help develop new questions, such as patterns between the other two numerical variables and discover them through the KNN classification algorithm. Furthermore, researchers could look more into unsupervised methods of analysis, such as asking if there are any subgroups within the data provided in this analysis, which could be approached using k-means clustering algorithm.