In [1]:
library(tidyverse)
library(repr)
library(tidymodels)
library(cowplot)
options(repr.matrix.max.rows = 6)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.6     [32m✔[39m [34mrsample     [39

In [8]:
PLAYERS_DATA_URL = "https://raw.githubusercontent.com/Michael-R-Dickinson/DSCI-100-individual-project/refs/heads/main/players.csv"

# Download Data
download.file(PLAYERS_DATA_URL, "players.csv")

# Load Data
players_df <- read_csv('players.csv') 

# Clean Data

# Remove NA's and add levels to experience for visualization
players_df <- players_df |>
    filter(!is.na(Age)) |>
    mutate(
        "age" = Age,
        "subscribe" = as.factor(subscribe),
        "hashed_email" = hashedEmail,
        "experience" = factor(
            experience, 
            levels = c("Beginner", "Amateur", "Regular", "Veteran", "Pro"), 
        )
    ) |>
    select(-Age, -hashedEmail)



[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [9]:
players_df <- players_df |>
    mutate("experience" = as.numeric(experience))

predictors_df <- players_df |> 
    select(played_hours, age, experience, subscribe)

In [12]:
# Histograms with Subscribe coloring for analyzing visualization

# for played hours 
options(repr.plot.width = 8, repr.plot.height = 6)
hours_his <- predictors_df |> ggplot(aes(x = played_hours, fill = subscribe)) +
    geom_histogram(bins = 12) +
    scale_x_log10() + 
    labs(
            x = "Hours Played",
            y = "Number of Players",
            fill = "Subscribed",
            title = "Played Hours Colored by Subscription"
    ) + 
    theme(text = element_text(size = 15))

# for age
options(repr.plot.width = 8, repr.plot.height = 6)
age_his <- predictors_df |> ggplot(aes(x = age, fill = subscribe)) +
    geom_histogram(bins = 12) +
    labs(
            x = "Player Age",
            y = "Number of Players",
            fill = "Subscribed",
            title = "Player Age Colored by Subscription"
    ) + 
    # Log scaling on y but without dropping all rows that cause 0 values
    # because log(0) = infinity
    scale_y_continuous(trans = scales::pseudo_log_trans()) + 
    theme(text = element_text(size = 15))

# for experience
options(repr.plot.width = 8, repr.plot.height = 6)
experience_his <- predictors_df |> ggplot(aes(x = experience, fill = subscribe)) +
    geom_histogram(bins = 5) + 
    labs(
            x = "Experience Level",
            y = "Number of Players",
            fill = "Subscribed",
            title = "Player Experience Colored by Subscription"
    ) + 
    theme(text = element_text(size = 15))

Using Age, Experience, and Played Hours to Predict Minecraft Newsletter Subcription Status
-

**DSCI 100 010 Group 6** 

**Sayyam Arora, Michael Dickinson, Cecile Nava, Zoey Qiu** 

------------

### Introduction

The Pacific Laboratory for Artificial Intelligence (PLAI) [link their research page], a research group at UBC’s Computer Science department led by Frank Wood [link his page], is interested in understanding which characteristics and behaviours in a video game player are most predictive of game newsletter subscription for Minecraft. Group 6 poses the question of whether age (`Age`), player's experience (`experience`), and played hours (`played_hours`) are reliable predictors for subscription status (`subscribe`) in the players.csv file.  While we will be explicitly using those three mentioned variables, there are also `gender`, `name`, and `hashedEmail` which make up the 7 columns and 196 rows of `players.csv`. Below is a description of the 7 variables.  


| # | Variable         | Description                                   | Variable Type |
|---| ---------------- | ----------------------------------------------| --------------|
| 1 | experience       |  experience status of player                  | character     |
| 2 | subscribe        | whether player is subscribe to game newsletter| logical       |
| 3 | hashedEmail      | player's hashed email to identify them        | character     |
| 4 | played_hours     | how long player has played (in hours)         |        double |
| 5 | name             | name of player                                |     character |
| 6 | gender           | gender of player                              |     character |
| 7 | Age              |                                 age of player |        double |

 

Based on data exploration the mean played hours of the players is 5.8 hours, with a player reaching 223.1 hour and some not even reaching 1 full hour. The mean age of the 196 recorded players is 20, the oldest being 50 and the youngest being 8 years old.  

### METHODS USED


To explore whether a Minecraft player’s age, experience level, and hours played could predict whether they subscribe to the game’s newsletter, we used a K-Nearest Neighbors (KNN) classification model.

**Data Preparation**

We began by importing the dataset players.csv, which contains 196 observations and 7 variables, including the target variable subscribe. Only the relevant predictor variables - `age`, `played_hours`, and `experience`, were selected for analysis. We removed entries with missing values and reformatted variable types as needed, converting categorical values such as experience into an ordered factor. Irrelevant identifiers such as name and hashedEmail were excluded.

**Exploring the Data**

Before building our model, we took a closer look at the data to understand the general trends. We noticed that most players were around 20 years old, with the youngest being 8 and the oldest 50. The time spent playing Minecraft varied a lot, some players had barely played at all, while one had logged over 220 hours. This step helped us understand the range and behavior of our predictors.

**Building the Model**

To train and test our model fairly, we split the data into two parts: one for training the model and one for testing how well it performs. We used a method called cross-validation to figure out how many neighbours the KNN model should consider when making a prediction. This helped us find the best version of the model.

**Evaluating the Model**

After building the model, we tested it on the unseen portion of the data to see how accurate it was. We found that our model could correctly predict whether a player subscribed about 72.5% of the time. While the accuracy was solid, we also noticed that the model didn’t perform significantly better than chance in terms of consistency, likely because the characteristics of subscribed and unsubscribed players were quite similar.