# Project







## Link to github repository
https://github.com/X9126/dsci-100-2025ss-individual/tree/main

## Introduction
The video game industry has seen major growth, with players of all ages joining in. To keep players engaged, developers and marketers are paying closer attention to player behavior, especially when it comes to game newsletters. Understanding which types of players are most likely to subscribe can help tailor communication strategies and boost long-term player retention.



## Question 
Can experience and played hours predict newsletter subscription in the players dataset?s


## Data Description
This analysis uses the `players.csv` dataset, which contains `196` records and `7` variables describing individual game players. The dataset was provided by a research group at `UBC` studying player behavior in a `Minecraft` server environment. The dataset includes both `categorical` and `numerical` variables. These variables are: `experience`, a `categorical` variable indicating the player’s self-reported gaming experience (e.g., `Pro`, `Veteran`); `subscribe`, a `boolean` variable showing whether the player subscribed to a game-related newsletter; `played_hours`, a `numerical` variable representing the total hours played; `Age`, a `numerical` variable recording the player’s age; `gender`, a `categorical` variable for gender identity; and two identifier variables, `name` and `hashedEmail`, which are excluded from analysis.

Among the `numerical` variables, `played_hours` ranges from `0` to `95` hours, with a mean of about `9.8` and a median of around `2.0`, indicating a right-skewed distribution. The `Age` variable ranges from `8` to `29`, with an average of about `17.3` and `2` missing values that need to be handled before modeling. The `categorical` variables such as `experience` and `gender` will require encoding during analysis. This project will use `experience`, `played_hours`, `gender`, and `Age` as explanatory variables to predict the binary target variable `subscribe`. The dataset is suitable for both `prediction` and `player-type comparison`, as `experience` can also serve as a grouping variable. Overall, the data is clean, organized, and mostly complete, with only minimal preprocessing needed before modeling.

## Methods & Results

In [2]:
library(tidyverse)    
library(tidymodels) 

── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.6     [32m✔[39m [34mrsample     [39m 1.2.1
[32m✔[39m [34mdials       [39m 1.3.0     [32m✔[39m [34mtune        [39m 1.1.2
[32m✔[39m [34minfer       [39m 1.0.7     [32m✔[39m [34mworkflows   [39m 1.1.4
[32m✔[39m [34mmodeldata   [39m 1.4.0     [32m✔[39m [34mworkflowsets[39m 1.0.1
[32m✔[39m [34mparsnip     [39m 1.2.1     [32m✔[39m [34myardstick   [39m 1.3.1
[32m✔[39m [34mrecipes     [39m 1.1.0     

── [1mConflicts[22m ───────────────────────────────────────── tidymodels_conflicts() ──
[31m✖[39m [34mscales[39m::[32mdiscard()[39m masks [34mpurrr[39m::discard()
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m   masks [34mstats[39m::filter()
[31m✖[39m [34mrecipes[39m::[32mfixed()[39m  masks [34mstringr[39m::fixed()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m      masks [34mstats[39m::lag()
[31m✖[39m [3

In [34]:
players <- read_csv("players.csv")

clean_players <- players |>
  select(experience, played_hours, subscribe) |>
  mutate(subscribe = as.factor(subscribe)) 

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [35]:
head(clean_players)

experience,played_hours,subscribe
<chr>,<dbl>,<fct>
Pro,30.3,True
Veteran,3.8,True
Veteran,0.0,False
Amateur,0.7,True
Regular,0.1,True
Amateur,0.0,True


In [36]:
clean_players |>
  group_by(experience) |>
  summarize(count = n())

experience,count
<chr>,<int>
Amateur,63
Beginner,35
Pro,14
Regular,36
Veteran,48


In [45]:
clean_players |>
  summarize(
    min_hours = min(played_hours),
    max_hours = max(played_hours),
    mean_hours = mean(played_hours),
    median_hours = median(played_hours))

min_hours,max_hours,mean_hours,median_hours
<dbl>,<dbl>,<dbl>,<dbl>
0,223.1,5.845918,0.1


In [46]:
players_split <- initial_split(clean_players, prop = 0.8, strata = subscribe)
players_training <- training(players_split)
players_testing  <- testing(players_split)

In [47]:
players_recipe <- recipe(subscribe ~ experience + played_hours, data = players_training) |>
  step_scale(played_hours) |>
  step_center(played_hours)

In [48]:
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 5) |>
  set_engine("kknn") |>
  set_mode("classification")

In [49]:
players_fit <- workflow() |>
  add_recipe(players_recipe) |>
  add_model(knn_spec) |>
  fit(data = players_training)

In [50]:
players_predictions <- predict(players_fit, players_testing) |>
  bind_cols(players_testing)

In [51]:
metrics(players_predictions, truth = subscribe, estimate = .pred_class)

.metric,.estimator,.estimate
<chr>,<chr>,<dbl>
accuracy,binary,0.5
kap,binary,0.02200489


In [52]:
conf_mat(players_predictions, truth = subscribe, estimate = .pred_class)

          Truth
Prediction FALSE TRUE
     FALSE     6   15
     TRUE      5   14

In [53]:
head(players_predictions)

.pred_class,experience,played_hours,subscribe
<fct>,<chr>,<dbl>,<fct>
True,Amateur,0.5,True
True,Amateur,0.7,True
True,Regular,0.6,True
False,Veteran,0.0,False
False,Veteran,0.1,True
False,Amateur,0.0,True
