In [None]:
library(tidyverse)
library(tidymodels)
library(themis)
library(repr)
library(broom)
library(infer)
library(cowplot)

options(repr.matrix.max.rows = 6)  
tidymodels_prefer()

Intro

## 1. Data Description

This project uses two datasets:

- **players.csv**  one row per player (demographics + gameplay summary)
- **sessions.csv** one row per play session (timestamps + durations)

Below is the structure, number of observations, and potential data issues.


In [None]:
players  <- read_csv("data/players.csv", show_col_types = FALSE)
sessions <- read_csv("data/sessions.csv", show_col_types = FALSE)

dim(players)
head(players)

dim(sessions)
head(sessions)

In [None]:
glimpse(players)

players |>
  summarise(
    n_players      = n(),
    mean_age       = mean(Age, na.rm = TRUE),
    min_age        = min(Age, na.rm = TRUE),
    max_age        = max(Age, na.rm = TRUE),
    mean_hours     = mean(played_hours, na.rm = TRUE),
    min_hours      = min(played_hours, na.rm = TRUE),
    max_hours      = max(played_hours, na.rm = TRUE)
  )

players |>
  count(experience) |>
  mutate(prop = n / sum(n))

players |>
  count(gender) |>
  mutate(prop = n / sum(n))

players |>
  count(subscribe)

In [None]:
session_counts <- sessions |>
  group_by(hashedEmail) |>
  summarise(
    total_sessions = n(),
    .groups = "drop"
  )

players_full <- players |>
  left_join(session_counts, by = "hashedEmail") |>
  mutate(
    total_sessions = replace_na(total_sessions, 0L),
    subscribe      = factor(subscribe, levels = c(FALSE, TRUE), labels = c("no", "yes"))
  )

glimpse(players_full)
summary(players_full$total_sessions)

In [None]:
options(repr.plot.width = 7, repr.plot.height = 5)

ggplot(players_full, aes(x = played_hours)) +
  geom_histogram(binwidth = 2, fill = "steelblue", color = "white") +
  labs(
    title = "Figure 1. Distribution of Total Played Hours",
    x = "Total played hours",
    y = "Number of players"
  ) +
  theme_minimal()

## 2. Questions

### Broad Question
What player behaviours and patterns are most predictive of long term engagement on the server?

### Specific Question

Can the total number of sessions and total played hours predict whether a player subscribes to the newsletter?

### Why this question?

Newsletter subscription shows a  deeper engagement with the project, and understanding whether this behaviours (like number of sessions and total hours played) predict subscription helps the research team identify players what factors and who are most likely to stay subscribed


## 3. Exploratory Data Analysis and Visualization

Below are basic visualizations to understand key variables related to my predictive question.


In [None]:
players_summary <- players %>%
  group_by(experience) %>%
  summarise(
    mean_played_hours = round(mean(played_hours, na.rm = TRUE), 2),
    n_players = n()
  )

players_summary

In [None]:
ggplot(players_summary, aes(x = experience, y = mean_played_hours, fill = experience)) +
  geom_col() +
  labs(title = "Mean Played Hours by Experience Level",
       x = "Experience Level",
       y = "Mean Played Hours") +
  theme_minimal() +
  theme(legend.position = "none")


### Comment:
More experienced players tend to have higher total played hours. This supports using both variables as predictors for newsletter subscription.


In [None]:
ggplot(players, aes(x = played_hours)) +
  geom_histogram(binwidth = 5, fill = "steelblue", color = "white") +
  labs(title = "Distribution of Played Hours", x = "Played Hours", y = "Count") +
  theme_minimal()

### Comment:
The distribution of played_hours is bad as most players have low hours. This suggests we may need to scale, depending on the model.


## 4. Methods and Plan

### Proposed Method
**k-Nearest Neighbors Classification**

### Why this method?
- Only uses basic concepts such as distance and similarity
- Does not require assumptions about linear relationships
- Simple and easy to understand

### Assumptions
- Players who have similar experience/played hours will behave similarly
- Variables are on similar scales (we will normalize them)
- No missing values in the predictors

### Possible Limitations
- Must normalize numeric variables so one does not dominate the distance 
- Does not automatically show which variable is “most important”
- kNN can be slower with large datasets

### Data Processing Plan
1. Clean newsletter variable (convert to numeric 0/1).
2. Remove rows with missing values in key columns.
3. Normalize numeric predictors: played_hours, Age
4. Convert experience to a numeric scale
5. Train-test split:  
   - **80% training**, **20% test**  
6. Choose values of k to test 
7. Fit kNN model using training data.
8. Evaluate performance
9. Interpret results


## 5. GitHub Repository

My GitHub project repository:

**<https://github.com/daniel-zouli/Dsci-100-Individual-Project.git>**

In [None]:
" This project aims to explore the behaviours of players on a research Minecraft server to figure out which factors are most closely tied to newsletter subscription, a key signal of long term engagement and interest in the community. The dataset is made up of two files: players.csv, which has one row per player with details like age, gender, experience level, total hours played, hashed email, name, and subscription status; and sessions.csv, which records individual play sessions with login and logout times. The players.csv file contains 196 entries with a mix of numeric variables (age, played_hours, subscribe) and categorical ones (experience, gender, name, hashedEmail). The sessions.csv file is much larger since players can have multiple sessions, and its timestamp data will need extra cleaning before it can be used for time analysis. Key challenges in the data include missing values for age and played_hours, possible mismatches between played_hours and the total time calculated from sessions, and categorical variables stored as text instead of factors. Another important detail is that the subscription variable, coded as 0 or 1, should be treated as a numeric class label during modelling but converted to a factor for interpretation and evaluation. The broader research question guiding this work is: What player characteristics and behaviours are most related to continued engagement with the game? This matters because subscribing to the newsletter shows a player’s interest in staying connected with the project and its community. My specific focus is: Can a player’s experience level and total hours played predict whether they subscribe to the newsletter? This is a good fit because both variables are directly related to engagement experience which Shows skill, while played_hours shows actual time investment. Both are readily available in players.csv, so sessions.csv isn’t needed for this initial model. Exploratory analysis suggests that players with higher experience levels generally log more hours, and the distribution of played_hours is heavily skewed, with many players spending only a small amount of time on the server. These trends point to the idea that players who invest more time and report greater experience are more likely to subscribe. Visualizations like bar plots of average played_hours by experience category and histograms of played_hours highlight these differences, and early summaries show group level variation that could affect classification results. To answer my predictive question, I plan to use the k-Nearest Neighbours (kNN) classification method. Because the subscribe variable is binary, kNN is appropriate for treating subscription as a classification task. kNN is a simple, interpretable method consistent with the course content, relying only on distance calculations between players based on selected predictors. The assumptions include proper scaling of numeric variables, appropriate categorical variables such as experience, and the non variables that might distort distances. Potential limitations include unbalanced classes, the influence of variables on distance calculations, and the difficulty of interpreting the model beyond neighbour comparisons.For modelling, I will first clean and encode the necessary variables, scale the predictors, and then split the players.csv data into an 80% training set and a 20% test set. I will apply kNN to classify newsletter subscription based on experience and played_hours and evaluate performance using accuracy."