In [None]:
### Run this cell before continuing. 
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)
source("cleanup.R")

**Predicting Newsletter Subscription Based on Player Experience and Playtime**

### **Broad Question:**  
What player characteristics and behaviours are most predictive of subscribing to a newsletter, and how do these features differ between various player types?

### **Specific Question:**  
Can a player’s **experience level** and **total played hours** predict whether they will subscribe to the **newsletter** in the `players.csv` dataset?

### **Response and Explanatory Variables:**  
- **Response Variable (Outcome):** `subscribe` (Boolean: True/False) – Whether a player subscribes to the newsletter.
- **Explanatory Variables (Predictors):**
  - `experience` (Categorical: Pro, Veteran, Amateur, Regular) – The skill level or familiarity of the player.
  - `played_hours` (Numeric) – The total number of hours a player has spent in the game.

### **Data Preparation & Wrangling:**  
To ensure the data is in a usable form for predictive modeling, the following preprocessing steps will be applied:
1. **Data Cleaning:**
   - Remove unnecessary columns such as the (`hashedEmail`, `name`) as they do not contribute to prediction.
   - Handle missing values, specifically filling in missing (`Age`, `played_hours`)  values with the median or mean.
   - Standardize categorical variables (`experience`).

2. **Feature Selection:**
   - Keep only relevant columns (`subscribe`, `experience`, `played_hours`).
   
3. **Transformations & Encoding:**
   - Convert `experience` into a factor variable to use it as a categorical predictor.
   - Ensure `subscribe` is treated as a binary outcome variable.
   
4. **Exploratory Analysis:**
   - Compute mean played hours for each experience level and visualize trends.
   - Examine the proportion of subscribers within each experience group.
   - Use visualizations (histograms, bar charts, scatter plots) to check for patterns between playtime, experience, and subscription likelihood.

### **How the Data Will Address the Question:**  
By analyzing trends in `experience` and `played_hours`, we can determine whether certain player types are more likely to subscribe.

### **Next Steps:**
- Perform descriptive statistics and visualizations to confirm assumptions and.
- Identify any necessary data transformations before predictive modeling.
- Choose and implement an appropriate classification model for prediction.


## Exploring Subscription Rates Through Visual Analysis


Here are several plots to explore the dataset of (players.csv) and its variables. Along with an exploration to find the best predictive variables for newsletter subscription rates (`subscribe`). Here's an explanation of each plot:

---

In [None]:
url <- "https://raw.githubusercontent.com/g-amadorz/dsci-project/refs/heads/main/data/players.csv"

data <- read_csv(url)

## Mean Subscription Rates by Categorical Variables

To understand how different player characteristics influence newsletter subscription rates, we first must find the mean of the quantiative variables in the dataset. The mean age of a player in this data set is 21 and the played hours is 6.

In [None]:
mean_values <- data |>
  summarize(
    mean_age = mean(Age, na.rm = TRUE),
    mean_played_hours = mean(played_hours, na.rm = TRUE)
  )
mean_values

## Plot 1: Subscription Rate by Experience Level

To understand how player experience levels influence newsletter subscription rates, I created a bar plot showing the proportion of players who subscribed (`subscribe = TRUE`) for each experience level.

---

In [None]:
experience <- data |>
  group_by(experience, subscribe) |>
  summarize(count = n()) |>
  mutate(proportion = count / sum(count)) |>
  filter(subscribe == TRUE)

experience_plot <- experience |>
  ggplot(aes(x = experience, y = proportion, fill = experience)) +
  geom_bar(stat = "identity") +
  labs(title = "Subscription Rate by Experience Level",
       x = "Experience Level",
       y = "Proportion Subscribed")

experience_plot

## Conclusion


The plot reveals a clear trend: as player experience increases, so does the likelihood of subscribing to the newsletter.

---

## Plot 2: Subscription Rate by Gender
To understand how player gender could influence newsletter subscription rates, I created a bar plot showing the proportion of players who subscribed (`subscribe = TRUE`) for gender.

---

In [None]:
gender <- data |>
  group_by(gender, subscribe) |>
  summarize(count = n()) |>
  mutate(proportion = count / sum(count)) |>
  filter(subscribe == TRUE)

gender_plot <- gender |> ggplot(aes(x = gender, y = proportion, fill = gender)) +
      geom_bar(stat = "identity") +
      labs(title = "Subscription Rate by Gender",
           x = "Gender",
           y = "Proportion Subscribed")

gender_plot

## Conclusion


The plot does not really conclude anything just the fact that around 75% of the playerbase is subscribed.

---