### DSCI 100 â€“ Project Planning Stage (Individual)
Name: Danial Zahid (62865654)

Date: November 11, 2025

## 1. Introduction

This project explores whether both **player age** and **time spent playing** can be used to predict if a player will subscribe to the game newsletter. By analyzing this relationship, we can identify which age groups are most engaged with the game and could be the most responsive to future newsletter campaigns.

The data was collected from a Minecraft research server managed by the UBC PLAI group (https://plaicraft.ai), which collects information about players' demographics and statistics. This analysis will focus on the **players.csv dataset**, which contains each player's age, subscription status, and the amount of time played.

## 2. Data Description

In [None]:
# Load the libraries
library(tidyverse)
library(ggplot2)
library(repr)
library(scales)

In [None]:
# Load the dataset
players <- read_csv("players.csv")
players

In [None]:
# Inspect the dataset
glimpse(players)

In [None]:
# Find the number of rows and columns
cat("Number of observations:", nrow(players), "\n")
cat("Number of variables:", ncol(players), "\n")

In [None]:
# Show the variable names of the dataset
colnames(players)

### Variable Summary Table:

| Variable Name | Type | Description | Example | Issues |
|---------------|------|-------------|----------|-------|
| experience | character | Self-reported experience with game | "Pro" | Possible bias |
| subscribe | logical | Whether player subscribed to newsletter | TRUE/FALSE | Imbalanced |
| hashedEmail | character | Unique ID for each player | "abc123" | none |
| played_hours | numeric | Amount of time played | 3.0 | Possible bias |
| name | character | Name for each player | "Luna" | none |
| gender | character | Gender for each player | "Female" | none |
| Age | numeric | Age for each player | 21 | Some missing values (N/A) |

### Summary:
- This dataset contains 196 players and 7 variables
- There are some missing data for the 'Age' and 'played_hours' variables
- Variables of focus: 'subscribe', 'Age', and 'played_hours'
- This data is recorded directly from the UBC PLai Minecraft website

## 3. Questions

**Broad Question:**
What player characteristics and behaviours are most predictive of subscribing to a newsletter, and how do these features differ between various player types?

**Specific Question:**
Is there a relationship between the age of newsletter subscribers and the time they spend playing that could inform which group of players to target for newsletter subscription campaigns?

**Response Variable:**
`played_hours`

**Explanatory Variable:**
`Age`

## 4. Exploratory Data Analysis

In [None]:
# Compute the mean value for each quantitative variable
players_mean <- players |>
    select(played_hours, Age) |>
    map_dbl(mean, na.rm = TRUE)
players_mean

| Variable Name | Mean |
|---------------|------|
| experience | 5.85 |
| Age | 21.14 |

In [None]:
# Filter subscribers only
subscribers_filter <- players |>
    filter(subscribe == TRUE, !is.na(Age), !is.na(played_hours))
subscribers_filter

In [None]:
# Visualization 1: Distribution of Player Age
plot1 <- ggplot(players, aes(x = Age)) +
    geom_histogram(bindwidth = 1, fill = "blue", color = "black") +
    labs(title = "Distribution of Player Age", x = "Age (years)", y = "Count") +
    theme(text = element_text(size = 15))
plot1

In [None]:
# Visualization 2: Newspaper Subscription Count
plot2 <- ggplot(players, aes(x = subscribe, fill = subscribe)) +
    geom_bar() +
    labs(title = "Number of Players Subscribed vs Not Subscribed", x = "Subscribed", y = "Count") +
    theme(text = element_text(size = 15))
plot2

In [None]:
# Visualization 3: Age vs. Time Played
plot3 <- ggplot(players, aes(x = Age, y = played_hours, color = subscribe)) +
    geom_point(alpha = 0.7) +
    scale_x_log10() +
    scale_y_log10() +
    labs(title = "Relationship Between Age and Total Playtime", x = "Age (years)", y = "Total Playtime", color = "Subscribed")
    theme(text = element_text(size = 15))
plot3

### Insights from visualizations
- The histogram shows that the largest majority of players who have subscribed to a game newsletter are between 10 and 30 years old, with a high focus on players who are teenagers and in their twenties. This indicates that most subscriptions are from younger players, and shows that they are the most appropriate players for newsletter targeting campaigns.
- The scatterplot shows that most players have low playtime hours, but there are a few players who have higher hours, representing outliers with 150 or more hours. The trend shows a weak or slightly negative linear relationship, meaning that as 'Age' increases,  the 'played_hours' tends to decrease, each point representing one player. This concludes that younger players are more active in the game compared to older players.
- The bar graph shows that the number of subscribers is three times the number of players who are not subscribed.

## 5. Methods and Plan

**Why is this method appropriate?**

For my question of interest, I will answer using linear regression and correlation analysis to examine the relationship between age and total playtime among newsletter subscribers. This approach is the most appropriate because both 'Age' and 'played_hours' variables are numeric. This allows us to use linear regression to estimate the strength and direction of their relationship. Correlation analysis will help determine how close the two variables move together.

**Which assumptions are required, if any, to apply the method selected?**

The relationship between 'Age' and 'played_hours' is linear, and the data is scaled to avoid any outliers.

**What are the potential limitations or weaknesses of the method selected?**

A limitation of this method is that 'Age' values may be missing for some players. In addition, a few players have higher playtime hours than others, which can distort the linear trend and reduce the accuracy of the relationship.

**How are you going to compare and select the model?**

- 

**How are you going to process the data to apply the model? For example: Are you splitting the data? How? How many splits? What proportions will you use for the splits? At what stage will you split? Will there be a validation set? Will you use cross-validation?**

I will first begin by cleaning the data to remove any outliers and missing data values in 'Age' and 'played_hours' to ensure data quality. Then, I will split the data into training and testing


Finally, I will use scatterplots, correlation, and regression lines to visualize and test the relationship between the two variables.

## 6. GitHub Repository

**GitHub Repository Link:**
https://github.com/danial-z/dsci-100-2025w1-group-39.git