# **Predicting Newsletter Subscription in Minecraft Research Server**

student number: 74745191  
Group number: 33  
Name: Cheryl  
Begin Date: November 4  
Total word account:

In [None]:
library(tidyverse)
library(readr)

## I. Data Description

In [None]:
#Data input and observation

url_players <- "https://raw.githubusercontent.com/cheryldobiki/DSCI-100-individual-project/refs/heads/main/players.csv"
url_sessions <- "https://raw.githubusercontent.com/cheryldobiki/DSCI-100-individual-project/refs/heads/main/sessions.csv"
players <- read_csv(url_players)
sessions <- read_csv(url_sessions)
head(players)
head(sessions)

**Data Description:**

The dataset was collected from a UBC Minecraft research server that records player activity. It includes two files: **players.csv** and **sessions.csv**.

#### players.csv
- 196 observations, 7 variables  
- Contains basic information of players

| Variable | Type | Description |
|-----------|------|-------------|
| experience | chr | Player’s experience level (e.g., Pro, Amateur) |
| subscribe | lgl | Whether the player subscribed to the newsletter |
| hashedEmail | chr | Unique encrypted player ID |
| played_hours | dbl | Total number of hours played |
| name | chr | Player’s display name |
| gender | chr | Player’s gender |
| Age | dbl | Player’s age in years |

#### sessions.csv
- 1,535 observations, 5 variables  
- Records play time of players and their ID

| Variable | Type | Description |
|-----------|------|-------------|
| hashedEmail | chr | Player identifier |
| start_time / end_time | chr | Session start and end times |
| original_start_time / end_time | dbl | The raw data automatically collected from the server|

##### Data quality and potential problems
- `played_hours` has value of 0.0, means some of the players may not open the games.
- `name`are useless data

## II.Questions

**Broad Question：** 

What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?


**Specific Question：**   

How do `player experience level` and `total playtime for each player` influence the `subscribing to game-related newsletter`, and does this relationship differ between `different experience levels`?

**Explanation**

To address this question, we will use information from both datasets:

1.
From players.csv, use **player experience** `pro`,`veteran`,`regular`,`amateur` as the explanatory variable.
2.
From players.csv, use **subscribe** (`TRUE/FALSE`) as the response variable.

This approach allows us to examine how both experience level and actual playtime relate to newsletter subscription.


## III.Exploratory Data Analysis and Visualization

In [None]:
players_mean <- players |>
  select(where(is.numeric)) |>
  summarise(across(everything(), ~ round(mean(.x, na.rm = TRUE), 2)))

players_mean

In [None]:
ggplot(players, aes(x = experience, y = played_hours + 1, color = subscribe)) +
  geom_point(position = position_jitter(width = 0.2, height = 0.1)) +
  scale_y_log10() +
  labs(title = "Total Playtime by Experience Level and Subscription Status (log scale)",
       x = "Experience Level",
       y = "Total Playtime (hours, log10)") +
  theme(text = element_text(size=16))

In [None]:
options(repr.plot.width =10, repr.plot.height =10)

experience_level_plot<- ggplot(players, aes(x = experience, fill = subscribe)) +
  geom_bar(position = "fill") +
  labs(
    title = "Newsletter Subscription Rate by Experience Level",
    x = "Experience Level",
    y = "Subscription Rate")+
   theme(text = element_text(size=20))
experience_level_plot

**Link of My GitHub repository**

https://github.com/cheryldobiki/DSCI-100-individual-project/tree/main