#  Individual Planning Report
**Caelyn Tiu #79942686**  
**Group 43**  



## I. Data Description

The dataset used in this report is called `players.csv`, which contains the list of all unique players and information about each of the players on a Minecraft research server managed by a UBC research group. This dataset records 196 observations over 7 variables, both demographic and behavioral characteristics of a player such as experience level, gender, total playtime, and age. These variables provide insights to each player's profile and allows us to analyze the factors that may influence if a player would subscribe to a game-related newsletter. Below shows the loaded dataset.

In [None]:
library(tidyverse)
library(readr)

players<-read_csv("data/players.csv")
players

In [None]:
head(players)
dim(players)

### Summary Statistics:

In [None]:
players |>
  summarise(
    mean_played_hours = round(mean(played_hours, na.rm = TRUE), 2),
    sd_played_hours   = round(sd(played_hours, na.rm = TRUE), 2),
    min_played_hours  = round(min(played_hours, na.rm = TRUE), 2),
    max_played_hours  = round(max(played_hours, na.rm = TRUE), 2),
    mean_age = round(mean(Age, na.rm = TRUE), 2),
    sd_age   = round(sd(Age, na.rm = TRUE), 2),
    min_age  = round(min(Age, na.rm = TRUE), 2),
    max_age  = round(max(Age, na.rm = TRUE), 2)
  )


The dataset above contains seven variables in total. While not all of these variables will be used in the predictive analysis, it is important to know what are the variables present to understand how it could be potentially useful in modelling. Below is a summary table of all seven variables, and what they describe.

| **Variable** | **Description** |
|---------------|-----------------|
| `experience` | a player's self identified experience level of the game (ex: Amateur, Beginner, Pro, Regular, Veteran)|
| `subscribe` | indicates whether a player has subscribed to the game-related newsletter (TRUE/FALSE) |
| `hashedEmail` |unique email hash to protect player privacy |
| `played_hours` | how many hours a player spends on the game|
| `name` | unique identification of the player |
| `gender` | self-identified gender of the player |
| `Age` | the player's age in years|



### Potential Issues and Limitations
While the dataset provides valuable information about the players, there are potential issues and limitations that may affect the quality and accuracy of the analysis. First, not all players chose to disclose demographic variables such as `Age` and `gender`, which makes the entries missing or incomplete. This can reduce the efficiency of the sample size and may shift the results. Another possible problem is that the number of players who subscribed to the newsletter versus the players who didn't may cause an imbalance, and could affect the performance of the classification model and lead to biased predictions. Another issue is that the experience level is self-reported and may not accurately show a player's true skill in the game unless this was accurately classified with another study. The `played_hours` variable could potentially have outliers as well and could skew the overall analysis if left unnoticed. 

# II. Questions

The broad question answered in this project is Question 1: "What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?" From this question, the specific question I made is : "Can the experience of the players predict if they would  subscribe to a game-related newsletter?"

The question I made is about exploring whether there is a correlation between a player's experience level (classified as Amateur, Beginner, Regular, Pro, or Veteran), and their likelihood to subscribe to the game newsletter. This question aims to determine if players with higher experience levels are more likely to stay updated about the game through newsletter or not. Other variables such as `played_hours`, `age`and `gender` may also be considered later to see if they influence the likelihood of subscribing to newsletters when put side by side with experience level. 

The data from `players.csv`will help address my question because it contains the key variable, `subscribe`, which indicates whether a player has subscribed to the game-related newsletter, while exploratory variables such as `experience`, `played_hours`, `Age`, and `gender`describe a player's characteristics that may influence this outcome. Before putting this data through a predictive model, it is important to perform data wrangling first to make sure any missing values or outliers are accounted for. Variables such as `experience`, `gender`, and `subscribe` will be converted into factor types since they are categorical variables. After tidying the data, the data will be split into training and testing sets for model validation. These steps ensure that the final dataset is properly formatted, consistent and ready to use for predictive analysis.






# III. Exploratory Data Analysis and Visualization

In [None]:
library(tidyverse)
library(readr)

players<-read_csv("data/players.csv")
players

### Mean Values for Quantitative Variables

In [None]:
mean_values<-players|>
           summarise(
               mean_played_hours = round(mean(played_hours, na.rm=TRUE), 2),
               mean_age = round(mean(Age, na.rm = TRUE), 2))
               
mean_values

### Data Wrangling

In [None]:
players<- players|>
        mutate(
            subscribe = as_factor(subscribe),
            experience = as_factor(experience),
            gender = as_factor(gender))
players

### Visualization 1: Experience of Players vs Subscription of Newsletters

In [None]:
exp_sub_plot<- players|>
              ggplot(aes(x=experience, fill= subscribe)) +
              geom_bar(position= "dodge") +
              labs(
                  x= "Experience Level",
                  y= "Number of Players",
                  fill= "Subscribed",
                  title= "Experience vs Subscription of Newsletters Plot")
exp_sub_plot

            

The bar chart above shows whether a player with more experience (Pro, Veteran) is more likely to subscribe to the newsletter compared to Amateur or Beginner players. From this graph, we can see that the group with the most players subscribed to the newsletter is the amateur group, while the pro players have the lowest players subscribed.

### Visualization 2: Gender vs Newsletter Subscription

In [None]:
gender_plot<- players |>
  ggplot(aes(x = gender, fill = subscribe)) +
  geom_bar(position = "dodge") +
  labs(
    x = "Gender",
    y = "Number of Players",
    fill = "Subscribed",
      title = "Newsletter Subscription by Gender")
gender_plot


This bar chart shows the number of players subscribed to the newsletter based on gender. The results show that male players have the highest number of subscriptions. The next highest number of subscription comes from females, while non-binary and other gender identities have lower number of subscriptions. This visualization suggests that there is a possible relationship between gender and subscription to newsletter.