In [None]:
library(tidyverse)

In [None]:
player_data<- read_csv("players.csv")
player_data

In [None]:
players_numeric_mean<- player_data|>
    summarise(
    mean_played_hours = round(mean(played_hours, na.rm = TRUE), 2),
    min_played_hours = min(played_hours, na.rm = TRUE),
    max_played_hours = max(played_hours, na.rm = TRUE),
    missing_played_hours = round(mean(is.na(played_hours)) * 100, 2),
    
    mean_age = round(mean(Age, na.rm = TRUE), 2),
    min_age = min(Age, na.rm = TRUE),
    max_age = max(Age, na.rm = TRUE),
    missing_age = round(mean(is.na(Age)) * 100, 2)
  )
players_numeric_mean

<h1> (1) Data Description: </h1>

**Summary:** This dataset frame contains information about individual players, including their experience level, subscription status, playtime, name, gender, and age. Each record represents a unique player identified by a hashed email address.
 
**Number of observations:** 196 players

**Number of Variables:** 7

<h2>Variables</h2>

 Variable Name | Type | Description | Example Value |
|----------------|------|--------------|----------------|
| `experience` | Categorical (`chr`) | Player’s skill level or rank. | `Pro` |
| `subscribe` | Boolean (`lgl`) | Indicates whether the player has an active subscription (TRUE) or not (FALSE). | `TRUE` |
| `hashedEmail` | String (`chr`) | Unique anonymized identifier for each player. | `f6daba4...` |
| `played_hours` | Numeric (`dbl`) | Total number of hours the player has spent playing. | `30.3` |
| `name` | String (`chr`) | Player’s first name. | `Morgan` |
| `gender` | Categorical (`chr`) | Player’s gender identity | `male` |
| `Age` | Numeric (`dbl`) | Player’s age in years. Contains some missing values (`NA`). | `17` |

---

<h2>Sumarry Statistic</h2>

| Variable | Mean | Min | Max | Missing (%) |
|-----------|------|-----------|------|--------------|
| `played_hours` | *5.85* | *0* | *223.1* | 0% |
| `Age` | *21.14* | *9* | *58* | 1.02% |
---

<h2>Some of the problems that can be observed in the dataframe</h2>

<h3>Some Direct Observations</h3>

- The hashedEmail variable appears to be the unique player identifier.
  
- The experience variable may represent skill progression and could be useful in predicting playtime.

<h3>Direct problems</h3>

- Some players are missing their age, so the dataset isn’t complete, or they prefer not to say.
  
- Many players have 0 played hours, which may indicate that they have just signed up and haven’t actually played. This might affect our later prediction in answering the question based on the data.

- The gender column has many different responses, such as “Other”, “Two-Spirited”, “Prefer not to say”, etc. This might make it hard to group or summarize.

<h3>Other potential issue</h3>

- The data may not represent all types of players (for example, older players or casual players may be missing).
  
- If some of the data are self-recorded (such as age), the outcome when using this data set might not be that accurate.

<h3>How the data were collected</h3>

<p> A research group in Computer Science at UBC, led by Frank Wood, is collecting data about how people play video games. They have set up a Minecraft serverLinks to an external site., and players' actions are recorded as they navigate through the world. </p>

<h1>(2) Questions:</h1>

<h3>The Question that I will be addressing </h3>

**Question 1:** What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

<h3>The specific question</h3>

Can the player's total playtime and age  predict whether they subscribe to the newsletter in the player database?

<h3> How the data will help me address the question of interest</h3>

<p>This dataset contains information such as total playtime, age, and subscription status for each player. I will focus on these three variables and remove missing values (N/A). </p>

<h1>(3) Exploratory Data Analysis and Visualization</h1>

<h4>Demonstrate that the dataset can be loaded into R:</h4>

In [None]:
player_data<- read_csv("players.csv")
player_data

<h3>Do the minimum necessary wrangling to turn your data into a tidy format:</h3>

<p>The players.csv database is already very tidy. Each characteristic being measured is stored in its own column, each observation forms a row, and each type of observational unit forms a table.</p>

<h3>Compute the mean value for each quantitative variable in the players.csv data set: </h3>

In [None]:
players_numeric_mean<- player_data|>
    summarise(
    mean_played_hours = round(mean(played_hours, na.rm = TRUE), 2),
    mean_age = round(mean(Age, na.rm = TRUE), 2),
  )
players_numeric_mean

<h3> Exploratory visualizations of the data: </h3>

In [None]:
visualization_age_subscribe<- player_data|>
    ggplot(aes(x = Age, fill = subscribe)) +
    geom_histogram(position = "identity", alpha = 0.3, binwidth = 5) +
    labs(title = "Age vs Subscribe",
       x = "Age (years)",
       y = "Count (per person) ")

visualization_playtime_subscribe<- player_data|>
    ggplot(aes(x = played_hours, fill = subscribe)) +
    geom_histogram(position = "identity", alpha = 0.3, binwidth = 5) +
    labs(title = "Played_hours vs Subscribe",
       x = "played_hours (h)",
       y = "Count (per person) ")

visualization_playtime_subscribe
visualization_age_subscribe

By using the histogram, we can see the distribution of age and playtime for whether players who have subscribed or not. The two graphs above show that most people who subscribed to the newsletter have played 0-5 hours and are between the ages of 13 and 17, which suggests that players within this range are more likely to subscribe.

<h1>(4) Methods and Plan</h1>

<h4>Methond</h4>

<p> The method that I might be using to address the question "Can the player's total playtime and age predict whether they subscribe to the newsletter in the player database?" will be the KNN regression model. This will allow me to classify players as "subscribers" and "non-subscribers" based on their playtime and age.</p>

<h4>Why is this method appropriate?</h4>

<p> This method is appropriate because KNN regression works well for numeric values. In addition, KNN regression also does not assume any specific relation between the predictor and outcome, which will be helpful to use since the age and playtime might have a non-linear relationship with the subscription. </p>

<h4>Which assumptions are required, if any, to apply the method selected?</h4>

 - The data must be scaled when using it. 

<h4>What are the potential limitations or weaknesses of the method selected?</h4>

 - the choice of k.
 - Outliers, especially when dealing with the game time data, as game time has a range from 0 to over 200 hours.

<h4>How are you going to compare and select the model?</h4>

<p> I’ll use k-fold cross-validation (around 5 folds) on the training data to choose a good value for k. Model performance will be compared mainly through accuracy and the misclassification rate. </p>

<h4>How are you going to process the data to apply the model? </h4>

<p>I’ll keep only the variables I need (played_hours, Age, and subscribe), remove missing values, and standardize the numeric predictors. I’ll also take a quick look at any extreme values in playtime since they might affect the distance calculations.</p>

<p>After cleaning and scaling, I will use 70% of the data for training and 30% for testing. The training set will be used for all model development, including scaling, tuning the number of neighbours (k), and running cross-validation. I plan to use 5-fold cross-validation within the training set to select the best k and ensure the model generalizes well. The test set will remain untouched until the final evaluation to provide an estimate of model performance.</p>
