In [None]:
library(tidyverse)

In [None]:
player_data<- read_csv("players.csv")
player_data

In [None]:
players_numeric_mean<- player_data|>
    summarise(
    mean_played_hours = round(mean(played_hours, na.rm = TRUE), 2),
    min_played_hours = min(played_hours, na.rm = TRUE),
    max_played_hours = max(played_hours, na.rm = TRUE),
    missing_played_hours = round(mean(is.na(played_hours)) * 100, 2),
    
    mean_age = round(mean(Age, na.rm = TRUE), 2),
    min_age = min(Age, na.rm = TRUE),
    max_age = max(Age, na.rm = TRUE),
    missing_age = round(mean(is.na(Age)) * 100, 2)
  )
players_numeric_mean

<h1> (1) Data Description: </h1>

**Summary:** This dataset frame contains information about individual players, including their experience level, subscription status, playtime, name, gender, and age. Each record represents a unique player identified by a hashed email address.
 
**Number of observations:** 196 players

**Number of Variables:** 7

<h2>Variables</h2>

 Variable Name | Type | Description | Example Value |
|----------------|------|--------------|----------------|
| `experience` | Categorical (`chr`) | Player’s skill level or rank. | `Pro` |
| `subscribe` | Boolean (`lgl`) | Indicates whether the player has an active subscription (TRUE) or not (FALSE). | `TRUE` |
| `hashedEmail` | String (`chr`) | Unique anonymized identifier for each player. | `f6daba4...` |
| `played_hours` | Numeric (`dbl`) | Total number of hours the player has spent playing. | `30.3` |
| `name` | String (`chr`) | Player’s first name. | `Morgan` |
| `gender` | Categorical (`chr`) | Player’s gender identity | `male` |
| `Age` | Numeric (`dbl`) | Player’s age in years. Contains some missing values (`NA`). | `17` |

---

<h2>Sumarry Statistic</h2>

| Variable | Mean | Min | Max | Missing (%) |
|-----------|------|-----------|------|--------------|
| `played_hours` | *5.85* | *0* | *223.1* | 0% |
| `Age` | *21.14* | *9* | *58* | 1.02% |
---

<h2>Some of the problems that can be observed in the dataframe</h2>

<h3> Some Direct Observations</h3>

- The hashedEmail variable appears to be the unique player identifier.
  
- The experience variable may represent skill progression and could be useful in predicting playtime.

<h3>direct problems</h3>

- Some players are missing their age, so the dataset isn’t complete, or they prefer not to say.
  
- Many players have 0 played hours, which may indicate that they have just signed up and haven’t actually played. This might affect our later prediction in answering the question based on the data.

- The gender column has many different responses, such as “Other”, “Two-Spirited”, “Prefer not to say”, etc. This might make it hard to group or summarize.

<h3>other potential issue</h3>

- The data may not represent all types of players (for example, older players or casual players may be missing).
  
- If some of the data are self-recorded (such as age), the outcome when using this data set might not be that accurate.

<h2>how the data were collected</h2>
<p> A research group in Computer Science at UBC, led by Frank Wood, is collecting data about how people play video games. They have set up a Minecraft serverLinks to an external site., and players' actions are recorded as they navigate through the world. </p>

<h1>(2) Questions:</h1>

<h3>The Question that I will be addressing </h3>

**Question 1:** What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

<h3>The specific question</h3>

Can the player's total playtime and age  predict whether they subscribe to the newsletter in the player database?

<h3> How the data will help me address the question of interest</h3>

<p>This dataset contains information such as total playtime, age, and subscription status for each player. I will focus on these three variables and remove missing values (N/A). Then, I can use a predictive model (KNN) to predict whether playtime and age can explain and predict which kind of players will be more likely to subscribe to newsletters, as what has been asked in the broad question.</p>

<h1>(3) Exploratory Data Analysis and Visualization</h1>

<h4>Demonstrate that the dataset can be loaded into R:</h4>

In [None]:
player_data<- read_csv("players.csv")
player_data

<h4>Do the minimum necessary wrangling to turn your data into a tidy format:</h4>

<p>The players.csv database is already very tidy. Each characteristic being measured is stored in its own column, each observation forms a row, and each type of observational unit forms a table.</p>

<h4>Compute the mean value for each quantitative variable in the players.csv data set: </h4>

In [None]:
players_numeric_mean<- player_data|>
    summarise(
    mean_played_hours = round(mean(played_hours, na.rm = TRUE), 2),
    mean_age = round(mean(Age, na.rm = TRUE), 2),
  )
players_numeric_mean

<h4> exploratory visualizations of the data: </h4>

In [None]:
visualization<- player_data|>
    