In [None]:
library(tidyverse)
library(tidymodels)
library("ggplot2")

# Question 1<br>

## players.csv
There are 7 variables in players dataset. Each row represents a player information. 196 players are recorded in the dataset. <br>
The data is collected from each unique hashed email, record the played hours and personal information of players

1. **experience**:<br>
   character type<br>
   Different game level of the players
2. **subscribe**:<br>Logical type<br>Whether the player have subscribed the game-related newsletter or not
3. **hashedEmail**:<br>character type<br>A private and unique representation of encoding oneâ€™s email address using a cryptographic hashing function. Each player in the data set has a unique hashed email.
4. **played_hours**: <br>Numeric<br>The time the players spent on the game (hours)
5. **name**: <br>character type<br>Name of each player
6. **gender**:<br>character type<br> Gender of players
7. **Age**: <br>Numeric type<br>Age of players (year)


<br><br><br>

# Question 2
My broad research question is to explore the time windows are most likely to have large number of simultaneous players. This is to ensure that the number of licenses on hand is sufficiently large to accommodate all parallel players with high probability. 

The specific question is: Can start_time and end_time predict the time_period?<br><br>

###  Wrangling Data
1. We mutate **experience**, **gender**, and **subscribe** to factor type because they are catagorical variables.
2. We mutate **subscribe** content: "TRUE" to "Yes"; "FALSE" to "No" so it is easier to read and understand.
3. **name** and **hashedEmail** variables are neither numeric nor catagorical variables. We will not need them in prediction problems, so we can unselect those columns in the dataframe.
4. Since we only need **played_hours**, **subscribe**, and **Age** to explore our question, we can create a new data frame with only these three variables in it.
5. We standardize **played_hours** and **Age** so that all variables will be on a comparable scale.This ensures all data observations will be on a comparable scale and contribute equal shares to the calculation of the distance between points.
6. Then we can use knn-classification to predict the label of **subscribe**.

<br><br><br>
# Question 3 

We load sessions and players dataset using the absolute path and a url format


In [None]:
url_sessions <- "https://raw.githubusercontent.com/Yiliny110/project/refs/heads/main/sessions.csv"
sessions<-read_csv(url_sessions)
sessions

In [None]:
url_players<- "https://raw.githubusercontent.com/Yiliny110/project/refs/heads/main/players.csv"
players<-read_csv(url_players)
players

<br><br><br><br>We wrangling the data as mentioned in Question 2

In [None]:
# Transfer subscribe, experience, gender into factors

players_data <- players |>
    mutate(
        subscribe = as_factor(subscribe),
        experience = as_factor(experience),
        gender = as_factor(gender)
    ) |>
    mutate(subscribe = fct_recode(subscribe, "Yes" = "TRUE", "No" = "FALSE")) |>
    select(-hashedEmail, -name)
players_data

In [None]:
# 1. Explore the players data

num_obs <- nrow(players)
players_data |>
  group_by(subscribe) |>
  summarize(
    count = n(),
    percentage = round(n() / num_obs * 100, 2)
  )

We explore the data by calculating the percentage in each group by dividing the total number of observations and multiplying by 100. We have 144 (73.5%) TRUE and 52 (26.5%) FALSE observations, indicating that our class proportions were roughly preserved when we split the data. <br><br>

In [None]:
# 2. Standardize the data set

set.seed(9999) 
subscribe_recipe <- recipe(subscribe ~ played_hours + Age, data = players_data) |>
                    step_scale(all_predictors()) |>
                    step_center(all_predictors())

subscribe_scaled <- subscribe_recipe |>
                    prep() |>
                    bake(players_data)

subscribe_scaled


<br><br><br><br>



Use is.numeric() to select all the numeric variables in players.csv data set and calculate the mean. The average played hours is 5.8 hours, and the average age of players is 21 years old.

In [None]:
# Calculate the mean value for each quantitative variable in the players.csv data set
# round to 2 decimals

numeric_variable_mean <- players_data |>
  select(where(is.numeric)) |>
  summarize(across(everything(), ~ round(mean(.x, na.rm = TRUE), 2)))

numeric_variable_mean

<br><br><br><br>
## Plots
<br><br>
### 1. Scatter Plot
<br>To predict the class of **subscribe**, we plot a scatter plot by putting age on the x-axis and played_hours on the y_axis. Color it by subscribe.
<br><br>Most of the points are concentrated at the bottom of the graph. For the people who have a high total played hours tend to subscribe a game-related newsletter. Most of unsubscribers are those who spend a short time (below average) on the game.<br><br>Most of the subscribed people are around the average age, while the people that are older have a unsubscribed majority.

In [None]:
options(repr.plot.width = 8, repr.plot.height = 7)

subscribe_plot <- subscribe_scaled |>
    ggplot(aes(x = Age, y = played_hours, colour = subscribe)) +
        geom_point(alpha = 0.6) +
        labs(x = "Age of Players(years)",
             y = "Played Hours (hours)",
             color = "Subscribing",
             title = "Played Hours and Age by Whether Subscribing to a Game-related Newsletter") +
        theme(
            text = element_text(size = 20),
            plot.title = element_text(size = 14, face = "bold"),
            strip.text = element_text(size = 16, face = "bold")
            )
subscribe_plot

### 2. Histogram Plots
We look at the data seperately using histogram to see the distribution of a single predictor to **subscribe**. <br><br>The maximum people who do not follow the newsletter is about 48 and spend about scaled -0.5 hours on the game.The most people, about 123 people, who subscribe the newsletter also spend about scaled -0.5 hours on the game. Both of the distribution have large population gap between most people played hours and other time.

In [None]:

options(repr.plot.width = 14, repr.plot.height = 7)

played_hours_hist <- subscribe_scaled |>
    ggplot(aes(x = played_hours, fill = subscribe)) +
      geom_histogram(bins = 30, alpha = 0.8, color = "white") +
      facet_grid(. ~ subscribe) +
      scale_fill_manual(values = c("No" = "red", "Yes" = "skyblue")) +
      labs(
        title = "Distribution of Played Hours Among Players Subscription to the Newsletter",
        x = "Played Hours (hours)",
        y = "Number of Players",
        fill = "Subscribed to Newsletter"
      ) +
      theme_minimal(base_size = 16) +
      theme(
        plot.title = element_text(size = 18, face = "bold"),
        strip.text = element_text(size = 16, face = "bold")
      )

played_hours_hist

The maximum people who do not follow the newsletter is about 22 at the age of scaled -0.5. The most people, about 58 people, who subscribe the newsletter also at scaled -0.5 years old. Both of the distribution have large population gap between most people age and other ages. Generally, the subscribed population decreases after scaled -0.5 years old.

In [None]:
options(repr.plot.width = 14, repr.plot.height = 7)

age_hist <- subscribe_scaled |>
    ggplot(aes(x = Age, fill = subscribe)) +
      geom_histogram(bins = 30, alpha = 0.8, color = "white") +
      facet_grid(. ~ subscribe) +
      scale_fill_manual(values = c("No" = "red", "Yes" = "skyblue")) +
      labs(
        title = "Distribution of Players at Different Ages and Whether They Are Subscribed",
        x = "Age of Player (Years)",
        y = "Number of Players",
        fill = "Subscribed to Newsletter"
      ) +
      theme_minimal(base_size = 16) +
      theme(
        plot.title = element_text(size = 18, face = "bold"),
        strip.text = element_text(size = 16, face = "bold")
      )

age_hist

## Question 4
<br><br>
### 1. K-NN Classification Method and Plan
Using 2 numeric predictors, **Age** and **played_hours**, to predict a catagorical variable is a significant binary k-nn classification problem. It works directly from the data using Euclidean distance without model training, which is a simple, intuitive algorithm The dataset is not very large (only 196 rows), which does not cost much to compute. <br><br>
We split the **subscribe_scaled** dataframe to 70% training data and 30% testing data. Then We split the training data into 5 evenly sized chunks for the validation set to do cross-validation comparison to get the best k value we will use.
### 2. Assumptions
1. To use knn model, we assume the closest neighbor points are likely having the same class.
2. The dataset contains enough observations because knn model needs many points to get the best k value and find neighbors.
3. There is no parametric relationship between whether **Age** and **subscribe** or **played_hours** and **subscribe**. They are relatively random combinations for a player
4. Since the "Yes" class represents the majority of **subscribe** in the training data, the majority classifier would always predict that a new observation is "Yes". The estimated accuracy of the majority classifier is usually fairly close to the majority class proportion in the training data. We would suspect that the majority classifier will have an accuracy of around 74%, as we calculated in .
### 3. Weakness
1. The **players.csv** is not a large data that has only 196 observations. The knn method might not be accurate with small dataset.
2. There is NA in **Age** of some players, which might cause the prediction not accurate since knn model replace the missing data with the average value of closest neighbor points.