# Project Planning Stage (Individual)

## (1) Data Description:
- ### player.csv
    - The player.csv dataset contains information about each unique player that has played on the MineCraft server. With the record spanning 196 different players (observations), the dataset keeps track of 7 kinds of information (variables) of each player as seen in the following.

| **Information**    | **Information Type** | **Description**        | **Statistic Summary**    | **Experience** | **Factor**    |
|--------------------|----------------------|------------------------|--------------------------|----------------|---------------|
| subscribe          | Logical              | (lgl)                  | -                        | -              | -             |
| hashedEmail        | Character Vector     | (chr)                  | -                        | -              | -             |
| played_hours       | Double               | (dbl)                  | Mean/Median/Mode/Min/Max | -              | -             |
| name               | String               | (chr)                  | -                        | -              | -             |
| gender             | String               | (chr)                  | -                        | -              | -             |
| Age                | Double               | (dbl)                  | Mean/Median/Mode/Min/Max | -              | -             |

- In this dataset, there are some potential issues that are present in the data.
  1. There are 'N/A' values in some of the cells, indicating that we have to either skip over those cells or replace them with a different value
  2. The dataset underrepresents Non-binary people as they seems be in the minority in the gender category
  3. The dataset underrepresents people of higher ages as most of the players around around 18 - 21
- Additionally, some unseen factors may include things such as where the data was collected or the reasoning behind people inputting 'N/A' as an answer 

## (2) Questions:
- ### Broad Question
    - What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?
- ### Specific Question
    - Can the skill level of players, age, and played hours predict subscription rates in player.csv?

## (3) Exploratory Data Analysis and Visualization:

In [None]:
#Required libraries
library(tidyverse)

In [None]:
# Loading dataset into R
players_data <- read_csv("data/players.csv")
players_data |> head(10)

In [None]:
# Minimum Wrangling on dataset
players_tidy <- players_data |>
    mutate(experience = as_factor(experience), gender = as_factor(gender))
players_tidy |> head(10)

In [None]:
#Computing mean values for each quantitative variable
players_mean <- players_tidy |> 
    select(played_hours, Age) |> 
    map_dfr(max, na.rm = TRUE)
players_mean

In [None]:
# Exploratory Visualizations
players_scatter <- players_tidy |>
    ggplot(aes(x = Age, y = played_hours, color = experience)) +
    geom_point(alpha = 0.9) + 
	scale_x_log10() +
    scale_y_log10() +
    labs(x = "Age (years)", y = "Hours Played (hours)", color = "Experience Level") +
    ggtitle("Hours Played vs. Age Relationship") +
    theme(text = element_text(size = 18))

players_scatter_sub <- players_tidy |>
    ggplot(aes(x = Age, y = played_hours, color = subscribe)) +
    geom_point(alpha = 0.9) + 
	scale_x_log10() +
    scale_y_log10() +
    labs(x = "Age (years)", y = "Hours Played (hours)", color = "Subscribed?") +
    ggtitle("Hours Played vs. Age Relationship") +
    theme(text = element_text(size = 18))

players_bar_gender <- players_tidy |>
    ggplot(aes(x = Age, y = played_hours)) +
    geom_bar(stat = "identity") +
    labs(x = "Age (years)", y = "Hours Played (hours)") +
    ggtitle("Hours Played vs. Age Relationship") +
    theme(text = element_text(size = 18)) + facet_grid(rows = vars(experience))

players_bar_ex <- players_tidy |>
    ggplot(aes(x = Age, y = played_hours, fill = experience)) +
    geom_bar(stat = "identity") +
    labs(x = "Age (years)", y = "Hours Played (hours)", fill = "Experience Level") +
    ggtitle("Hours Played vs. Age Relationship") +
    theme(text = element_text(size = 18))+ facet_grid(rows = vars(experience))

players_bar_gender_better <- players_tidy |>
    ggplot(aes(x = gender, fill = experience)) +
    geom_histogram(stat = "count") +
    labs(x = "Gender", y = "Number of Players") +
    ggtitle("Distribution of player across gender and \nexperience") +
    theme(text = element_text(size = 18),
         axis.text.x = element_text(angle = 45, hjust = 1))

players_scatter
players_bar_gender
players_bar_ex
players_scatter_sub
players_bar_gender_better

- ### Insight
    - Players with the most experience spends the least amount of time on the server.
    - There seems to be an overrepresentation of Males in the gender category.
    - Gender does not seem to share a relationship with the amount of hours played.
    - Most of the players in the dataset are subscribed

## (4) Methods and Plan:

Method:
1. Wrangle dataset and filter out any unused variables.
2. Use visuzalization techniques to estimate generally where the answer may be
3. Split the dataset in to training set and testing set, 75/25 split
4. Fold the training set until optimal K
5. Train the model and use it to predict training set
6. Finally use the model to predict and measure accuracy on testing set

1. Wrangle dataset and filter out 

- Why is this method appropriate?
    - Getting a general understanding of the question at hand is better to know if the rest of the porject is going in the right direction.
- Which assumptions are required, if any, to apply the method selected?
    - That this dataset is perfectly equal and no biases occured when collecting it.
- What are the potential limitations or weaknesses of the method selected?
    - The imbalance in class may be make the model become biased on one end.
- How are you going to compare and select the model?
    - By measuring the accuraccy and something..
- How are you going to process the data to apply the model? For example: Are you splitting the data? How? How many splits?
    - Splliting into 5?
- What proportions will you use for the splits? At what stage will you split? Will there be a validation set? Will you use cross validation?
  - 75/25 split right after the vizualization phase.

- ### Broad Question
    - What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?
- ### Specific Question
    - Can the skill level of players, age, and played hours predict subscription rates in player.csv?

To answer the broad and specific question, I would utilize the K-Nearest-Nieghbor classification model to assess the problem. Specifically, I would utilize the appraoch from binary classification as the value we are prediciting is a logical type which means there is only two possible outcomes. However, before applying this method, a few things must be considered. W