# Project Planning Stage (Individual)

## (1) Data Description:
- ### player.csv
    - The player.csv dataset contains information about each unique player that has played on the MineCraft server. With the record spanning 196 different players (observations), the dataset keeps track of 7 kinds of information (variables) of each player as seen in the following.

| **Information**    | **Information Type** | **Description**        | **Statistic Summary**    |
|--------------------|----------------------|------------------------|--------------------------|
| subscribe          | Logical              | whether a player is subscribed to the newsletter                | N/A                        |
| hashedEmail        | Character Vector     | the hasmail of the player                 | N/A                        |
| played_hours       | Double               | the amount of hours the player spent on the server                  | Mean: 20 Min: 0 Max: 223.1|
| name               | String               | name of the player                  | N/A                         |
| gender             | String               | gender of the player                  | N/A                        |
| Age                | Double               | age of the player                  |Mean: 20 Min: 8 Max: 50|
| experience                | Character Vector               | experience level of player                  |N/A|

- In this dataset, there are some potential issues that are present in the data.
  1. There are 'N/A' values in some of the cells, indicating that we have to either skip over those cells or replace them with a different value
  2. The dataset underrepresents Non-binary people as they seems be in the minority in the gender category
  3. The dataset underrepresents people of higher ages as most of the players around around 18 - 21
- Additionally, some unseen factors may include things such as where the data was collected or the reasoning behind people inputting 'N/A' as an answer 

## (2) Questions:
- ### Broad Question
    - What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?
- ### Specific Question
    - Can the the player type, age and played hours of players predict if they are going to subscribe to a game-related newsletter and which player type is the most predictive?
 
### Process with Given Data
I plan to utilized the provided data to predict whether if a player will subscribe or not to the newsletter. By using a binary k-nn classification model I can use the values of Age and hours played to  predict whether if a player will subscribe to a newletter. Additionally, I can group the experience levels of players into sepearte dataset and see which oWith only quantitative data being useful in this classification model, I would only require the subscribe, experience, played_hours, and age columns. Hence I would wrangle the data so that it only contains those 4 columns by using the select() function. Lastly, I would filter for each player experience level and train models based on each one.

## (3) Exploratory Data Analysis and Visualization:

In [None]:
#Required libraries
library(tidyverse)

In [None]:
# Loading dataset into R
players_data <- read_csv("data/players.csv")
players_data |> head(10)

In [None]:
# Minimum Wrangling on dataset
players_tidy <- players_data |>
    mutate(experience = as_factor(experience), gender = as_factor(gender))
players_tidy |> head(10)

In [None]:
#Computing mean values for each quantitative variable
players_mean <- players_tidy |> 
    select(played_hours, Age) |> 
    map_dfr(mean, na.rm = TRUE)
players_mean

In [None]:
# Exploratory Visualizations
players_scatter <- players_tidy |>
    ggplot(aes(x = Age, y = played_hours, color = experience)) +
    geom_point(alpha = 0.9) + 
	scale_x_log10() +
    scale_y_log10() +
    labs(x = "Age (years)", y = "Hours Played (hours)", color = "Experience Level") +
    ggtitle("Hours Played vs. Age Relationship") +
    theme(text = element_text(size = 18))

players_scatter_sub <- players_tidy |>
    ggplot(aes(x = Age, y = played_hours, color = subscribe)) +
    geom_point(alpha = 0.9) + 
	scale_x_log10() +
    scale_y_log10() +
    labs(x = "Age (years)", y = "Hours Played (hours)", color = "Subscribed?") +
    ggtitle("Hours Played vs. Age Relationship") +
    theme(text = element_text(size = 18))

players_bar_gender <- players_tidy |>
    ggplot(aes(x = Age, y = played_hours)) +
    geom_bar(stat = "identity") +
    labs(x = "Age (years)", y = "Hours Played (hours)") +
    ggtitle("Hours Played vs. Age Relationship") +
    theme(text = element_text(size = 18)) + facet_grid(rows = vars(experience))

players_bar_ex <- players_tidy |>
    ggplot(aes(x = Age, y = played_hours, fill = experience)) +
    geom_bar(stat = "identity") +
    labs(x = "Age (years)", y = "Hours Played (hours)", fill = "Experience Level") +
    ggtitle("Hours Played vs. Age Relationship") +
    theme(text = element_text(size = 18))+ facet_grid(rows = vars(experience))

players_bar_gender_better <- players_tidy |>
    ggplot(aes(x = gender, fill = experience)) +
    geom_histogram(stat = "count") +
    labs(x = "Gender", y = "Number of Players") +
    ggtitle("Distribution of player across gender and \nexperience") +
    theme(text = element_text(size = 18),
         axis.text.x = element_text(angle = 45, hjust = 1))

players_scatter
players_bar_gender
players_bar_ex
players_scatter_sub
players_bar_gender_better

- ### Insight
    - Players with the most experience spends the least amount of time on the server.
    - There seems to be an overrepresentation of people around the ages of ~18-20.
    - Gender does not seem to share a relationship with the amount of hours played.
    - Most of the players in the dataset are subscribed

## (4) Methods and Plan:

Method:
1. Wrangle dataset and filter out irrelavent variables. (such as hashmail)
2. Use vizualization techniques to understand the data and generally estimate what an answer might be like
3. Filter for each player type and preform the following step for each data set
    1. Split the data into training and testing sets, 85/15 split
    2. Fold the training data into 5 folds
    3. Train, fit, and tune accuracy using k-nn binary classification model until the best K is found. K range should be frome 1-30
    4. Use best k to retrain the k-nn classification model with training set
    5. Assess accuracy and precision with the testing set and make sure it it better than the majority classification model
    6. Run the predictor through forward selection to find the most important predictors
4. Compare which one model yields the most accuracy in the end to see which player type is the most preditable one


To answer the broad and specific question, I utilize the K-Nearest-Neighbor binary classification model to assess the problem. This is mainly due to the fact that the prediction I am making only has two states, if the new player is going subscribe or not. Hence, choosing this method of modeling of a binary classification would be the most appropriate. However, before exacuting this, there are a couple of assumption I need to make. Firstly, the dataset used has no outliers as it might distort the prediction. And secondly, the dataset consist of a sufficient sample size of each experience level of player to make a fairly accurate prediction. Nevertheless, some limitation of this model may include how it preforms poorly when there are too many predictors or when the classes are imbalance. However, I can try to maximize the accuracy of the models by utilizing cross validation methods when training the classification model. Through this way, I can compare among the models with different K values and identify the best one that yields the highest accuracy.

Through my process, I will be splitting my dataset 6 times in total for each player experience level dataset. Firstly, I would split the dataset into training set and testing sets right after the dataset is wrangled into a tidy fromat. Then when I need to tune the best K value, I would split the training set into 5 chunks to apply cross validation on it. The initial split will consist of a 85/15 split beteween the training set and the testing set. The other 5 splitted chunks from the training set would be an even proportion split, so it can use crass validation later on.

In [None]:
source("cleanup.R")