# Project Planning Stage

## 1. Data Description

In [None]:
library(tidyverse)

In [None]:
sessions <- read_csv("data/sessions.csv")
players <- read_csv("data/players.csv")

In [None]:
combined_data <- inner_join(sessions, players)
combined_data

In [None]:
average_hr_played <- combined_data |> 
  summarize(average_hr_played = round(mean(played_hours, na.rm = TRUE), 2))
average_hr_played

In [None]:
average_gender <- combined_data |>
  group_by(gender) |>
  summarize(count = n()) |>
  mutate(percentage = round(count / sum(count) * 100, 2))
average_gender

In [None]:
average_experience <- combined_data |> 
  group_by(experience) |>
  summarize(count = n()) |>
  mutate(percentage = round(count / sum(count) * 100, 2)) |>
  arrange(desc(count))
average_experience

In [None]:
average_across <- combined_data |> 
  summarize(across(where(is.numeric), ~round(mean(.x, na.rm = TRUE), 2)))
average_across

## 2. Questions

The broad question that I will be addressing is:  

**We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts.**

The specific question that I'll be addressing is: 

**Can player demographics predict whether a player is a high-activity contributor in the Minecraft dataset?**

This is a valid question as I will be figuring out what player demographics such as age, gender, and experience is most likely to play minecraft for a large period of time. High contributor will be defined as players who plays above the average play time. I will be using KNN classification to classify players as either "high" or "low" based on the variables indicated earlier.

#### Summary of the dataset: 
- Data collected via plaicraft.ai program launched by The Pacific Laboratory for Artificial Intelligence with volunteers
- **1535** Observations in total

- **11** variables in total, including: **"hashEmail"** (numerical value acting as  identifier for each player), **"start_time** (numerical value organized in date/month/year time format recording start time), **"end_time"** (numerical value organized in date/month/year time format recording end time), **"original_start_time"** (numerical value tracking game start time but organized in another format recording), **"original_end_time"** (numerical value tracking game end time but organized in another format recording), **"experience"** (categorical variable indicating  playerâ€™s experience level (e.g., Beginner, Amateur, Regular, Pro, Veteran)), **"subscribe"**(logical variable based on subscription status (TRUE/FALSE)", **"played_hours"** (numerical value indicating player's total play time), **"name"** (categorical variable of the player's name), **"gender"** (categorical variable indicating player's gender),and **"Age"** (Numerical value recording player's age).
  
- Datasource from *players.csv* and *sessions.csv*. Combined using the joint key of **"hashEmail"**

- Each row representative of single minecraft session, with player demographic detail.

- An issue with the data is its incompleteness. In certain observations, the **"gender"** variable had multiple **"prefer not to answer"** responses, and some **"NA's"** for the age variable in the players.csv dataset.
- Sampling data as it involves players that choose to be a part of the research.
- Inconsistencies in the data measurement. Data consists of multiple hard-to-assign categorical variables. For example, difficult to measure **"experience"** (e.g., Beginner, Amateur, Regular, Pro, Veteran).
- Inconsistencies for play time as players could have left the game running during the session.

##### Summary Statistics
- Average hours played = **98.57**
- **66.12%** male, **24.89%** female.  
- **53.42%** Amateur, **33.81%** Regular, **6.91%** Beginner, **3.32%** Veteran,  **2.54%** Pro.
- Average start time = **1.72e+12**.
- Average end time = **1.72e+12**.
- Average play time = **98.57** hours
- Average age = **19.43** years old. 

### 3. Exploratory Data Analysis and Visualization

In [None]:
library(tidyverse)

### 4. Methods and Plan

- One method of interest is KNN classification. This is an appropriate method as it can classify the players contribution as defined earlier as "high" or "low" based on player demographics and experience.
- The assumption that player demographics will result in a similar outcome will be made when carrying out KNN classification.
- The data will be scaled and centered to ensure that all variables contribute equally in predicting the dataset, since KNN predictions are based on distance.
- 

In [None]:
Propose one method to address your question of interest using the selected dataset and explain why it was chosen. Do not perform any modelling or present results at this stage. We are looking for high-level planning regarding model choice and justifying that choice.

In your explanation, respond to the following questions:

Why is this method appropriate?
Which assumptions are required, if any, to apply the method selected?
What are the potential limitations or weaknesses of the method selected?
How are you going to compare and select the model?
How are you going to process the data to apply the model? For example: Are you splitting the data? How? How many splits? What proportions will you use for the splits? At what stage will you split? Will there be a validation set? Will you use cross validation?
