# Project Planning Stage

## 1. Data Description

In [None]:
library(tidyverse)

In [None]:
sessions <- read_csv("dsci100-project/sessions.csv")
players <- read_csv("dsci100-project/players.csv")

In [None]:
combined_data <- inner_join(sessions, players)
combined_data

In [None]:
average_hr_played <- combined_data |> 
  summarize(average_hr_played = round(mean(played_hours, na.rm = TRUE), 2))
average_hr_played

In [None]:
average_gender <- combined_data |>
  group_by(gender) |>
  summarize(count = n()) |>
  mutate(percentage = round(count / sum(count) * 100, 2))
average_gender

In [None]:
average_experience <- combined_data |> 
  group_by(experience) |>
  summarize(count = n()) |>
  mutate(percentage = round(count / sum(count) * 100, 2)) |>
  arrange(desc(count))
average_experience

In [None]:
average_across <- combined_data |> 
  summarize(across(where(is.numeric), ~round(mean(.x, na.rm = TRUE), 2)))
average_across

#### Summary of the dataset: 
- Data is collected with players voluntarily signing up for the plaicraft.ai program launched by The Pacific Laboratory for Artificial Intelligence
- There are **1535** Observations in total

- There are **11** variables in total, including: **"hashEmail"** (a numerical value acting as an identifier for each minecraft player), **"start_time** (numerical value organized in date/month/year time format recording the game start time), **"end_time"** (numerical value organized in date/month/year time format recording the game end time), **"original_start_time"** (numerical value tracking game start time but organized in another format recording), **"original_end_time"** (numerical value tracking game end time but organized in another format recording), **"experience"** (categorical variable indicating  playerâ€™s experience level (e.g., Beginner, Amateur, Regular, Pro, Veteran)), **"subscribe"**(logical variable indication whether or not player has subscribed (TRUE/FALSE)", **"played_hours"** (numerical value indicating player's total play time in recorded session), **"name"** (categorical variable of the player's name), **"gender"** (categorical variable indicating player's gender),and **"Age"** (Numerical value recording player's age).
  
- The datasource is the combined data from *players.csv* and *sessions.csv*. The two data was combined using the joint key of **"hashEmail"**

- Each row is representative of a single minecraft session, with player demographic detail.

- An issue with the data is the incompleteness of the data. In certain observations, the **"gender"** variable had multiple **"prefer not to answer"** responses, and some **"NA's"** for the age variable in the players.csv dataset.
23 biases. For instance, as mentioned above, this dataset is collected with players voluntarily signing up, thus there may be sampling data as it only involves the players that choose to be a part of the research.
- Inconsistencies in the data measurement also arises. The data consists of multiple categorical variables that are hard to assign, which, for example the **"experience"** variable (e.g., Beginner, Amateur, Regular, Pro, Veteran)is difficult to measure.
- Inconsistencies could also exist for the play time (when it started/ended) as players could have left the game running in the period of the experiment.

##### Summary Statistics
- Average hours played between all players was **98.57**
- Out of all the players,the majority was **66.12%** male, and **24.89%** female.  
- **53.42%** of players were Amateur based on experience level, with **33.81%** Regular, **6.91%** Beginner, **3.32%** Veteran, and **2.54%** Pro.
- The average start time was **1.72e+12**.
- The average end time was **1.72e+12**.
- The average play time was **98.57** hours
- The average age was **19.43** years old. 

## 2. Questions

The broad question that I will be addressing is:  We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts. The specific question that I'll be addressing is the 