# Individual Planning Report  

Ariel Zhang  
DSCI 100

## Questions

**Broad Question:**   
We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts

**Specific Question:**  
Can a player’s age, gender, and experience level predict the total number of hours they have played in the minecraft server?

**Description:**  
This specific question explores which types of players are connected to higher levels of engagement and contribution to the data on the Minecraft research server. For this question, we will only be using one of the data sets. The players.csv dataset provides information such as age, gender, and self-reported experience level, along with each player’s total hours played. In this analysis, the response variable will be played_hours, which measures the total amount of time each player spent on the game. The explanatory variables will include age, gender, and experience, showing the different characteristics that might influence how long a player spends on the game.  

Before we apply any regression methods, the dataset will be wrangled to ensure it is ready for analysis. This will include removing the unnecessary columns like name and hashedEmail, removing missing values, and converting categorical variables such as gender and experience into factor formats that can be used in regression models. Once the data is tidy, it can be used to fit and compare linear and KNN regression models to see how well player characteristics predict the total play time.


## Data Description

The dataset `players.csv` contains information about individual Minecraft players who participated in a research project. Each row represents one player, including data on their experience level, subscription status, total hours played, gender, and age. This dataset allows us to explore how player characteristics are related to total playtime, which tells us how much data each player contributes.

**Dataset Overview**
- Number of Observations: 196
- Variables: 7  
- Unit of observation: One row per player

**Variables Summary**
| Variable | Type | Description | Use in Analysis |
|-----------|------|-------------|----------------|
| `experience` | Categorical | Player’s self-reported skill level | Explanatory variable |
| `subscribe` | Logical | Whether player subscribed to the newsletter | Not used for this analysis |
| `hashedEmail` | Character | Email ID | Not used |
| `played_hours` | Double | Total hours the player spent in game | Response variable |
| `name` | Character | Player’s name | Not used |
| `gender` | Categorical | Player’s gender | Explanatory variable |
| `Age` | Integer | Player’s age in years | Explanatory variable |

**Summary Statistics**
| Variable | Mean |
|-----------|------|
| `played_hours` |  | 
| `Age` |  | 




## Exploratory Data Analysis and Visualization

**Load and preview data:**

In [47]:
#Load tidyverse
library(tidyverse)

#Load dataset into R
players  <- read_csv("players.csv")

#Preview data
head(players)
                

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21
Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17


**Minimum Wrangling:**

In [48]:
#Keep only data used for analysis
players_tidy <- players |>
                select(experience, played_hours, gender, Age) |>
                drop_na() |>
                mutate(experience = as_factor(experience)) |>
                mutate(gender = as_factor(gender))

head(players_tidy)

experience,played_hours,gender,Age
<fct>,<dbl>,<fct>,<dbl>
Pro,30.3,Male,9
Veteran,3.8,Male,17
Veteran,0.0,Male,17
Amateur,0.7,Female,21
Regular,0.1,Male,21
Amateur,0.0,Female,17


**Mean Values for Each Quantitative Variable in** `players.csv`: