In [None]:
library(tidyverse)
library(repr)

players_data <- read_csv("players.csv")
sessions_data <- read_csv("sessions.csv")

# 1. Data Description

There are two datasets for this project: *players.csv* and *sessions.csv*.  

Data for both datasets were collected through an online Minecraft server called PLAICraft where researchers could track players' actions and progress through the game. 

### 1.1 players.csv

In [None]:
# number of observations

nrow(players_data)

- number of variables: 7
- name and type of variables:
    - experience (chr), subscribe (lgl), hashedEmail (chr), played_hours (dbl), name (chr), gender (chr), age (dbl)
- what variables mean:
    - experience: player experience sorted into beginner, amateur, regular, veteran, or pro.
    - subscribe: whether players are subscribed to the game.
    - hashedEmail: encrypted version of players' email addresses.
    - played_hours: total hours spent playing the game
    - name: first name of players
    - gender: gender of players
    - age: age of the players in years
- visible issues:
    - We have some NA entries in the Age column; will have to use na.rm = TRUE to account for this.
- invisible issues:
    - unclear how was player experience was determined
    - unclear how player experience levels are ranked (i.e. is pro higher or lower than veteran?)

### 1.2 sessions.csv


In [None]:
# number of observations

nrow(sessions_data)

- number of variables: 5
- name and type of variables:
    - hashedEmail (chr), start_time (chr) end_time (chr), original_start_time (dbl), original_end_time (dbl)
- what variables mean:
    - hashedEmail: explained above in players.csv
    - start_time: when player started playing, in dd/mm/yyyy [24h time]
    - end_time: when player stopped playing, in dd/mm/yyyy [24h time]
    - original_start_time: when player started playing, in UNIX format (milliseconds)
    - original_end_time: when player stopped playing, in UNIX format (milliseconds)
- can't see any issue in data

# 2. Questions


I would like to address the broad question of which "kinds" of players are most likely to contribute a large amount of data to the research so that they can be targeted in recruitement efforts. Specifically, I am interested in the question: can player gender can be predicted from number of hours of PLAICraft played and player age from the players.csv dataset? The players.csv data will help me answer this questions by supplying data from which I can create a classification model to visualize and predict whether there is a relationship between hours played + age and gender of players (i.e. which gender tends to spend the most time on the game). I would use the "played_hours" and "Age" variables as predictors, and the "gender" variable as the classification label. 


# 3. Exploratory Data Analysis and Visualization


**I will only be focusing on the players.csv dataset for this section, as my question only pertains to the "gender" and "played_hours" variables, thus none from session.csv.**

The players.csv data is already in a tidy format. We could change all the variable/column names to be lowercase, however this is not necessary for the data to be tidy. This is shown below:

In [None]:
options(repr.matrix.max.rows = 6)
players_data 

### 3.1 Summary stats

**Summary of player experience**

In [None]:
players_experience <- players_data |>
    group_by(experience) |>
    summarize(count = n()) |>
    arrange(desc(count))
players_experience


Majority of players are amateurs, while the pro players make up the least of individuals in this data. 

**Summary of player subscription**

In [None]:
players_sub <- players_data |>
    group_by(subscribe) |>
    summarize(count = n()) |>
    arrange(desc(count))   
players_sub

More players are subscribed, which would likely be favoured by the researchers, as players would get news and updates easier and be easily recuited for future projects.


**Summary of hours played**

In [None]:
#played hours summary stats
players_hours <- players_data |>
    mutate(max_hours = max(played_hours), 
            min_hours = min(played_hours), 
            avg_hours = mean(played_hours)) |>
    select(max_hours, min_hours, avg_hours) |>
    slice(1) |>
    mutate(avg_hours = round(avg_hours, 2))
players_hours


The minimum and maximum hours played are 0 and 223.1 hours, respectively. There is an average of 5.85 hours played.

**Summary of player genders**

In [None]:
# summary of player genders
players_gender <- players_data |>
    group_by(gender) |>
    summarize(count = n()) |>
    arrange(desc(count))
players_gender

Most players are male.

**Summary of player age**

In [None]:
# summary of player ages
players_age <- players_data |>
    mutate(max_age = max(Age, na.rm = TRUE), 
            min_age = min(Age, na.rm = TRUE), 
            avg_age = mean(Age, na.rm = TRUE)) |>
    select(max_age, min_age, avg_age) |>
    slice(1) |>
    mutate(avg_age = round(avg_age, 2))
players_age

Ages range from 9-58 years old, with player age averaging at about 21 years old.

### 3.2 Visualizations

**Age vs played_hours**\
Visualizing the age and played_hours in the players.csv dataset to assess whether there is any significant trend in how long someone will play the game based on how old they are.

In [None]:
options(repr.plot.width = 12, repr.plot.height = 8)

age_vs_hours_plot <- players_data |>
    ggplot(aes(x = Age, y = played_hours)) +
    geom_point(alpha = 0.4) +
    xlab("Player age (in years)") +
    ylab("Hours of PLAIcraft played") +
    theme(text = element_text(size = 15))
age_vs_hours_plot

# Methods and Plan

# GitHub Repository