In [1]:
library(tidyverse)
library(repr)
options(repr.matrix.max.rows = 6) 

players_data <- read_csv("players.csv")
sessions_data <- read_csv("sessions.csv")

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m


**QUESTIONS:**
- does data record users who visited multiple times?


**Notes:**
- formatting doesnt really matter
- writing is more what is being marked
- jusifty any code you do!!! including summary stats

## Individual Planning Report 
- describe the data you're working on
- demonstrate an understanding of all variables and potential issues in the data
- identify both the *broad question* I would like to address and the specific question I have formulated


# Data Description

There are two datasets for this project: *players.csv* and *sessions.csv*.  

Data for both datasets were collected through an online Minecraft server called PLAICraft where researchers could track players' actions and progress through the game. 

### players.csv

In [2]:
# number of observations

nrow(players_data)

- number of variables: 7
- name and type of variables:
    - experience (chr), subscribe (lgl), hashedEmail (chr), played_hours (dbl), name (chr), gender (chr), age (dbl)
- what the variables mean:
    - experience: how experienced the player is with the video game. Players are beginner, amateur, regular, veteran, or pro
    - subscribe: whether or not players are subscribed to the video game
    - hashedEmail: this is an encrypted version of players' email addresses used to play the video game 
    - played_hours: length of time spent playing the game in hours
    - name: first name of players
    - gender: gender of players
    - age: age of the players in years
- issues you see in the data:
    - We have a lot of character vectors, so if we would like to perform **[insert wrangling task]**, we would need to transform them into factor vectors (fct)
- issues you cannot directly see:
    - how was player experience determined? Surveys? Analysis of previous video game activity

### sessions.csv


In [3]:
# number of observations

nrow(sessions_data)

- number of variables: 5
- name and type of variables:
    - hashedEmail (chr), start_time (chr) end_time (chr), original_start_time (dbl), original_end_time (dbl)
- what the variables mean:
    - hashedEmail: explained above in players.csv
    - start_time: when the player started playing the game, in the format dd/mm/yyyy [24h clock time]
    - end_time: when the player stopped playing the game, in the format dd/mm/yyyy [24h clock time]
    - original_start_time: when the player started playing the game, in the UNIX format (milliseconds)
    - original_end_time: when the player ended playing the game, in the UNIX format (milliseconds)
- issues you see in the data:
    - not tidy; violates the rule that each variable has its own column. start_time and end_time have the day, month, year, and time (24h clock) in each cell of the vector. We should use pivot wider to give a separate vector for each of these
- issues you cannot directly see:

# Questions


I would like to address the broad question of which "kinds" of players are most likely to contribute a large amount of data to the research so that they can be targeted in recruitement efforts. Specifically, I am interested in the question: can player gender can be predicted from number of hours of PLAICraft played in the players.csv dataset? The players.csv data will help me answer this questions by supplying data from which I can create a classification model to visualize and predict whether there is a relationship between hours played and gender of players (i.e. which gender tends to spend the most time on the game). I would use the "played_hours" variable as the predictor, and the "gender" variable as the classification label. **[You may need to describe how you plan to wrangle your data to get it into a form where you can apply one of the predictive methods from this class.]**

# Exploratory Data Analysis and Visualization


In [5]:
#summary statistics of players data

#summary of player experience
players_experience <- players_data |>
    group_by(experience) |>
    summarize(count = n()) |>
    mutate(pct_of_players = (count/196)*100) |>
    mutate(pct_of_players = round(pct_of_players, 2))
players_experience

#summary of player subscriptions
players_sub <- players_data |>
    group_by(subscribe) |>
    summarize(count = n()) |>
    mutate(pct_of_players = (count/196)*100) |>
    mutate(pct_of_players = round(pct_of_players, 2))
players_sub

#played hours summary stats
players_hours <- players_data |>
    mutate(max_hours = max(played_hours), 
            min_hours = min(played_hours), 
            avg_hours = mean(played_hours)) |>
    select(max_hours, min_hours, avg_hours) |>
    slice(1) |>
    mutate(avg_hours = round(avg_hours, 2))
players_hours

# summary of player genders
players_gender <- players_data |>
    group_by(gender) |>
    summarize(count = n()) |>
    mutate(pct_of_players = (count/196)*100)|>
    mutate(pct_of_players = round(pct_of_players, 2))
players_gender

# summary of player ages
players_age <- players_data |>
    mutate(max_age = max(Age, na.rm = TRUE), 
            min_age = min(Age, na.rm = TRUE), 
            avg_age = mean(Age, na.rm = TRUE)) |>
    select(max_age, min_age, avg_age) |>
    slice(1) |>
    mutate(avg_age = round(avg_age, 2))
players_age

experience,count,pct_of_players
<chr>,<int>,<dbl>
Amateur,63,32.14
Beginner,35,17.86
Pro,14,7.14
Regular,36,18.37
Veteran,48,24.49


subscribe,count,pct_of_players
<lgl>,<int>,<dbl>
False,52,26.53
True,144,73.47


max_hours,min_hours,avg_hours
<dbl>,<dbl>,<dbl>
223.1,0,5.85


gender,count,pct_of_players
<chr>,<int>,<dbl>
Agender,2,1.02
Female,37,18.88
Male,124,63.27
⋮,⋮,⋮
Other,1,0.51
Prefer not to say,11,5.61
Two-Spirited,6,3.06


max_age,min_age,avg_age
<dbl>,<dbl>,<dbl>
58,9,21.14


# Methods and Plan

# GitHub Repository