1) Summary of dataset

- Full descriptive summary of the dataset
- Number of observations
- Summary statistics (to 2 decimal places)
- Number of variables
- Name and type of variables
- What these variables mean
- Any issues with the data
- Any issues related to things we can't directly see
- How the data was collected
- Etc

Should be in tables and bullet point lists, esps for summarizing variables.
Should summarize all variables, not just the ones I need for my question.

| Variable Name | Type | Meaning | Summary Stat |
| ------------- | ---- | ------- | ------------ |
| experience    |  chr |Category of familiarity with Minecraft | See summary below |
| subscribe     | lgl  | True if player is subscribed to mailing list, False if not | 144 True/52 False |
| hashedEmail   | chr  | Player email, hashed for privacy | 0 repeated emails|
| played_hours  | dbl  | Hours played  | Mean: 5.85|
| name          | chr  | Identifying name of player | 0 repeated names|
| gender        | chr  | Gender disclosed by player (including choice to opt-out) | See summary below |
| Age           | dbl  | Age of player | Mean: 21.14 |

Players: 196 observations. 7 variables.

Sessions: 1535 observations. 5 variables.

Issues:
- Experience and gender should be converted to factors
- Age has some NAs

In [3]:
library(tidyverse)
library(repr)
options(repr.matrix.max.rows = 6)
source("cleanup.R")

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
“cannot open file 'cleanup.R': No such file or directory”


ERROR: Error in file(filename, "r", encoding = encoding): cannot open the connection


In [34]:
players = read_csv("players.csv")
sessions = read_csv("sessions.csv")

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [47]:
players = players |>
    mutate(experience = as_factor(experience)) |>
    mutate(gender = as_factor(gender))

exp_counts = players |>
    group_by(experience) |>
    summarize(n = n())

exp_counts

subscribe_counts = players |>
    group_by(subscribe) |>
    summarize(n = n())

subscribe_counts

unique_email_count = players |>
    group_by(hashedEmail) |>
    summarize(n = n()) |>
    filter(n > 1)

unique_email_count

hours_mean = players |>
    summarize(mean = mean(played_hours)) |>
    pull() |>
    round(2)

hours_mean

name_overlaps = players |>
    group_by(name) |>
    summarize(n = n()) |>
    filter(n > 1)

name_overlaps

gender_counts = players |>
    group_by(gender) |>
    summarize(n = n())

gender_counts

age_mean = players |>
    summarize(age_mean = mean(Age, na.rm = TRUE)) |>
    pull() |>
    round(2)

age_mean

experience,n
<fct>,<int>
Pro,14
Veteran,48
Amateur,63
Regular,36
Beginner,35


subscribe,n
<lgl>,<int>
False,52
True,144


hashedEmail,n
<chr>,<int>


name,n
<chr>,<int>


gender,n
<fct>,<int>
Male,124
Female,37
Non-binary,15
⋮,⋮
Agender,2
Two-Spirited,6
Other,1


2) Question:
The one broad question I plan to address, and then how I've formulated it into a specific question that uses one response vaiable of interest and one or more explanatory variables.
- It should be stated as a question
- Ex, can foo predict bar in dataset?
- Describe how the data will help address the question of interest
- Potentially describe how to wrangle the data to get it into the form where I can apply one of the predictive methods we've learned.

3. Exploratory Data Analysis and Visualization

- Load the dataset into R
- Do the minimum necessary wrangling to make the data tidy. Nothing extra!
- Compute the mean value for each quantitative variable in players.csv. Report the mean values in a table format.
- Make a few exploratory visualizations of the data to help understand it
- Use viz best practices (labels, titles, units, etc)
- Explain any insights from the plots that are relevant to my question
- No predictive analysis, just exploration before later modelling.

4. Methods and Plan
- Propose one method to address the question of interest using the selected dataset
- Explain why you chose it
- Don't actually do any modelling or presenting of results, this is just the high-level plan and justification

Must include:
- Why is this method appropriate?
- Which assumptions are required, if any, to apply the method selected?
- What are the potential limitations or weaknesses of the method selected?
- How are you going to compare and select the model?
- How are you going to process the data to apply the model? For example: Are you splitting the data? How? How many splits? What proportions will you use for the splits? At what stage will you split? Will there be a validation set? Will you use cross validation?