In [26]:
library(tidyverse)

### 1) Data Description
_Provide a full descriptive summary of the dataset, including information such as the number of observations, summary statistics (report values to 2 decimal places), number of variables, name and type of variables, what the variables mean, any issues you see in the data, any other potential issues related to things you cannot directly see, how the data were collected, etc. Make sure to use bullet point lists or tables to summarize the variables in an easy-to-understand format._

_Note that the selected dataset(s) will probably contain more variables than you need. In fact, exploring how the different variables in the dataset affect your model may be a crucial part of the project. You need to summarize the full data regardless of which variables you may choose to use later on._

The provided dataset for this assignment consists of two csv files, ```players.csv``` and ```sessions.csv```.

As shown below in ```overview```, ```players``` consists of 196 players and 7 variables, and ```sessions``` consists of 1535 sessions and 5 variables. Further details are described in the next few blocks of code & markdown cells:

In [35]:
# reading the files
players <- read_csv('players.csv', show_col_types = FALSE)
sessions <- read_csv('sessions.csv', show_col_types = FALSE)

# quick overview of datasets
overview <- tibble(dataset = c("players.csv", "sessions.csv"),
                  n_rows = c(nrow(players), nrow(sessions)),
                  n_cols = c(ncol(players), ncol(sessions)))
overview

dataset,n_rows,n_cols
<chr>,<int>,<int>
players.csv,196,7
sessions.csv,1535,5


In ```players.csv```, each observation consists of **one unique player**, identified by a ```hashedEmail```. In total, there are 196 unique players in this dataset. Other demographic information per player is included, as described below in ```players_variables```:

In [36]:
## variables in players.csv
players_variables <- tibble(variable = names(players), 
                            type = map_chr(players, ~class(.x)[1]),
                            description = c("Players's experience level (Amateur, Veteran, Pro)",
                                            "The player's subscription status to the newsletter",
                                            "The player's anonymized e-mail",
                                            "The total number of hours played",
                                            "The player's name",
                                            "The player's gender",
                                            "The player's age"))
players_variables

variable,type,description
<chr>,<chr>,<chr>
experience,character,"Players's experience level (Amateur, Veteran, Pro)"
subscribe,logical,The player's subscription status to the newsletter
hashedEmail,character,The player's anonymized e-mail
played_hours,numeric,The total number of hours played
name,character,The player's name
gender,character,The player's gender
Age,numeric,The player's age


Summary statistics for ```played_hours``` and ```Age``` are stored in ```hours_stats``` and ```age_stats```, respectively (see below). From here, I can already identify some potential issues with the data:

For ```played_hours```:
- ```min_hours``` is zero, which means that some players registered but never played. This indicates that some data might be missing or invalid.
- ```max_hours``` is 223.1, which indicates that the data is extremely positively skewed. This points to potential outliers, and subsequently, ```mean_hours``` may be inflated.

For ```Age```:
- There seems to be less obvious issues overall, although I had to use ```na.rm = TRUE```. This means that some players did not report their age. This might be problematic because it could lead to bias and reduces the sample size. 

In [37]:
## summary statistics for played_hours and Age
hours_stats <- players |>
  summarise(min_hours = round(min(played_hours, na.rm = TRUE), 2),
    max_hours = round(max(played_hours, na.rm = TRUE), 2),
    mean_hours = round(mean(played_hours, na.rm = TRUE), 2),
    median_hours = round(median(played_hours, na.rm = TRUE), 2),
    total_players = n())

age_stats <- players |>
  summarise(min_age = round(min(Age, na.rm = TRUE), 2),
    max_age = round(max(Age, na.rm = TRUE), 2),
    mean_age = round(mean(Age, na.rm = TRUE), 2),
    median_age = round(median(Age, na.rm = TRUE), 2),
    total_players = n())

hours_stats
age_stats

min_hours,max_hours,mean_hours,median_hours,total_players
<dbl>,<dbl>,<dbl>,<dbl>,<int>
0,223.1,5.85,0.1,196


min_age,max_age,mean_age,median_age,total_players
<dbl>,<dbl>,<dbl>,<dbl>,<int>
9,58,21.14,19,196


In ```sessions.csv```, each row is information about **one session**. The identifying ```hashedEmail``` is included per play session, so this information can be mapped back to ```players.csv```. In total, there are 1535 observations in this dataset. There are 5 columns in this dataset, described below in ```sessions_variables```:

In [38]:
sessions_variables <- tibble(variable = names(sessions),
                             type = map_chr(sessions, ~class(.x)[1]),
                             description = c("The player's anonymized e-mail",
                                             "session start time (dd/mm/yyyy hh/mm)",
                                             "session end time (dd/mm/yyyy hh/mm)",
                                             "UNIX start time",
                                             "UNIX end time"))
sessions_variables

variable,type,description
<chr>,<chr>,<chr>
hashedEmail,character,The player's anonymized e-mail
start_time,character,session start time (dd/mm/yyyy hh/mm)
end_time,character,session end time (dd/mm/yyyy hh/mm)
original_start_time,numeric,UNIX start time
original_end_time,numeric,UNIX end time


Once again, summary statistics for ```sessions``` are shown below in ```sessions_stats```. The data was grouped by player (```hashedEmail```) to get information about number of sessions per player. Here, I noticed that ```min_sessions``` is 1, which indicates that ```players``` with 0 hours were not included. 

Some potential problems with ```sessions.csv``` might include:
- ```min_sessions``` and ```median_sessions``` are both 1, and ```max_sessions``` is 310. This indicates an extremely strong positive skew. Similar to ```players.csv```, the mean may be inflated.
- This also points to another issue: since half the players only played 1 time, this provides no replicates per individual (no biological replicates).

In [39]:
sessions_info <- sessions |>
    group_by(hashedEmail) |>
    summarise(count = n())

sessions_stats <- sessions_info |>
    summarise(min_sessions = min(count),
    max_sessions = max(count),
    mean_sessions = mean(count),
    median_sessions = median(count),
    total_players = n())

sessions_stats

min_sessions,max_sessions,mean_sessions,median_sessions,total_players
<int>,<int>,<dbl>,<int>,<int>
1,310,12.28,1,125


In [40]:
## histograms for played_hours and Age
hours_plot <- players |>
    ggplot(aes(x = played_hours)) +
    geom_histogram(bins = 50,
                   aes(y = after_stat(count/sum(count) * 100))) +
    labs(title = 'Distribution of Hours Played',
         x = 'Hours Played',
         y = 'Percent Total Players (%)') +
    theme(plot.title = element_text(hjust = 0.5))

age_plot <- players |>
    ggplot(aes(x = Age)) +
    geom_histogram(bins = 50,
                   aes(y = after_stat(count/sum(count) * 100))) +
    labs(title = 'Distribution of Player Age',
         x = 'Age (years)',
         y = 'Percent Total Players (%)') +
    theme(plot.title = element_text(hjust = 0.5))

### 2) Questions
_Clearly state one broad question that you will address, and the specific question that you have formulated. Your question should involve one response variable of interest and one or more explanatory variables, and should be stated as a question. One common question format is: “Can [explanatory variable(s)] predict [response variable] in [dataset]?”, but you are free to format your question as you choose so long as it is clear. Describe clearly how the data will help you address the question of interest. You may need to describe how you plan to wrangle your data to get it into a form where you can apply one of the predictive methods from this class._

**Broad Question 1**: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

**Specific Question**: Are players who play more often during certain times of day (mornings/afternoons/evenings) more likely to subscribe to the newsletter?


The predictor variable will be proprtion of hours or sessions in morning/afternoon/night time windows. This will be a **predictive classification** question because the outcome variable, 'subscribe', is binary. I can also include other variables (age, experience) as control variables. This problem fits the logistic regression model we learned in class. 