In [None]:
library(tidyverse)

In [None]:
players<- read_csv("players.csv")

sessions<- read_csv("sessions.csv")

The dataset contains **196 observations**, with each representing an individual who participated in an study involving an online Minecraft server. There is a total of **7 variables**.

### List of Variables

| Variable        | Description |
|----------------|-------------|
| `experience`    | Self-reported skill level in Minecraft. It is either: Beginner, Amateur, Regular, Pro, or Veteran. |
| `subscribe`     | Whether the given participant subscribed to a gaming-related newsletter. Either: Yes or No. |
| `hashedEmail`   | An encrypted version of a participants email (hashed). Each email is unique. |
| `played_hours`  | The total number of hours of Minecraft played during data collection. |
| `name`          | The participant’s name. |
| `gender`        | The participant’s gender. |
| `Age`           | The participant’s age in years. |

| Variable              | Description |
|-----------------------|-------------|
| `hashedEmail`         | An encrypted version of a participants email (hashed). Each email is unique. |
| `start_time`          | The session start time, formatted as `dd/mm/yyyy hh:mm`. |
| `end_time`            | The session end time, formatted as `dd/mm/yyyy hh:mm`. |
| `original_start_time` | The raw numeric timestamp representing the start time (POSIX style format). |
| `original_end_time`   | The raw numeric timestamp representing the end time. |

### Data Collection Process

The data was collected from a free public Minecraft server operated from researchers at the University of British Columbia. The goal of this data collection was to gather behavioral and demographic information of players to better support the development of an AI capable of playing Minecraft.

### Data Quality

Overall, the dataset is high quality and full of insightful information. However, there are some issues with the naming of variables:
- Some variables use `snake_case` (e.g., `played_hours`), others use `camelCase` (e.g., `hashedEmail`), and some are inconsistently capitalized (e.g., `Age`).
- Overall, these issues are minor and do not actually affect the usability of the dataset.

### Question

**Which Has a Greater Effect on a Player Subscribing to a Game Newsletter: Hours Played or Total Experience?**

We examine this by modeling the relationship between the **response variable**:

- `subscribe` (Yes/No)

and the **explanatory variables**:

- `experience` (categorical)
- `total_session_time_hrs` (numeric)
- `average_session_time_hrs` (numeric)
- `num_of_sessions` (numeric) 

### Exploratory Data Analysis and Visualization

In this section, we prepare the `players` and `sessions` datasets for exploratory analysis:

* Load the datasets into R to ensure they can be accessed and manipulated.
* Perform only the minimum necessary wrangling to create a tidy dataset, without additional processing that will be done later.
* Merge `sessions` with `players` using `hashedEmail` to combine demographic and gameplay information.
* Select only the relevant columns: `experience`, `subscribe`, `Age`, `total_session_time_hrs`, `average_session_time_hrs`, and `num_of_sessions`.
* Convert `experience` to an ordered factor representing increasing proficiency: Beginner < Amateur < Regular < Pro < Veteran.

Next steps for exploration:

* Compute the mean value for each quantitative variable (`Age`, `total_session_time_hrs`, `average_session_time_hrs`, `num_of_sessions`) and present the results in a table.
* Create a few visualizations to explore relationships between these variables and the subscription status (`subscribe`).
* Follow visualization best practices: include clear labels, titles, and units of measurement.
* Summarize insights from the plots that are relevant to understanding how player engagement and experience may relate to newsletter subscription.