### Summary of Dataset

#### Players Dataset
This dataset contains 196 observations of players registered by the PLAICraft team, with 9 variables:

| Variable Name | Type | Meaning | Summary Stat |
| ------------- | ---- | ------- | ------------ |
| experience    | chr |Category of familiarity with Minecraft, with 5 options |  |
| subscribe     | lgl  | True if player is subscribed to mailing list, False if not | 144 True/52 False (73.47% True) |
| hashedEmail   | chr  | Player email, hashed for privacy | |
| played_hours  | dbl  | Hours played  | Min: 0<br>Max: 223.1 <br>Mean: 5.85|
| name          | chr  | Identifying name of player | |
| gender        | chr  | Gender disclosed by player (including choice to opt-out) |  |
| Age           | dbl  | Age of player | Min: 9<br> Max: 58 <br> Mean: 21.14 |

hashedEmail and name both have 0 repeated values, so both could function as identifiers, but we'll see in the second database that hashedEmail is the shared identifier across both sets. We aren't given information on whether the name field might be a username with enforced unique requirements, or a generated name meant to anonymize the participant data, but given the hashedEmail identifier we won't need to use the name field for any investigations.

The dataset is already tidy and NAs are only present in the Age variable, which has 2. Gender should be treated as a factor, as we see all responses fit into 7 options (which we visualize later in this report.) 

Experience could be treated as a factor with 5 options, but also implies an increasing amount of experience that could be represented numerically (ie, Beginner = 1, Amateur = 1, etc) with the downside of not being sure what scale the researchers actually presented (ie, is Pro a higher or lower level of experience than Veteran?)

#### Sessions Dataset
This dataset contains 1535 observations of individual play sessions recorded by the PLAICraft team, with 5 variables:
| Variable Name       | Type | Meaning | Summary Stat |
| ------------------- | ---- | ------- | ------------ |
| hashedEmail         | chr  | Player email, hashed for privacy |  |
| start_time          | chr  | Start of session in day/month/year and 24 hr format | (All calculated by converting to datetime)<br>Min : 2024-04-06 09:27:00 |
| end_time            | chr  | End of session in same format as start_time | (All calculated by converting to datetime)<br> Max: 2024-09-26 07:39:00 <br> Mean duration between start_time and end_time: 50.86 mins|
| original_start_time | dbl  | Lower precision session start time, expressed in milliseconds from Unix Epoch | |
| original_end_time   | dbl  | Lower precision session end time, in same format as original_start_time |  |

While the original_start_time and original_end_time are numerical variables, they represent Unix Epoch time and calculating summary statistics like mean doesn't make sense. I've included the minimum for start_time and maximum for end_time as this tells us that data was collected between April 6th and Sept 26th, 2024. The start_time and end_time are provided as character type, but should be converted to datetime.

The reason for start and end time being included twice, once in date time format and once in Unix Epoch millisecond time, is unclear. When compaired, they also don't exactly match, and in some cases original_start_time and original_end_time are the same number. In those situations the start_time and end_time show a session of short duration, and so my theory is that some precision has been lost with these millisecond Unix Epoch times, which is why they don't match precisely. Since the start_time and end_time are more detailed and don't match each other, those are the fields I would use if my question related to session information.

hashedEmails acts as the identifier and can be matched with the data in the players dataset. Grouping by unique showed that there are 125 unique hashed emails in the set, with the mean number of repeat sessions being 12.28. This is notable since it's lower than the 196 entries in the players dataset, but as we will see in visualization, a significant number of entries in the players dataset have 0 played hours and so would not be present in the sessions dataset.

The only NAs present in the set are from two observations that have NA in both end_time and original_end_time. The start times of these two observations are different, so it's not clear why these two observations are missing end times.

### Question of Interest

Broad question: We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts.

Specific: Can the age and experience of a player be used to predict the sum total of hours played across multiple sessions of PLAICraft?

Since the research team is looking for 'kinds' of players, we include Age as demographic information. I decided against including gender because the data is lacking in many observations for most genders, and so we don't have sufficient training data to use it as a predictor. Experience is the one additional detail for 'type' of player, and I plan to convert it from 5 discrete factors into a linear scale of 1-5 in the order Amateur, Beginner, Regular, Veteran, and Pro.

The researchers are asking to identify predictors for players who contribute a "large amount" of data which could be considered categorical, but I plan to treat it as a regression problem where we predict the numerical value of played_hours for a given type of player. This gives the researchers the ability to define their own cutoff for what minimum number of hours is considered 'large'.

### Exploratory Data Analysis and Visualization

### Methods and Plan