### Description
This report investigates how a player's age and hours can be used to determine if they signed up for the game's newsletter. The data for players.csv was collected by a computer science research group at UBC, guided by Frank Wood, who gathered information on video game play patterns of a MineCraft server they have set up. As players play the game, their actions are tracked. The research group made sure to use adequate resources, such as server hardware and software licensing, to manage the volume of players they gathered. As the raw data in players.csv is not in a tidy format, this report focuses on cleaning the dataset, generating summary statistics, and creating visualizations to analyze trends. Through these steps, the report provides insights that help answer the classification question.

### Question: 
Can a player’s age and hours played predict whether they subscribe to the game’s newsletter?

### Summary of Variable names
The players.csv dataset has 196 observations with 7 variables, each variable representing information about that player:

| Variable     | Type              | Description                                                     |
| :----------- | :---------------- | :-------------------------------------------------------------- |
| experience   | Categorical (chr) | Player's skill level (Veteran, Pro, Amateur, Regular, Beginner) |
| subscribe    | Boolean (lgl)     | Whether the player has subscribed (TRUE/FALSE)                  |
| hashedEmail  | Categorical (chr) | Hashed email address for privacy                                |
| played_hours | Numeric (dbl)     | Number of hours the player has played                           |
| name         | Categorical (chr) | Player's first name                                             |
| gender       | Categorical (chr) | Gender of the player (Male, Female, Other, Prefer not to say)   |
| Age          | Numeric (dbl)     | Player’s age                                                    |

### Issues
| Issue | Description of Issue | Solution |
| :------| :-------------------- | :-------- |
| Duplicate Entries | Some participants have multiple records | Ensure these are distinct play sessions and not duplicates. Group data by hashedEmail to verify if multiple sessions correspond to the same user or if it's a data entry error	|
| Missing Data | Missing values in key columns like start_time, end_time, experience, subscribe, played_hours | Ignore missing values drop_na(), Fill in missing values using mean imputation step_impute_mean(all_predictors()) |
| Time Format Consistency | The start_time and end_time columns use a string format | Convert these strings to a proper datetime format |
| Age Distribution | The age range is wide and not evenly distributed | Create age categories (e.g., 0-18, 19-35, 36-50, etc.) to simplify analysis|
| Gender Representation | Some entries for gender are non-binary or unknown | Handle these cases by either grouping non-binary responses into a category or excluding them |
| Mutiple entries | Ensure there are no issues like multiple hashed emails for the same user. This could indicate duplicates or incorrectly handled data during hashing |  Verify that each hashedEmail corresponds to a unique user |
| Inconsistent Session Lengths | The played_hours column can have values that seem unusually short, suggesting that some players may have logged incomplete sessions | Set thresholds for what constitutes a valid session duration. Filter out sessions that fall below the threshold |
| Correlation Between Variables | Columns like experience and played_hours might have a strong relationship | Use correlation metrics or scatter plots to visualize relationships |


