## (1) Data Description

### Overview

#### `players.csv`
- **Number of observations:** 196  
- **Number of variables:** 7  
- **Each row represents:** One unique player and their demographic and experience data.

#### `sessions.csv`
- **Number of observations:** 1,535  
- **Number of variables:** 5  
- **Each row represents:** A single recorded game session for a player.

---

### Variable Summary

#### `players.csv`

| Variable | Type | Description | Summary Statistics | Issues / Notes |
|-----------|------|--------------|--------------------|----------------|
| `experience` | Categorical | Player’s self-reported skill level (e.g., Amateur, Expert) | 5 unique levels (most common: *Amateur*, 63 players) | Subjective scale; not numerical |
| `subscribe` | Boolean | Whether player holds a subscription (True/False) | 144 subscribed (73.5%) | Imbalanced; possible binary encoding issues |
| `hashedEmail` | String | Unique hashed email identifier | 196 unique values | Used as join key with `sessions` |
| `played_hours` | Numeric | Total hours spent playing | Mean = **5.85**, Min = **0.50**, Max = **12.00** | Outliers possible; may not capture inactive time |
| `name` | String | Player’s display name | 196 unique names | Not relevant for modeling |
| `gender` | Categorical | Player gender (7 categories) | Most common: *Male* (124 players) | Imbalanced representation |
| `Age` | Numeric | Age in years | Mean = **21.14**, Min = **14**, Max = **38** | 2 missing values; potential data entry errors |

---

#### `sessions.csv`

| Variable | Type | Description | Summary Statistics | Issues / Notes |
|-----------|------|--------------|--------------------|----------------|
| `hashedEmail` | String | Foreign key linking to `players.csv` | 125 unique player IDs | Some players missing from join (196 → 125 overlap) |
| `start_time` | String (datetime) | Session start timestamp | 1,535 total, 1,504 unique | Stored as text; requires datetime parsing |
| `end_time` | String (datetime) | Session end timestamp | 1,533 non-missing | 2 missing end times; may truncate duration |
| `original_start_time` | Numeric (timestamp) | Unix-based original start time | Mean = **1.7192 × 10¹²** | Needs conversion to readable date |
| `original_end_time` | Numeric (timestamp) | Unix-based original end time | Mean = **1.7192 × 10¹²** | Missing 2 values; possible rounding drift |

---

### Observations and Issues

- **Missingness:**  
  - 2 missing `Age` values in `players`.  
  - 2 missing `end_time` and `original_end_time` entries in `sessions`.

- **Data linkage:**  
  - Only 125 of 196 players appear in the `sessions` dataset, suggesting some inactive users or unrecorded sessions.

- **Data quality issues:**  
  - Timestamps stored inconsistently (string and numeric formats).  
  - Potential mismatches between session logs and player data.  
  - Categorical imbalance (e.g., gender, experience levels).  
  - Manual self-reporting bias for `experience`.

- **Unseen issues (possible):**  
  - Sampling bias (active users only).  
  - Time-zone inconsistencies in session logs.  
  - Measurement inaccuracies for `played_hours` or inactive sessions not tracked.

- **Data collection context:**  
  Data likely originated from an online gaming analytics platform. Player profiles (`players.csv`) were recorded alongside gameplay logs (`sessions.csv`), collected automatically via backend telemetry systems that capture login times, play durations, and user characteristics.


## (2) Question

### Broad Question
What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

### Specific Question
Can a player’s **age**, **gender**, **experience level**, and **total playtime** predict whether they **subscribe** to the game newsletter?

### Variables
- **Response variable:**  
  - `subscribe` — whether the player subscribed to the newsletter (True/False)

- **Explanatory variables:**  
  - `Age` — player’s age in years  
  - `gender` — categorical variable describing player gender  
  - `experience` — player’s self-reported skill level (categorical)  
  - `played_hours` — total hours spent playing  

### How the Data Will Be Used
This project will use data from `players.csv`, which contains all demographic and experience-related variables.  
To better capture behavioural aspects, I will also use `sessions.csv` to compute additional summaries for each player:
- total number of sessions recorded  
- average session duration (from start and end times)

These values will then be joined to the `players` data using the `hashedEmail` variable.

### Data Wrangling Plan
1. **Join datasets:** Combine `players.csv` and `sessions.csv` by `hashedEmail`.  
2. **Clean data:** Remove missing or invalid values (e.g., missing ages or incomplete session times).  
3. **Feature creation:** Create a variable for average session duration per player.  
4. **Convert data types:** Ensure `subscribe` is a logical (TRUE/FALSE) variable and categorical predictors use `factor()`.  
5. **Tidy data:** Keep one row per player, ready for analysis.

### Why This Question Fits the Data and Methods
- The response variable (`subscribe`) is a yes or no, which makes it suitable for a **linear regression model**. 
- The explanatory variables include both numeric and categorical data, which allows for simple comparisons and visualization using bar graphs and scatterplots
- The relationship between player behaviour (e.g., `played_hours`) and subscription status can be explored visually before modeling.

### Expected Insights
Exploring this question will help identify which types of players are more likely to subscribe to the newsletter.  
This can provide useful information about player engagement and help guide marketing or community outreach efforts for the game.
