## **(1) Data Description**
The dataset players.csv contains 196 observations and 9 variables describing online players’ demographics, experience, and gaming behavior.

| Variable | Type | Meaning |
|-----------|------|---------|
| `experience` | Categorical | Player skill level (e.g., Beginner, Amateur, Pro, Veteran, Expert). |
| `subscribe` | Boolean | Whether the player has a premium subscription. |
| `hashedEmail` | String | Unique hashed identifier |
| `played_hours` | Numeric | Total hours spent playing. |
| `name` | String | Player's name |
| `gender` | Categorical | Player's gender. |
| `age` | Numeric | Player age. |
| `individualId` | Numeric | Missing |
| `organizationName` | String | Missing. |

The dataset is complete for key features, but two columns (`individualId`, `organizationName`) have no data and will be dropped. `played_hours` is highly skewed, and subscription counts are imbalanced (144 True vs 52 False).

## **(2) Question**  
Can we predict whether a player will subscribe to a premium service based on age, gender, experience, and total hours played?

**Response variable:** `subscribe`  
**Predictors:** `played_hours`, `experience`, `age`, `gender`

The data allow us to analyze how engagement (hours played) and demographics affect subscription behaviour. I will remove unused columns, creating new categorical variables, and normalizing numeric ones.

### How I would Wrangle
I dropped empty or unnecessary columns (`name`, `hashedEmail`, `individualId`, `organizationName`), and converted categorical columns (`experience`, `gender`) into proper formats.  


## **(3) Exploratory Data Analysis and Visualization**
#### 1. Histogram: Player Playtime
A histogram of `played_hours` shows that most players have very low playtime (under 5 hours),  
while a few have extremely high values.  
This shows a right-skewed distribution, meaning a small group of players are very active.

#### 2. Bar Chart: Experience and Subscription
A bar chart of `experience` vs `subscribe` shows that players with higher experience levels  
(e.g., Pro or Veteran) tend to subscribe more often.  
Less experienced players are less likely to pay for premium features.

#### 3. Histogram: Age Distribution
A histogram of `age` shows most players are between 15 and 30 years old.  
There are a few older players, but very few children under 10.  
This helps us understand which age group dominates the player base.

#### 4. Scatter Plot: Playtime vs Age
A scatter plot of `played_hours` vs `age`, colored by `subscribe`, shows that players who play longer are more likely to subscribe.  
However, there is no clear pattern between age and playtime — engagement varies across ages.

#### 5. Facet Plot: Playtime by Gender
Using facets to separate genders, we see that both male and female players show a similar pattern —  
To understand which gender is more likely to play longer.
 

This may show that subscribers play more hours on average and tend to have higher experience levels. 

In [3]:
import pandas as pd

# Load the dataset
df = pd.read_csv("dsci-100-group23/data/players.csv")

# Remove columns we don't need or that are empty
df = df.drop(columns=["individualId", "organizationName", "hashedEmail", "name"], errors="ignore")

# Make sure categorical columns have the right type
df["experience"] = df["experience"].astype("category")
df["gender"] = df["gender"].astype("category")

# Drop any rows that are missing key information
df = df.dropna(subset=["experience", "gender", "age", "played_hours", "subscribe"])

# Show the first few rows to confirm the data looks tidy
df.head()


FileNotFoundError: [Errno 2] No such file or directory: 'dsci-100-group23/data/players.csv'

## (4) Methods and Plan

### Proposed Method
- I will use the K-Nearest Neighbours (KNN) algorithm to predict whether a player will subscribe to a premium service based on their age, gender, experience, and total hours played.


### Why This Method Is Appropriate
- KNN is a simple method that makes predictions based on similarity between players.
- It’s easy to understand and interpret visually.


### Assumptions
- Data points that are close together in feature space have similar outcomes.  
- The distance metric (usually Euclidean) accurately reflects similarity.  
- Features should be normalized so no variable dominates.


### Limitations
- KNN can be sensitive to noisy data and outliers.  
- The choice of k (number of neighbours) strongly affects accuracy.

### Model Comparison and Selection
- Different k values will be tested (e.g., k = 3, 5,..., n) using cross-validation to find the best-performing model.  
- Evaluation will use accuracy, precision and recall performance against a baseline model.


### Data Processing Plan
- Split the data into 80% training** and 0% testing sets.  
- Normalize numeric features (`age`, `played_hours`) so all variables are on the same scale.  
- Using 5-fold cross-validation on the training data to tune `k`.  
- Evaluate the final model on the test data.
