# ðŸ“Š Provisional EDA - Burnout Labels Creation

## Purpose

This notebook is a preliminary/provisional version of the EDA focused on **target variable creation** (burnout_score and burnout_level).

### Workflow
1. Load the 4 raw datasets from Kaggle
2. Create composite burnout score (z-score)
3. Discretize into 3 classes using percentiles
4. Merge with daily data to propagate labels

> **Note**: The final and more complete version is in `01_eda.ipynb`

In [None]:
# =============================================================================
# DATA LOADING
# =============================================================================
# Import necessary libraries and load all datasets
# from the data/raw/ folder (downloaded from Kaggle)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option("display.max_columns", None)

# Load all 4 dataset files
daily_logs = pd.read_csv("../data/raw/daily_logs.csv")      # Daily metrics
daily_all = pd.read_csv("../data/raw/daily_all.csv")        # Expanded version
interventions = pd.read_csv("../data/raw/interventions.csv") # Wellness interventions
weekly = pd.read_csv("../data/raw/weekly_summaries.csv")     # Weekly summaries

# Print dimensions for verification
print("Daily logs:", daily_logs.shape)      # ~365k rows (1000 users Ã— 365 days)
print("Daily all:", daily_all.shape)
print("Interventions:", interventions.shape)
print("Weekly summaries:", weekly.shape)    # ~52k rows (1000 users Ã— 52 weeks)

Daily logs: (731000, 26)
Daily all: (731000, 53)
Interventions: (332, 6)
Weekly summaries: (105000, 10)


In [None]:
# =============================================================================
# BURNOUT SCORE CREATION
# =============================================================================
# Burnout is not directly observable in the dataset.
# We derive it by combining correlated psychometric indicators.
#
# FORMULA:
# burnout = (z_stress + z_anxiety + z_depression + z_sleep_debt - z_job_satisfaction) / 5
#
# Where z_* indicates the standardized value (z-score) of each variable.
# Standardization makes the scales comparable.

import numpy as np
from sklearn.preprocessing import StandardScaler

# Select columns that contribute to burnout
# These are available in the weekly_summaries dataset
burnout_features = weekly[[
    "perceived_stress_scale",  # PSS-10: Perceived Stress Scale (0-40)
    "anxiety_score",           # GAD-7: Generalized Anxiety Disorder scale
    "depression_score",        # PHQ-9: Patient Health Questionnaire for depression
    "sleep_debt_hours",        # Hours of sleep lost relative to ideal needs
    "job_satisfaction",        # Job satisfaction scale (1-10)
]]

# Z-score standardization
# Transforms each column to: (x - mean) / standard_deviation
# Result: mean = 0, std = 1 for each column
scaler = StandardScaler()
burnout_z = scaler.fit_transform(burnout_features)

burnout_z = pd.DataFrame(
    burnout_z,
    columns=burnout_features.columns,
    index=weekly.index
)

# Calculate composite burnout score
# NOTE: job_satisfaction has NEGATIVE coefficient because it's a protective factor
# (high satisfaction = low burnout)
weekly["burnout_score"] = (
    burnout_z["perceived_stress_scale"]
    + burnout_z["anxiety_score"]
    + burnout_z["depression_score"]
    + burnout_z["sleep_debt_hours"]
    - burnout_z["job_satisfaction"]  # Subtracted!
) / 5.0  # Average of 5 contributions

In [None]:
# =============================================================================
# BURNOUT SCORE DESCRIPTIVE STATISTICS
# =============================================================================
# Verify that the burnout score has a reasonable distribution

weekly["burnout_score"].describe()

# Expected output:
# - mean â‰ˆ 0 (because we use z-scores)
# - std â‰ˆ 0.4-0.6 (combination of 5 standardized variables)
# - min/max reasonable (no extreme outliers)

count    1.050000e+05
mean    -1.808162e-16
std      6.748538e-01
min     -2.011962e+00
25%     -4.795661e-01
50%     -1.075208e-01
75%      3.998777e-01
max      4.497371e+00
Name: burnout_score, dtype: float64

In [None]:
# =============================================================================
# DISCRETIZATION INTO 3 CLASSES
# =============================================================================
# For classification, we convert continuous burnout into 3 levels:
# - 0 = Low (low burnout)
# - 1 = Medium (medium)
# - 2 = High (high burnout)
#
# We use 33rd and 66th percentiles to ensure balanced classes.

low_thr = weekly["burnout_score"].quantile(0.33)   # Threshold between low and medium
high_thr = weekly["burnout_score"].quantile(0.66)  # Threshold between medium and high

def burnout_class(score):
    """Convert continuous burnout score to discrete class."""
    if score < low_thr:
        return 0  # Low burnout
    elif score < high_thr:
        return 1  # Medium burnout
    else:
        return 2  # High burnout

weekly["burnout_level"] = weekly["burnout_score"].apply(burnout_class)

# Verify class balance
# We should see approximately 33% for each class
weekly["burnout_level"].value_counts()

burnout_level
2    35700
1    34658
0    34642
Name: count, dtype: int64

In [None]:
# =============================================================================
# MERGE WITH DAILY DATA
# =============================================================================
# Burnout is calculated at weekly level, but we want to use daily
# features for training. We merge based on (user_id, week) to
# propagate burnout_level to each day.

# Convert dates to datetime
daily_logs["date"] = pd.to_datetime(daily_logs["date"])
weekly["week_start"] = pd.to_datetime(weekly["week_start"])

# Extract ISO week number (1-52)
daily_logs["week"] = daily_logs["date"].dt.isocalendar().week
weekly["week"] = weekly["week_start"].dt.isocalendar().week

# Left join: keep all daily records
# and add burnout_score/level from the corresponding week
merged = pd.merge(
    daily_logs,
    weekly[["user_id", "week", "burnout_score", "burnout_level"]],
    on=["user_id", "week"],
    how="left"
)

print(f"Merged dataset shape: {merged.shape}")
print(f"Missing burnout values: {merged['burnout_level'].isna().sum()}")