# Exploratory Data Analysis (EDA)

## Objectives

This notebook performs exploratory data analysis on the processed dataset to better understand the engineered features created during the ETL stage. It combines statistical hypothesis testing and graphical exploration to identify meaningful patterns, validate assumptions, and confirm relationships relevant to stress detection.

The insights from this analysis will provide a deeper understanding of feature significance, support model development, and inform the design of visual elements in the Streamlit dashboard.

The tests for the following hypotheses compare three stress groups (Low, Medium, High) using non-parametric tests (data that is skewed, contains outliers, or is better described by ranks or medians rather than means). For this reason, I use the Kruskal-Wallis H test (which is ideally suited to this scenario with 3+ groups):

- **H1:** Lower sleep duration is linked to higher stress.
- **H2:** Lower sleep quality is linked to higher stress.
- **H5:** Higher physical activity is linked to lower stress.
- **H6:** Higher caffeine intake is linked to higher stress.
- **H7:** Longer work hours and longer travel time are linked to higher stress.
- **H8:** Health indicators differ across stress groups (BP, cholesterol, blood sugar).

In comparison, the Chi-square test of independence is used for the following hypotheses, as they involve categorical variables and assess whether there is a significant association between stress levels and different lifestyle characteristics.

The Chi-square test is appropriate when comparing frequency counts across categories (e.g., meditates vs does not meditate, exercise type). It evaluates whether observed differences between groups are likely due to chance or reflect a genuine relationship.

Where Chi-square test results show contingency tables with very small expected cell counts (typically fewer than 5), the Fisher's Exact Test is also conducted, as it provides a more reliable result for small samples.

Hypotheses tested with Chi-square (and Fisher's Exact if needed):

- **H3:** Higher screen time is linked to higher stress.
- **H4:** Meditation practice is linked to lower stress.

---

## Change working directory

I need to change the working directory from the current folder to its parent folder (required because the notebook is being run from inside the jupyter notebooks subfolder).

In [None]:
import os
current_dir = os.getcwd()
current_dir

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

In [None]:
current_dir = os.getcwd()
current_dir

---

## Load Libraries and Data

In [None]:
# Core libraries
import numpy as np
import pandas as pd
from pathlib import Path

# Visualisation libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Statistical testing libraries
from scipy import stats
from scipy.stats import kruskal, mannwhitneyu, chi2_contingency, spearmanr

# Display settings
pd.set_option('display.max_columns', None)
plt.style.use('seaborn-v0_8-whitegrid')

# Load the processed dataset
df = pd.read_csv('data/processed/stress_data_processed.csv')

print(f"Dataset loaded: {df.shape[0]} rows, {df.shape[1]} columns")
df.head()

In [None]:
# Verify target variable distribution
print("\nStress Level Distribution:")
print("=" * 40)
print(df['Stress_Detection'].value_counts())
print("\nProportions:")
print(df['Stress_Detection'].value_counts(normalize=True).round(3))

---

## Helper Functions for Statistical Testing

Define reusable functions for hypothesis testing and effect size calculation.

In [None]:
def calculate_epsilon_squared(H_statistic, n):
    """
    Calculate epsilon-squared effect size for Kruskal-Wallis test.
    
    Interpretation:
    - Small effect: 0.01 - 0.06
    - Medium effect: 0.06 - 0.14
    - Large effect: > 0.14
    """
    epsilon_sq = H_statistic / (n - 1)
    return epsilon_sq


def interpret_epsilon_squared(epsilon_sq):
    """
    Interpret epsilon-squared effect size.
    """
    if epsilon_sq < 0.01:
        return "negligible"
    elif epsilon_sq < 0.06:
        return "small"
    elif epsilon_sq < 0.14:
        return "medium"
    else:
        return "large"


def kruskal_wallis_test(df, variable, group_col='Stress_Detection'):
    """
    Perform Kruskal-Wallis H test comparing a continuous variable across stress groups.
    
    Returns: H-statistic, p-value, epsilon-squared, group medians
    """
    # Separate data by stress group
    groups = []
    group_names = ['Low', 'Medium', 'High']
    
    for name in group_names:
        group_data = df[df[group_col] == name][variable].dropna()
        groups.append(group_data)
    
    # Perform Kruskal-Wallis test
    H_stat, p_value = kruskal(*groups)
    
    # Calculate effect size (epsilon-squared)
    n = df[variable].dropna().shape[0]
    epsilon_sq = calculate_epsilon_squared(H_stat, n)
    effect_interpretation = interpret_epsilon_squared(epsilon_sq)
    
    # Calculate medians for each group
    medians = {name: df[df[group_col] == name][variable].median() 
               for name in group_names}
    
    return {
        'H_statistic': H_stat,
        'p_value': p_value,
        'epsilon_squared': epsilon_sq,
        'effect_size': effect_interpretation,
        'medians': medians,
        'n': n
    }

---

## Hypotheses Tests and Visualisations

In this section, statistical tests are performed to examine the main hypotheses about stress behaviour. Tests are supported by relevant visualisations (where possible) to illustrate patterns and provide context for the statistical results.

---

### H1: Lower sleep duration is linked to higher stress

- **Aim:** To test whether lower sleep duration is associated with higher stress levels.
- **Test details:**
    - **Variable tested:** Sleep_Duration (hours per night)
    - **Test used:** Kruskal-Wallis H test (non-parametric, suitable for comparing 3 groups)
    - **Reason for non-parametric test:** Sleep duration may not be normally distributed across all groups, and we are comparing medians rather than means to be robust against outliers.
- **Expected outcome:** If H1 is supported, we expect the High stress group to have significantly lower median sleep duration than the Low stress group.

In [None]:
# H1: Statistical Test - Kruskal-Wallis

# Perform the test
h1_results = kruskal_wallis_test(df, 'Sleep_Duration')

# Display results
print("H1: Lower sleep duration is linked to higher stress")
print("=" * 60)
print(f"\nTest: Kruskal-Wallis H test")
print(f"Variable: Sleep_Duration")
print(f"Sample size: n = {h1_results['n']}")
print("\nResults:")
print("-" * 40)
print(f"H-statistic: {h1_results['H_statistic']:.3f}")
print(f"p-value: {h1_results['p_value']:.2e}")
print(f"Epsilon-squared (ε²): {h1_results['epsilon_squared']:.4f} ({h1_results['effect_size']} effect)")
print("\nMedian Sleep Duration by Stress Level:")
print("-" * 40)
for group, median in h1_results['medians'].items():
    print(f"{group}: {median:.2f} hours")

#### Interpretation

Based on the Kruskal-Wallis test results:

- **Statistical significance:** If p < 0.05, there is a statistically significant difference in sleep duration across stress groups.
- **Effect size:** The epsilon-squared value indicates the proportion of variance in sleep duration explained by stress level.
- **Direction:** Compare the medians to determine if higher stress is associated with lower sleep duration as hypothesised.

The interpretation will be completed after viewing the actual results.

#### H1 Visualisation: Sleep Duration by Stress Level (Box Plot)

- **Matplotlib and Seaborn** were chosen for their simplicity and clarity in showing distributions across categorical groups.
- The box plot summarises the relationship between Sleep_Duration and Stress_Detection.
- The plot shows the median, interquartile range (IQR), and any outliers for each stress group.
- **Insights:** Look for:
    - Lower median sleep duration in the High stress group
    - Separation between the boxes indicating meaningful differences
    - Outliers that may influence the results

In [None]:
# H1 Visualisation: Box Plot of Sleep Duration by Stress Level

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Define consistent order for stress levels
stress_order = ['Low', 'Medium', 'High']
colors = ['#2ecc71', '#f39c12', '#e74c3c']  # Green, Orange, Red

# Box Plot
ax1 = axes[0]
sns.boxplot(
    data=df, 
    x='Stress_Detection', 
    y='Sleep_Duration',
    order=stress_order,
    palette=colors,
    ax=ax1
)
ax1.set_title('H1: Sleep Duration by Stress Level', fontsize=12, fontweight='bold')
ax1.set_xlabel('Stress Level', fontsize=10)
ax1.set_ylabel('Sleep Duration (hours)', fontsize=10)

# Add median annotations
for i, stress_level in enumerate(stress_order):
    median_val = df[df['Stress_Detection'] == stress_level]['Sleep_Duration'].median()
    ax1.annotate(
        f'Median: {median_val:.1f}h',
        xy=(i, median_val),
        xytext=(i + 0.25, median_val + 0.3),
        fontsize=9,
        ha='left'
    )

# Violin Plot (alternative view)
ax2 = axes[1]
sns.violinplot(
    data=df, 
    x='Stress_Detection', 
    y='Sleep_Duration',
    order=stress_order,
    palette=colors,
    ax=ax2
)
ax2.set_title('H1: Sleep Duration Distribution by Stress Level', fontsize=12, fontweight='bold')
ax2.set_xlabel('Stress Level', fontsize=10)
ax2.set_ylabel('Sleep Duration (hours)', fontsize=10)

plt.tight_layout()
plt.show()

In [None]:
# H1: Summary Statistics by Stress Group

print("\nH1: Descriptive Statistics - Sleep Duration by Stress Level")
print("=" * 70)

h1_summary = df.groupby('Stress_Detection')['Sleep_Duration'].agg(
    ['count', 'mean', 'median', 'std', 'min', 'max']
).round(2)

# Reorder index
h1_summary = h1_summary.reindex(['Low', 'Medium', 'High'])
h1_summary.columns = ['Count', 'Mean', 'Median', 'Std Dev', 'Min', 'Max']

print(h1_summary)

#### H1 Conclusion

Based on the statistical test and visualisations:

- **Findings:** [To be interpreted after running the notebook with actual data]
    - The Kruskal-Wallis test indicates whether the differences in sleep duration across stress groups are statistically significant.
    - The effect size (epsilon-squared) indicates the practical significance of the relationship.
    - The box/violin plots visually confirm the pattern.

- **Support for H1:** 
    - If p < 0.05 and the High stress group shows lower median sleep duration, H1 is supported.
    - If the effect size is small, the relationship exists but may not be a dominant factor in stress prediction.

- **Implications for modelling:**
    - Sleep_Duration should be considered as a feature in the predictive model.
    - The binned version (Sleep_Duration_Bin) may also be useful for interpretability.

---

## Summary and Next Steps

### H1 Summary

| Hypothesis | Variable | Test | p-value | Effect Size | Supported? |
|------------|----------|------|---------|-------------|------------|
| H1: Lower sleep → higher stress | Sleep_Duration | Kruskal-Wallis | [result] | [result] | [Yes/No] |

