# Exploratory Data Analysis (EDA)

## Objectives

This notebook performs exploratory data analysis on the processed dataset to better understand the engineered features created during the ETL stage. It combines statistical hypothesis testing and graphical exploration to identify meaningful patterns, validate assumptions, and confirm relationships relevant to stress detection.

The insights from this analysis will provide a deeper understanding of feature significance, support model development, and inform the design of visual elements in the Streamlit dashboard.

The tests for the following hypotheses compare three stress groups (Low, Medium, High) using non-parametric tests (data that is skewed, contains outliers, or is better described by ranks or medians rather than means). For this reason, I use the Kruskal-Wallis H test (which is ideally suited to this scenario with 3+ groups):

- **H1:** Lower sleep duration is linked to higher stress.
- **H2:** Lower sleep quality is linked to higher stress.
- **H5:** Higher physical activity is linked to lower stress.
- **H6:** Higher caffeine intake is linked to higher stress.
- **H7:** Longer work hours and longer travel time are linked to higher stress.
- **H8:** Health indicators differ across stress groups (BP, cholesterol, blood sugar).

In comparison, the Chi-square test of independence is used for the following hypotheses, as they involve categorical variables and assess whether there is a significant association between stress levels and different lifestyle characteristics.

The Chi-square test is appropriate when comparing frequency counts across categories (e.g., meditates vs does not meditate, exercise type). It evaluates whether observed differences between groups are likely due to chance or reflect a genuine relationship.

Where Chi-square test results show contingency tables with very small expected cell counts (typically fewer than 5), the Fisher's Exact Test is also conducted, as it provides a more reliable result for small samples.

Hypotheses tested with Chi-square (and Fisher's Exact if needed):

- **H3:** Higher screen time is linked to higher stress.
- **H4:** Meditation practice is linked to lower stress.

---

## Change working directory

I need to change the working directory from the current folder to its parent folder (required because the notebook is being run from inside the jupyter notebooks subfolder).

In [None]:
import os
current_dir = os.getcwd()
current_dir

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

In [None]:
current_dir = os.getcwd()
current_dir

---

## Load Libraries and Data

In [None]:
# Core libraries
import numpy as np
import pandas as pd
from pathlib import Path

# Visualisation libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Statistical testing libraries
from scipy import stats
from scipy.stats import kruskal, mannwhitneyu, chi2_contingency, spearmanr

# Display settings
pd.set_option('display.max_columns', None)
plt.style.use('seaborn-v0_8-whitegrid')

# Load the processed dataset
df = pd.read_csv('data/processed/stress_data_processed.csv')

print(f"Dataset loaded: {df.shape[0]} rows, {df.shape[1]} columns")
df.head()

In [None]:
# Verify target variable distribution
print("\nStress Level Distribution:")
print("=" * 40)
print(df['Stress_Detection'].value_counts())
print("\nProportions:")
print(df['Stress_Detection'].value_counts(normalize=True).round(3))

---

## Helper Functions for Statistical Testing

Define reusable functions for hypothesis testing and effect size calculation.

In [None]:
def calculate_epsilon_squared(H_statistic, n):
    """
    Calculate epsilon-squared effect size for Kruskal-Wallis test.
    
    Interpretation:
    - Small effect: 0.01 - 0.06
    - Medium effect: 0.06 - 0.14
    - Large effect: > 0.14
    """
    epsilon_sq = H_statistic / (n - 1)
    return epsilon_sq


def interpret_epsilon_squared(epsilon_sq):
    """
    Interpret epsilon-squared effect size.
    """
    if epsilon_sq < 0.01:
        return "negligible"
    elif epsilon_sq < 0.06:
        return "small"
    elif epsilon_sq < 0.14:
        return "medium"
    else:
        return "large"


def kruskal_wallis_test(df, variable, group_col='Stress_Detection'):
    """
    Perform Kruskal-Wallis H test comparing a continuous variable across stress groups.
    
    Returns: H-statistic, p-value, epsilon-squared, group medians
    """
    # Separate data by stress group
    groups = []
    group_names = ['Low', 'Medium', 'High']
    
    for name in group_names:
        group_data = df[df[group_col] == name][variable].dropna()
        groups.append(group_data)
    
    # Perform Kruskal-Wallis test
    H_stat, p_value = kruskal(*groups)
    
    # Calculate effect size (epsilon-squared)
    n = df[variable].dropna().shape[0]
    epsilon_sq = calculate_epsilon_squared(H_stat, n)
    effect_interpretation = interpret_epsilon_squared(epsilon_sq)
    
    # Calculate medians for each group
    medians = {name: df[df[group_col] == name][variable].median() 
               for name in group_names}
    
    return {
        'H_statistic': H_stat,
        'p_value': p_value,
        'epsilon_squared': epsilon_sq,
        'effect_size': effect_interpretation,
        'medians': medians,
        'n': n
    }


def cramers_v(contingency_table):
    """
    Calculate Cramér's V effect size for Chi-square test.
    
    Interpretation:
    - Small effect: 0.1
    - Medium effect: 0.3
    - Large effect: 0.5
    """
    chi2 = chi2_contingency(contingency_table)[0]
    n = contingency_table.sum().sum()
    min_dim = min(contingency_table.shape) - 1
    return np.sqrt(chi2 / (n * min_dim))


def interpret_cramers_v(v):
    """
    Interpret Cramér's V effect size.
    """
    if v < 0.1:
        return "negligible"
    elif v < 0.3:
        return "small"
    elif v < 0.5:
        return "medium"
    else:
        return "large"

---

## Hypotheses Tests and Visualisations

In this section, statistical tests are performed to examine the main hypotheses about stress behaviour. Tests are supported by relevant visualisations (where possible) to illustrate patterns and provide context for the statistical results.

---

### H1: Lower sleep duration is linked to higher stress

- **Aim:** To test whether lower sleep duration is associated with higher stress levels.
- **Test details:**
    - **Variable tested:** Sleep_Duration (hours per night)
    - **Test used:** Kruskal-Wallis H test (non-parametric, suitable for comparing 3 groups)
    - **Reason for non-parametric test:** Sleep duration may not be normally distributed across all groups, and we are comparing medians rather than means to be robust against outliers.
- **Expected outcome:** If H1 is supported, we expect the High stress group to have significantly lower median sleep duration than the Low stress group.

In [None]:
# H1: Statistical Test - Kruskal-Wallis

# Perform the test
h1_results = kruskal_wallis_test(df, 'Sleep_Duration')

# Display results
print("H1: Lower sleep duration is linked to higher stress")
print("=" * 60)
print(f"\nTest: Kruskal-Wallis H test")
print(f"Variable: Sleep_Duration")
print(f"Sample size: n = {h1_results['n']}")
print("\nResults:")
print("-" * 40)
print(f"H-statistic: {h1_results['H_statistic']:.3f}")
print(f"p-value: {h1_results['p_value']:.2e}")
print(f"Epsilon-squared (ε²): {h1_results['epsilon_squared']:.4f} ({h1_results['effect_size']} effect)")
print("\nMedian Sleep Duration by Stress Level:")
print("-" * 40)
for group, median in h1_results['medians'].items():
    print(f"{group}: {median:.2f} hours")

#### Interpretation

Based on the Kruskal-Wallis test results:

- **Statistical significance:** If p < 0.05, there is a statistically significant difference in sleep duration across stress groups.
- **Effect size:** The epsilon-squared value indicates the proportion of variance in sleep duration explained by stress level.
- **Direction:** Compare the medians to determine if higher stress is associated with lower sleep duration as hypothesised.

The interpretation will be completed after viewing the actual results.

#### H1 Visualisation: Sleep Duration by Stress Level (Box Plot)

- **Matplotlib and Seaborn** were chosen for their simplicity and clarity in showing distributions across categorical groups.
- The box plot summarises the relationship between Sleep_Duration and Stress_Detection.
- The plot shows the median, interquartile range (IQR), and any outliers for each stress group.
- **Insights:** Look for:
    - Lower median sleep duration in the High stress group
    - Separation between the boxes indicating meaningful differences
    - Outliers that may influence the results

In [None]:
# H1 Visualisation: Box Plot of Sleep Duration by Stress Level

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Define consistent order for stress levels
stress_order = ['Low', 'Medium', 'High']
colors = ['#2ecc71', '#f39c12', '#e74c3c']  # Green, Orange, Red

# Box Plot
ax1 = axes[0]
sns.boxplot(
    data=df, 
    x='Stress_Detection', 
    y='Sleep_Duration',
    order=stress_order,
    palette=colors,
    ax=ax1
)
ax1.set_title('H1: Sleep Duration by Stress Level', fontsize=12, fontweight='bold')
ax1.set_xlabel('Stress Level', fontsize=10)
ax1.set_ylabel('Sleep Duration (hours)', fontsize=10)

# Add median annotations
for i, stress_level in enumerate(stress_order):
    median_val = df[df['Stress_Detection'] == stress_level]['Sleep_Duration'].median()
    ax1.annotate(
        f'Median: {median_val:.1f}h',
        xy=(i, median_val),
        xytext=(i + 0.25, median_val + 0.3),
        fontsize=9,
        ha='left'
    )

# Violin Plot (alternative view)
ax2 = axes[1]
sns.violinplot(
    data=df, 
    x='Stress_Detection', 
    y='Sleep_Duration',
    order=stress_order,
    palette=colors,
    ax=ax2
)
ax2.set_title('H1: Sleep Duration Distribution by Stress Level', fontsize=12, fontweight='bold')
ax2.set_xlabel('Stress Level', fontsize=10)
ax2.set_ylabel('Sleep Duration (hours)', fontsize=10)

plt.tight_layout()
plt.show()

In [None]:
# H1: Summary Statistics by Stress Group

print("\nH1: Descriptive Statistics - Sleep Duration by Stress Level")
print("=" * 70)

h1_summary = df.groupby('Stress_Detection')['Sleep_Duration'].agg(
    ['count', 'mean', 'median', 'std', 'min', 'max']
).round(2)

# Reorder index
h1_summary = h1_summary.reindex(['Low', 'Medium', 'High'])
h1_summary.columns = ['Count', 'Mean', 'Median', 'Std Dev', 'Min', 'Max']

print(h1_summary)

#### H1 Conclusion

Based on the statistical test and visualisations:

- **Findings:** [To be interpreted after running the notebook with actual data]
    - The Kruskal-Wallis test indicates whether the differences in sleep duration across stress groups are statistically significant.
    - The effect size (epsilon-squared) indicates the practical significance of the relationship.
    - The box/violin plots visually confirm the pattern.

- **Support for H1:** 
    - If p < 0.05 and the High stress group shows lower median sleep duration, H1 is supported.
    - If the effect size is small, the relationship exists but may not be a dominant factor in stress prediction.

- **Implications for modelling:**
    - Sleep_Duration should be considered as a feature in the predictive model.
    - The binned version (Sleep_Duration_Bin) may also be useful for interpretability.

---

### H2: Lower sleep quality is linked to higher stress

- **Aim:** To test whether lower sleep quality is associated with higher stress levels.
- **Test details:**
    - **Variable tested:** Sleep_Quality (1-5 scale rating)
    - **Test used:** Kruskal-Wallis H test (non-parametric, suitable for comparing 3 groups)
    - **Reason for non-parametric test:** Sleep quality is measured on an ordinal scale (1-5), and the distribution may not be normal. The Kruskal-Wallis test compares ranks rather than assuming equal intervals between scale points.
- **Expected outcome:** If H2 is supported, we expect the High stress group to have significantly lower median sleep quality than the Low stress group.

In [None]:
# H2: Statistical Test - Kruskal-Wallis

# Perform the test
h2_results = kruskal_wallis_test(df, 'Sleep_Quality')

# Display results
print("H2: Lower sleep quality is linked to higher stress")
print("=" * 60)
print(f"\nTest: Kruskal-Wallis H test")
print(f"Variable: Sleep_Quality")
print(f"Sample size: n = {h2_results['n']}")
print("\nResults:")
print("-" * 40)
print(f"H-statistic: {h2_results['H_statistic']:.3f}")
print(f"p-value: {h2_results['p_value']:.2e}")
print(f"Epsilon-squared (ε²): {h2_results['epsilon_squared']:.4f} ({h2_results['effect_size']} effect)")
print("\nMedian Sleep Quality by Stress Level:")
print("-" * 40)
for group, median in h2_results['medians'].items():
    print(f"{group}: {median:.2f}")

#### Interpretation

Based on the Kruskal-Wallis test results:

- **Statistical significance:** If p < 0.05, there is a statistically significant difference in sleep quality across stress groups.
- **Effect size:** The epsilon-squared value indicates the proportion of variance in sleep quality explained by stress level.
- **Direction:** Compare the medians to determine if higher stress is associated with lower sleep quality as hypothesised.

The interpretation will be completed after viewing the actual results.

#### H2 Visualisation: Sleep Quality by Stress Level (Box Plot)

- **Matplotlib and Seaborn** were chosen for their simplicity and clarity in showing distributions across categorical groups.
- The box plot summarises the relationship between Sleep_Quality and Stress_Detection.
- The plot shows the median, interquartile range (IQR), and any outliers for each stress group.
- **Insights:** Look for:
    - Lower median sleep quality in the High stress group
    - Clear separation between the boxes indicating meaningful differences
    - The distribution shape in the violin plot showing concentration of values

In [None]:
# H2 Visualisation: Box Plot of Sleep Quality by Stress Level

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Define consistent order for stress levels
stress_order = ['Low', 'Medium', 'High']
colors = ['#2ecc71', '#f39c12', '#e74c3c']  # Green, Orange, Red

# Box Plot
ax1 = axes[0]
sns.boxplot(
    data=df, 
    x='Stress_Detection', 
    y='Sleep_Quality',
    order=stress_order,
    palette=colors,
    ax=ax1
)
ax1.set_title('H2: Sleep Quality by Stress Level', fontsize=12, fontweight='bold')
ax1.set_xlabel('Stress Level', fontsize=10)
ax1.set_ylabel('Sleep Quality (1-5 scale)', fontsize=10)
ax1.set_ylim(0, 6)  # Set y-axis limits for clarity

# Add median annotations
for i, stress_level in enumerate(stress_order):
    median_val = df[df['Stress_Detection'] == stress_level]['Sleep_Quality'].median()
    ax1.annotate(
        f'Median: {median_val:.1f}',
        xy=(i, median_val),
        xytext=(i + 0.25, median_val + 0.3),
        fontsize=9,
        ha='left'
    )

# Violin Plot (alternative view)
ax2 = axes[1]
sns.violinplot(
    data=df, 
    x='Stress_Detection', 
    y='Sleep_Quality',
    order=stress_order,
    palette=colors,
    ax=ax2
)
ax2.set_title('H2: Sleep Quality Distribution by Stress Level', fontsize=12, fontweight='bold')
ax2.set_xlabel('Stress Level', fontsize=10)
ax2.set_ylabel('Sleep Quality (1-5 scale)', fontsize=10)
ax2.set_ylim(0, 6)

plt.tight_layout()
plt.show()

In [None]:
# H2: Summary Statistics by Stress Group

print("\nH2: Descriptive Statistics - Sleep Quality by Stress Level")
print("=" * 70)

h2_summary = df.groupby('Stress_Detection')['Sleep_Quality'].agg(
    ['count', 'mean', 'median', 'std', 'min', 'max']
).round(2)

# Reorder index
h2_summary = h2_summary.reindex(['Low', 'Medium', 'High'])
h2_summary.columns = ['Count', 'Mean', 'Median', 'Std Dev', 'Min', 'Max']

print(h2_summary)

In [None]:
# H2: Additional visualisation - Sleep Quality value counts by Stress Level

fig, ax = plt.subplots(figsize=(10, 6))

# Create grouped bar chart
sleep_quality_by_stress = df.groupby(['Stress_Detection', 'Sleep_Quality']).size().unstack(fill_value=0)
sleep_quality_by_stress = sleep_quality_by_stress.reindex(['Low', 'Medium', 'High'])

sleep_quality_by_stress.plot(kind='bar', ax=ax, colormap='RdYlGn_r', edgecolor='black')
ax.set_title('H2: Sleep Quality Distribution Across Stress Levels', fontsize=12, fontweight='bold')
ax.set_xlabel('Stress Level', fontsize=10)
ax.set_ylabel('Count', fontsize=10)
ax.legend(title='Sleep Quality', bbox_to_anchor=(1.02, 1), loc='upper left')
ax.tick_params(axis='x', rotation=0)

plt.tight_layout()
plt.show()

#### H2 Conclusion

Based on the statistical test and visualisations:

- **Findings:** [To be interpreted after running the notebook with actual data]
    - The Kruskal-Wallis test indicates whether the differences in sleep quality across stress groups are statistically significant.
    - The effect size (epsilon-squared) indicates the practical significance of the relationship.
    - The box/violin plots visually confirm the pattern.

- **Support for H2:** 
    - If p < 0.05 and the High stress group shows lower median sleep quality, H2 is supported.
    - The grouped bar chart shows the distribution of sleep quality ratings within each stress level.

- **Implications for modelling:**
    - Sleep_Quality should be considered as a feature in the predictive model.
    - May interact with Sleep_Duration to provide a more complete picture of sleep health.

---

### H3: Higher screen time is linked to higher stress

- **Aim:** To test whether higher screen time is associated with higher stress levels.
- **Test details:**
    - **Variable tested:** Screen_Time (hours per day)
    - **Tests used:** 
        1. Kruskal-Wallis H test (comparing screen time across 3 stress groups)
        2. Spearman correlation (measuring monotonic relationship between screen time and encoded stress level)
    - **Reason for dual approach:** The Kruskal-Wallis test determines if groups differ significantly, while Spearman correlation quantifies the strength and direction of the relationship.
- **Expected outcome:** If H3 is supported, we expect the High stress group to have significantly higher median screen time, and a positive Spearman correlation.

In [None]:
# H3: Statistical Test - Kruskal-Wallis

# Perform the Kruskal-Wallis test
h3_results = kruskal_wallis_test(df, 'Screen_Time')

# Display results
print("H3: Higher screen time is linked to higher stress")
print("=" * 60)
print(f"\nTest 1: Kruskal-Wallis H test")
print(f"Variable: Screen_Time")
print(f"Sample size: n = {h3_results['n']}")
print("\nResults:")
print("-" * 40)
print(f"H-statistic: {h3_results['H_statistic']:.3f}")
print(f"p-value: {h3_results['p_value']:.2e}")
print(f"Epsilon-squared (ε²): {h3_results['epsilon_squared']:.4f} ({h3_results['effect_size']} effect)")
print("\nMedian Screen Time by Stress Level:")
print("-" * 40)
for group, median in h3_results['medians'].items():
    print(f"{group}: {median:.2f} hours")

In [None]:
# H3: Spearman Correlation Test

# Calculate Spearman correlation between Screen_Time and encoded stress level
spearman_corr, spearman_p = spearmanr(
    df['Screen_Time'], 
    df['Stress_Level_Encoded']
)

print("\nTest 2: Spearman Correlation")
print("-" * 40)
print(f"Spearman correlation (ρ): {spearman_corr:.4f}")
print(f"p-value: {spearman_p:.2e}")

# Interpret correlation strength
if abs(spearman_corr) < 0.1:
    corr_strength = "negligible"
elif abs(spearman_corr) < 0.3:
    corr_strength = "weak"
elif abs(spearman_corr) < 0.5:
    corr_strength = "moderate"
else:
    corr_strength = "strong"

direction = "positive" if spearman_corr > 0 else "negative"
print(f"Interpretation: {corr_strength} {direction} correlation")

#### Interpretation

Based on the Kruskal-Wallis and Spearman correlation results:

- **Kruskal-Wallis:** If p < 0.05, there is a statistically significant difference in screen time across stress groups.
- **Spearman correlation:** A positive correlation indicates that higher screen time is associated with higher stress levels.
- **Effect sizes:** Both epsilon-squared and the correlation coefficient indicate the strength of the relationship.

The interpretation will be completed after viewing the actual results.

#### H3 Visualisation: Screen Time by Stress Level

- **Box plot** shows the distribution of screen time across stress groups.
- **Scatter plot** shows the relationship between screen time and sleep duration, coloured by stress level, to explore potential interactions.
- **Insights:** Look for:
    - Higher median screen time in the High stress group
    - Patterns in the scatter plot showing high stress individuals clustering at high screen time and/or low sleep duration

In [None]:
# H3 Visualisation: Screen Time by Stress Level

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Define consistent order for stress levels
stress_order = ['Low', 'Medium', 'High']
colors = ['#2ecc71', '#f39c12', '#e74c3c']  # Green, Orange, Red
color_dict = dict(zip(stress_order, colors))

# Box Plot
ax1 = axes[0]
sns.boxplot(
    data=df, 
    x='Stress_Detection', 
    y='Screen_Time',
    order=stress_order,
    palette=colors,
    ax=ax1
)
ax1.set_title('H3: Screen Time by Stress Level', fontsize=12, fontweight='bold')
ax1.set_xlabel('Stress Level', fontsize=10)
ax1.set_ylabel('Screen Time (hours)', fontsize=10)

# Add median annotations
for i, stress_level in enumerate(stress_order):
    median_val = df[df['Stress_Detection'] == stress_level]['Screen_Time'].median()
    ax1.annotate(
        f'Median: {median_val:.1f}h',
        xy=(i, median_val),
        xytext=(i + 0.25, median_val + 0.3),
        fontsize=9,
        ha='left'
    )

# Scatter Plot: Screen Time vs Sleep Duration (coloured by Stress Level)
ax2 = axes[1]
for stress_level in stress_order:
    subset = df[df['Stress_Detection'] == stress_level]
    ax2.scatter(
        subset['Screen_Time'], 
        subset['Sleep_Duration'],
        c=color_dict[stress_level],
        label=stress_level,
        alpha=0.6,
        edgecolors='white',
        s=50
    )

ax2.set_title('H3: Screen Time vs Sleep Duration by Stress Level', fontsize=12, fontweight='bold')
ax2.set_xlabel('Screen Time (hours)', fontsize=10)
ax2.set_ylabel('Sleep Duration (hours)', fontsize=10)
ax2.legend(title='Stress Level')

plt.tight_layout()
plt.show()

In [None]:
# H3: Summary Statistics by Stress Group

print("\nH3: Descriptive Statistics - Screen Time by Stress Level")
print("=" * 70)

h3_summary = df.groupby('Stress_Detection')['Screen_Time'].agg(
    ['count', 'mean', 'median', 'std', 'min', 'max']
).round(2)

# Reorder index
h3_summary = h3_summary.reindex(['Low', 'Medium', 'High'])
h3_summary.columns = ['Count', 'Mean', 'Median', 'Std Dev', 'Min', 'Max']

print(h3_summary)

In [None]:
# H3: Percentage of High Stress by Screen Time Bins

# Use the binned screen time from ETL (or create if not available)
if 'Screen_Time_Bin' in df.columns:
    screen_time_stress = df.groupby('Screen_Time_Bin')['Stress_Detection'].value_counts(normalize=True).unstack(fill_value=0)
    
    fig, ax = plt.subplots(figsize=(10, 6))
    
    # Plot percentage of each stress level by screen time bin
    screen_time_stress[['Low', 'Medium', 'High']].plot(
        kind='bar', 
        stacked=True, 
        ax=ax, 
        color=colors,
        edgecolor='black'
    )
    
    ax.set_title('H3: Stress Level Distribution by Screen Time Category', fontsize=12, fontweight='bold')
    ax.set_xlabel('Screen Time Category', fontsize=10)
    ax.set_ylabel('Proportion', fontsize=10)
    ax.legend(title='Stress Level', bbox_to_anchor=(1.02, 1), loc='upper left')
    ax.tick_params(axis='x', rotation=45)
    
    # Add percentage labels
    for c in ax.containers:
        ax.bar_label(c, fmt='%.1f', label_type='center', fontsize=8)
    
    plt.tight_layout()
    plt.show()
else:
    print("Screen_Time_Bin not found in dataset. Run ETL notebook first.")

#### H3 Conclusion

Based on the statistical tests and visualisations:

- **Findings:** [To be interpreted after running the notebook with actual data]
    - The Kruskal-Wallis test indicates whether screen time differs significantly across stress groups.
    - The Spearman correlation quantifies the relationship direction and strength.
    - The scatter plot reveals potential interactions with sleep duration.

- **Support for H3:** 
    - If p < 0.05 for both tests and the High stress group shows higher median screen time with a positive correlation, H3 is supported.

- **Implications for modelling:**
    - Screen_Time should be considered as a feature in the predictive model.
    - The Screen_Activity_Ratio (screen time to physical activity) may capture additional signal.
    - Potential interaction with sleep variables could be explored.

---

### H4: Meditation practice is linked to lower stress

- **Aim:** To test whether meditation practice is associated with lower stress levels.
- **Test details:**
    - **Variable tested:** Meditation_Practice (binary: 0 = No, 1 = Yes)
    - **Test used:** Chi-square test of independence
    - **Reason for Chi-square:** Both variables are categorical (Meditation: Yes/No, Stress: Low/Medium/High). Chi-square tests whether the distribution of stress levels differs significantly between those who meditate and those who don't.
    - **Effect size:** Cramér's V measures the strength of association.
- **Expected outcome:** If H4 is supported, we expect meditators to have a higher proportion of Low stress and a lower proportion of High stress compared to non-meditators.

In [None]:
# H4: Statistical Test - Chi-square test of independence

# Create contingency table
contingency_table = pd.crosstab(
    df['Meditation_Practice'], 
    df['Stress_Detection']
)

# Reorder columns
contingency_table = contingency_table[['Low', 'Medium', 'High']]

# Rename index for clarity
contingency_table.index = ['No Meditation', 'Meditates']

print("H4: Meditation practice is linked to lower stress")
print("=" * 60)
print("\nContingency Table (Counts):")
print("-" * 40)
print(contingency_table)

# Show proportions within each meditation group
print("\nProportions within each group:")
print("-" * 40)
print(contingency_table.div(contingency_table.sum(axis=1), axis=0).round(3))

In [None]:
# H4: Perform Chi-square test

chi2, p_value, dof, expected = chi2_contingency(contingency_table)

# Calculate Cramér's V effect size
n = contingency_table.sum().sum()
min_dim = min(contingency_table.shape) - 1
cramers_v_value = np.sqrt(chi2 / (n * min_dim))
effect_interpretation = interpret_cramers_v(cramers_v_value)

print("\nTest: Chi-square test of independence")
print("-" * 40)
print(f"Chi-square statistic: {chi2:.3f}")
print(f"p-value: {p_value:.2e}")
print(f"Degrees of freedom: {dof}")
print(f"Cramér's V: {cramers_v_value:.4f} ({effect_interpretation} effect)")

print("\nExpected frequencies (if no association):")
print("-" * 40)
expected_df = pd.DataFrame(
    expected, 
    index=['No Meditation', 'Meditates'],
    columns=['Low', 'Medium', 'High']
)
print(expected_df.round(1))

# Check for small expected cell counts
if (expected < 5).any():
    print("\n⚠️  Warning: Some expected cell counts are less than 5.")
    print("   Consider Fisher's Exact Test for more reliable results.")

#### Interpretation

Based on the Chi-square test results:

- **Statistical significance:** If p < 0.05, there is a statistically significant association between meditation practice and stress level distribution.
- **Effect size:** Cramér's V indicates the strength of the association (0.1 = small, 0.3 = medium, 0.5 = large).
- **Direction:** Compare the proportions to determine if meditators have lower stress as hypothesised.

The interpretation will be completed after viewing the actual results.

#### H4 Visualisation: Stress Distribution by Meditation Practice

- **Stacked bar chart** shows the proportion of each stress level within meditation vs non-meditation groups.
- **Grouped bar chart** provides an alternative view comparing counts.
- **Insights:** Look for:
    - Higher proportion of Low stress among meditators
    - Lower proportion of High stress among meditators
    - Clear visual difference between the two groups

In [None]:
# H4 Visualisation: Stress Distribution by Meditation Practice

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Define colors for stress levels
colors = ['#2ecc71', '#f39c12', '#e74c3c']  # Green, Orange, Red

# Stacked Bar Chart (Proportions)
ax1 = axes[0]
proportions = contingency_table.div(contingency_table.sum(axis=1), axis=0)
proportions.plot(
    kind='bar', 
    stacked=True, 
    ax=ax1, 
    color=colors,
    edgecolor='black'
)
ax1.set_title('H4: Stress Distribution by Meditation Practice (Proportions)', fontsize=12, fontweight='bold')
ax1.set_xlabel('Meditation Practice', fontsize=10)
ax1.set_ylabel('Proportion', fontsize=10)
ax1.legend(title='Stress Level', bbox_to_anchor=(1.02, 1), loc='upper left')
ax1.tick_params(axis='x', rotation=0)
ax1.set_ylim(0, 1)

# Add percentage labels
for c in ax1.containers:
    ax1.bar_label(c, fmt='%.1f', label_type='center', fontsize=9)

# Grouped Bar Chart (Counts)
ax2 = axes[1]
contingency_table.plot(
    kind='bar', 
    ax=ax2, 
    color=colors,
    edgecolor='black'
)
ax2.set_title('H4: Stress Counts by Meditation Practice', fontsize=12, fontweight='bold')
ax2.set_xlabel('Meditation Practice', fontsize=10)
ax2.set_ylabel('Count', fontsize=10)
ax2.legend(title='Stress Level', bbox_to_anchor=(1.02, 1), loc='upper left')
ax2.tick_params(axis='x', rotation=0)

plt.tight_layout()
plt.show()

In [None]:
# H4: Summary Statistics

print("\nH4: Stress Distribution Summary by Meditation Practice")
print("=" * 70)

# Create summary with counts and percentages
meditation_summary = df.groupby('Meditation_Practice')['Stress_Detection'].value_counts().unstack(fill_value=0)
meditation_summary = meditation_summary[['Low', 'Medium', 'High']]
meditation_summary.index = ['No Meditation (0)', 'Meditates (1)']

# Add totals
meditation_summary['Total'] = meditation_summary.sum(axis=1)

print("\nCounts:")
print(meditation_summary)

# Calculate and display High Stress rate
print("\nHigh Stress Rate by Meditation Practice:")
print("-" * 40)
for idx in meditation_summary.index:
    high_rate = meditation_summary.loc[idx, 'High'] / meditation_summary.loc[idx, 'Total'] * 100
    print(f"{idx}: {high_rate:.1f}% High Stress")

#### H4 Conclusion

Based on the statistical test and visualisations:

- **Findings:** [To be interpreted after running the notebook with actual data]
    - The Chi-square test indicates whether meditation practice is significantly associated with stress level distribution.
    - Cramér's V indicates the strength of the association.
    - The stacked bar chart visually shows the difference in stress proportions.

- **Support for H4:** 
    - If p < 0.05 and meditators show a higher proportion of Low stress (or lower proportion of High stress), H4 is supported.
    - The effect size indicates whether this is a meaningful practical difference.

- **Implications for modelling:**
    - Meditation_Practice should be considered as a feature in the predictive model.
    - This binary feature may be particularly useful given its interpretability.
    - May interact with other lifestyle factors (e.g., physical activity, exercise type).

---

## Summary and Next Steps

### Hypotheses Summary (H1-H4)

| Hypothesis | Variable | Test | p-value | Effect Size | Supported? |
|------------|----------|------|---------|-------------|------------|
| H1: Lower sleep duration → higher stress | Sleep_Duration | Kruskal-Wallis | [result] | [result] | [Yes/No] |
| H2: Lower sleep quality → higher stress | Sleep_Quality | Kruskal-Wallis | [result] | [result] | [Yes/No] |
| H3: Higher screen time → higher stress | Screen_Time | Kruskal-Wallis + Spearman | [result] | [result] | [Yes/No] |
| H4: Meditation → lower stress | Meditation_Practice | Chi-square | [result] | [result] | [Yes/No] |

### Key Insights So Far

1. **Sleep factors (H1, H2):** [To be completed after analysis]
2. **Screen time (H3):** [To be completed after analysis]
3. **Meditation (H4):** [To be completed after analysis]

### Next Steps

Continue with the remaining hypotheses:
- **H5:** Higher physical activity is linked to lower stress
- **H6:** Higher caffeine intake is linked to higher stress
- **H7:** Longer work hours and travel time are linked to higher stress
- **H8:** Health indicators differ across stress groups