In [1]:
import pandas as pd
import numpy as np
import scipy.stats as stats
from statsmodels.stats.weightstats import ztest as ztest

In [2]:
df = pd.read_csv("data/cleaned_student_lifestyle_dataset.csv")

In [3]:
df.head()

Unnamed: 0,student_id,study_hours_per_day,extracurricular_hours_per_day,sleep_hours_per_day,social_hours_per_day,physical_activity_hours_per_day,gpa,stress_level,stress_level_numeric,gpa_group
0,1,6.9,3.8,8.7,2.8,1.8,2.99,Moderate,2,2.5-3
1,2,5.3,3.5,8.0,4.2,3.0,2.75,Low,1,2.5-3
2,3,5.1,3.9,9.2,1.2,4.6,2.67,Low,1,2.5-3
3,4,6.5,2.1,7.2,1.7,6.5,2.88,Moderate,2,2.5-3
4,5,8.1,0.6,6.5,2.2,6.6,3.51,High,3,3.5-4


let's plot our data in order understand which results we are going to get

1. Histogram of sleep hours

### 1. Can we reject the hypothesis that the students get healthy 7 hours of sleep per day on average?


Hypothesis testing
H0: μ <= 7
H1: μ > 7

Where:

Since there are 2000 observations, we can use the z-test for large samples dur to CLT

In [4]:
z_stat, p_value = ztest(df['sleep_hours_per_day'], value=7, alternative='larger')

In [5]:
z_stat, p_value

(np.float64(15.343854306097922), np.float64(1.9468654098786628e-53))

In [6]:
mean_sleep = df['sleep_hours_per_day'].mean()
std_sleep = df['sleep_hours_per_day'].std()
n = len(df['sleep_hours_per_day'])

In [9]:
for conf in [0.90, 0.95, 0.99]:
    # For Z-test, use the normal distribution directly
    confidence_interval = stats.norm.interval(conf, loc=mean_sleep, scale=std_sleep/np.sqrt(n))
    print(f"Confidence interval at {int(conf * 100)}%: {confidence_interval[0]:.2f} to {confidence_interval[1]:.2f}")

Confidence interval at 90%: 7.45 to 7.55
Confidence interval at 95%: 7.44 to 7.57
Confidence interval at 99%: 7.42 to 7.59


### Interpretation of Results

#### Z-statistic: 15.34
This indicates that the sample mean is significantly greater than the hypothesized mean of 7 hours. The large positive value shows strong evidence against the null hypothesis.

#### P-value: 1.95e-53
The p-value is extremely small (essentially zero), providing overwhelming evidence to reject the null hypothesis that students sleep an average more than 7 hours per day.

#### Confidence Intervals:
- **90% Confidence Interval: 7.45 to 7.55**: We are 90% confident that the true mean sleep hours per day lies between 7.45 and 7.55 hours.
- **95% Confidence Interval: 7.44 to 7.57**: We are 95% confident that the true mean sleep hours per day lies between 7.44 and 7.57 hours.
- **99% Confidence Interval: 7.42 to 7.59**: We are 99% confident that the true mean sleep hours per day lies between 7.42 and 7.59 hours.

Since all these intervals are above 7 hours but below 8 hours, this supports rejecting the null hypothesis while indicating that students get more than the recommended 7 hours of sleep.

### Conclusion
We can reject the hypothesis that students get an average of 7 or fewer hours of sleep per day. The data shows students sleep significantly more than 7 hours. However, the actual average sleep duration (likely between 7.42 and 7.59 hours) still more than the recommended 7 hours of sleep for optimal health.

In [11]:
df['gpa_group'] = df['gpa_group'].astype('category')

Function to perform hypothesis testing and calculate confidence intervals

In [12]:
def analyze_sleep_by_gpa_group(df, group):
    group_data = df[df['gpa_group'] == group]['sleep_hours_per_day']
    t_stat, p_value = stats.ttest_1samp(group_data, 8)
    mean_sleep = group_data.mean()
    std_sleep = group_data.std()
    n = len(group_data)

    confidence_intervals = {}
    for conf in [0.90, 0.95, 0.99]:
        confidence_interval = stats.t.interval(conf, n-1, loc=mean_sleep, scale=std_sleep/np.sqrt(n))
        confidence_intervals[f"{int(conf * 100)}%"] = confidence_interval

    return t_stat, p_value, confidence_intervals

Analyze each GPA group

In [13]:
results = {}
for group in df['gpa_group'].cat.categories:
    t_stat, p_value, confidence_intervals = analyze_sleep_by_gpa_group(df, group)
    results[group] = {
        'T-statistic': t_stat,
        'P-value': p_value,
        'Confidence Intervals': confidence_intervals
    }

In [14]:
for group, result in results.items():
    print(f"GPA Group: {group}")
    print(f"T-statistic: {result['T-statistic']:.2f}")
    print(f"P-value: {result['P-value']:.2f}")
    for conf, interval in result['Confidence Intervals'].items():
        print(f"{conf} Confidence Interval: {interval[0]:.2f} to {interval[1]:.2f}")
    print()

GPA Group: 2-2.5
T-statistic: -1.95
P-value: 0.06
90% Confidence Interval: 7.10 to 7.94
95% Confidence Interval: 7.01 to 8.02
99% Confidence Interval: 6.84 to 8.19

GPA Group: 2.5-3
T-statistic: -8.32
P-value: 0.00
90% Confidence Interval: 7.44 to 7.63
95% Confidence Interval: 7.43 to 7.65
99% Confidence Interval: 7.39 to 7.68

GPA Group: 3-3.5
T-statistic: -12.59
P-value: 0.00
90% Confidence Interval: 7.37 to 7.52
95% Confidence Interval: 7.36 to 7.53
99% Confidence Interval: 7.33 to 7.56

GPA Group: 3.5-4
T-statistic: -3.00
P-value: 0.00
90% Confidence Interval: 7.52 to 7.86
95% Confidence Interval: 7.48 to 7.89
99% Confidence Interval: 7.42 to 7.96



### Interpretation of Results

#### GPA Group: 2-2.5
- **T-statistic**: -1.95
- **P-value**: 0.06
- **90% Confidence Interval**: 7.10 to 7.94
- **95% Confidence Interval**: 7.01 to 8.02
- **99% Confidence Interval**: 6.84 to 8.19

#### GPA Group: 2.5-3
- **T-statistic**: -8.32
- **P-value**: 0.00
- **90% Confidence Interval**: 7.44 to 7.63
- **95% Confidence Interval**: 7.43 to 7.65
- **99% Confidence Interval**: 7.39 to 7.68

#### GPA Group: 3-3.5
- **T-statistic**: -12.59
- **P-value**: 0.00
- **90% Confidence Interval**: 7.37 to 7.52
- **95% Confidence Interval**: 7.36 to 7.53
- **99% Confidence Interval**: 7.33 to 7.56

#### GPA Group: 3.5-4
- **T-statistic**: -3.00
- **P-value**: 0.00
- **90% Confidence Interval**: 7.52 to 7.86
- **95% Confidence Interval**: 7.48 to 7.89
- **99% Confidence Interval**: 7.42 to 7.96

### Conclusion
For GPA group 2-2.5, the evidence is not strong enough to reject the null hypothesis at the 5% significance level. For GPA groups 2.5-3, 3-3.5, and 3.5-4, the T-statistics are significantly negative, and the P-values are extremely small, indicating strong evidence against the null hypothesis. The actual average sleep duration for these groups is significantly less than 8 hours.

### 2. Can we reject the hypothesis that the proportion of students getting at least 8 hours of sleep per day is 50%?

In [15]:
n = 2000  # Total number of students
X = sum(df['sleep_hours_per_day'] >= 8)  # Number of students sleeping at least 8 hours
p_hat = X / n  # Sample proportion
p_0 = 0.5  # Null hypothesis proportion

checking the assumptions of the test

In [21]:
if n * p_0 > 5 and n * (1 - p_0) > 5:
    print("Assumptions met: Proceed with the test.")
else:
    print("Assumptions not met: Cannot proceed with the test.")

Assumptions met: Proceed with the test.


define the different significance levels in order to test the hypothesis on different levels

In [16]:
significance_level = [0.01, 0.05, 0.1]

In [17]:
for alpha in significance_level:
    # Compute test statistic (z-score)
    z_stat = (p_hat - p_0) / np.sqrt(p_0 * (1 - p_0) / n)

    # Compute p-value (two-tailed test)
    p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))

    # Compute confidence interval
    z_critical = stats.norm.ppf(1 - alpha / 2)
    margin_of_error = z_critical * np.sqrt(p_hat * (1 - p_hat) / n)
    confidence_interval = (p_hat - margin_of_error, p_hat + margin_of_error)

    # Output results
    print(f"Significance Level: {alpha}")
    print(f"Z-Statistic: {z_stat:.4f}")
    print(f"P-Value: {p_value:.4f}")
    print(f"{int((1 - alpha) * 100)}% Confidence Interval: {confidence_interval[0]:.4f} to {confidence_interval[1]:.4f}")
    if p_value < alpha:
        print("Reject the null hypothesis: The proportion is significantly different from 50%.")
    else:
        print("Fail to reject the null hypothesis: No significant evidence that the proportion is different from 50%.")
    print()


Significance Level: 0.01
Z-Statistic: -7.5579
P-Value: 0.0000
99% Confidence Interval: 0.3871 to 0.4439
Reject the null hypothesis: The proportion is significantly different from 50%.

Significance Level: 0.05
Z-Statistic: -7.5579
P-Value: 0.0000
95% Confidence Interval: 0.3939 to 0.4371
Reject the null hypothesis: The proportion is significantly different from 50%.

Significance Level: 0.1
Z-Statistic: -7.5579
P-Value: 0.0000
90% Confidence Interval: 0.3974 to 0.4336
Reject the null hypothesis: The proportion is significantly different from 50%.



### Conclusion
Overall Conclusion
Across all significance levels (0.01, 0.05, 0.1),
the null hypothesis is rejected. This means there is strong statistical
 evidence that the proportion of students sleeping at least 8 hours per day is significantly
 different from 50%. The confidence intervals consistently show that the true proportion lies
 between approximately 38.7% and 44.4%, which is well below 50%. This suggests that fewer than
 half of the students are getting the recommended 8 hours of sleep per day.</hr>

Function to test the hypothesis for each GPA group

In [18]:
def test_proportion_by_gpa_group(df, group_column, sleep_column, threshold, p_0=0.5, significance_levels=[0.01, 0.05, 0.1]):
    results = {}
    for group in df[group_column].unique():
        group_data = df[df[group_column] == group]
        n = len(group_data)  # Total number of students in the group
        X = sum(group_data[sleep_column] >= threshold)  # Number of students sleeping at least the threshold
        p_hat = X / n  # Sample proportion

        group_results = {}
        for alpha in significance_levels:
            # Compute test statistic (z-score)
            z_stat = (p_hat - p_0) / np.sqrt(p_0 * (1 - p_0) / n)

            # Compute p-value (two-tailed test)
            p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))

            # Compute confidence interval
            z_critical = stats.norm.ppf(1 - alpha / 2)
            margin_of_error = z_critical * np.sqrt(p_hat * (1 - p_hat) / n)
            confidence_interval = (p_hat - margin_of_error, p_hat + margin_of_error)

            # Store results
            group_results[alpha] = {
                "Z-Statistic": z_stat,
                "P-Value": p_value,
                "Confidence Interval": confidence_interval,
                "Reject Null": p_value < alpha
            }
        results[group] = group_results
    return results


In [19]:
results = test_proportion_by_gpa_group(df, group_column='gpa_group', sleep_column='sleep_hours_per_day', threshold=8)

In [20]:
for group, group_results in results.items():
    print(f"GPA Group: {group}")
    for alpha, result in group_results.items():
        print(f"  Significance Level: {alpha}")
        print(f"    Z-Statistic: {result['Z-Statistic']:.4f}")
        print(f"    P-Value: {result['P-Value']:.4f}")
        print(f"    Confidence Interval: {result['Confidence Interval'][0]:.4f} to {result['Confidence Interval'][1]:.4f}")
        print(f"    Reject Null Hypothesis: {result['Reject Null']}")
    print()

GPA Group: 2.5-3
  Significance Level: 0.01
    Z-Statistic: -3.7499
    P-Value: 0.0002
    Confidence Interval: 0.3807 to 0.4773
    Reject Null Hypothesis: True
  Significance Level: 0.05
    Z-Statistic: -3.7499
    P-Value: 0.0002
    Confidence Interval: 0.3922 to 0.4657
    Reject Null Hypothesis: True
  Significance Level: 0.1
    Z-Statistic: -3.7499
    P-Value: 0.0002
    Confidence Interval: 0.3981 to 0.4598
    Reject Null Hypothesis: True

GPA Group: 3.5-4
  Significance Level: 0.01
    Z-Statistic: -0.8402
    P-Value: 0.4008
    Confidence Interval: 0.3806 to 0.5606
    Reject Null Hypothesis: False
  Significance Level: 0.05
    Z-Statistic: -0.8402
    P-Value: 0.4008
    Confidence Interval: 0.4021 to 0.5391
    Reject Null Hypothesis: False
  Significance Level: 0.1
    Z-Statistic: -0.8402
    P-Value: 0.4008
    Confidence Interval: 0.4131 to 0.5281
    Reject Null Hypothesis: False

GPA Group: 3-3.5
  Significance Level: 0.01
    Z-Statistic: -6.6895
    P-Value:

### Interpretation of Sleep Patterns Across GPA Groups

**Statistical Findings:**
- **Middle GPA ranges (2.5-3 and 3-3.5)**: Significantly fewer than 50% of students get the recommended 8+ hours of sleep.
- **Highest and Lowest GPA ranges (2-2.5 and 3.5-4)**: No statistical evidence that the proportion differs from 50%.

**Potential Explanations:**
1. **U-shaped relationship**: Sleep patterns may follow a U-shaped curve relative to academic performance, with both high and low GPA students more likely to prioritize adequate sleep.

2. **Different causes**:
   - High-GPA students (3.5-4) may practice better sleep hygiene as part of overall better self-care.
   - Low-GPA students (2-2.5) might have lower academic demands or prioritize sleep over additional study time.
   - Middle-GPA students (2.5-3.5) may sacrifice sleep to maintain their academic standing through longer study hours.

3. **Time management**:
    Students in the middle GPA ranges might be struggling with time management, trying to balance academic requirements with other commitments.

These findings challenge the simple assumption that **"more sleep equals better grades"** and suggest a more complex relationship between sleep habits and academic performance.