In [11]:
#imported packages
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm
%matplotlib inline

print("packages imported")

packages imported


# Problem Statement

Model the total sleep duration of 374 people using linear regression, given the number of minutes engaged in physical activity, self-reported stress level (1-10), and self-reported sleep quality (1-10).

# Variables and Parameters

| Symbol | Description | Type | Dimension | Units |
|---|---|---|---|---|
| $Y$ | The total number of hours slept | dependent variable | $T$ | hours |
| $B_0$ | The default sleeping duration $d$ | parameter | $T$ | hours |
| $B_1$ | Regression coefficient for physical activity | parameter | $T$ | hours/minutes |
| $X_1$ | The number of minutes (minutes) engaged in physical activity during the day | independent variable | $T$ | minutes |
| $B_2$ | Regression coefficient for stress level | parameter | $T$ | hours |
| $X_2$ | The (self-reported) stress level experienced | independent variable | 1 |  |
| $B_3$ | Regression coefficient for quality of sleep | parameter | $T$ | hours |
| $X_3$ | The (self-reported) quality of sleep reported on a scale of 1 to 10 | independent variable | 1 |  |
| $\epsilon$ | residual | parameter | $T$ | hours |


# Assumptions and Constraints:

- Other variables that affect sleep duration, such as timezone, room temperature, and diet, are constant.
- Assume that the data comes from 374 people of working age with no sleep-related diseases.
- No other variables affect sleep duration


# Building the Solution:

In [9]:
sleep_data = pd.read_csv('Sleep.csv')
sleep = pd.DataFrame(sleep_data)
sleep.head()

Unnamed: 0,Duration,Quality,Physical_Activity,Stress
0,6.1,6,42,6
1,6.2,6,60,8
2,6.2,6,60,8
3,5.9,4,30,8
4,5.9,4,30,8


In [13]:
X = sleep[['Physical_Activity','Quality','Stress']]
Y = sleep['Duration']
X = sm.add_constant(X)
reg = sm.OLS(Y, X).fit()
print(reg.summary())

0,1,2,3
Dep. Variable:,Duration,R-squared:,0.785
Model:,OLS,Adj. R-squared:,0.783
Method:,Least Squares,F-statistic:,450.5
Date:,"Wed, 26 Nov 2025",Prob (F-statistic):,4.13e-123
Time:,03:23:22,Log-Likelihood:,-157.2
No. Observations:,374,AIC:,322.4
Df Residuals:,370,BIC:,338.1
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,3.6737,0.402,9.132,0.000,2.883,4.465
Physical_Activity,0.0024,0.001,2.434,0.015,0.000,0.004
Quality,0.4981,0.039,12.663,0.000,0.421,0.575
Stress,-0.0607,0.026,-2.331,0.020,-0.112,-0.009

0,1,2,3
Omnibus:,31.756,Durbin-Watson:,0.77
Prob(Omnibus):,0.0,Jarque-Bera (JB):,35.774
Skew:,0.729,Prob(JB):,1.71e-08
Kurtosis:,2.59,Cond. No.,1340.0


### Table Analyze
Upon building the linear regression model with no feature transformation, we now analyze the summary information table. The first piece of information of note is that there is a warning that the condition number is large. We know from class that this indicates there might be linearly dependent features. 
Additionally the $R^2$ and $R^2_{adj}$ are both relatively high and this indicates that most of the variance in the target variable (sleep duration) can be explained by the features of the model which is good. 
Finally we see that the regression coefficient value for physical activity is near $0$ , this could indicate that either physical activity is not a good predictor of our target variable or that the physical activity is not on the same scale as the rest of the data.

# Analysis and Assessment: