In this lab you'll be doing a bit more of a comprehensive analysis. Despite coming from Kaggle, this will not be a competition. Check out the data source and context here:

https://www.kaggle.com/datasets/riinuanslan/sleep-data-from-fitbit-trackerLinks to an external site.

It's a relatively simple dataset, but your goal is this: 

Can we predict Sleep Score (posted by the FitBit app) using the other metrics in the dataset? In other words, is there a formula here that the FitBit app uses to compute Sleep Score that we can reverse-engineer?
Two constraints for this assignment:

1. Your modeling efforts must involve bagging and stacking in some way. Otherwise, you may try whatever you like.

2. You are allowed, even encouraged, to compute and/or gather additional features to use as explanatory variables in your model. For example, you might create a variable for the time they went to sleep (as a measure of how "early" they went to bed, or not). There are multiple datasets and you should use all of them, which means you may use the corresponding month for the dataset as a variable as well (or anything related to it).

Your submission should be an HTML or .ipynb file of all of your work.

In [82]:
import pandas as pd
import numpy as np
import re
from numpy import mean
from numpy import std
from sklearn.datasets import make_regression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.ensemble import StackingRegressor
import warnings
from sklearn.exceptions import ConvergenceWarning
from matplotlib import pyplot

#suppress convergence warnings
warnings.simplefilter("ignore", ConvergenceWarning)
pd.options.mode.chained_assignment = None

# Cleaning data

In [83]:
# load in data

april = pd.read_csv("/Users/chloefeehan/Desktop/python/GSB-545/Labs/Lab-1/data/April sleep data - Sheet1.csv")
december = pd.read_csv("/Users/chloefeehan/Desktop/python/GSB-545/Labs/Lab-1/data/December sleep data - Sheet1.csv")
february = pd.read_csv("/Users/chloefeehan/Desktop/python/GSB-545/Labs/Lab-1/data/February sleep data - Sheet1 (1).csv")
january = pd.read_csv("/Users/chloefeehan/Desktop/python/GSB-545/Labs/Lab-1/data/January sleep data - Sheet1.csv")
march = pd.read_csv("/Users/chloefeehan/Desktop/python/GSB-545/Labs/Lab-1/data/March sleep data - Sheet1.csv")
november = pd.read_csv("/Users/chloefeehan/Desktop/python/GSB-545/Labs/Lab-1/data/November sleep data - Sheet1.csv")

# Add a 'Month' column to each
april["Month"] = "April"
april = april.rename(columns={"APRIL": "Day"})
december["Month"] = "December"
december = december.rename(columns={"DECEMBER": "Day"})
february["Month"] = "February"
february = february.rename(columns={"FEBEUARY": "Day"})
february = february.rename(columns={"SLEEP SQORE": "SLEEP SCORE"})
january["Month"] = "January"
january = january.rename(columns={"JANUARY": "Day"})
january = january.rename(columns={"HEART RATE UNDER RESTING": "HEART RATE BELOW RESTING"})
march["Month"] = "March"
march = march.rename(columns={"MARCH": "Day"})
march = march.rename(columns = {"HEARTRATE BELOW RESTING": "HEART RATE BELOW RESTING"})
november["Month"] = "November"
november = november.rename(columns={"NOVEMBER": "Day"})

# create combined dataset
sleep_data = pd.concat([november, december, january, february, march, april], ignore_index=True)
sleep_data = sleep_data.dropna()
sleep_data.head()

# clean sleep data
sleep_data["REM SLEEP"] = sleep_data["REM SLEEP"].str.replace('%', '').astype(float)
sleep_data["DEEP SLEEP"] = sleep_data["DEEP SLEEP"].str.replace('%', '').astype(float)
sleep_data["HEART RATE BELOW RESTING"] = sleep_data["HEART RATE BELOW RESTING"].str.replace('%', '').astype(float)

sleep_data["DATE"] = pd.to_datetime(sleep_data["DATE"], errors= "coerce")


In [84]:

# Split into bedtime and wake time
split_times = sleep_data['SLEEP TIME'].str.split(' - ', expand=True)
sleep_data['bedtime_raw'] = split_times[0]
sleep_data['wakeup_raw'] = split_times[1]

# fix missing am/pm
def fix_ampm(time_str, default_ampm):
    if pd.isna(time_str):
        return None
    if 'am' in time_str.lower() or 'pm' in time_str.lower():
        return time_str
    else:
        return time_str + default_ampm

def replace_dash_with_colon(time_str):
    if pd.isna(time_str):
        return time_str
    time_str = str(time_str).strip()
    return time_str.replace('-', ':')

sleep_data['bedtime_raw'] = sleep_data['bedtime_raw'].apply(replace_dash_with_colon)
sleep_data['wakeup_raw'] = sleep_data['wakeup_raw'].apply(replace_dash_with_colon)

sleep_data['bedtime_fixed'] = sleep_data['bedtime_raw'].apply(lambda x: fix_ampm(x, 'pm'))
sleep_data['wakeup_fixed'] = sleep_data['wakeup_raw'].apply(lambda x: fix_ampm(x, 'am'))

sleep_data['Bedtime'] = pd.to_datetime(
    sleep_data['DATE'].dt.strftime('%Y-%m-%d') + ' ' + sleep_data['bedtime_fixed'],
    format='%Y-%m-%d %I:%M%p',
    errors='coerce'
)

# create Wakeup time
sleep_data['Wakeup'] = pd.to_datetime(
    sleep_data['DATE'].dt.strftime('%Y-%m-%d') + ' ' + sleep_data['wakeup_fixed'],
    format='%Y-%m-%d %I:%M%p',
    errors='coerce'
)

# If Wakeup is before Bedtime, assume wakeup is the next day
sleep_data.loc[sleep_data['Wakeup'] < sleep_data['Bedtime'], 'Wakeup'] += pd.Timedelta(days=1)

# Create Sleep Duration in hours
sleep_data['Sleep Duration'] = (sleep_data['Wakeup'] - sleep_data['Bedtime']).dt.total_seconds() / 3600

# Convert Bedtime to hours
sleep_data['Bedtime_hours'] = sleep_data['Bedtime'].dt.hour + sleep_data['Bedtime'].dt.minute / 60

# Convert Wakeup to hours
sleep_data['Wakeup_hours'] = sleep_data['Wakeup'].dt.hour + sleep_data['Wakeup'].dt.minute / 60


#sleep_data = pd.get_dummies(sleep_data, columns=['Month', 'Day'], drop_first=True)

# Bagging and Stacking Models


In [142]:
# Define X and y
X = sleep_data[["Day", "REM SLEEP", "DEEP SLEEP", "HEART RATE BELOW RESTING", "Month", "Sleep Duration"]]
y = sleep_data["SLEEP SCORE"]

# One-hot encode 'Month' and 'Day' (assuming they're categorical, not numeric ordered)
X_encoded = pd.get_dummies(X, drop_first=True)

# Scale all numeric features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_encoded)
X_scaled_df = pd.DataFrame(X_scaled, columns=X_encoded.columns)

# Then split
X_train, X_test, y_train, y_test = train_test_split(X_scaled_df, y, test_size=0.2, random_state=6)

# Bagging Model

In [149]:
model = BaggingRegressor(
    estimator=DecisionTreeRegressor(),
    n_estimators=100,
    random_state=6,
    n_jobs=-1
)

model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

print(f"Mean Squared Error: {mse:.2f}")

cv_scores = cross_val_score(model, X_scaled, y, cv=5, scoring='r2')
print(f"Average CV R^2 Score (5-fold): {np.mean(cv_scores):.2f}")

Mean Squared Error: 108.69
Average CV R^2 Score (5-fold): 0.43


# Stacking Model


In [None]:

# Define stacking model
def get_stacking():
    level0 = list()
    level0.append(('knn', KNeighborsRegressor()))
    level0.append(('cart', DecisionTreeRegressor()))
    level0.append(('svm', SVR()))
    level0.append(('bagging', BaggingRegressor(estimator=DecisionTreeRegressor(), n_estimators=100,
    random_state=6,
    n_jobs=-1
)))
    level1 = LinearRegression()
    model = StackingRegressor(estimators=level0, final_estimator=level1, cv=5)
    return model

# List of models to evaluate
def get_models():
    models = dict()
    models['knn'] = KNeighborsRegressor()
    models['cart'] = DecisionTreeRegressor()
    models['svm'] = SVR()
    models['bagging'] = BaggingRegressor()
    models['stacking'] = get_stacking()
    return models

# Evaluate using cross-validation
def evaluate_model(model, X, y):
    cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=6)
    scores = cross_val_score(model, X, y, scoring='r2', cv=cv, n_jobs=-1, error_score='raise')
    return scores

# Evaluate models and predict
models = get_models()
results, names = [], []
avg_predictions = {}

for name, model in models.items():
    # Cross-validation
    scores = evaluate_model(model, X_train, y_train)
    results.append(scores)
    names.append(name)
    print('>%s %.3f (%.3f)' % (name, np.mean(scores), np.std(scores)))

    # Fit on train and predict on test
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    avg_predictions[name] = np.mean(preds)

# Create and print prediction summary
avg_pred_df = pd.DataFrame(list(avg_predictions.items()), columns=['Model', 'Average Predicted Sleep Score'])
print("\nAverage Prediction per Model:")
print(avg_pred_df)


>knn 0.085 (0.270)
>cart 0.272 (0.326)
>svm 0.309 (0.125)
>bagging 0.587 (0.187)
>stacking 0.615 (0.171)

Average Prediction per Model:
      Model  Average Predicted Sleep Score
0       knn                      85.294444
1      cart                      83.305556
2       svm                      84.734516
3   bagging                      83.427778
4  stacking                      83.598362


Based on the results, the stacking model's slightly higher R-squared value suggests that it provides a more accurate prediction compared to the other models.

# Appendix and References

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingRegressor.html
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingRegressor.html


Generative AI Statement: Generative AI tools were used to streamline processes and troubleshoot errors. Aided in function building for creating variables