In this lab you'll be doing a bit more of a comprehensive analysis. Despite coming from Kaggle, this will not be a competition. Check out the data source and context here:

https://www.kaggle.com/datasets/riinuanslan/sleep-data-from-fitbit-trackerLinks to an external site.

It's a relatively simple dataset, but your goal is this:

Can we predict Sleep Score (posted by the FitBit app) using the other metrics in the dataset? In other words, is there a formula here that the FitBit app uses to compute Sleep Score that we can reverse-engineer?
Two constraints for this assignment:

1. Your modeling efforts must involve bagging and stacking in some way. Otherwise, you may try whatever you like.

2. You are allowed, even encouraged, to compute and/or gather additional features to use as explanatory variables in your model. For example, you might create a variable for the time they went to sleep (as a measure of how "early" they went to bed, or not). There are multiple datasets and you should use all of them, which means you may use the corresponding month for the dataset as a variable as well (or anything related to it).

Your submission should be an HTML or .ipynb file of all of your work.



##Cleaning the data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import warnings
warnings.filterwarnings('ignore')
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
from numpy import mean
from numpy import std
from sklearn.datasets import make_regression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.ensemble import StackingRegressor

In [None]:
april=pd.read_csv('/content/April sleep data - Sheet1.csv')
december=pd.read_csv('/content/December Sleep data - Sheet1.csv')
february=pd.read_csv('/content/February sleep data - Sheet1 (1).csv')
january=pd.read_csv('/content/January sleep data - Sheet1.csv')
march=pd.read_csv('/content/March sleep data - Sheet1.csv')
november=pd.read_csv('/content/November Sleep Data - Sheet1.csv')


In [None]:
february.rename(columns={"SLEEP SQORE": "SLEEP SCORE"}, inplace=True)
february.rename(columns={"FEBEUARY": "FEBRUARY"}, inplace=True)


In [None]:
february.rename(columns={"FEBRUARY": "DAYS OF THE WEEK"}, inplace=True)
april.rename(columns={"APRIL": "DAYS OF THE WEEK"}, inplace=True)
december.rename(columns={"DECEMBER": "DAYS OF THE WEEK"}, inplace=True)
january.rename(columns={"JANUARY": "DAYS OF THE WEEK"}, inplace=True)
march.rename(columns={"MARCH": "DAYS OF THE WEEK"}, inplace=True)
november.rename(columns={"NOVEMBER": "DAYS OF THE WEEK"}, inplace=True)

In [None]:
april["MONTH"] = "APRIL"
december["MONTH"] = "DECEMBER"
february["MONTH"] = "FEBRUARY"
january["MONTH"] = "JANUARY"
march["MONTH"] = "MARCH"
november["MONTH"] = "NOVEMBER"

In [None]:
january.rename(columns={"HEART RATE UNDER RESTING": "HEART RATE BELOW RESTING"}, inplace=True)
march.rename(columns={"HEARTRATE BELOW RESTING": "HEART RATE BELOW RESTING"}, inplace=True)

In [None]:
april = april.dropna()
december = december.dropna()
february = february.dropna()
january = january.dropna()
march = march.dropna()
november = november.dropna()

In [None]:
all_sleep_data = pd.concat([november, december, january, february, march, april], ignore_index=True)

In [None]:
#had chat help with doing this

# Split "SLEEP TIME" into two separate columns
all_sleep_data[["SLEEP START", "SLEEP END"]] = all_sleep_data["SLEEP TIME"].str.split(" - ", expand=True)

# Clean and standardize time strings
def fix_time_string(s):
    if pd.isna(s):
        return s
    s = s.strip().lower()
    s = re.sub(r"(\d{1,2})-(\d{2})(am|pm)", r"\1:\2\3", s)
    s = re.sub(r"[^\dxapm:]", "", s)
    return s

all_sleep_data["SLEEP START"] = all_sleep_data["SLEEP START"].apply(fix_time_string)
all_sleep_data["SLEEP END"] = all_sleep_data["SLEEP END"].apply(fix_time_string)

# Convert times to datetime, then to string format
all_sleep_data["SLEEP START"] = pd.to_datetime(all_sleep_data["SLEEP START"], errors="coerce").dt.strftime("%H:%M")
all_sleep_data["SLEEP END"] = pd.to_datetime(all_sleep_data["SLEEP END"], errors="coerce").dt.strftime("%H:%M")

# Fix time strings to have seconds (e.g., HH:MM:SS)
all_sleep_data['HOURS OF SLEEP'] = all_sleep_data['HOURS OF SLEEP'].apply(
    lambda x: x if len(x.split(':')) == 3 else f"{x}:00"
)

# Convert sleep duration to float hours
all_sleep_data['HOURS OF SLEEP'] = pd.to_timedelta(all_sleep_data['HOURS OF SLEEP']).dt.total_seconds() / 3600

# Convert percentage columns to float
for col in ['REM SLEEP', 'DEEP SLEEP', 'HEART RATE BELOW RESTING']:
    all_sleep_data[col] = all_sleep_data[col].str.replace('%', '', regex=False).astype(float)

# Convert sleep start and end times to float hours
all_sleep_data['SLEEP START'] = pd.to_datetime(all_sleep_data['SLEEP START'].astype(str), format='%H:%M', errors='coerce')
all_sleep_data['SLEEP END'] = pd.to_datetime(all_sleep_data['SLEEP END'].astype(str), format='%H:%M', errors='coerce')

all_sleep_data['SLEEP START'] = all_sleep_data['SLEEP START'].dt.hour + all_sleep_data['SLEEP START'].dt.minute / 60
all_sleep_data['SLEEP END'] = all_sleep_data['SLEEP END'].dt.hour + all_sleep_data['SLEEP END'].dt.minute / 60


##models

###bagging

In [None]:


# Drop rows with missing sleep score
all_sleep_data = all_sleep_data.dropna(subset=['SLEEP SCORE'])

# Separate features and target
X = all_sleep_data.drop(columns=['SLEEP SCORE', 'SLEEP TIME'])  # drop target + any non-feature columns
y = all_sleep_data['SLEEP SCORE']

# One-hot encode categorical variables
X = pd.get_dummies(X, drop_first=True)

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)



In [None]:
model = BaggingRegressor(
    estimator=DecisionTreeRegressor(),
    n_estimators=50,
    random_state=42,
    n_jobs=-1
)

model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

print(f"Mean Squared Error: {mse:.2f}")

cv_scores = cross_val_score(model, X_scaled, y, cv=5, scoring='r2')
print(f"Average CV R^2 Score (5-fold): {np.mean(cv_scores):.2f}")


Mean Squared Error: 3.77
Average CV R^2 Score (5-fold): 0.63


I also tried 70, 100, 80, and 90 as the n_estimators but they all produced similar r2 values.

###Stacking

In [None]:
from sklearn.ensemble import StackingRegressor, BaggingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split, cross_val_score, RepeatedKFold
from sklearn.metrics import mean_squared_error, r2_score
from numpy import mean, std
import numpy as np
import pandas as pd

# Split your data if not already done
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Define the stacking model with bagging included
def get_stacking():
    level0 = list()
    level0.append(('knn', KNeighborsRegressor()))
    level0.append(('cart', DecisionTreeRegressor()))
    level0.append(('svm', SVR()))

    # Add Bagging Regressor as a base model
    bagging = BaggingRegressor(estimator=DecisionTreeRegressor(), n_estimators=50, random_state=42)
    level0.append(('bagging', bagging))

    level1 = LinearRegression()
    model = StackingRegressor(estimators=level0, final_estimator=level1, cv=5)
    return model

# Get a list of models to evaluate
def get_models():
    models = dict()
    models['knn'] = KNeighborsRegressor()
    models['cart'] = DecisionTreeRegressor()
    models['svm'] = SVR()
    models['bagging'] = BaggingRegressor(estimator=DecisionTreeRegressor(), n_estimators=50, random_state=42)
    models['stacking'] = get_stacking()
    return models

# Evaluate a given model using cross-validation
def evaluate_model(model, X, y):
    cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
    scores = cross_val_score(model, X, y, scoring='r2', cv=cv, n_jobs=-1, error_score='raise')
    return scores


models = get_models()
results, names = list(), list()
avg_predictions = dict()

print("Cross-Validation Results:")
for name, model in models.items():
    scores = evaluate_model(model, X_train, y_train)
    results.append(scores)
    names.append(name)
    print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores)))

    # Fit model and make predictions
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    avg_predictions[name] = np.mean(preds)  # Average prediction for that model

# Display average predicted score for each model
avg_pred_df = pd.DataFrame(list(avg_predictions.items()), columns=['Model', 'Average Predicted Sleep Score'])

print("\nAverage Prediction per Model:")
print(avg_pred_df)

Cross-Validation Results:
>knn -1.158 (1.303)
>cart 0.446 (0.302)
>svm -0.027 (0.077)
>bagging 0.659 (0.248)
>stacking 0.595 (0.268)

Average Prediction per Model:
      Model  Average Predicted Sleep Score
0       knn                      80.400000
1      cart                      84.111111
2       svm                      85.443818
3   bagging                      84.752222
4  stacking                      84.347556


The bagging model has the best r^2 of all the models, meaning that its prediction of the average sleep score is the closest to the actual average.

I had help from chat making the loops for the stacking code