### If you do it at once, it'll take too long

### Preprocessing
- Time processing: Bedtime, Wakeup → hour, minute decomposition
- Multicollinearity removal: Sleep duration, Deep sleep%
- Binary encoding: Gender, smoking status
- Eliminating outliers: Caffeine consumption (IQR)
- Missing value processing: mean/median based on Gender
- Add REM/Light Exclusion Option

In [1]:
import pandas as pd
import numpy as np

def preprocess(df, drop_sleep_features=True):
    df = df.copy()

    # Time Variables → Decompose numerically
    df["Bedtime"] = pd.to_datetime(df["Bedtime"])
    df["Wakeup time"] = pd.to_datetime(df["Wakeup time"])
    df["Bedtime_hour"] = df["Bedtime"].dt.hour
    df["Wakeup_hour"] = df["Wakeup time"].dt.hour

    # Remove multicollinearity (Sleep duration, Deep sleep%)
    df.drop(columns=["Sleep duration", "Deep sleep percentage", "ID", "Bedtime", "Wakeup time"], inplace=True, errors='ignore')

    # categorical encoding
    df["Gender"] = df["Gender"].map({"Female": 0, "Male": 1})
    df["Smoking status"] = df["Smoking status"].map({"No": 0, "Yes": 1})

    # Eliminating Outliers (Caffeine IQR)
    Q1 = df["Caffeine consumption"].quantile(0.25)
    Q3 = df["Caffeine consumption"].quantile(0.75)
    IQR = Q3 - Q1
    df = df[(df["Caffeine consumption"] >= Q1 - 1.5 * IQR) & 
            (df["Caffeine consumption"] <= Q3 + 1.5 * IQR)]

    # Missing value processing (median value based on gender)
    df["Caffeine consumption"] = df.groupby("Gender")["Caffeine consumption"].transform(lambda x: x.fillna(x.median()))
    
    # Process other missing values as well
    for col in df.columns:
        if df[col].isnull().sum() > 0:
            if df[col].dtype in [np.float64, np.int64]:
                df[col] = df[col].fillna(df[col].median())
            else:
                df[col] = df[col].fillna(df[col].mode()[0])

    # Remove as per REM/Light option
    if drop_sleep_features:
        df = df.drop(columns=["REM sleep percentage", "Light sleep percentage"], errors='ignore')

    return df

Experiment with StandardScaler, MinMaxScaler, and RobustScaler respectively and select the scaler that performs best.

In [2]:
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold, StratifiedKFold, cross_val_score

# Import data
df = pd.read_csv("/kaggle/input/sleep-efficiency/Sleep_Efficiency_preprocessed.csv")
X = df.drop('Sleep efficiency', axis=1)
y_reg = df['Sleep efficiency']

y_cls = (y_reg >= 0.85).astype(int)

# Define the scaler to compare
scalers = {
    'StandardScaler': StandardScaler(),
    'MinMaxScaler': MinMaxScaler(),
    'RobustScaler': RobustScaler()
}

# Comparison of regression model performance
print('=== Regression (Linear Regression) ===')
for name, scaler in scalers.items():
    pipe = Pipeline([
        ('scaler', scaler),
        ('model', LinearRegression())
    ])
    kf = KFold(n_splits=5, shuffle=True, random_state=42)
    # neg_root_mean_squared_error는 음수로 반환되므로 부호를 바꿔 RMSE 구함
    rmse = -cross_val_score(pipe, X, y_reg, cv=kf, scoring='neg_root_mean_squared_error')
    r2  =  cross_val_score(pipe, X, y_reg, cv=kf, scoring='r2')
    print(f'\n{name}:')
    print(f'  Mean RMSE: {rmse.mean():.4f} ± {rmse.std():.4f}')
    print(f'  Mean R²:   {r2.mean():.4f} ± {r2.std():.4f}')

# Classification Model Performance Comparison
print('\n=== Classification (Random Forest) ===')
for name, scaler in scalers.items():
    pipe = Pipeline([
        ('scaler', scaler),
        ('model', RandomForestClassifier(random_state=42))
    ])
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    acc = cross_val_score(pipe, X, y_cls, cv=skf, scoring='accuracy')
    print(f'\n{name}:')
    print(f'  Mean Accuracy: {acc.mean():.4f} ± {acc.std():.4f}')


=== Regression (Linear Regression) ===

StandardScaler:
  Mean RMSE: 0.0618 ± 0.0058
  Mean R²:   0.7826 ± 0.0562

MinMaxScaler:
  Mean RMSE: 0.0618 ± 0.0058
  Mean R²:   0.7826 ± 0.0562

RobustScaler:
  Mean RMSE: 0.0618 ± 0.0058
  Mean R²:   0.7826 ± 0.0562

=== Classification (Random Forest) ===

StandardScaler:
  Mean Accuracy: 0.9286 ± 0.0149

MinMaxScaler:
  Mean Accuracy: 0.9264 ± 0.0164

RobustScaler:
  Mean Accuracy: 0.9242 ± 0.0189


Select StandardScaler because all three scalers have the same results for regression models and are almost the same for classification models

### Classification
- Criteria: Sleep efficiency ≥ 0.85 → 1, less than → 0
- Model: LogisticRegression, DecisionTree, RandomForest
- Evaluation: Accuracy, MSE, R² (StratifiedKFold)
- Choose the best model among the three models and tune it

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_predict, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, mean_squared_error, r2_score, confusion_matrix
from scipy.stats import randint

# Data loading and label creation
df = pd.read_csv("/kaggle/input/sleep-efficiency/ModelingSet.csv")
df['Sleep_Label'] = (df['Sleep efficiency'] >= 0.85).astype(int)

# Feature settings
X = df.drop(columns=['Sleep efficiency', 'Sleep_Label'])
y = df['Sleep_Label']

# Train/Test Segmentation and Scaling
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Defining an Evaluation Function
results = []

def evaluate_test(model, model_name, stage):
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    y_proba = model.predict_proba(X_test_scaled)[:, 1]
    results.append({
        'KFold': 'Test',
        'Model': model_name,
        'Stage': stage,
        'Accuracy': accuracy_score(y_test, y_pred),
        'MSE': mean_squared_error(y_test, y_proba),
        'R²': r2_score(y_test, y_proba)
    })

def evaluate_cv(model, model_name, n_fold):
    skf = StratifiedKFold(n_splits=n_fold, shuffle=True, random_state=42)
    y_proba = cross_val_predict(model, X_train_scaled, y_train, cv=skf, method="predict_proba")[:, 1]
    y_pred = (y_proba >= 0.5).astype(int)
    results.append({
        'KFold': f"{n_fold}-Fold",
        'Model': model_name,
        'Stage': 'CV',
        'Accuracy': accuracy_score(y_train, y_pred),
        'MSE': mean_squared_error(y_train, y_proba),
        'R²': r2_score(y_train, y_proba)
    })

# Evaluation of the base model
models = [
    ("Logistic Regression", LogisticRegression(max_iter=1000, random_state=42)),
    ("Decision Tree", DecisionTreeClassifier(random_state=42)),
    ("Random Forest", RandomForestClassifier(random_state=42))
]

for name, model in models:
    evaluate_test(model, name, "Test")
    for k in [3, 5, 10]:
        evaluate_cv(model, name, k)

# Setting parameters for tuning
param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [5, 10, None],
    'min_samples_split': [2, 5]
}

param_dist = {
    'n_estimators': randint(50, 200),
    'max_depth': [5, 10, 20, None],
    'min_samples_split': [2, 5, 10]
}

# 7. Tuning(RandomForestClassifier) for k=3,5,10
for k in [3, 5, 10]:
    skf = StratifiedKFold(n_splits=k, shuffle=True, random_state=42)

    # GridSearchCV
    grid = GridSearchCV(
        RandomForestClassifier(random_state=42),
        param_grid,
        cv=skf,
        n_jobs=-1
    )
    grid.fit(X_train_scaled, y_train)
    best_grid = grid.best_estimator_
    evaluate_test(best_grid, f"Random Forest (GridSearch, {k}-Fold)", "Test")
    evaluate_cv(best_grid, f"Random Forest (GridSearch, {k}-Fold)", k)

    # RandomizedSearchCV
    random_search = RandomizedSearchCV(
        RandomForestClassifier(random_state=42),
        param_distributions=param_dist,
        n_iter=20,
        cv=skf,
        random_state=42,
        n_jobs=-1
    )
    random_search.fit(X_train_scaled, y_train)
    best_random = random_search.best_estimator_
    evaluate_test(best_random, f"Random Forest (RandomSearch, {k}-Fold)", "Test")
    evaluate_cv(best_random, f"Random Forest (RandomSearch, {k}-Fold)", k)

# Output Results
df_results = pd.DataFrame(results)
print(df_results)

      KFold                                  Model Stage  Accuracy       MSE  \
0      Test                    Logistic Regression  Test  0.920455  0.065009   
1    3-Fold                    Logistic Regression    CV  0.908571  0.072590   
2    5-Fold                    Logistic Regression    CV  0.905714  0.068050   
3   10-Fold                    Logistic Regression    CV  0.902857  0.070470   
4      Test                          Decision Tree  Test  0.943182  0.056818   
5    3-Fold                          Decision Tree    CV  0.885714  0.114286   
6    5-Fold                          Decision Tree    CV  0.880000  0.120000   
7   10-Fold                          Decision Tree    CV  0.882857  0.117143   
8      Test                          Random Forest  Test  0.943182  0.047052   
9    3-Fold                          Random Forest    CV  0.911429  0.075722   
10   5-Fold                          Random Forest    CV  0.922857  0.069908   
11  10-Fold                          Ran

Choose from Si-Random Forest for Acuity, MSE, and R²
- After tuning, K-Fold [3, 5, 10] evaluation (consider test and CV difference, Accuracy, MSE, R², etc.)
- best combination : Random Forest (RandomSearch, 5-Fold)
- Next Top 4 (k=3,5,10)
- Random Forest (GridSearch, 10-Fold)
- Random Forest (RandomSearch, 10-Fold)
- Random Forest (GridSearch, 3-Fold)
- Random Forest (GridSearch, 5-Fold)
- You can definitely see that the performance increases when you do the tuning. It takes too long to season, so proceed separately

### Regression
- Model: Linear, RandomForest, GradientBoosting
- Rating: R², MSE (KFold)
- Choose the best model among the three models and tune it

In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split, KFold, GridSearchCV, RandomizedSearchCV, cross_val_predict
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
from scipy.stats import randint

# data loading
df = pd.read_csv("/kaggle/input/sleep-efficiency/ModelingSet.csv")
X = df.drop(columns=['Sleep efficiency'])
y = df['Sleep efficiency']

# Train/Test Segmentation and Scaling
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Defining an Evaluation Function
results = []

def evaluate_test_reg(model, model_name, stage, kfold_desc):
    model.fit(X_train_scaled, y_train)
    y_pred_test = model.predict(X_test_scaled)
    mse = mean_squared_error(y_test, y_pred_test)
    r2 = r2_score(y_test, y_pred_test)
    results.append({
        'KFold': kfold_desc,
        'Model': model_name,
        'Stage': stage,
        'MSE': mse,
        'R²': r2
    })

def evaluate_cv_reg(model, model_name, kf, kfold_desc):
    y_pred_cv = cross_val_predict(model, X_train_scaled, y_train, cv=kf)
    mse = mean_squared_error(y_train, y_pred_cv)
    r2 = r2_score(y_train, y_pred_cv)
    results.append({
        'KFold': kfold_desc,
        'Model': model_name,
        'Stage': 'CV',
        'MSE': mse,
        'R²': r2
    })

# Evaluation of the base model
models = [
    ("Linear Regression", LinearRegression()),
    ("Random Forest Regressor", RandomForestRegressor(random_state=42)),
    ("Gradient Boosting Regressor", GradientBoostingRegressor(random_state=42))
]

for name, model in models:
    evaluate_test_reg(model, name, "Test", "Test")
    for k in [3, 5, 10]:
        kf = KFold(n_splits=k, shuffle=True, random_state=42)
        evaluate_cv_reg(model, name, kf, f"{k}-Fold")

# Parameters to be tuned
param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [5, 10, None],
    'min_samples_split': [2, 5]
}
param_dist = {
    'n_estimators': randint(50, 200),
    'max_depth': [5, 10, 20, None],
    'min_samples_split': [2, 5, 10]
}

# Tuning - GridSearch & RandomSearch for k in 3,5,10
for k in [3, 5, 10]:
    kf = KFold(n_splits=k, shuffle=True, random_state=42)
    # GridSearchCV
    grid = GridSearchCV(RandomForestRegressor(random_state=42), param_grid, cv=kf, n_jobs=-1)
    grid.fit(X_train_scaled, y_train)
    best_rf_grid = grid.best_estimator_
    evaluate_test_reg(best_rf_grid, f"Random Forest (GridSearch, {k}-Fold)", "Test", "Test")
    evaluate_cv_reg(best_rf_grid, f"Random Forest (GridSearch, {k}-Fold)", kf, f"{k}-Fold")

    # RandomizedSearchCV
    random_search = RandomizedSearchCV(
        RandomForestRegressor(random_state=42),
        param_distributions=param_dist,
        n_iter=20,
        cv=kf,
        random_state=42,
        n_jobs=-1
    )
    random_search.fit(X_train_scaled, y_train)
    best_rf_random = random_search.best_estimator_
    evaluate_test_reg(best_rf_random, f"Random Forest (RandomSearch, {k}-Fold)", "Test", "Test")
    evaluate_cv_reg(best_rf_random, f"Random Forest (RandomSearch, {k}-Fold)", kf, f"{k}-Fold")

# Output Results
df_results = pd.DataFrame(results)
print(df_results)


      KFold                                  Model Stage       MSE        R²
0      Test                      Linear Regression  Test  0.004038  0.787850
1    3-Fold                      Linear Regression    CV  0.003887  0.780148
2    5-Fold                      Linear Regression    CV  0.003823  0.783786
3   10-Fold                      Linear Regression    CV  0.003843  0.782666
4      Test                Random Forest Regressor  Test  0.002387  0.874627
5    3-Fold                Random Forest Regressor    CV  0.002827  0.840089
6    5-Fold                Random Forest Regressor    CV  0.002888  0.836644
7   10-Fold                Random Forest Regressor    CV  0.002869  0.837748
8      Test            Gradient Boosting Regressor  Test  0.002248  0.881899
9    3-Fold            Gradient Boosting Regressor    CV  0.002921  0.834805
10   5-Fold            Gradient Boosting Regressor    CV  0.002874  0.837451
11  10-Fold            Gradient Boosting Regressor    CV  0.002759  0.843974

- Random Forest is comparable to the predictive performance of Random Foreset Regressor, Gradient Boosting Regressor,
Random forest with a smaller drop difference when K-folded
- After tuning, K-Fold [3, 5, 10] evaluation (consideration of test and CV difference, MSE, R², etc.)
- best combination : Random Forest (RandomSearch, 5-Fold)
- Next Top 4 (k=3,5,10)
- Random Forest (GridSearch, 5-Fold)
- Random Forest (RandomSearch, 10-Fold)
- Random Forest (RandomSearch, 3-Fold)
- Random Forest (GridSearch, 10-Fold)
- Certainly, you can see that the performance increases when you proceed with tuning.