### Problem Statement: 
Predict the presence or absence of a sleep disorder (Insomnia, Sleep Apnea, or None) based on lifestyle and health factors such as age, gender, quality of sleep, physical activity level, stress levels, BMI category, and blood pressure.

### Here's a brief outline of how I tackled this project:

#### Data Preprocessing: 
I initially cleaned the data by transforming categorical variables to numerical ones where necessary (for instance, transforming 'Gender' and 'BMI Category' using Label Encoding) and by extracting systolic and diastolic blood pressure as separate features from the 'Blood Pressure' column. I then selected relevant features for predicting sleep disorders.

#### Handling Imbalanced Data: 
I noticed a class imbalance in the target variable 'Sleep Disorder'. To handle this, I used Synthetic Minority Over-sampling Technique (SMOTE) to create a balanced dataset by oversampling the minority class. This made the model more capable of distinguishing between different classes.

#### Model Selection & Training: 
I chose to use the Random Forest Classifier due to its robustness and versatility. This model was trained on the resampled dataset.

#### Model Evaluation: 
I used the standard train-test-split approach for validating the model. The model's predictions were evaluated based on the accuracy achieved on the test dataset.

#### Saving the Model: 
I serialized the trained model using joblib, allowing it to be saved to a file and then reloaded later. This is an important step for deploying the model.

#### Creating an Interactive Predictive Tool: 
I created a simple, interactive tool using the ipywidgets library. This tool allows a user to input their personal data, and then it uses the trained model to predict whether the user is likely to have a sleep disorder.

While I didn't deploy this model as a web application or dive deep into model interpretability for this project, these are potential next steps that could be taken. These additional steps would allow me to showcase more skills in machine learning engineering and explain the decision-making process of the model, respectively.

Throughout this process, I was able to apply and demonstrate skills in various important areas of a machine learning project, including data preprocessing, handling imbalanced datasets, model selection and training, and creating an interactive tool for making predictions.

# Dataset and Libraries

In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv('/content/Sleep_health_and_lifestyle_dataset.csv')
df

Unnamed: 0,Person ID,Gender,Age,Occupation,Sleep Duration,Quality of Sleep,Physical Activity Level,Stress Level,BMI Category,Blood Pressure,Heart Rate,Daily Steps,Sleep Disorder
0,1,Male,27,Software Engineer,6.10,6,42,6,Overweight,126/83,77,4200,
1,2,Male,28,Doctor,6.20,6,60,8,Normal,125/80,75,10000,
2,3,Male,28,Doctor,6.20,6,60,8,Normal,125/80,75,10000,
3,4,Male,28,Sales Representative,5.90,4,30,8,Obese,140/90,85,3000,Sleep Apnea
4,5,Male,28,Sales Representative,5.90,4,30,8,Obese,140/90,85,3000,Sleep Apnea
...,...,...,...,...,...,...,...,...,...,...,...,...,...
369,370,Female,59,Nurse,8.10,9,75,3,Overweight,140/95,68,7000,Sleep Apnea
370,371,Female,59,Nurse,8.00,9,75,3,Overweight,140/95,68,7000,Sleep Apnea
371,372,Female,59,Nurse,8.10,9,75,3,Overweight,140/95,68,7000,Sleep Apnea
372,373,Female,59,Nurse,8.10,9,75,3,Overweight,140/95,68,7000,Sleep Apnea


# Exploratory Data Analysis

In [None]:
df.isnull().sum()

Person ID                  0
Gender                     0
Age                        0
Occupation                 0
Sleep Duration             0
Quality of Sleep           0
Physical Activity Level    0
Stress Level               0
BMI Category               0
Blood Pressure             0
Heart Rate                 0
Daily Steps                0
Sleep Disorder             0
dtype: int64

Great! no missing or null values

In [None]:
df['Sleep Disorder'].value_counts()

None           219
Sleep Apnea     78
Insomnia        77
Name: Sleep Disorder, dtype: int64

Definitely can see class imbalances

In [None]:
df['Occupation'].unique()

array(['Software Engineer', 'Doctor', 'Sales Representative', 'Teacher',
       'Nurse', 'Engineer', 'Accountant', 'Scientist', 'Lawyer',
       'Salesperson', 'Manager'], dtype=object)

In [None]:
software_engineer_mean = df[df['Occupation'] == 'Software Engineer']['Sleep Duration'].mean()
engineer_mean = df[df['Occupation'] == 'Engineer']['Sleep Duration'].mean()
print("Mean Sleep Duration for Software Engineers: ", software_engineer_mean)
print("Mean Sleep Duration for Engineers: ", engineer_mean)

Mean Sleep Duration for Software Engineers:  6.75

Mean Sleep Duration for Engineers:  7.987301587301586


Software Engineers and Engineers are Not similiar

In [None]:
Doctor_mean = df[df['Occupation'] == 'Doctor']['Sleep Duration'].mean()
Nurse_mean = df[df['Occupation'] == 'Nurse']['Sleep Duration'].mean()

print("Mean Sleep Duration for Doctor: ", Doctor_mean)
print("Mean Sleep Duration for Nurse: ", Nurse_mean)

Mean Sleep Duration for Doctor:  6.970422535211269

Mean Sleep Duration for Nurse:  7.0630136986301375


The Nurse and Doctor are Extermely similiar

In [None]:
# Combining both Doctor and Nurse due to how similiar they are
df['Occupation'] = df['Occupation'].replace(['Doctor', 'Nurse'], 'Healthcare')

In [None]:
# BMI Category 
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['BMI Category'] = le.fit_transform(df['BMI Category'])

In [None]:
# Gender Male: 1 and Female: 0
le = LabelEncoder()
df['Gender'] = le.fit_transform(df['Gender'])

In [None]:
df = pd.get_dummies(df, columns=['Occupation'])

In [None]:
df[['Systolic_BP', 'Diastolic_BP']] = df['Blood Pressure'].str.split('/', expand=True)
df['Systolic_BP'] = pd.to_numeric(df['Systolic_BP'])
df['Diastolic_BP'] = pd.to_numeric(df['Diastolic_BP'])
df = df.drop(columns='Blood Pressure') 

In [None]:
display(df)

Unnamed: 0,Person ID,Gender,Age,Sleep Duration,Quality of Sleep,Physical Activity Level,Stress Level,BMI Category,Heart Rate,Daily Steps,...,Occupation_Healthcare,Occupation_Lawyer,Occupation_Manager,Occupation_Sales Representative,Occupation_Salesperson,Occupation_Scientist,Occupation_Software Engineer,Occupation_Teacher,Systolic_BP,Diastolic_BP
0,1,1,27,6.10,6,42,6,3,77,4200,...,0,0,0,0,0,0,1,0,126,83
1,2,1,28,6.20,6,60,8,0,75,10000,...,1,0,0,0,0,0,0,0,125,80
2,3,1,28,6.20,6,60,8,0,75,10000,...,1,0,0,0,0,0,0,0,125,80
3,4,1,28,5.90,4,30,8,2,85,3000,...,0,0,0,1,0,0,0,0,140,90
4,5,1,28,5.90,4,30,8,2,85,3000,...,0,0,0,1,0,0,0,0,140,90
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
369,370,0,59,8.10,9,75,3,3,68,7000,...,1,0,0,0,0,0,0,0,140,95
370,371,0,59,8.00,9,75,3,3,68,7000,...,1,0,0,0,0,0,0,0,140,95
371,372,0,59,8.10,9,75,3,3,68,7000,...,1,0,0,0,0,0,0,0,140,95
372,373,0,59,8.10,9,75,3,3,68,7000,...,1,0,0,0,0,0,0,0,140,95


# ML Modeling

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

X = df.drop('Sleep Disorder', axis=1)
y = df['Sleep Disorder']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support



    Insomnia       0.72      0.81      0.76        16

        None       0.95      0.98      0.97        43

 Sleep Apnea       0.85      0.69      0.76        16



    accuracy                           0.88        75

   macro avg       0.84      0.83      0.83        75

weighted avg       0.88      0.88      0.88        75




In [None]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

X_train_resampled, X_test_resampled, y_train_resampled, y_test_resampled = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)

clf_resampled = RandomForestClassifier(random_state=42)
clf_resampled.fit(X_train_resampled, y_train_resampled)

y_pred_resampled = clf_resampled.predict(X_test_resampled)

print(classification_report(y_test_resampled, y_pred_resampled))

              precision    recall  f1-score   support



    Insomnia       0.84      0.90      0.87        29

        None       0.94      0.96      0.95        52

 Sleep Apnea       0.96      0.90      0.93        51



    accuracy                           0.92       132

   macro avg       0.91      0.92      0.92       132

weighted avg       0.93      0.92      0.92       132




The SMOTE (Synthetic Minority Over-sampling Technique) is an oversampling technique used to balance the classes in an imbalanced dataset 

# LazyPredict

In [None]:
!pip install lazypredict

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/

Collecting lazypredict

  Downloading lazypredict-0.2.12-py2.py3-none-any.whl (12 kB)















Installing collected packages: lazypredict

Successfully installed lazypredict-0.2.12


In [None]:
from lazypredict.Supervised import LazyClassifier

In [None]:
# Creating an instance of LazyClassificer
clf = LazyClassifier(verbose=0, ignore_warnings=True, custom_metric=None)
models, predictions = clf.fit(X_train, X_test, y_train, y_test)

display(models)

100%|██████████| 29/29 [00:01<00:00, 19.54it/s]


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Perceptron,0.92,0.89,,0.92,0.01
RidgeClassifierCV,0.91,0.88,,0.91,0.01
BaggingClassifier,0.92,0.88,,0.92,0.08
LinearSVC,0.91,0.87,,0.91,0.08
CalibratedClassifierCV,0.91,0.87,,0.91,0.22
RidgeClassifier,0.89,0.86,,0.89,0.02
DecisionTreeClassifier,0.91,0.85,,0.9,0.01
LinearDiscriminantAnalysis,0.88,0.85,,0.88,0.04
SGDClassifier,0.89,0.85,,0.89,0.02
LabelPropagation,0.89,0.85,,0.89,0.04


In [None]:
# Creating an instance of LazyClassificer on resampled data
clf = LazyClassifier(verbose=0, ignore_warnings=True, custom_metric=None)
models, predictions = clf.fit(X_train_resampled, X_test_resampled, y_train_resampled, y_test_resampled)

display(models)

100%|██████████| 29/29 [00:02<00:00, 13.32it/s]


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LinearSVC,0.94,0.93,,0.94,0.14
ExtraTreesClassifier,0.93,0.93,,0.93,0.25
LogisticRegression,0.93,0.92,,0.93,0.04
CalibratedClassifierCV,0.93,0.92,,0.93,0.36
RandomForestClassifier,0.92,0.92,,0.92,0.29
LabelPropagation,0.92,0.92,,0.92,0.04
LabelSpreading,0.92,0.92,,0.92,0.06
SGDClassifier,0.91,0.91,,0.91,0.02
SVC,0.9,0.91,,0.9,0.03
Perceptron,0.92,0.91,,0.92,0.02


In [None]:
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.svm import LinearSVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('svc', LinearSVC(random_state=42))
])

param_grid = {
    'svc__C': [0.1, 1, 10, 100],
    'svc__tol': [10 ,1 ,0.1 ,1e-2, 1e-3, 1e-4, 1e-5, 1e-6]
}

# A GridSearchCV object
grid_search = GridSearchCV(pipe, param_grid, cv=5)
grid_search.fit(X_train_resampled, y_train_resampled)

best_params = grid_search.best_params_
best_score = grid_search.best_score_
y_pred = grid_search.predict(X_test_resampled)
print(classification_report(y_test_resampled, y_pred, target_names=['Insomnia', 'None', 'Sleep Apnea']))

              precision    recall  f1-score   support



    Insomnia       0.93      0.86      0.89        29

        None       0.93      0.96      0.94        52

 Sleep Apnea       0.96      0.96      0.96        51



    accuracy                           0.94       132

   macro avg       0.94      0.93      0.93       132

weighted avg       0.94      0.94      0.94       132




In [None]:
print(best_params)

{'svc__C': 1, 'svc__tol': 0.1}


# Next Steps

In [None]:
importances = clf_resampled.feature_importances_
feature_names = X_train_resampled.columns
feature_importances = dict(zip(feature_names, importances))
sorted_feature_importances = sorted(feature_importances.items(), key=lambda item: item[1], reverse=True)
print(sorted_feature_importances)

[('Person ID', 0.14852511893508102), ('Diastolic_BP', 0.14774320115962555), ('Systolic_BP', 0.1357872439054357), ('BMI Category', 0.10920245722519521), ('Physical Activity Level', 0.08884135583779726), ('Sleep Duration', 0.07717184415852316), ('Daily Steps', 0.07266054403137205), ('Age', 0.05871879361692078), ('Occupation_Healthcare', 0.03510059529776193), ('Quality of Sleep', 0.031469457002867754), ('Heart Rate', 0.02949396149723242), ('Stress Level', 0.02298108401982315), ('Occupation_Salesperson', 0.010910854775814516), ('Gender', 0.009248844391415139), ('Occupation_Lawyer', 0.006030525152674882), ('Occupation_Teacher', 0.005219293024668201), ('Occupation_Engineer', 0.004721041493003322), ('Occupation_Accountant', 0.0030116539009941), ('Occupation_Software Engineer', 0.0012436514904302484), ('Occupation_Sales Representative', 0.0012383882464336356), ('Occupation_Scientist', 0.0004681072613003495), ('Occupation_Manager', 0.0002119835756295594)]


In [None]:
!pip install eli5

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/

Collecting eli5

  Downloading eli5-0.13.0.tar.gz (216 kB)

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m216.2/216.2 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m

[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone












Building wheels for collected packages: eli5

  Building wheel for eli5 (setup.py) ... [?25l[?25hdone

  Created wheel for eli5: filename=eli5-0.13.0-py2.py3-none-any.whl size=107730 sha256=78a63a38e071ae7aac00dc06e8a0b6dcb068bd912e02fb7f64f21c8a0ebd6ea8

  Stored in directory: /root/.cache/pip/wheels/b8/58/ef/2cf4c306898c2338d51540e0922c8e0d6028e07007085c0004

Successfully built eli5

Installing collected packages: eli5

Successfully installed eli5-0.13.0


In [None]:
import eli5
from eli5.sklearn import PermutationImportance

perm = PermutationImportance(clf_resampled, random_state=1).fit(X_test_resampled, y_test_resampled)
eli5.show_weights(perm, feature_names = X_test_resampled.columns.tolist())

Weight,Feature
0.1000  ± 0.0113,Person ID
0.0152  ± 0.0096,BMI Category
0.0106  ± 0.0121,Systolic_BP
0.0091  ± 0.0113,Diastolic_BP
0.0076  ± 0.0000,Gender
0.0061  ± 0.0177,Quality of Sleep
0.0030  ± 0.0155,Physical Activity Level
0.0015  ± 0.0113,Stress Level
0  ± 0.0000,Occupation_Teacher
0  ± 0.0000,Occupation_Software Engineer


Importing a tool called Permutation Importance from the library eli5

# Training for Deployment

In [None]:
from imblearn.over_sampling import SMOTE

# Load your data
df = pd.read_csv('/content/Sleep_health_and_lifestyle_dataset.csv')
le = LabelEncoder()

df['BMI Category'] = le.fit_transform(df['BMI Category'])
df['Gender'] = le.fit_transform(df['Gender'])
df[['Systolic_BP', 'Diastolic_BP']] = df['Blood Pressure'].str.split('/', expand=True)
df['Systolic_BP'] = pd.to_numeric(df['Systolic_BP'])
df['Diastolic_BP'] = pd.to_numeric(df['Diastolic_BP'])

df = df.drop(columns='Blood Pressure') 
features = ['Age', 'BMI Category', 'Systolic_BP', 'Diastolic_BP', 'Gender', 'Quality of Sleep', 'Physical Activity Level', 'Stress Level', 'Sleep Disorder']

df = df[features]
X = df.drop('Sleep Disorder', axis=1)
y = df['Sleep Disorder']

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)
X_train_resampled, X_test_resampled, y_train_resampled, y_test_resampled = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)

clf_resampled = RandomForestClassifier(random_state=42)
clf_resampled.fit(X_train_resampled, y_train_resampled)

y_pred_resampled = clf_resampled.predict(X_test_resampled)
joblib.dump(clf_resampled, 'model.pkl')

['model.pkl']

In [1]:
import ipywidgets as widgets


input_age = widgets.BoundedIntText(value=25, min=0, max=100, description='Age:', step=1)
input_bmi_category = widgets.BoundedIntText(value=25, min=0, max=50, description='BMI Category:', step=1)
input_systolic_bp = widgets.BoundedIntText(value=120, min=0, max=200, description='Systolic BP:', step=1)
input_diastolic_bp = widgets.BoundedIntText(value=80, min=0, max=200, description='Diastolic BP:', step=1)
input_gender = widgets.Dropdown(options=['Male', 'Female'], value='Male', description='Gender:')
input_quality_of_sleep = widgets.BoundedFloatText(value=7, min=0, max=10, description='Quality of Sleep:', step=0.1)
input_physical_activity_level = widgets.BoundedFloatText(value=5, min=0, max=10, description='Physical Activity Level:', step=0.1)
input_stress_level = widgets.BoundedFloatText(value=5, min=0, max=10, description='Stress Level:', step=0.1)

button_predict = widgets.Button(description='Predict')
button_predict.on_click(on_button_predict_clicked)

display(input_age)
display(input_bmi_category)
display(input_systolic_bp)
display(input_diastolic_bp)
display(input_gender)
display(input_quality_of_sleep)
display(input_physical_activity_level)
display(input_stress_level)
display(button_predict)

NameError: ignored

# You May Edit this for Deployment 🙌
## Please Upvote