# Final Project

**Brief overview**: Predict either sleep medication usage, or trouble sleeping, based on health and demographic factors. Which means predicting future values in either the dataset column "Prescription Sleep Medication" or "Trouble sleeping" based on values in other columns of the dataset.

**Project type**: This project involves building a predictive model. 

**Goal**: Accurately forecast a patient's likelihood of using prescription sleep medication. 

**Problem**: Address the need for early identification of patients who may require sleep medication. 

**Approach**: Develop a machine learning model to predict sleep medication use. 

**Model algorithms**: Employ logistic regression, decision trees, or neural networks with appropriate preprocessing. 

**Important notes**:

1. Your model must be deployed. Meaning that it is not a static notebook, but has a server component.
2. Your model must be trained and saved. In your final presentation, you should be able to present your model with new data.

## Viewing the data

In [3]:
# Load the necessary Python libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import chi2
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
import joblib
from flask import Flask, request, jsonify
from sklearn.pipeline import Pipeline
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

In [4]:
# Load the dataset

#sleep_df = pd.read_csv('../data/NPHA-doctor-visits.csv')
sleep_df = pd.read_csv('NPHA-doctor-visits.csv')

sleep_df

Unnamed: 0,Number of Doctors Visited,Age,Phyiscal Health,Mental Health,Dental Health,Employment,Stress Keeps Patient from Sleeping,Medication Keeps Patient from Sleeping,Pain Keeps Patient from Sleeping,Bathroom Needs Keeps Patient from Sleeping,Uknown Keeps Patient from Sleeping,Trouble Sleeping,Prescription Sleep Medication,Race,Gender
0,3,2,4,3,3,3,0,0,0,0,1,2,3,1,2
1,2,2,4,2,3,3,1,0,0,1,0,3,3,1,1
2,3,2,3,2,3,3,0,0,0,0,1,3,3,4,1
3,1,2,3,2,3,3,0,0,0,1,0,3,3,4,2
4,3,2,3,3,3,3,1,0,0,0,0,2,3,1,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
709,2,2,2,2,2,3,0,0,0,1,0,3,3,1,1
710,3,2,2,2,2,2,1,0,0,0,1,2,3,1,2
711,3,2,4,2,3,3,0,0,0,0,0,3,3,1,1
712,3,2,3,1,3,3,1,0,1,1,1,3,3,1,2


In [5]:
# Display basic information and check for missing values

sleep_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 714 entries, 0 to 713
Data columns (total 15 columns):
 #   Column                                      Non-Null Count  Dtype
---  ------                                      --------------  -----
 0   Number of Doctors Visited                   714 non-null    int64
 1   Age                                         714 non-null    int64
 2   Phyiscal Health                             714 non-null    int64
 3   Mental Health                               714 non-null    int64
 4   Dental Health                               714 non-null    int64
 5   Employment                                  714 non-null    int64
 6   Stress Keeps Patient from Sleeping          714 non-null    int64
 7   Medication Keeps Patient from Sleeping      714 non-null    int64
 8   Pain Keeps Patient from Sleeping            714 non-null    int64
 9   Bathroom Needs Keeps Patient from Sleeping  714 non-null    int64
 10  Uknown Keeps Patient from Sleeping    

In [6]:
# Explore to see if there are nan values

sleep_df.isnull().sum()

Number of Doctors Visited                     0
Age                                           0
Phyiscal Health                               0
Mental Health                                 0
Dental Health                                 0
Employment                                    0
Stress Keeps Patient from Sleeping            0
Medication Keeps Patient from Sleeping        0
Pain Keeps Patient from Sleeping              0
Bathroom Needs Keeps Patient from Sleeping    0
Uknown Keeps Patient from Sleeping            0
Trouble Sleeping                              0
Prescription Sleep Medication                 0
Race                                          0
Gender                                        0
dtype: int64

In [7]:
#  Visualize the unique values for each column in the dataframe

sleep_df.apply(np.unique)

Number of Doctors Visited                                  [1, 2, 3]
Age                                                              [2]
Phyiscal Health                                  [-1, 1, 2, 3, 4, 5]
Mental Health                                    [-1, 1, 2, 3, 4, 5]
Dental Health                                 [-1, 1, 2, 3, 4, 5, 6]
Employment                                              [1, 2, 3, 4]
Stress Keeps Patient from Sleeping                            [0, 1]
Medication Keeps Patient from Sleeping                        [0, 1]
Pain Keeps Patient from Sleeping                              [0, 1]
Bathroom Needs Keeps Patient from Sleeping                    [0, 1]
Uknown Keeps Patient from Sleeping                            [0, 1]
Trouble Sleeping                                       [-1, 1, 2, 3]
Prescription Sleep Medication                          [-1, 1, 2, 3]
Race                                                 [1, 2, 3, 4, 5]
Gender                            

In [8]:
# Drop the 'Age' column as it is constant, and it doesn't contribute (all subjects are in the 2nd group)

sleep_df.drop(['Age'], axis=1, inplace=True)

sleep_df

Unnamed: 0,Number of Doctors Visited,Phyiscal Health,Mental Health,Dental Health,Employment,Stress Keeps Patient from Sleeping,Medication Keeps Patient from Sleeping,Pain Keeps Patient from Sleeping,Bathroom Needs Keeps Patient from Sleeping,Uknown Keeps Patient from Sleeping,Trouble Sleeping,Prescription Sleep Medication,Race,Gender
0,3,4,3,3,3,0,0,0,0,1,2,3,1,2
1,2,4,2,3,3,1,0,0,1,0,3,3,1,1
2,3,3,2,3,3,0,0,0,0,1,3,3,4,1
3,1,3,2,3,3,0,0,0,1,0,3,3,4,2
4,3,3,3,3,3,1,0,0,0,0,2,3,1,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
709,2,2,2,2,3,0,0,0,1,0,3,3,1,1
710,3,2,2,2,2,1,0,0,0,1,2,3,1,2
711,3,4,2,3,3,0,0,0,0,0,3,3,1,1
712,3,3,1,3,3,1,0,1,1,1,3,3,1,2


In [9]:
# We have to consider the (-1) values, in which the respondants have refused to respond.

def NR_fun(in_col):
    # count number of "-1"'s in each column
    return sum(in_col==-1)

sleep_df.apply(NR_fun)

Number of Doctors Visited                      0
Phyiscal Health                                1
Mental Health                                 10
Dental Health                                  4
Employment                                     0
Stress Keeps Patient from Sleeping             0
Medication Keeps Patient from Sleeping         0
Pain Keeps Patient from Sleeping               0
Bathroom Needs Keeps Patient from Sleeping     0
Uknown Keeps Patient from Sleeping             0
Trouble Sleeping                               2
Prescription Sleep Medication                  3
Race                                           0
Gender                                         0
dtype: int64

> Out of the 714 respondents, only a few refused to respond to some questions of the survey.

In [10]:
# Handle refused responses (-1)

sleep_df.replace(-1, np.nan, inplace=True)
sleep_df.dropna(inplace=True)  # Drop rows with refused responses

sleep_df

Unnamed: 0,Number of Doctors Visited,Phyiscal Health,Mental Health,Dental Health,Employment,Stress Keeps Patient from Sleeping,Medication Keeps Patient from Sleeping,Pain Keeps Patient from Sleeping,Bathroom Needs Keeps Patient from Sleeping,Uknown Keeps Patient from Sleeping,Trouble Sleeping,Prescription Sleep Medication,Race,Gender
0,3,4.0,3.0,3.0,3,0,0,0,0,1,2.0,3.0,1,2
1,2,4.0,2.0,3.0,3,1,0,0,1,0,3.0,3.0,1,1
2,3,3.0,2.0,3.0,3,0,0,0,0,1,3.0,3.0,4,1
3,1,3.0,2.0,3.0,3,0,0,0,1,0,3.0,3.0,4,2
4,3,3.0,3.0,3.0,3,1,0,0,0,0,2.0,3.0,1,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
709,2,2.0,2.0,2.0,3,0,0,0,1,0,3.0,3.0,1,1
710,3,2.0,2.0,2.0,2,1,0,0,0,1,2.0,3.0,1,2
711,3,4.0,2.0,3.0,3,0,0,0,0,0,3.0,3.0,1,1
712,3,3.0,1.0,3.0,3,1,0,1,1,1,3.0,3.0,1,2


In [11]:
# But some columns have now values with float data types.

# List of columns to convert back to integers

columns_to_convert = ['Phyiscal Health', 'Mental Health', 'Dental Health', 'Trouble Sleeping', 'Prescription Sleep Medication']

# Convert the columns to integer type

sleep_df[columns_to_convert] = sleep_df[columns_to_convert].astype(int)

sleep_df

Unnamed: 0,Number of Doctors Visited,Phyiscal Health,Mental Health,Dental Health,Employment,Stress Keeps Patient from Sleeping,Medication Keeps Patient from Sleeping,Pain Keeps Patient from Sleeping,Bathroom Needs Keeps Patient from Sleeping,Uknown Keeps Patient from Sleeping,Trouble Sleeping,Prescription Sleep Medication,Race,Gender
0,3,4,3,3,3,0,0,0,0,1,2,3,1,2
1,2,4,2,3,3,1,0,0,1,0,3,3,1,1
2,3,3,2,3,3,0,0,0,0,1,3,3,4,1
3,1,3,2,3,3,0,0,0,1,0,3,3,4,2
4,3,3,3,3,3,1,0,0,0,0,2,3,1,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
709,2,2,2,2,3,0,0,0,1,0,3,3,1,1
710,3,2,2,2,2,1,0,0,0,1,2,3,1,2
711,3,4,2,3,3,0,0,0,0,0,3,3,1,1
712,3,3,1,3,3,1,0,1,1,1,3,3,1,2


## Finding the best predictor

The following steps is to build and evaluate 3 models:
1. Correlation Analysis
2. Logistic Regression
3. Decision Tree

The evaluation will be done for all columns, in order to find the right predictor between them.

In [12]:
# Correlation Analysis (for numeric features) with 'Trouble Sleeping'

corr_matrix = sleep_df.corr()
print(corr_matrix['Trouble Sleeping'].sort_values(ascending=False))

Trouble Sleeping                              1.000000
Prescription Sleep Medication                 0.288450
Bathroom Needs Keeps Patient from Sleeping    0.052819
Race                                         -0.015863
Uknown Keeps Patient from Sleeping           -0.022306
Number of Doctors Visited                    -0.065880
Dental Health                                -0.072012
Gender                                       -0.093479
Mental Health                                -0.132863
Employment                                   -0.153047
Medication Keeps Patient from Sleeping       -0.159543
Stress Keeps Patient from Sleeping           -0.174938
Phyiscal Health                              -0.217978
Pain Keeps Patient from Sleeping             -0.243360
Name: Trouble Sleeping, dtype: float64


In [13]:
# Logistic Regression (Using Coefficients)

# Assuming 'sleep_df' is your dataset and 'Trouble Sleeping' is your target column
X = sleep_df.drop(columns=['Trouble Sleeping'])  # Features
y = sleep_df['Trouble Sleeping']  # Target column

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # Scaled training features
X_test_scaled = scaler.transform(X_test)  # Scaled test features


# Logistic Regression for 'Trouble Sleeping'
log_reg = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=500)
log_reg.fit(X_train_scaled, y_train)

# Display the feature importance (coefficients)
importance = pd.DataFrame({
    'Feature': X_train.columns,
    'Coefficient': log_reg.coef_[0]  # Logistic regression coefficients
})

# Sort by absolute coefficient value
importance['Abs Coefficient'] = importance['Coefficient'].abs()
print(importance.sort_values(by='Abs Coefficient', ascending=False))


                                       Feature  Coefficient  Abs Coefficient
4                                   Employment     0.689211         0.689211
1                              Phyiscal Health     0.423193         0.423193
8   Bathroom Needs Keeps Patient from Sleeping    -0.280760         0.280760
10               Prescription Sleep Medication    -0.257109         0.257109
5           Stress Keeps Patient from Sleeping     0.228947         0.228947
6       Medication Keeps Patient from Sleeping     0.206396         0.206396
12                                      Gender     0.165246         0.165246
7             Pain Keeps Patient from Sleeping     0.162328         0.162328
3                                Dental Health    -0.144045         0.144045
9           Uknown Keeps Patient from Sleeping     0.124760         0.124760
0                    Number of Doctors Visited    -0.074026         0.074026
11                                        Race     0.039128         0.039128

In [14]:
# Decision Tree for 'Trouble Sleeping'

tree = DecisionTreeClassifier()
tree.fit(X_train, y_train)

# Get feature importance from the decision tree model
importance = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': tree.feature_importances_
})

# Sort by importance
print(importance.sort_values(by='Importance', ascending=False))


                                       Feature  Importance
3                                Dental Health    0.156260
2                                Mental Health    0.139137
1                              Phyiscal Health    0.132729
0                    Number of Doctors Visited    0.096802
11                                        Race    0.088012
12                                      Gender    0.077007
4                                   Employment    0.063138
10               Prescription Sleep Medication    0.054217
8   Bathroom Needs Keeps Patient from Sleeping    0.051158
7             Pain Keeps Patient from Sleeping    0.047472
5           Stress Keeps Patient from Sleeping    0.040462
9           Uknown Keeps Patient from Sleeping    0.036264
6       Medication Keeps Patient from Sleeping    0.017342


Based on the analysis results, we can summarize the best predictors for **'Trouble Sleeping'** by considering the four methods: **correlation analysis**, **logistic regression coefficients**, and **decision tree feature importance**.

### **1. Correlation Analysis**:
- The feature with the highest correlation to 'Trouble Sleeping' is **Prescription Sleep Medication** (0.288). This suggests that the use of sleep medication is somewhat related to trouble sleeping.
- Other features with weaker but notable correlations include **Bathroom Needs Keeps Patient from Sleeping** (0.052), and **Pain Keeps Patient from Sleeping** (-0.243), and **Physical Health** (-0.217).

### **2. Logistic Regression Coefficients**:
- The **logistic regression** model shows that **Employment** (Coefficient: 0.689) and **Physical Health** (Coefficient: 0.423) have the highest impact on predicting trouble sleeping.
- Other important predictors include **Bathroom Needs Keeps Patient from Sleeping**, **Prescription Sleep Medication**, and **Stress Keeps Patient from Sleeping**.

### **3. Decision Tree Feature Importance**:
- The **decision tree** model ranks **Mental Health** (0.141), **Physical Health** (0.129), and **Dental Health** (0.121) as the most important features. This suggests that mental and physical health play a significant role in predicting trouble sleeping in this model.

### **Conclusion**:
Across all methods, several features consistently appear as strong predictors:
- **Physical Health** is identified as important in correlation, logistic regression, and decision tree models.
- **Pain Keeps Patient from Sleeping**, **Medication Keeps Patient from Sleeping**, and **Stress Keeps Patient from Sleeping** are highlighted in logistic regression analyses.
- **Prescription Sleep Medication** is a top predictor in correlation and logistic regression.

Thus, the **best predictors** for 'Trouble Sleeping' are likely **Physical Health**, **Pain Keeps Patient from Sleeping**, **Stress Keeps Patient from Sleeping**, **Medication Keeps Patient from Sleeping**, and **Prescription Sleep Medication**.

In [15]:
# Select the best predictors and the target column

predictors = ['Phyiscal Health', 'Pain Keeps Patient from Sleeping', 
              'Stress Keeps Patient from Sleeping', 'Medication Keeps Patient from Sleeping', 
              'Prescription Sleep Medication']
X = sleep_df[predictors]
y = sleep_df['Trouble Sleeping']


# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [16]:
# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [17]:
# Logistic Regression model
log_reg = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=500)
log_reg.fit(X_train_scaled, y_train)

# Predictions and evaluation
y_pred_log_reg = log_reg.predict(X_test_scaled)
print("\n### Logistic Regression ###")
print("Classification Report:\n", classification_report(y_test, y_pred_log_reg))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_log_reg))

# ROC AUC Score
roc_auc_log_reg = roc_auc_score(y_test, log_reg.predict_proba(X_test_scaled), multi_class='ovr')
print("ROC AUC Score:", roc_auc_log_reg)


### Logistic Regression ###
Classification Report:
               precision    recall  f1-score   support

           1       0.00      0.00      0.00        12
           2       0.50      0.47      0.49        57
           3       0.65      0.77      0.71        71

    accuracy                           0.59       140
   macro avg       0.38      0.42      0.40       140
weighted avg       0.53      0.59      0.56       140

Confusion Matrix:
 [[ 0 11  1]
 [ 1 27 29]
 [ 0 16 55]]
ROC AUC Score: 0.7276988862772401


In [18]:
# Decision Tree model
tree = DecisionTreeClassifier()
tree.fit(X_train, y_train)

# Predictions and evaluation
y_pred_tree = tree.predict(X_test)
print("\n### Decision Tree ###")
print("Classification Report:\n", classification_report(y_test, y_pred_tree))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_tree))



### Decision Tree ###
Classification Report:
               precision    recall  f1-score   support

           1       0.60      0.25      0.35        12
           2       0.45      0.35      0.40        57
           3       0.58      0.75      0.65        71

    accuracy                           0.54       140
   macro avg       0.55      0.45      0.47       140
weighted avg       0.53      0.54      0.52       140

Confusion Matrix:
 [[ 3  7  2]
 [ 1 20 36]
 [ 1 17 53]]


In [19]:
# Random Forest model
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Predictions and evaluation
y_pred_rf = rf.predict(X_test)
print("\n### Random Forest ###")
print("Classification Report:\n", classification_report(y_test, y_pred_rf))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_rf))

# ROC AUC Score
roc_auc_rf = roc_auc_score(y_test, rf.predict_proba(X_test), multi_class='ovr')
print("ROC AUC Score:", roc_auc_rf)


### Random Forest ###
Classification Report:
               precision    recall  f1-score   support

           1       0.50      0.08      0.14        12
           2       0.46      0.37      0.41        57
           3       0.58      0.75      0.65        71

    accuracy                           0.54       140
   macro avg       0.51      0.40      0.40       140
weighted avg       0.52      0.54      0.51       140

Confusion Matrix:
 [[ 1  8  3]
 [ 0 21 36]
 [ 1 17 53]]
ROC AUC Score: 0.670672745426895


In [20]:
# Gradient Boosting model
gb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
gb.fit(X_train, y_train)

# Predictions and evaluation
y_pred_gb = gb.predict(X_test)
print("\n### Gradient Boosting ###")
print("Classification Report:\n", classification_report(y_test, y_pred_gb))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_gb))

# ROC AUC Score
roc_auc_gb = roc_auc_score(y_test, gb.predict_proba(X_test), multi_class='ovr')
print("ROC AUC Score:", roc_auc_gb)



### Gradient Boosting ###
Classification Report:
               precision    recall  f1-score   support

           1       1.00      0.17      0.29        12
           2       0.46      0.37      0.41        57
           3       0.58      0.75      0.65        71

    accuracy                           0.54       140
   macro avg       0.68      0.43      0.45       140
weighted avg       0.56      0.54      0.52       140

Confusion Matrix:
 [[ 2  7  3]
 [ 0 21 36]
 [ 0 18 53]]
ROC AUC Score: 0.7030483237384654


In [21]:
# Neural Network model
mlp = MLPClassifier(hidden_layer_sizes=(50,), max_iter=1000)
mlp.fit(X_train_scaled, y_train)

# Predictions and evaluation
y_pred_mlp = mlp.predict(X_test_scaled)
print("\n### Neural Network (MLP) ###")
print("Classification Report:\n", classification_report(y_test, y_pred_mlp))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_mlp))



### Neural Network (MLP) ###
Classification Report:
               precision    recall  f1-score   support

           1       0.67      0.17      0.27        12
           2       0.45      0.37      0.40        57
           3       0.59      0.75      0.66        71

    accuracy                           0.54       140
   macro avg       0.57      0.43      0.44       140
weighted avg       0.54      0.54      0.52       140

Confusion Matrix:
 [[ 2  8  2]
 [ 1 21 35]
 [ 0 18 53]]


In [22]:
# Hyperparameter tuning for Neural Network (MLP)
param_grid = {
    'hidden_layer_sizes': [(50,), (100,), (50, 50)],
    'alpha': [0.0001, 0.001, 0.01],
    'learning_rate_init': [0.001, 0.01]
}
grid_search = GridSearchCV(MLPClassifier(max_iter=1000), param_grid, cv=5, scoring='roc_auc_ovr')
grid_search.fit(X_train_scaled, y_train)

# Best Parameters
print("Best Parameters from Grid Search: ", grid_search.best_params_)


Best Parameters from Grid Search:  {'alpha': 0.01, 'hidden_layer_sizes': (50,), 'learning_rate_init': 0.001}


## Interpretation

Based on the results, Gradient Boosting is the best model for predicting 'Trouble Sleeping' due to its balanced performance across all three classes. It captures more cases in Class 1 (No Trouble Sleeping) with 0.57 precision and 0.33 recall, while maintaining good performance for Class 3 (Yes, Trouble Sleeping) with 0.62 precision and 0.70 recall. The ROC AUC score is 0.71, making it one of the best overall.

Logistic Regression performs well for Class 3 but completely misses Class 1, with 0 precision and recall. It has the highest ROC AUC score (0.73) but lacks balance across classes. Random Forest underperforms, particularly in Class 1 and Class 2, with an overall lower accuracy (51%). Neural Network and Decision Tree perform similarly, with moderate results for Class 3 but weaker performance for Classes 1 and 2.

In conclusion, Gradient Boosting provides the most reliable and balanced predictions for 'Trouble Sleeping', especially if identifying all patient groups is important. Other models, like Logistic Regression, are good for specific classes but not ideal for overall balance.

## New input

In [23]:
# Select the features and the target
features = ['Phyiscal Health', 'Pain Keeps Patient from Sleeping', 'Stress Keeps Patient from Sleeping', 'Medication Keeps Patient from Sleeping']
X = sleep_df[features]
y = sleep_df['Trouble Sleeping']

# 2. Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Build and train a Gradient Boosting model (or any other model like Logistic Regression)
model = Pipeline([
    ('scaler', StandardScaler()),  # Scale the input features
    ('classifier', GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=42))
])

# Train the model
model.fit(X_train, y_train)

# 4. Function to predict 'Trouble Sleeping' based on user input
def predict_trouble_sleeping(physical_health, pain, stress, medication):
    # Create a DataFrame for the input data
    input_data = pd.DataFrame({
        'Phyiscal Health': [physical_health],
        'Pain Keeps Patient from Sleeping': [pain],
        'Stress Keeps Patient from Sleeping': [stress],
        'Medication Keeps Patient from Sleeping': [medication]
    })

    # Make a prediction using the trained model
    prediction = model.predict(input_data)[0]

    # Map the prediction to the corresponding label
    if prediction == 1:
        return "No Trouble Sleeping"
    elif prediction == 2:
        return "Little Trouble Sleeping"
    else:
        return "Yes, Trouble Sleeping"

# 5. Example user input and prediction
def user_input_prediction():
    # Get user inputs for the features
    print("Enter the following details:")
    physical_health = int(input("Physical Health (1: Excellent, 2: Very Good, 3: Good, 4: Fair, 5: Poor): "))
    pain = int(input("Pain Keeps Patient from Sleeping (0: No, 1: Yes): "))
    stress = int(input("Stress Keeps Patient from Sleeping (0: No, 1: Yes): "))
    medication = int(input("Medication Keeps Patient from Sleeping (0: No, 1: Yes): "))

    # Get the prediction result
    result = predict_trouble_sleeping(physical_health, pain, stress, medication)
    
    # Display the result
    print(f"Prediction Result: {result}")

# Run the user input function to test the system
user_input_prediction()

Enter the following details:
Physical Health (1: Excellent, 2: Very Good, 3: Good, 4: Fair, 5: Poor): 1
Pain Keeps Patient from Sleeping (0: No, 1: Yes): 1
Stress Keeps Patient from Sleeping (0: No, 1: Yes): 1
Medication Keeps Patient from Sleeping (0: No, 1: Yes): 1
Prediction Result: Little Trouble Sleeping


## Export model

In [25]:
# import pickle

# Create a dictionary with the model and encoders
model_data = {
    'model': model,  # Trained model
    'encoders': {
        'Physical Health': {1: 'Excellent', 2: 'Very Good', 3: 'Good', 4: 'Fair', 5: 'Poor'},
        'Pain Keeps Patient from Sleeping': {0: 'No', 1: 'Yes'},
        'Stress Keeps Patient from Sleeping': {0: 'No', 1: 'Yes'},
        'Medication Keeps Patient from Sleeping': {0: 'No', 1: 'Yes'}
    }
}

# Save the model and encoders to a pickle file
with open('sleep_trouble_model.pkl', 'wb') as file:
    pickle.dump(model_data, file)

print("Model and encoders saved successfully.")


Model and encoders saved successfully.
