# **Predicting Stress Level By Training a Multilayer Perceptron**

This machine learning project aims to predict the stress level being experienced by an individual given their sleep health and various lifestyle habits. The project uses a Multilayer Perceptron (MLP) as the machine learning algorithm. [Sleep Health Data](https://www.kaggle.com/datasets/imaginativecoder/sleep-health-data-sampledt) is the dataset used for the project, and is acquired from Kaggle. 

This is a course requirement for CS 180 (Artificial Intelligence) Course of the Department of Computer Science, College of Engineering, University of the Philippines, Diliman under the guidance of Carlo Raquel for A.Y. 2023-2024.

- MAXIMO, Calvin James T.
- MENDOZA, Janelle M.
- MURILLO, Joana Marie V.

The GitHub repository for this project can be accessed [here](https://github.com/cjmax34/cs180-project).

## 1. Importing libraries

First, we have to import the libraries needed for the project. The libraries imported here are for loading the dataset, and performing exploratory data analysis (EDA). Additional libraries for modeling, and metrics will be imported later in the notebook.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## 2. Loading the dataset

Next, we will now load the dataset and store it in a variable named `dataset`.

In [None]:
dataset = pd.read_csv('Sleep_Data_Sampled.csv')

## 3. Performing Exploratory Data Analysis (EDA)

Now we will perform EDA on our dataset. This is important to help us understand the dataset more before we make assumptions. There might be patterns, trends, and relationships in the dataset that may not be visible at first glance. There might also be rows that contain missing values, outliers, inconsistencies, or biases, which could lead to inaccurate results. By gaining a deeper understanding of the dataset at hand through performing EDA, we are able to choose the appropriate techniques and approaches in training the model.

### Data exploration

#### Shape of dataset

We want to know the shape of the dataset, or the number of rows and columns it has. We can easily do this by calling `shape()` on `dataset`. 

In [None]:
# Prints (# of rows, # of columns)
dataset.shape

Our dataset has 15000 rows (!!) and 13 columns/features.

#### Getting information about the dataset

We want more information about the dataset. We can do this by calling `info()` on `dataset`. This provides us essential information such as the number of columns, column names, and data types of each column.

In [None]:
# Provides information about the dataset
dataset.info(); 

In [None]:
# Print column names of dataset
dataset.columns

We can see that there are 13 columns in total, namely: Person ID, Gender, Age, Occupation, Sleep Duration, Quality of Sleep, Physical Activity Level, Stress Level, BMI Category, Blood Pressure, Heart Rate, Daily Steps, and Sleep Disorder. 

The data types across all columns are float, int, and object (string).

#### Unique values per column

In this part, we want to identify the unique values for each of the dataset's columns/features. This helps us ensure the consistent formatting of the values across all columms and assess the complexity of each categorical feature (number of unique values).

In [None]:
# Number of unique values per column
dataset.nunique()

The number of unique Person IDs is 15000, which makes sense because it is used to identify an individual in the dataset. We will deal with this later.

In [None]:
# Unique values per categorical feature
for col in dataset.select_dtypes(include='object').columns:
    print(f"{col}: {list(dataset[col].unique())}")

From the output above, it seems that the formatting of each column/feature's values is consistent. However, we have to preprocess some of the values, such as "Male" and "Female" in the Gender category, "Normal Weight" in the BMI Category feature, and the systolic and diastolic blood pressure measurements in the Blood Pressure category. We will delve into this later on.


##### Renaming of column values

As seen from the output of the previous code block, the unique values of the `BMI Category` feature are `Normal Weight`, `Normal`, `Overweight`, and `Obese`. We want to change `Normal Weight` to `Normal` since they are equivalent.

In [None]:
# Rename Normal Weight to Normal
dataset['BMI Category'] = dataset['BMI Category'].str.replace('Normal Weight', 'Normal')
dataset['BMI Category'].unique()    # Verify that there are only three unique values  

In [None]:
dataset.head()

We have successfully renamed all `Normal Weight` entries to `Normal`.

#### Statistical summary

The `describe()` method returns a statistical summary of the numerical features in the given dataset. For each column, it returns these information.

count - The number of not-empty values.\
mean - The average (mean) value.\
std - The standard deviation.\
min - the minimum value.\
25% - The 25% percentile.\
50% - The 50% percentile.\
75% - The 75% percentile.\
max - the maximum value.

Reference: https://www.w3schools.com/python/pandas/ref_df_describe.asp

In [None]:
dataset.describe()

We can see from the table above that all numerical features have the same count (15000).

The average age is 44.13 years. The average sleep duration is 7 hours. The average quality of sleep (1-10, 1 being the lowest and 10 being the highest) is 7.13. The average amount of daily physical activity is 59.93 minutes, or nearly an hour. The average stress level (1-10, 1 being the lowest and 10 being the highest) is 5.65. The average heart rate is 70.86 beats per minute. The average number of daily steps is 6795.

#### Checking for the presence of null values

An important part of data exploration is checking for the presence of null values in the dataset. It is vital to handle null values because they can produce inaccurate or misleading results. Some machine learning algorithms, especially MLP, also cannot handle null values. Ultimately, it helps us in deciding the best way to handle the null values through various methods such as imputation and removal of rows.

In [None]:
# Checking for null values
dataset.isnull().sum()

Fortunately, there are no null values across **ALL** rows of the dataset.

### Data visualization

Data visualization is another important part of every machine learning project. In this part, we will be using graphs to better understand the patterns and trends in the data. These patterns and trends can easily be understood through visualization especially if they are hard to discern from the raw data. It condenses quite complex information into an easily digestible format. We can also discover what features correlate strongly with the target variable (stress level). Ultimately, it is a powerful tool for effectively communicating insights on data to a wider audience.

#### Distribution of stress level

We want to visualize the distribution of the stress level feature, which is our target variable, in our dataset. The code block below is from [Tanaya Tipre's project](https://www.kaggle.com/code/tanayatipre/stress-level-detection#5.-Data-Visualization), with the labels modified for clarity.

In [None]:
# Distribution of stress level
sns.countplot(x='Stress Level', data=dataset)

plt.xlabel('Stress Level')

plt.ylabel('Count')

plt.title('Distribution of Stress Level')

# Displaying the plot
plt.show()

From the figure above, we can observe that the stress level is not distributed evenly, and a stress level of 6 is found across many records in the dataset.

#### Distribution of numerical features

We want to visualize the distribution of the numerical features of the dataset. We might be able to identify outliers and potential patterns across these features.

In [None]:
# List of features to plot histograms for
features = [col for col in dataset.columns if col not in ['Stress Level', 'Person ID']]

# Plot histograms for each feature
dataset.hist(column=features, bins=10, figsize=(10, 10))
plt.suptitle("Histograms of Features", fontsize=20)
plt.show()

#### Distribution of categorical features

Visualizing the distribution of stress levels by Gender, BMI Category, and Sleep Disorder allows for a visual examination of trends and relationships in the data.


In [None]:
sns.boxplot(x='Gender', y='Stress Level', data=dataset)
plt.title('Distribution of Stress Levels by Gender')
plt.show()

In [None]:
sns.boxplot(x='BMI Category', y='Stress Level', data=dataset)
plt.title('Distribution of Stress Levels by BMI Category')
plt.show()

In [None]:
sns.boxplot(x='Sleep Disorder', y='Stress Level', data=dataset)
plt.title('Distribution of Stress Levels by Sleep Disorder')
plt.show()

#### Solving and visualizing the correlation matrix

We want to identify relationships among the variables and select features that are highly correlated. This helps in avoiding redundancy and building more accurate models.


In [None]:
numerical_features = dataset.select_dtypes(include=['number'])
numerical_features.drop('Person ID', axis=1, inplace=True)

# Calculate the correlation matrix
correlation_matrix = numerical_features.corr()

# Print the correlation matrix
print(correlation_matrix)

In [None]:
plt.figure(figsize = (10, 10))
sns.heatmap(correlation_matrix, cmap = 'crest', annot = True)
plt.show

##### Positively correlated columns
"Sleep Duration" and "Quality of Sleep"\
"Physical Activity Level" and "Daily Steps"\
"Stress Level" and Heart Rate"

##### Negatively correlated columns:
"Stress Level" and "Sleep Duration"\
"Stress Level" and "Quality of Sleep"

##### No correlation with target column (stress level)
"Physical Activity Level"

## 4. Data preprocessing

Now that we have uncovered the possible trends and patterns in the dataset, we will now perform preprocessing. This involves renaming of columns or column values (if applicable), dropping of unused features, normalization of numerical features, label encoding (converting categorical data to numerical data), and dealing with null values.

### Importing libraries for preprocessing

In [None]:
# Importing the necessary libraries to conduct preprocessing
from sklearn.preprocessing import LabelEncoder, StandardScaler

There are categorical features in the dataset such as `Gender`, `Occupation`, `BMI Category`, `Sleep Disorder`, and also the target variable `Stress Level`. Label encoding is utilized here to convert categorical values into numerical values by assigning a unique numerical label for each categorical value. This is done by importing `LabelEncoder`.

On the other hand, standardizing numerical features such as `Age`, `Sleep Duration`, `Physical Activity Level`, `Blood Pressure`, `Heart Rate`, and `Daily Steps` is essential. This process ensures that variables with different units and scales contribute equally to the model, preventing biases and allowing machine learning algorithms to effectively learn patterns, leading to improved model performance.

Reference: https://www.kaggle.com/code/alnourabdalrahman9/sleeping-disorders-detection-outliers-removal#Splitting-the-Data:

### Splitting blood pressure

The `Blood Pressure` feature in our dataset is currently in the format of systolic pressure/diastolic pressure. We want to split them up such that there are separate columns for the systolic pressure and diastolic pressure to avoid inconsistencies when training the MLP.

Systolic pressure is the maximum blood pressure during contraction of the ventricles; diastolic pressure is the minimum pressure recorded just prior to the next contraction.

Reference: https://www.ncbi.nlm.nih.gov/books/NBK268/

In [None]:
# Split the blood pressure feature into systolic blood pressure and diastolic blood pressure
dataset[['Systolic Blood Pressure', 'Diastolic Blood Pressure']] = dataset['Blood Pressure'].str.split('/', n=1, expand=True)

# Convert the data type of the newly created columns to numeric
dataset[['Systolic Blood Pressure', 'Diastolic Blood Pressure']] = dataset[['Systolic Blood Pressure', 'Diastolic Blood Pressure']].apply(pd.to_numeric)

# Verify that the data type conversion is successful
print(dataset[['Systolic Blood Pressure', 'Diastolic Blood Pressure']].dtypes) 
dataset.head() # Verify that the splitting and dropping is successful

We can see from the output of the code block above that the splitting of blood pressure data into systolic and diastolic blood pressures was successful. We also dropped the blood pressure column, which can be observed in the table.

### Setting stress level as the last column

To streamline the process of getting the training and testing data later, we will set the `Stress Level` feature as the last column of `dataset`. 

In [None]:
# Move the target column (stress level) to the last column
dataset['Stress Level'] = dataset.pop('Stress Level')
dataset.columns

### Dropping of unnecessary features

Some of the features in the dataset were deemed unnecessary and will be dropped/removed. Recall that we want to predict the stress level of an individual given their sleep health and lifestyle habits.

In [None]:
# Remove some features because they are unnecessary for analysis
features_to_remove = ['Person ID', 'Occupation', 'Blood Pressure']

dataset.drop(features_to_remove, axis=1, inplace=True)
dataset.head()  # Verify that the features in the features_to_remove list have been dropped from the dataset

### Label encoding

Now we will convert the categorical features (`Gender`, `BMI Category`, `Sleep Disorder`) to numerical data through the use of the `LabelEncoder` function.

In [None]:
label_encoder = LabelEncoder()  # Instantiate the label encoder

categorical_features = dataset.select_dtypes(include=['object']).columns.tolist()
for cat in categorical_features:
    dataset[cat] = label_encoder.fit_transform(dataset[cat])

dataset.head()  # Verify that the categorical features have been encoded

## 5. Importing methods needed for training, modelling, and metrics

In [None]:
# For generating the training and testing sets (80% training, 20% testing)
from sklearn.model_selection import train_test_split

# For evaluating the model's performance
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

# For the main model (Multilayer Perceptron)
from sklearn.neural_network import MLPClassifier

# For the other models that will be used for comparison
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm 
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier

## 6. Generating the training and testing sets

After performing EDA and preprocessing the dataset, we are now ready to generate the data that will be used in training the model. For this project, we will use 80-20 splits (80% training, 20% testing).

In [None]:
# Initializing the features and target variables
X = dataset.drop('Stress Level', axis=1)
y = dataset['Stress Level']

# Generating the training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=22)

# Displaying the dimensions of the training and testing sets
print(f"X_train shape: {X_train.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_test shape: {y_test.shape}")

In [None]:
# Print X_train
X_train

## 7. Comparing to other models


### Logistic Regression

In [None]:
# Create a logistic regression model
log_reg = LogisticRegression()

# Fit the model to the training data
log_reg.fit(X_train, y_train)

# Predict on the testing data
y_predict = log_reg.predict(X_test)

# Confusion matrix
conf_matr = confusion_matrix(y_test, y_predict)
print("Confusion Matrix : \n", conf_matr)

classif_rep = classification_report(y_test, y_predict)
print("Classification Report:\n", classif_rep)

# Printing the test accuracy
print("The test accuracy of Logistic Regression is : ", accuracy_score(y_test, y_predict) * 100, "%")

### Naive Bayes

In [None]:
# Create a naive bayes model
naive_bayes = GaussianNB()

# Fit the model to the training data
naive_bayes.fit(X_train,y_train)

# Predict on the testing data
y_predict = naive_bayes.predict(X_test)

# Confusion matrix
conf_matrx = confusion_matrix(y_test, y_predict)
print("Confusion Matrix: \n", conf_matr)

classif_rep = classification_report(y_test, y_predict)
print("Classification Report: \n", classif_rep)

# Printing the test accuracy
print("The test accuracy of Naive Bayes is : ", naive_bayes.score(X_test,y_test) * 100, "%")

### Support Vector Mechanism

In [29]:
# Create an svm model
svm_classifier = svm.SVC(kernel='linear')

# Fit the model to the training data
svm_classifier.fit(X_train, y_train)

# Predict on the testing data
y_predict = svm_classifier.predict(X_test)

accuracy = accuracy_score(y_test, y_predict)

# Confusion matrix
conf_matrx = confusion_matrix(y_test, y_predict)
print("Confusion Matrix: \n", conf_matr)

classif_rep = classification_report(y_test, y_predict)
print("Classification Report: \n", classif_rep)

# Printing the test accuracy
print("The test accuracy of SVM is : ", accuracy * 100, "%")

### KNN Classifier

In [None]:
# Create knn model
knn = KNeighborsClassifier()

# Fit the model to the training data
knn.fit(X_train, y_train)

# Predict on the testing data
y_predict = knn.predict(X_test)

# Confusion matrix
conf_matrx = confusion_matrix(y_test, y_predict)
print("Confusion Matrix: \n", conf_matr)

classif_rep = classification_report(y_test, y_predict)
print("Classification Report: \n", classif_rep)

# Printing the test accuracy
print("The test accuracy of Naive Bayes is : ", knn.score(X_test,y_test) * 100, "%")

### Random Forest Classifier

In [None]:
# Create a random forest classifier model
random_forest = RandomForestClassifier(n_estimators=13)

# Fit the model to the training data
random_forest.fit(X_train,y_train)

# Predict on the testing data
y_predict = random_forest.predict(X_test)

# Confusion matrix
conf_matrx = confusion_matrix(y_test, y_predict)
print("Confusion Matrix: \n", conf_matr)

classif_rep = classification_report(y_test, y_predict)
print("Classification Report: \n", classif_rep)

# Printing the test accuracy
print("The test accuracy of Random Forest Classifier is : ", random_forest.score(X_test,y_test) * 100, "%")

### Decision Tree

In [None]:
# Create a decision model
decision_tree = DecisionTreeClassifier()

# Fit the model to the training data
decision_tree.fit(X_train,y_train)

# Predict on the testing data
y_predict = decision_tree.predict(X_test)

# Confusion matrix
conf_matrx = confusion_matrix(y_test, y_predict)
print("Confusion Matrix: \n", conf_matr)

classif_rep = classification_report(y_test, y_predict)
print("Classification Report: \n", classif_rep)

# Printing the test accuracy
print("The test accuracy of Decision Tree is : ", decision_tree.score(X_test,y_test) * 100, "%")