# Case 1. Heart Disease Classification
**Neural Networks for Machine Learning Applications**<br>
23.01.2023<br>
## Team X<br>
## Amal Kayed
## Julian Marco Soliveres
## Mateusz Czarnecki
[Information Technology, Bachelor's Degree](https://www.metropolia.fi/en/academics/bachelors-degrees/information-technology)<br>
[Metropolia University of Applied Sciences](https://www.metropolia.fi/en)

## 1. Introduction

The main objectives of the following notebook is to use neural networks to make an **expert system** to support diagnostic desicion making. We are going to deepen our deep learning knowledge by testing different model architectures, using visualiations tools and metrics. 

Our task is to firstly read and preprocess the **Heart Disease Health Indicators Dataset** and create a neural network to predict the presence of heart disease among the patients given in the dataset. 

## 2. Setup

We need several python libraries to make the goal of this assignment possible. 

The **pandas** and **numpy** libraries allow us to create and operate on DataFrames, read the data from .csv file and perform operations on arrays such as concatinating. 

The **imblearn** library will become useful when resampling (balancing) the data.

The sklearn library is going to help us with **normalizing** the data.

We need the **matplotlib** and **seaborn** libraries to visualize the number of specific values in the dataset.

The **tensorflow** library is going to help us when building the Machine Learning Model.

In [1]:
import pandas as pd
import numpy as np
from imblearn.over_sampling import RandomOverSampler, SMOTE, ADASYN

from sklearn import preprocessing
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt
import seaborn as sns

import tensorflow as tf
from tensorflow.keras import models
from tensorflow.keras import layers

ModuleNotFoundError: No module named 'tensorflow'

In [None]:
python3.10 -m pip install --upgrade pip

## 3. Dataset

The dataset consists of **253,680** survey responses collected from The Behavioral Risk Factor Surveillance System 2015. The primary purpose of the dataset is to be used for binary classification of a heart disease. **229,787 of the respondents never had a heart** disease, while **23,893 have or had a heart disease**.

A detailed description of the dataset can be found at: https://www.kaggle.com/datasets/alexteboul/heart-disease-health-indicators-dataset

In [None]:
df = pd.read_csv("heart_disease_health_indicators_BRFSS2015.csv")
df.sample(10)

In [None]:
df.shape

## 4. Preprocessing

Instructions: Describe:

- how the missing values are handled
- conversion of textual and categorical data into numerical values (if needed)
- how the data is splitted into train, validation and test sets
- the features (=input) and labels (=output), and 
- how the features are normalized or scaled

At start, let's check if there are any missing values and if our dataset needs a class balance.

In [None]:
df.dtypes

Let's start the dataset preprocessing by changing all the values from float to integer values. We don't need the float type because all the numbers are in fact integers.

In [None]:
df = df.astype(int)

In [None]:
df.dtypes

The next thing we are going to do is split the dataframe columns into features (inputs) and labels (outputs). 

We can also see a big **disproportion** between the disease cases and healthy cases.

In [None]:
# Split column to Features and Target(Y)
features = df.drop(columns='HeartDiseaseorAttack')
labels = df['HeartDiseaseorAttack']

print(f'Disease cases: {sum(labels == 0)}')
print(f'Healthy cases: {sum(labels > 0)}')

To make our data more suitable to create an accurate neural network, we need to balance our data We are going to make the amount of disease cases and healthy cases equal using **RandomOverSampler()** function from imblearn library.

In [None]:
random_sampler = RandomOverSampler()
features, labels = random_sampler.fit_resample(features, labels)

print('Resampled data')
print(f'Disease cases: {sum(labels == 0)}')
print(f'Healthy cases: {sum(labels > 0)}')

After seperating the features and outputs and resampling the data, let's study the dataset in more details.

In [None]:
features.isnull().sum()

We can see that there are no null values in the dataset.

Let's have a brief overview on the dataset's statistics.

In [None]:
features.describe()

In [None]:
features.columns

In [None]:
catcol = ['HighBP', 'HighChol', 'CholCheck', 'BMI',
       'Smoker', 'Stroke', 'Diabetes', 'PhysActivity', 'Fruits', 'Veggies',
       'HvyAlcoholConsump', 'AnyHealthcare', 'NoDocbcCost', 'GenHlth',
       'MentHlth', 'PhysHlth', 'DiffWalk', 'Sex', 'Age', 'Education',
       'Income']

plt.figure(figsize=(15,40))
for i,column in enumerate(catcol):
    plt.subplot(len(catcol), 2, i+1)
    plt.suptitle("Plot Value Count", fontsize=20, x=0.5, y=1)
    sns.countplot(data=features, x=column)
    plt.title(f"{column}")
    plt.tight_layout()

Thanks to the statistics, we can clearly divide the features into three sets:
* binary features having values 0 or 1
* numerical features having a range of values
* categorical features having numerical categorical values

In [None]:
bin_features = ['HighBP', 'HighChol', 'CholCheck', 'Smoker', 'Stroke', 
                'PhysActivity', 'Fruits', 'Veggies', 'HvyAlcoholConsump',
               'AnyHealthcare', 'NoDocbcCost', 'DiffWalk', 'Sex']

num_features = ['BMI','MentHlth', 'PhysHlth']

cat_features = ['Diabetes', 'GenHlth', 'Age', 'Education', 'Income']

All the binary features can be easily merged together

In [None]:
bin_values = features[bin_features].values
bin_values

For the numerical values, we will use **RobustScaler()** function from the preprocessing library, which normalizes the numerical values and makes them easier to analyze by our future model.

In [None]:
transformer = preprocessing.RobustScaler().fit(features[num_features])
num_values = transformer.transform(features[num_features])

num_values

For the categorical values, we will use a **OneHotEncoder()** function from the preprocessing library, which changes their numerical values into arrays representing a given category.

In [None]:
encoder = preprocessing.OneHotEncoder().fit(features[cat_features])
cat_values = encoder.transform(features[cat_features]).toarray()

cat_values

After normalizing our data, we have to concatenate all the features together into one DataFrame.

In [None]:
all_values = np.concatenate((bin_values, num_values, cat_values), axis = 1)
features = pd.DataFrame(all_values)

Here is how our features look like after normalizing the data:

In [None]:
features

The last step of the preprocessing is to divide the features and labels into training, testing and validation sets.

In [None]:
train_data, test_data, train_labels, test_labels = train_test_split(features, labels, test_size=0.2, stratify=labels, random_state=48)

In [None]:
train_data, val_data, train_labels, val_labels = train_test_split(train_data, train_labels, test_size=0.2, stratify=train_labels, random_state=48)

After using the **train_test_split()** function two times, according to the 80/20 rule, we receive the data split into training, validation and testing sets. As a result we get the proportions:
* Training set   - 64%
* Validation set - 16%
* Testing set    - 20%

In [None]:
train_data = train_data.to_numpy()
train_labels = train_labels.to_numpy()

val_data = val_data.to_numpy()
val_labels = val_labels.to_numpy()

test_data = test_data.to_numpy()
test_labels = test_labels.to_numpy()

# Counting the data %
sum_length = len(train_data) + len(val_data) + len(test_data)
train_percent =  len(train_data) / sum_length * 100
val_percent = len(val_data) / sum_length * 100
test_percent = len(test_data) / sum_length * 100

print('train data %:\t',train_percent)
print('val data %:\t', val_percent)
print('test data %:\t', test_percent)

## 5. Modeling

Instructions: Write a short description of the model: 

- selected loss, optimizer and metrics settings, and 
- the summary of the selected model architecture. 

In [None]:
print(train_data.shape, train_data.dtype)
print(train_labels.shape, train_labels.dtype)
print(val_data.shape, val_data.dtype)
print(val_labels.shape, val_labels.dtype)

In [None]:
model = models.Sequential()
model.add(layers.Dense(5, activation='relu', input_shape=(51,))) # input_shape=(51,)
model.add(layers.Dense(1, activation='sigmoid'))

model.summary()

## 6. Training

Instructions: Write a short description of the training process, and document the code for training and the total time spend on it. 

In [None]:
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['Accuracy','Recall'])

hist = model.fit(train_data, train_labels, epochs=1, batch_size=16,
                 validation_data = (val_data, val_labels),verbose=0)

print('Accuracy:',hist.history['Accuracy'][-1])
print('Recall:',hist.history['recall'][-1])
print('Val Accuracy:',hist.history['val_Accuracy'][-1])
print('Val Recall:',hist.history['val_recall'][-1])

In [None]:
plt.plot(hist.history['loss'], label='Loss')
plt.plot(hist.history['val_loss'], label='Val loss')
plt.xlabel('Epochs')
plt.ylabel('Loss value')
plt.legend()
plt.show()

In [None]:
plt.plot(hist.history['Accuracy'], label='Acc')
plt.plot(hist.history['val_Accuracy'], label='Val acc')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

## 7. Performance and evaluation

Instructions: 

- Show the training and validation loss and accuracy plots
- Interpret the loss and accuracy plots (e.g. is there under- or over-fitting)
- Describe the final performance of the model with test set 

In [None]:
# Your code

## 8. Discussion and conclusions

Instructions: Write

- What settings and models were tested before the best model was found
    - What where the results of these experiments 
- Summary of  
    - What was your best model and its settings 
    - What was the final achieved performance 
- What are your main observations and learning points
- Discussion how the model could be improved in future 

**Note:** Remember to evaluate the final metrics using the test set. 
