# Case 1. Heart Disease Classification

#### Joonas Lehikoinen, Przemyslaw Zuchmanski
##### 31.01.2020
### Helsinki Metropolia University of Applied Sciences

The main object is to created and train a dense neural network to predict the presence of heart disease on the base of heart disease cleveland data downloaded from the site: https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/ .

## Data
The data contains values of various health factors usefull in detecting heart diseases. There are 13 factors described in 13 coluns. 14th column describes if the patint sufers from heart disease. The number of records is 303. Missing values (detected in 6 raws) were replaced with 0. 

In [1]:
%pylab inline
import pandas as pd
import numpy
from sklearn import preprocessing
import tensorflow as tf
from sklearn.model_selection import train_test_split


#names of columns
names = ["age","sex","cp","trestbps","chol","fbs","restecg",
                            "thalach","examg","oldpeak","slope","ca","thal","num"]
                   
#reading data and giving names for columns, detecting NaN valuess
df = pd.read_csv("processed.cleveland.data", 
                 names=names,
                     header=None, 
                     index_col = None, 
                     na_values = '?')

#replacing NaN values with 0
df = df.replace(numpy.NaN,0)

Populating the interactive namespace from numpy and matplotlib


Basic statistics are as follows

In [2]:
print('shape of data set: ', df.shape)
df.describe()

shape of data set:  (303, 14)


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,examg,oldpeak,slope,ca,thal,num
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.438944,0.679868,3.158416,131.689769,246.693069,0.148515,0.990099,149.607261,0.326733,1.039604,1.60066,0.663366,4.70297,0.937294
std,9.038662,0.467299,0.960126,17.599748,51.776918,0.356198,0.994971,22.875003,0.469794,1.161075,0.616226,0.934375,1.971038,1.228536
min,29.0,0.0,1.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,1.0,0.0,0.0,0.0
25%,48.0,0.0,3.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,3.0,0.0
50%,56.0,1.0,3.0,130.0,241.0,0.0,1.0,153.0,0.0,0.8,2.0,0.0,3.0,0.0
75%,61.0,1.0,4.0,140.0,275.0,0.0,2.0,166.0,1.0,1.6,2.0,1.0,7.0,2.0
max,77.0,1.0,4.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,3.0,3.0,7.0,4.0


Division data into two subsets: <br>
data - all health factors <br>
labels - indicate if the person is rather sick (1) or healthy (0)

In [3]:
#dividing set to data and labels
data = df.drop(['num'], axis=1)
#converting labels to binary atribut
label = 1.0*(df['num'] >0)

## Models and training
Dividing data. For training and validating we use 80% of samples. Remains 20% we will use for testing.

In [4]:
train_data, test_data, train_labels, test_labels = train_test_split(
                    data,
                    label,
                    test_size = 0.2,
                    random_state = 39,
                    shuffle = True)
#standarisation
mean = train_data.mean(axis=0)
train_data -= mean
std = train_data.std(axis=0)
train_data /= std
test_data -= mean
test_data /= std

creating a model

In [5]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Activation, Flatten, Dropout, Dense, Embedding, TimeDistributed
from tensorflow.keras.callbacks import ModelCheckpoint
#from tensorflow.keras.utils import np_utils
from tensorflow.keras import regularizers

#creating layers
LSTM_layer_num=3
layer_size = [128,128,128]
model = Sequential()

model.add(Dense(34, input_shape =(13,),activation='relu',kernel_regularizer=regularizers.l2(0.05)))
model.add(Dropout(0.20))
model.add(Dense(34, activation='relu',kernel_regularizer=regularizers.l2(0.05)))
model.add(Dropout(0.20))
model.add(Dense(1))

model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics=['accuracy'])

###Checkpoint###
checkpoint_name = 'Disease3x128Batch16.hdf5'
checkpoint = ModelCheckpoint(checkpoint_name, monitor='loss', verbose = 0, save_best_only = True, mode ='min')
callbacks_list = [checkpoint]                        

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


### Functions
in order to makes testing and showing the resuts easier we made functions

###### Fit and make model
Fitting the model. We use 20% of remaining data for validation.

In [6]:
history = model
def fitModel(num_epo,batch,neuron_amount):
    
    
    #creating layers and defining parameters
    LSTM_layer_num=3
    layer_size = [neuron_amount]
    activation_f="relu"
    dropout=0.2
    model = Sequential()


    model.add(Dense(layer_size[0], input_shape =(13,),activation=activation_f,kernel_regularizer=regularizers.l2(0.05)))
    model.add(Dropout(dropout))
    model.add(Dense(layer_size[0], activation=activation_f,kernel_regularizer=regularizers.l2(0.05)))
    model.add(Dropout(dropout))
    model.add(Dense(1))
    
    model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics=['accuracy'])

    ###Checkpoint###
    checkpoint_name = 'Disease3x128Batch16.hdf5'
    checkpoint = ModelCheckpoint(checkpoint_name, monitor='loss', verbose = 0, save_best_only = True, mode ='min')
    callbacks_list = [checkpoint]  
    
      
    
    # Fit the model :
    global history
    model_params = {'epochs': num_epo,
                    'batch_size': batch,
                    'callbacks': callbacks_list,
                    'verbose': 0,
                    'validation_split': 0.20,
                    'shuffle': True,
                    'initial_epoch': 1,
                    'steps_per_epoch': None,
                    'validation_steps': None}

    
    history=model.fit(train_data.values,
              train_labels.values,
               epochs = model_params['epochs'],
               batch_size = model_params['batch_size'],
               callbacks= model_params['callbacks'],
               verbose = model_params['verbose'],
               validation_split = model_params['validation_split'],
               shuffle = model_params['shuffle']    
                     )


##### ploting results

In [7]:
def plot():
    # Plot the loss score and mean absolute error for both training and validation setss

    #coleting data from history
    loss = history.history['loss']
    val_loss = history.history['val_loss']
    mae = history.history['acc']
    val_mae = history.history['val_acc']

    #defining time axis
    time = range(1,len(loss)+1)

    #ploting loss vs Epochs
    #loss of validation set is red
    plt.plot(time, loss, 'b-')
    plt.plot(time, val_loss, 'r-')
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.show()

    #ploting accuracy vs Epochs
    #accuracy of validation set is red
    plt.plot(time, mae, 'b-')
    plt.plot(time, val_mae, 'r-')
    plt.xlabel('Epochs')
    plt.ylabel('ACC')
    plt.show()

In [8]:
def loadWeights(weightFile):
    weights_file = weightFile # weights file path
    model.load_weights(weights_file)
    model.compile(loss = 'mse', optimizer = 'adam',metrics=['mae'])

## Results

##### Main tried combinations

##### #1 network setup:
model.add(Dense(64, input_shape =(13,),activation='relu',kernel_regularizer=regularizers.l2(0.05)))<br>
model.add(Dropout(0.20))<br>
model.add(Dense(64, activation='relu',kernel_regularizer=regularizers.l2(0.05)))<br>
model.add(Dropout(0.20))<br>
model.add(Dense(1))<br>
model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics=['accuracy'])

Different batch sizes for 400 epochs were checked. In this set up it sims to be that batch size 30 is the best <br>
best results for:

number of epochs: 400, batch size: 30 <br>
0.4978 - acc: 0.8852 <br>
plots of loss and accuracy:

##### #2 network setup:
model.add(Dense(34, input_shape =(13,),activation='relu',kernel_regularizer=regularizers.l2(0.05)))<br>
model.add(Dropout(0.20))<br>
model.add(Dense(34, activation='relu',kernel_regularizer=regularizers.l2(0.05)))<br>
model.add(Dropout(0.20))<br>
model.add(Dense(1))<br>
model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics=['accuracy'])

Different batch sizes for 1000 epochs were checked. In this set up it sims to be that batch size 40 is the best<br>
best results for:

number of epochs: 1100, batch size: 40 <br>
loss: 0.4852 - acc: 0.8852

In [9]:
los = 10
accu = 0
result = [0,0,0]
hist = [[0,0,0,0,0]]
for neur in range (30,60,4):

    for bat in range (20,50,10):
        
        for epo in range (200,1200,100):

            fitModel(epo,bat,neur)
            a = model.evaluate(test_data.values, test_labels.values)
            hist = np.vstack((hist,[[neur,bat,epo,a[0],a[1]]]))
            print (bat)
            
            if (a[0]<los and a[1]>accu):
                print('neur:', neur)
                print ('batch',bat)
                print('epo',epo)
                print ('a:', a[0],a[1])
                los = a[0]
                accu = a[1]
                result = [neur,bat,epo]
                

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
10
10
10


KeyboardInterrupt: 

### Results

final evaluation of a model

In [None]:
a = model.evaluate(test_data.values, test_labels.values)
print(a[0])
a[0]+1