<p style="font-family: Arial; font-size:2em;color:purple; font-style:bold"><br>
<b>Deep Neural Networks to Predict <br><br>Self-perception of Cardiovascular Disease
<br><br>~ A Technical Demo</b><br></p>
<p style="font-family: Arial; font-size:1.5em;color:#2462C0; font-style:bold"><br>David Zeng Ph.D.</p>
<p style="font-family: Arial; font-size:1.5em;color:#2462C0; font-style:bold">Assistant Professor/Researcher, Masters of Science in Analytics Program</p>
<p style="font-family: Arial; font-size:1.5em;color:#2462C0; font-style:bold">College of Business and Information Systems, Dakota State University</p>

<br><br>
### The objective of this research is three-fold:
#### * Better understanding of how well DNN would improve the accuracy of prediction on perception of cardiovascular disease
#### * Framework of developing more sophisticated DNN models to predict medical outcomes
#### * Foundation for learning multi-dimensional/distributed representation of healthcare concepts that are both intepretable and scalable

### I train a multiple-hidden-layer deep neural network with a large data set of 1729 features and about 30,000 samples. The dataset is CDC  Demographics, Dietary Data, Examination, Laboratory, and Questionnaire datasets collected from 1999 to 2016. Substantial data cleaning and pre-processing are done with Python pandas library. <p>
    
#### Technical highlights:
* Open-source libraries/models
* Python Library for Deep Learning keras
* TensorFlow as the backend
* Multiclass classification with softmax output layer and cross-entropy loss
* 1 128-dimensional hidden layer with relu activations
* 2 64-dimensional hidden layers with relu activations
* 0.5 drop-out following each hidden layer
* Commonly-accepted optimizer: SGD with Nesterov Momentum
* L2 weight regularization
* 1729 input features
* 10-class target: from patient questionnaire data
* Data transformations: 1729 - 128 - 64 - 64 - 10
* Mini batch-size = 512
* Optimal number of epochs = 100
* Input features are standardized
* Missing values are filled with means
* Base models (LogReg, kNN, DT, Shallow NNET, and SVM) are implemented with scikit-learn
* Models are trained on both CPU and GPU
* GPU environment: two GTX 1080 Ti 8GB cards
* A subset of the data with about 3,200 examples is used for this demonstration
* About 600 iterations to complete 100 epoch
* About 1 million parameters in the DNN

## Merging Data Files into the training dataset
#### 1. Load all files into a list
#### 2. Merge them into a big pandas DataFrame with a for loop

In [68]:
import os
import glob
import pandas as pd
import numpy as np

path = "C:\\Users\\dzeng\\DL Research\\Data\\1999\\"
allFiles = glob.glob(os.path.join(path,"*.xpt"))


data = np.array([np.arange(1,10001)]).T
x_train =pd.DataFrame(data, index = range(0,10000),columns=['SEQN'])

for infile in allFiles:
    df = pd.read_sas(infile)
    df['SEQN']=df['SEQN'].astype(int)
    x_train = x_train.merge(df.drop_duplicates(subset=['SEQN']), on='SEQN', how='left')
    

In [69]:
x_train.shape

(10000, 1729)

In [None]:
x_train.tail(20)

### Removing empty rows from DataFrame in pandas

In [79]:
x_train_noemptyrows = x_train.dropna(axis=0, thresh=20, how='all')

In [80]:
x_train_noemptyrows.shape

(9965, 1729)

In [None]:
x_train_noemptyrows.tail(20)

### Working with missing data

In [84]:
train_data = x_train_noemptyrows.fillna(x_train_noemptyrows.mean())

### Normalizing the data

In [None]:
mean = train_data.mean(axis=0)
train_data -= mean
std = train_data.std(axis=0)
train_data /= std
# test_data -= mean
# test_data /= std

### Adding target into the DataFrame

In [86]:
y_train = pd.read_sas("C:\\Users\\dzeng\\DL Research\\Data\\target\\CDQ.xpt")
#full_dataset = full_data.merge(y_train,on='SEQN', how='right')

In [106]:
y_train_clean = y_train.drop(['CDQ030', 'CDQ040','CDQ060','CDQ090'], axis=1)

In [107]:
y_train_clean.shape

(3185, 6)


### Meraging into a full dataset

In [None]:
full_dataset = pd.merge(y_train_clean, train_data, on='SEQN', how='left')

### Spliting the dataset into training (75%) and test (25%) sets

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
full_dataset[:,0:1728], full_dataset[:,1728:], random_state=0)

### Encoding the target
One-hot encoding is a widely used format for categorical data, also called categorical encoding.

In [None]:
import numpy as np
def to_one_hot(labels, dimension=10):
results = np.zeros((len(labels), dimension))
for i, label in enumerate(labels):
results[i, label] = 1.
return results

one_hot_train_labels = to_one_hot(train_labels)
one_hot_test_labels = to_one_hot(test_labels)

## The Model: Deep Neural Networks for multi-class softmax classification

In [None]:
import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.optimizers import SGD
from keras import regularizers

# Generate dummy data
import numpy as np
x_train = np.random.random((1000, 200))
y_train = keras.utils.to_categorical(np.random.randint(10, size=(1000, 1)), num_classes=10)
x_test = np.random.random((100, 200))
y_test = keras.utils.to_categorical(np.random.randint(10, size=(100, 1)), num_classes=10)

model = Sequential()
# Dense(64) is a fully-connected layer with 64 hidden units.
# in the first layer, you must specify the expected input data shape:

model.add(Dense(128, kernel_regularizer=regularizers.l2(0.001),
    activation='relu', input_shape=(x_train.shape[1],)))
model.add(Dropout(0.5))
model.add(Dense(64, kernel_regularizer=regularizers.l2(0.001),
    activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, kernel_regularizer=regularizers.l2(0.001),
    activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(10, activation='softmax'))

sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy',
              optimizer=sgd,
              metrics=['accuracy'])

### Adding dropout
Dropout is one of the most effective and most commonly used regularization techniques for neural networks, developed by Geoff Hinton and his students at the University of Toronto. Dropout, applied to a layer, consists of randomly dropping out (setting to zero) a number of output features of the layer during training. Let’s say a given layer would normally return a vector [0.2, 0.5, 1.3, 0.8, 1.1] for a given input sample during training. After applying dropout, this vector will have a few zero entries distributed at random: for example, [0, 0.5, 1.3, 0, 1.1]. The dropout rate is the fraction of the features that are zeroed out; it’s usually set between 0.2 and 0.5. At test time, no units are dropped out; instead, the layer’s output values are scaled down by a factor equal to the dropout rate, to balance for the fact that more units are active than at training time.

### Setting aside a validation set

In [None]:
x_val = x_train[:500]
partial_x_train = x_train[500:]
y_val = one_hot_train_labels[:500]
partial_y_train = one_hot_train_labels[500:]

### Training the model

In [None]:
history = model.fit(partial_x_train,partial_y_train,
    epochs=100,
    batch_size=512,
    validation_data=(x_val, y_val))

### Plotting the training and validation loss

In [None]:
import matplotlib.pyplot as plt
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(loss) + 1)
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

### Plotting the training and validation accuracy

In [None]:
plt.clf()
acc = history.history['acc']
val_acc = history.history['val_acc']
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

### Retraining a model from scratch with the optimal number of epochs

In [None]:
model = models.Sequential()



model.add(Dense(128, activation='relu', input_shape=(x_train.shape[1],)))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(10, activation='softmax'))

sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy',
              optimizer=sgd,
              metrics=['accuracy'])

model.fit(partial_x_train, partial_y_train, epochs=9,
    batch_size=512, validation_data=(x_val, y_val))

results = model.evaluate(x_test, one_hot_test_labels,batch_size=512)

### Here are the final results:

In [None]:
results

### This approach reaches an accuracy of  about 80% while the best accuracy reached by a base-model classifier is 50%.