This project is taken from https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/ and is a started project to get experience working with deep neural networks.

In [2]:
from numpy import loadtxt
from keras.models import Sequential
from keras.layers import Dense

The dataset to be used is medical information from Pima people and whether or not they had an onset of diabetes within five years. A 0 means no diabetes while 1 means an onset. All inputs are numerical.

Input Variables (X):

- Number of times pregnant
- Plasma glucose concentration at 2 hours in an oral glucose tolerance test
- Diastolic blood pressure (mm Hg)
- Triceps skin fold thickness (mm)
- 2-Hour serum insulin (mu U/ml)
- Body mass index (weight in kg/(height in m)^2)
- Diabetes pedigree function
- Age (years)

Output Variables (y):

- Class variable (0 or 1)


In [3]:
# load the dataset
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=',')
# split into input (X) and output (y) variables
X = dataset[:,0:8]
y = dataset[:,8]

In Keras **sequential** models are created. The first layer is the input layer which will have 8 dimensions for the 8 input variables. `input_dim=8`

The number of layers to use can be found using trial and error but for this problem we will use a fully connected network structure with 3 layers. The first two layers will be rectified linear unit activation functions referred to as ReLU while the output layer will be a Sigmoid function. The Sigmoid function ensures the output is between 0 and 1 and easy to map to either a probability of class 1 or snap to a hard classification of either class with a default threshold of 0.5.

All together, it looks like this:
- The model expects rows of data with 8 variables (the `input_dim=8` argument)
- The first hidden layer has 12 nodes and uses the ReLU activation function.
- The second hidden layer has 8 nodes and uses the ReLU activation function.
- The output layer has one node and uses the sigmoid activation function.


In [8]:
# define the keras model
model = Sequential()
model.add(Dense(12, input_dim=8, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

**ReLU** is the positive part of its argument, so if $x\leq0:$ return $0$, if $x>0:$ return $x$.

The **sigmoid** function follows the equation $$\frac{1}{1+e^{-x}}.$$ 

For small values ($x<-5$) the output is close to 0, while for large values ($x>5$) the output approaches 1.

In [9]:
# compile the keras model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

The **loss** function is used to evaluate a set of weights to map inputs to outputs in the dataset ($Y=\sum(weight\times input)+bias$). Loss should penalise bad predictions, so when the probability is low, you want a high value and when the probability is 1.0 you want the loss to be 0. In this problem we use binary cross entropy as our loss function as it is a binary clasification problem. The equation for this is
$$H_p(q)=-\frac{1}{N}\sum^N_{i=1}y_i\log(p(y_i))+(1-y_i)\log(1-p(y_i))$$
where $y_i$ would be 1 if there was an onset of diabetes and 0 if not, and $p(y_i)$ is the probability of an onset of diabetes. Since the loss function should be 0 for a perfect model and gets larger as the probability gets smaller. This is achieved by the negative log(probabity). Then the binary cross-entropy is the mean of each -log(probability).

The **optimizer** searches through different weights and any other optional metrics we want to collect and report during training. In this case, "adam" is chosen because it automatically tunes itself and gives a good reasults for a wide variety of problems. Rather than having a single fixed learning rate (alpha) for all weight updates, adam has a different rate for each network weight which can be adapted as learning unfolds. Several parameters are used in adam:
- $\alpha$: Learning rate, the proportion that weights are updated by. Larger values lead to quicker learning
- $\beta1$: The exponential decay rate of the first moment estimates (the mean)
- $\beta2$: The exponential decay rate of the second moment estimates (the uncentred variance)
- $\epsilon$: A small number to prevent division-by-zero errors.

In [14]:
# fit the keras model on the dataset
model.fit(X, y, epochs=150, batch_size=10, verbose=0)
# evaluate the keras model
_, accuracy = model.evaluate(X, y)
print('Accuracy: %.2f' % (accuracy*100))

Accuracy: 80.47


In [16]:
# make class predictions with the model
predictions = (model.predict(X) > 0.5).astype(int)
# summarize the first 5 cases
for i in range(5):
	print('%s => %d (expected %d)' % (X[i].tolist(), predictions[i], y[i]))

[6.0, 148.0, 72.0, 35.0, 0.0, 33.6, 0.627, 50.0] => 1 (expected 1)
[1.0, 85.0, 66.0, 29.0, 0.0, 26.6, 0.351, 31.0] => 0 (expected 0)
[8.0, 183.0, 64.0, 0.0, 0.0, 23.3, 0.672, 32.0] => 1 (expected 1)
[1.0, 89.0, 66.0, 23.0, 94.0, 28.1, 0.167, 21.0] => 0 (expected 0)
[0.0, 137.0, 40.0, 35.0, 168.0, 43.1, 2.288, 33.0] => 1 (expected 1)
