# Keras Applied Examples
## Multiclass Classification Of Flower Species

In [26]:
# Import libraries and data

import numpy as np
from pandas import read_csv
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from keras.utils import np_utils
from sklearn.model_selection import cross_val_score, KFold
from sklearn.pipeline import Pipeline

seed = 7
np.random.seed(seed)

data = read_csv("data/iris.csv", header = None)
dataset = data.values
X = dataset[:, :-1].astype(float)
Y = dataset[:, -1]

### Data cleaning and preprocessing 

Encode categorial output variable using one hot encoding. Sklearn's LabelBinarizer does this

In [32]:
from sklearn.preprocessing import LabelBinarizer

# 1. One hot encode Y
lb = LabelBinarizer()
encoded_Y = lb.fit_transform(Y)
print(encoded_Y[:5, :])

[[1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]
 [1 0 0]]


### Define Neural Network Model

There is a KerasClassifier class in Keras that can be used as an Estimator in scikit-learn, the base type of model in the library. The KerasClassifier takes the name of a function as an argument. This function must return the constructed neural network model, ready for training.

Because we used a one hot encoding for our iris dataset, the output layer must create 3 output values, one for each class. The output value with the largest value will be taken as the class predicted by the model. 

We use a softmax activation function in the output layer. This is to ensure the output values are in the range of 0 and 1 and may be used as predicted probabilities. Finally, the network uses the efficient Adam gradient descent optimization algorithm with a logarithmic loss function, which is called categorical crossentropy in Keras.

Model structure:

    4 inputs -> [8 hidden nodes] -> 3 ouptuts

In [33]:
# 1. define baseline model
def baseline_model():
    # 1. create model
    model = Sequential()
    model.add(Dense(8, input_dim = 4, activation= 'relu'))
    model.add(Dense(3, activation='softmax'))
    
    # 2. compile model
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model  

In [35]:
# 2. Wrap model in KerasClassifier
estimator = KerasClassifier(build_fn=baseline_model, epochs=200, batch_size=5, verbose = 0)

### Evaluate Model with k-fold Cross-Validation

In [36]:
kfold = KFold(n_splits=10, shuffle=True, random_state=seed)

In [37]:
results = cross_val_score(estimator, X, encoded_Y, cv = kfold)
print("Accuracy: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

Accuracy: 97.33% (4.42%)


## Binary Classification of Sonar Returns

### Background
The dataset we will use in this tutorial is the Sonar dataset. This is a dataset that describes sonar chirp returns bouncing off different surfaces. The 60 input variables are the strength of the returns at different angles. It is a binary classification problem that requires a model to differentiate rocks from metal cylinders.

This dataset is a standard benchmark problem. Using cross-validation, a neural network should be able to achieve performance around 84% with an upper bound on accuracy. Baseline Neural Network Model Performance for custom models at around 88%

### Data loading and Preparation

In [21]:
import numpy as np
import pandas as pd

seed = 7
np.random.seed(seed)

data = pd.read_csv("data/sonar.csv", header = None)
dataset = data.values 

X = dataset[:, :-1].astype(float)
Y = dataset[:,-1]

In [22]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
encoded_Y = le.fit_transform(Y)

### Define Base Model

The weights are initialized using a small Gaussian random number. The Rectifier activation function is used. The output layer contains a single neuron in order to make predictions. It uses the sigmoid activation function in order to produce a probability output in the range of 0 to 1 that can easily and automatically be converted to crisp class values. Finally, we are using the logarithmic loss function (binary crossentropy) during training, the preferred loss function for binary classification problems. The model also uses the efficient Adam optimization algorithm for gradient descent and accuracy metrics will be collected when the model is trained.

In [17]:
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier

def create_baseline():
    # Create model
    model = Sequential()
    model.add(Dense(60, input_dim=60, kernel_initializer='normal', activation='relu'))
    model.add(Dense(1, kernel_initializer='normal', activation='sigmoid'))
    
    # Compile Model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics =['accuracy'])
    return model

### Model Evaluation
Running this code produces the following output showing the mean and standard deviation of the estimated accuracy of the model on unseen data.

In [18]:
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

estimator = KerasClassifier(build_fn=create_baseline, epochs=100, batch_size=5, verbose=0)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
results = cross_val_score(estimator, X, encoded_Y, cv = kfold)
print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

Baseline: 81.28% (5.93%)


### Normalizing Input Data

Neural network models are especially suitable to having consistent input values, both in scale and distribution. An effective data preparation scheme for tabular data when building neural network models is standardization. This is where the data is rescaled such that the mean value for each attribute is 0 and the standard deviation is 1. This preserves Gaussian and Gaussian-like distributions whilst normalizing the central tendencies for each attribute.

We can use scikit-learn to perform the standardization of our Sonar dataset using the StandardScaler class. Rather than performing the standardization on the entire dataset, it is good practice to train the standardization procedure on the training data within the pass of a cross-validation run and to use the trained standardization instance to prepare the unseen test fold. This makes standardization a step in model preparation in the cross-validation process and it prevents the algorithm having knowledge of unseen data during evaluation, knowledge that might be passed from the data preparation scheme like a crisper distribution.

We can achieve this in scikit-learn using a Pipeline class. The pipeline is a wrapper that executes one or more models within a pass of the cross-validation procedure. Here, we can define a pipeline with the StandardScaler followed by our neural network model.

In [23]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

estimators = [ ('standardize', StandardScaler() ), 
               ( 'mlp', KerasClassifier(build_fn=create_baseline, epochs=100, batch_size=5, verbose=0)) ]

pipeline = Pipeline(estimators)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
results = cross_val_score(pipeline, X, encoded_Y, cv = kfold)
print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

Baseline: 84.13% (8.09%)


### Tuning Layers and Neurons in the Model

There are many things to tune on a neural network, such as the weight initialization, activation functions, optimization procedure and so on. One aspect that may have an outsized effect is the structure of the network itself called the network topology.

### Evaluating a Smaller Network

The data describes the same signal from different angles. Perhaps some of those angles are more relevant than others. We can force a type of feature extraction by the network by restricting the representational space in the first hidden layer.

In this experiment we take our baseline model with 60 neurons in the hidden layer and reduce it by half to 30. This will put pressure on the network during training to pick out the most important structure in the input data to model. 

In [24]:
def create_baseline():
    # Create model
    model = Sequential()
    model.add(Dense(30, input_dim=60, kernel_initializer='normal', activation='relu'))
    model.add(Dense(1, kernel_initializer='normal', activation='sigmoid'))
    
    # Compile Model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics =['accuracy'])
    return model

estimators = [ ('standardize', StandardScaler() ), 
               ( 'mlp', KerasClassifier(build_fn=create_baseline, epochs=100, batch_size=5, verbose=0)) ]

pipeline = Pipeline(estimators)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
results = cross_val_score(pipeline, X, encoded_Y, cv = kfold)
print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

Baseline: 85.04% (7.38%)


We can see that we have a very slight boost in the mean estimated accuracy and an important reduction in the standard deviation (average spread) of the accuracy scores for the model. This is a great result because we are
doing slightly better with a network half the size, which in turn takes half the time to train.

### Evaluating a Larger Network

A neural network topology with more layers offers more opportunity for the network to extract key features and recombine them in useful nonlinear ways. We can evaluate whether adding more layers to the network improves the performance easily by making another small tweak to the function used to create our model. Here, we add one new layer (one line) to the network that introduces another hidden layer with 30 neurons after the first hidden layer. Our network now has the topology:

    60 inputs -> [60 -> 30] -> 1 output


The idea here is that the network is given the opportunity to model all input variables before being bottlenecked and forced to halve the representational capacity, much like we did in the experiment above with the smaller network. Instead of squeezing the representation of the inputs themselves, we have an additional hidden layer to aid in the process.

In [25]:
def create_baseline():
    # Create model
    model = Sequential()
    model.add(Dense(60, input_dim=60, kernel_initializer='normal', activation='relu'))
    model.add(Dense(30, input_dim=60, kernel_initializer='normal', activation='relu'))
    model.add(Dense(1, kernel_initializer='normal', activation='sigmoid'))
    
    # Compile Model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics =['accuracy'])
    return model

estimators = [ ('standardize', StandardScaler() ), 
               ( 'mlp', KerasClassifier(build_fn=create_baseline, epochs=100, batch_size=5, verbose=0)) ]

pipeline = Pipeline(estimators)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
results = cross_val_score(pipeline, X, encoded_Y, cv = kfold)
print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

Baseline: 84.11% (7.80%)


Running this example produces the results below. We can see that we do not get a lift in the model performance. This may be statistical noise or a sign that further training is needed.

## Boston Housing Prices Regression

The Boston Housing prices dataset describes properties of houses in Boston suburbs and is concerned with modeling the price of houses in those suburbs in thousands of dollars. As such, this is a regression predictive modeling problem. There are 13 input variables that describe the properties of a given Boston suburb. The full list of attributes in this dataset are as follows:
    1. CRIM: per capita crime rate by town.
    2. ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
    3. INDUS: proportion of non-retail business acres per town.
    4. CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). 
    5. NOX: nitric oxides concentration (parts per 10 million).
    6. RM: average number of rooms per dwelling.
    7. AGE: proportion of owner-occupied units built prior to 1940. 
    8. DIS: weighted distances to five Boston employment centers. 
    9. RAD: index of accessibility to radial highways.
    10. TAX: full-value property-tax rate per $10,000.
11. PTRATIO: pupil-teacher ratio by town.
12. B: 1000(Bk − 0.63)2 where Bk is the proportion of blacks by town. 
13. LSTAT: % lower status of the population.
14. MEDV: Median value of owner-occupied homes in $1000s.


In [53]:
# Load Data
data = pd.read_csv("data/housing.csv", header=None, delim_whitespace=True,)
display(data.head())
dataset = data.values

# feature/target split
X = dataset[:,0:13].astype(float)
Y = dataset[:,13].astype(float)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


### Define Model

We will use a simple model that has a single fully connected hidden layer with the same number of neurons as input attributes (13). The network uses good practices such as the rectifier activation function for the hidden layer. No activation function is used for the output layer because it is a regression problem and we are interested in predicting numerical values directly without transform.

The efficient ADAM optimization algorithm is used and a mean squared error loss function is optimized. This will be the same metric that we will use to evaluate the performance of the model. It is a desirable metric because by taking the square root of an error value it gives us a result that we can directly understand in the context of the problem with the units in thousands of dollars.

Reasonable performance for models evaluated using Mean Squared Error (MSE) are around 20 in squared thousands of dollars (or $4,500 if you take the square root). This is a nice target to aim for with our neural network model. 

** Note: Cross-val returns a negative MSE score. You need to take the absolute to properly read the results **

In [58]:
def baseline_model():
    # Create model
    model = Sequential()
    model.add(Dense(13, input_dim=13, kernel_initializer='normal', activation='relu'))
    model.add(Dense(1, kernel_initializer='normal'))
    
    # Compile Model
    model.compile(loss='mean_squared_error', optimizer='adam')
    return model

In [64]:
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import KFold

estimator = KerasRegressor(build_fn=baseline_model, epochs=100, batch_size=5, verbose=0)
kfold = KFold(n_splits = 10, random_state = seed)
results = cross_val_score(estimator, X, Y, cv=kfold)
print("Baseline: %f (%f) MSE" % (np.abs(results.mean()), results.std()))

Baseline: 30.613162 (25.280203) MSE


### Model Tuning: Standardizing input data

An important concern with the Boston house price dataset is that the input attributes all vary in their scales because they measure different quantities. We can use scikit-learn’s Pipeline framework3 to perform the standardization during the model evaluation process, within each fold of the cross-validation. This ensures that there is no data leakage from each testset cross-validation fold into the training data.

In [65]:
estimators = [ ('standardize', StandardScaler()),
               ('mlp', KerasRegressor(build_fn=baseline_model, epochs=100, batch_size=5, verbose=0)) ]

pipeline = Pipeline(estimators)
kfold = KFold(n_splits=10, random_state=seed)
results = cross_val_score(pipeline, X, Y, cv=kfold)
print("Standardized: %.2f (%.2f) MSE" % (np.abs(results.mean()), results.std()))

Standardized: 24.18 (28.23) MSE


### Model Tuning: Experimenting with wider network topology

One way to improve the performance of a neural network is to add more layers. This might allow the model to extract and recombine higher order features embedded in the data.

The topology for the new network is:

    13 inputs -> [20] -> 1 output

In [66]:
def wider_model():
    model = Sequential()
    model.add(Dense(20, input_dim=13, kernel_initializer = 'normal', activation = 'relu'))
    model.add(Dense(1, kernel_initializer='normal'))
    
    # Compile model
    model.compile(loss='mean_squared_error', optimizer='adam')
    return model
              
estimators = [ ('standardize', StandardScaler()),
               ('mlp', KerasRegressor(build_fn=wider_model, epochs=100, batch_size=5, verbose=0)) ]

pipeline = Pipeline(estimators)
kfold = KFold(n_splits=10, random_state=seed)
results = cross_val_score(pipeline, X, Y, cv=kfold)
print("Standardized: %.2f (%.2f) MSE" % (np.abs(results.mean()), results.std()))

Standardized: 23.08 (25.85) MSE


Building the model does see a further drop in error to about 21 thousand squared dollars.