# Sample Project: Sonar Return Data

In this notebook, we will explore a popular classification problem and several potential solutions using neural networks.

## Classification of Sonar Returns

This is a dataset that describes sonar chirp returns bouncing off of different surfaces. The 60 input variables are the strength of the returns at different angles. The goal is to differentiate sonar bounces from rocks from metal cylinders. All of variables are continuous and generally in the range of 0 to 1. The target variable is in a string format, with `M` for metal cylinders and `R` for rock. 

## Imports

In [1]:
import pandas as pd
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from scikeras.wrappers import KerasClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold

## Loading and Preparing Data

In [2]:
df = pd.read_csv('sonar.csv', header=None)
dataset = df.values

# split into input and output variables
X = dataset[:,:60].astype(float)
y = dataset[:,60]

In [3]:
X[:3]

array([[0.02  , 0.0371, 0.0428, 0.0207, 0.0954, 0.0986, 0.1539, 0.1601,
        0.3109, 0.2111, 0.1609, 0.1582, 0.2238, 0.0645, 0.066 , 0.2273,
        0.31  , 0.2999, 0.5078, 0.4797, 0.5783, 0.5071, 0.4328, 0.555 ,
        0.6711, 0.6415, 0.7104, 0.808 , 0.6791, 0.3857, 0.1307, 0.2604,
        0.5121, 0.7547, 0.8537, 0.8507, 0.6692, 0.6097, 0.4943, 0.2744,
        0.051 , 0.2834, 0.2825, 0.4256, 0.2641, 0.1386, 0.1051, 0.1343,
        0.0383, 0.0324, 0.0232, 0.0027, 0.0065, 0.0159, 0.0072, 0.0167,
        0.018 , 0.0084, 0.009 , 0.0032],
       [0.0453, 0.0523, 0.0843, 0.0689, 0.1183, 0.2583, 0.2156, 0.3481,
        0.3337, 0.2872, 0.4918, 0.6552, 0.6919, 0.7797, 0.7464, 0.9444,
        1.    , 0.8874, 0.8024, 0.7818, 0.5212, 0.4052, 0.3957, 0.3914,
        0.325 , 0.32  , 0.3271, 0.2767, 0.4423, 0.2028, 0.3788, 0.2947,
        0.1984, 0.2341, 0.1306, 0.4182, 0.3835, 0.1057, 0.184 , 0.197 ,
        0.1674, 0.0583, 0.1401, 0.1628, 0.0621, 0.0203, 0.053 , 0.0742,
        0.0409, 0.0061,

In [4]:
y[:3]

array(['R', 'R', 'R'], dtype=object)

The output variable is a string datatype. Previously we converted this to one-hot encoded representation using `pd.get_dummies().to_numpy()`. This time we will use the `LabelEncoder` class from scikit-learn.

In [5]:
# encode class values as integers
encoder = LabelEncoder()
encoder.fit(y)
encoded_Y = encoder.transform(y)
encoded_Y[:3]

array([1, 1, 1])

## Defining the Model

In [6]:
def create_baseline():
    # create model
    model = Sequential()
    model.add(Dense(60, input_shape=(60,), activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    
    # compile model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    return model

## Evaluating the Model

In [7]:
# evaluate model with dataset
estimator = KerasClassifier(model=create_baseline, epochs=100, batch_size=5, verbose=0)
kfold = StratifiedKFold(n_splits=10, shuffle=True)
results = cross_val_score(estimator, X, encoded_Y, cv=kfold)
print(f"Baseline: {results.mean()*100}% (Std: {results.std()*100}%)")

Baseline: 78.21428571428571% (Std: 9.952979020535507%)


## Improving Performance: Standardization

While there are an incredible number of possible hyperparameter settings, we can approach improving the model through a more efficient method: data preparation. Neural network models are especially suitable to having consistent input values, both in scale and distribution. Scaling can standardize our data for improved performance. We can use scikit-learn to perform the standardization of our data using the `StandardScaler` class. 

> Note: Rather than performing the standardization on the entire dataset, it is good practice to train the standardization procedure on the training data **within the pass of a cross-validation run** and to use the trained standardization instance to prepare the unseen test fold. This ensures that standardization is a step in model preparation during cross-validation and prevents learning from out-of-sample data. 

We can achieve this using a `Pipeline` class. The pipeline is a wrapper that executes one or more models within a pass of the cross-validation procedure. Here we can define a pipeline with the `StandardScaler` followed by our neural network model. 

In [8]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

In [9]:
# evaluate baseline model with a standardized dataset
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasClassifier(model=create_baseline, epochs=100, batch_size=5, verbose=0)))
pipeline = Pipeline(estimators)
kfold = StratifiedKFold(n_splits=10, shuffle=True)
results = cross_val_score(pipeline, X, encoded_Y, cv=kfold)
print("Standardized: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

Standardized: 85.57% (5.27%)


After standardization, we have about a 5% improvement in accuracy and the standard deviation has been cut in half. 

## Improving Performance: Network Topology

We will run two experiments on the structure of the network. First, making it smaller, then making it larger. 

### A Smaller Network

In the previous experiment, our network topology included 60 neurons in the hidden layer. In a sense, there is one hidden neuron for each input neuron. What if there were fewer? In this case, the network would be pressured to pick out the most important structure in the input data. Perhaps some angles are more important than others, and with fewer neurons to handle those patterns, we can end up with a better result. There's no way to know until we try it out. 

In [10]:
def create_smaller():
    # create a smaller model
    model = Sequential()
    model.add(Dense(30, input_shape=(60,), activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    
    # compile the model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    return model

In [11]:
# evaluate baseline model with a standardized dataset
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasClassifier(model=create_smaller, epochs=100, batch_size=5, verbose=0)))
pipeline = Pipeline(estimators)
kfold = StratifiedKFold(n_splits=10, shuffle=True)
results = cross_val_score(pipeline, X, encoded_Y, cv=kfold)
print("Standardized/Smaller: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

Standardized/Smaller: 84.69% (7.55%)


In this case, it doesn't seem like the smaller network creates a meaningful difference. The two results are comparable, at least when other hyperparameters stay the same. One big advantage to the smaller network is the reduction in training time. If the two models are comparable in performance, then perhaps the smaller topology is ideal to use when evaluating other hyperparameters. 

### A Larger Network

Now we can consider the impact of a larger network. Additional layers may better extract key features in non-linear ways. There are several ways to increase the size of the network topology. In this experiment, we will add another hidden layer to the original design. 

In [12]:
def create_larger():
    # create a larger model
    model = Sequential()
    model.add(Dense(60, input_shape=(60,), activation='relu'))
    model.add(Dense(30, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    
    # compile model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    return model

In [13]:
# evaluate baseline model with a standardized dataset
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasClassifier(model=create_larger, epochs=100, batch_size=5, verbose=0)))
pipeline = Pipeline(estimators)
kfold = StratifiedKFold(n_splits=10, shuffle=True)
results = cross_val_score(pipeline, X, encoded_Y, cv=kfold)
print("Standardized/Larger: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

Standardized/Larger: 86.50% (6.83%)


Like the previous example, there doesn't appear to be a meaningful difference with the additional layer. Nevertheless, an infinite number of variations in network topology can be explored. 