In [13]:
import plotly.plotly as py
import plotly.graph_objs as go
import json

import numpy as np

In [22]:
with open('data.json', 'r') as f:
     modeldata = json.load(f)

In another notebook (ran on the machine learning docker) I included a few runs of the emotion detection neural net, and brought in my results here for analysis.

In order to test how local minima and/or saddle points might be affecting my accuracy, I decided to test out a few optimizer algorithms, on the same model, provided with Keras. I would then plot my training and test accuracy over time and see what sort of patterns I noticed.

I'm using [this post](http://www.turingfinance.com/misconceptions-about-neural-networks/#algo) for inspiration on how to pick the optimizers I'm testing. It shows how the Keras optimizers correspond to a saddle point in a function.

I picked the Keras default: the RMSProp algorithm, along with the Nesterov Adam and the Adadelta algorithms. The Nesterov Adam optimizer seems to be a momentum based strategy, which seem to either get permanently stuck in the saddle point, or spend a few more cycles trying to eventually get out of it down a gradient in another dimension.

The Adadelta algorithm, on the other hand, didn't even hit the saddle point, and rather curved in the direction of the other minimum right away.

Image recognition is highly dimensional, so I won't expect to meaningfully see these saddle points and/or local minima in this way. However, by looking at the training and test accuracy over time, I should be able to see "flat" periods and interpret them as getting caught in one of these points.

In [30]:
data = [
    go.Scatter(
        x = np.arange(0,50),
        y = modeldata['model']['history']['acc'],
        name = 'RMSProp Training Accuracy'
    ),
    go.Scatter(
        x = np.arange(0,50),
        y = modeldata['model']['history']['val_acc'],
        name = 'RMSProp Test Accuracy'
    ),
    go.Scatter(
        x = np.arange(0,50),
        y = modeldata['model2']['history']['acc'],
        name = 'Adadelta Training Accuracy'
    ),
    go.Scatter(
        x = np.arange(0,50),
        y = modeldata['model2']['history']['val_acc'],
        name = 'Adadelta Test Accuracy'
    ),
    go.Scatter(
        x = np.arange(0,50),
        y = modeldata['model3']['history']['acc'],
        name = 'Nesterov Adam Optimizer Training Accuracy'
    ),
    go.Scatter(
        x = np.arange(0,50),
        y = modeldata['model3']['history']['val_acc'],
        name = 'Nesterov Adam Optimizer Test Accuracy'
    )
    
]

layout = go.Layout(
    updatemenus = [
        dict(
            x = -0.10,
            y = 1,
            yanchor = 'top',
            buttons = [
                dict(
                    args = ['visible', [True, True, False, False, False, False]],
                    label = 'RMSProp',
                    method = 'restyle'
                ),
                dict(
                    args = ['visible', [False, False, True, True, False, False]],
                    label = 'Adadelta',
                    method = 'restyle'
                ),
                dict(
                    args = ['visible', [False, False, False, False, True, True]],
                    label = 'Nesterov Adam Optimizer',
                    method = 'restyle'
                ),
                dict(
                    args = ['visible', [True, True, True, True, True, True]],
                    label = 'All',
                    method = 'restyle'
                )
            ],
            active = 3
        )
    ]
)

fig = go.Figure(data = data, layout = layout)

py.iplot(fig, filename = 'week14')

This does indeed show some interesting behavior. There appears to be a local minima or saddle point that is giving me an accuracy ~0.88. Interestingly, the RMSProp algorithm seems to be the quickest to escape this point, very closely followed by Adadelta. The Nesterov Adam Optimizer however doesn't appear to be able to escape this point.

My intuition surrounding these sorts of optimizers (partially using the blog post above, as well as [this](http://sebastianruder.com/optimizing-gradient-descent/) description of algorithms)involves around trying to imagine saddle points in higher dimensions. In three dimensions, we think of a saddle point as a point which is a minima in one dimension and a maxima in another. Optimizers partially work by finding regions with gradients of 0. Because we're "descending", it's difficult to end up at a local maxima, but it is possible to descend into a saddle point. Once there, a gradient descent algorithm might be fooled into thinking it's a minimum, rather than finding the dimension where it can still descend.

Extending this thinking to higher dimensions, one could imagine points with a 0 derivative, and several gradients varying between positive and negative. If we have an 11 dimensional hyperplane for example, one could imagine a saddle point where it is a local minima in 9 dimensions and a local maxima in the 10th dmension.

We can't be sure about what exactly this point looks like, but it does seem that the Nesterov Adam Optimizer is being fooled by a point that is not fooling the RMSProp or Adadelta algorithm.

Below I'll include the code I used to create this model. I reused my image collection pipeline, and the only real change I made was defining the model callback, and exporting my results to json for analysis here (something to protect my sanity given the amount of time these models take to run.)

In [None]:
model = Sequential()
model.add(Convolution2D(32, 3, 3, input_shape=(150, 150, 3), dim_ordering='tf'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Convolution2D(32, 3, 3, dim_ordering='tf'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2), dim_ordering='tf'))

model.add(Convolution2D(64, 3, 3, dim_ordering='tf'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2), dim_ordering='tf'))

Below in the compile step is where I would define my algorithm (here it's RMSProp).

In [None]:
model.add(Flatten())  
model.add(Dense(64))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(8))
model.add(Activation('softmax'))

model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

In [None]:
# this is the augmentation configuration we will use for training
train_datagen = ImageDataGenerator(
        rescale=1./255,
        shear_range=0.2,
        zoom_range=0.2,
        horizontal_flip=True)

# this is the augmentation configuration we will use for testing:
# only rescaling
test_datagen = ImageDataGenerator(rescale=1./255)

# this is a generator that will read pictures found in
# subfolers of 'data/train', and indefinitely generate
# batches of augmented image data
train_generator = train_datagen.flow_from_directory(
        '/root/sharedfolder/facial_expressions/Test2/Train/',  # this is the target directory
        target_size=(150, 150),  # all images will be resized to 150x150
        batch_size=32,
        class_mode='categorical')  # since we use binary_crossentropy loss, we need binary labels

# this is a similar generator, for validation data
validation_generator = test_datagen.flow_from_directory(
        '/root/sharedfolder/facial_expressions/Test2/Validation/',
        target_size=(150, 150),
        batch_size=32,
        class_mode='categorical')

In [None]:
history = model.fit_generator(
        train_generator,
        samples_per_epoch=1000,
        nb_epoch=50,
        validation_data=validation_generator,
        nb_val_samples=800)

Once my history is defined, I would add it to my model json below:

In [None]:
modelData = dict()

modelData['model1'] = dict(loss = 'binary_crossentropy', 
                          optimizer = 'rmsprop', 
                          history = history.history, 
                          config = history.model.get_config())