In [1]:
# general imports needed by functions
import errno    
import os

# import python scientific libraries
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# import needed keras objects into current namespace
from keras import layers
from keras import models
from keras import optimizers


Using TensorFlow backend.


In [2]:
# set plotting visual style and parameters for all plotted figures
%matplotlib inline
sns.set_style('darkgrid') # use seaborn style to improve visual presentation
sns.set_context('notebook')
plt.rcParams['figure.figsize'] = (12.0 , 8.0)

In [3]:
# check which devices tensorflow has recognized and is using
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 3516041434571516116
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 2393453564272176661
physical_device_desc: "device: XLA_CPU device"
]


# 7 Advanced deep-learning best practices

This notebook explores a number of powerful tools
that are useful when you get beyond building
straightforward deep networks for classic
regression or classification problems.

## 7.1 Going beyond the Sequential model: the Keras functional API

We have mostly been using the Keras `Sequential`
model to this point in the class/textbook.
This model makes (simplifying) assumption that the
deep network has exactly one input (tensor) and
also exactly one output (tensor).

Some use cases and network designs require more
flexibility.  We may have several independent
inputs and/or need to make several predictions
in parallel for a given task.  Further, some networks
have internal branching between layers that make
them look like directed graphys of layers rather
than linear stacks of layers.

For *multimodal* inputs, we want to merge data
coming from different input sources. We often want
to process such data using different types of
neural networks, before merging the extracted
features for our task.

If we had multiple inputs, a naive approach would be
to train multiple netwoks separately, and then do
a weighted average of their predictions.  But this
is usually suboptimal, because the information 
extracted by the models may be redundant.  

A better way is to *jointly* learn a more accurate
model of the data by using a single model that
can see all available input modalities
simultaneously.

Similarly some tasks need to predict multiple
target attributes.  Again we could train
separate networks.  But the input data nor the
output targets are usually statistically independent,
and thus we can usually do better by building a
single model that learns the multiple outputs
jointly.

Finally, more recent neural network architectures
are beginning to require nonlinear network topology:
networks structures as directed acylic graphs. 
This are **acylic**, they are different from the 
recurrent network layers we discussed previously.
But the graphs of the layers are more complex than
the linear sequence of sequential layers we have
mainly seen to this point.

The **Inception** family of networks, and the
**ResNet** architectures are examples of these
more complex processing networks.

For these three important use cases -- multi-input,
multi-output, and more complex acylic directed graph-like
models  -- we cannot use Keras' simple `Sequential`
model class.  

Keras as a more flexible interface: the *functional API*.
The functional API uses concepts from functional
programming, and allows for more flexible specifications
of networks like these we just introduced.

### 7.1.1 Introduction to the functional API

In the Keras functional API you directly manipulate
tensors, and you use layers as *functions* that take
tensors and return tensors (hence, the name
*functional API):

In [4]:
from keras import Input, layers

In [5]:
# A tensor
input_tensor = Input(shape=(32,))

# because using TensorFlow back end, this is actual
# a tensor object from TensorFlow
print(type(input_tensor))
# notice shape is 2D, so we expect (samples, features)
# in shaped tensors for this example
print(input_tensor.shape)

<class 'tensorflow.python.framework.ops.Tensor'>
(None, 32)


In [6]:
# a layer is a function
dense = layers.Dense(32, activation='relu')
print(type(dense))

<class 'keras.layers.core.Dense'>


In [7]:
# a layer may be called on a tensor, and it returns
# a tensor
output_tensor = dense(input_tensor)
print(type(output_tensor))
print(output_tensor.shape)

<class 'tensorflow.python.framework.ops.Tensor'>
(None, 32)


Let's start with a minimal example that shows
side by side a simple `Sequential` model
and its equivalent in the functional API:

In [8]:
# first using Sequential model
from keras.models import Sequential, Model
from keras import layers
from keras import Input

In [9]:
seq_model = Sequential()
seq_model.add(layers.Dense(32, activation='relu', input_shape=(64,)))
seq_model.add(layers.Dense(32, activation='relu'))
seq_model.add(layers.Dense(10, activation='softmax'))
seq_model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_2 (Dense)              (None, 32)                2080      
_________________________________________________________________
dense_3 (Dense)              (None, 32)                1056      
_________________________________________________________________
dense_4 (Dense)              (None, 10)                330       
Total params: 3,466
Trainable params: 3,466
Non-trainable params: 0
_________________________________________________________________


In [10]:
# building the equivalent using functional API
input_tensor = Input(shape=(64,))
first_layer_function = layers.Dense(32, activation='relu')
x = first_layer_function(input_tensor)
x = layers.Dense(32, activation='relu')(x)
output_tensor = layers.Dense(10, activation='softmax')(x)

# the Model class turns an input tensor and output
# tensor into a model
model = Model(input_tensor, output_tensor)

model.summary()

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         (None, 64)                0         
_________________________________________________________________
dense_5 (Dense)              (None, 32)                2080      
_________________________________________________________________
dense_6 (Dense)              (None, 32)                1056      
_________________________________________________________________
dense_7 (Dense)              (None, 10)                330       
Total params: 3,466
Trainable params: 3,466
Non-trainable params: 0
_________________________________________________________________


Behind the scenes, there are hooks so that
Keras can retrieve every layer that connects
the provided `input_tensor` to the `output_tensor`.
This create a graph-like data structure -
a `Model`.  It is an error to provide an
`output_tensor` that was not derived by repeatedly
transforming from the `input_tensor`.



In [11]:
unrelated_input = Input(shape=(32,))
bad_model = Model(unrelated_input, output_tensor)

ValueError: Graph disconnected: cannot obtain value for tensor Tensor("input_2:0", shape=(None, 64), dtype=float32) at layer "input_2". The following previous layers were accessed without issue: []

This error means that Keras couldn't reach the
output tensor from the provided input tensor.

Once you create the network and encapsulate into
a `Model` in eras, compiling, training and 
evaluating such a `Model` instance is done in
exactly the same way as we have been doing it for
the `Sequential` model.

In [12]:
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')

# gnerates dummy NumPy data to train on
x_train = np.random.random((1000, 64))
y_train = np.random.random((1000, 10))

# train the model for 10 epochs
model.fit(x_train, y_train, epochs=10, batch_size=128)

# evalute the model
score = model.evaluate(x_train, y_train)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


### 7.1.2 Multi-Input model

In this section we have an example of building
a model with multiple inputs.  Typically
such models at some point merge the different
input branches using a layer that can combine
several tensors: by adding them, or concatenating them,
or so on.  This is done with things like
`keras.layers.add` and `keras.layers.concatenate`.

A typical question-answering model has two inputs:
a natural-language question and a text snippet
(such as a news article) providing information
to be used for answering the question.  The model
then must produce an answer: in the simplest possible
setup, this is a one-word answer obtained via
softmax over some predefined vocabulary.

Following is an example of how we can build such
a model using the Keras functional API.
We use two independent branches for the input,
LSTM's that learn the different sequences.  These
representations are concatenated and a softmax
classifier is added on top.

In [13]:
from keras.models import Model
from keras import layers
from keras import Input

In [14]:
text_vocabulary_size = 10000
question_vocabulary_size = 10000
answer_vocabulary_size = 500

# the text input is a variable length sequence
# of integers, note we name the input tensor
text_input = Input(shape=(None,), dtype='int32', name='text')

# embeds the inputs into a sequence of vectors of
# size 64
embedded_text = layers.Embedding(text_vocabulary_size,
                                 64)(text_input)

# encodes the vectors in a single vector via an LSTM
encoded_text = layers.LSTM(32)(embedded_text)

# same process for the question
question_input = Input(shape=(None,), dtype='int32', name='question')

embedded_question = layers.Embedding(question_vocabulary_size,
                                     32)(question_input)
encoded_question = layers.LSTM(16)(embedded_question)

# concatenates the encoded question and encoded text
concatenated = layers.concatenate([encoded_text, encoded_question],
                                  axis=-1)

# adds a softmax classifier on top
answer = layers.Dense(answer_vocabulary_size,
                      activation='softmax')(concatenated)

# at model instantiation, you specify the two inputs
# and the output
model = Model([text_input, question_input], answer)

model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['acc'])

In [15]:
model.summary()

Model: "model_3"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
text (InputLayer)               (None, None)         0                                            
__________________________________________________________________________________________________
question (InputLayer)           (None, None)         0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, None, 64)     640000      text[0][0]                       
__________________________________________________________________________________________________
embedding_2 (Embedding)         (None, None, 32)     320000      question[0][0]                   
____________________________________________________________________________________________

Now how do you train this two-input model?  There are
two possible APIs: you can feed the model a list
of Numpy arrays as inputs, or you can feed it a
dictionary that maps input names to Numpy arrays.

In [16]:
num_samples = 1000
max_length = 100

# generates dummy numpy data
# notice that the input text is integers from 0 to 100
# and is shaped to be padded to 100 (dummy) words
text = np.random.randint(1, text_vocabulary_size,
                         size=(num_samples, max_length))

# also dummy questions, again padded to 100 words
question = np.random.randint(1, question_vocabulary_size,
                             size=(num_samples, max_length))

# answers are one-hot encoded, not integers
answers = np.random.randint(0, 1,
                            size=(num_samples, answer_vocabulary_size))



In [17]:
print(text.shape)
print(question.shape)
print(answers.shape)

(1000, 100)
(1000, 100)
(1000, 500)


In [18]:
# method 1, as a list of numpy arrays for inputs
model.fit([text, question], answers, epochs=10, batch_size=128)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.callbacks.History at 0x7ffa400e5750>

In [19]:
# method 2: fitting using a dictionary of inputs
# only works if input functions/layers are named
model.fit({'text': text, 'question': question}, 
          answers,
          epochs=10, batch_size=128)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.callbacks.History at 0x7ffa80075fd0>

### 7.1.3 Multi-output models

In the same way, you can use the functional API
to build models with multiple outputs
(or multiple *heads*).  A simple example is a network
that attempts to simultaneously predict different
properties of the data, such as social media
posts as inputs and tries to predict several attributes
of the poster, like age, gender and income level.

In [20]:
vocabulary_size = 50000
num_income_groups = 10

posts_input = Input(shape=(None,), dtype='int32', name='posts')
embedded_posts = layers.Embedding(vocabulary_size, 256)(posts_input)
x = layers.Conv1D(128, 5, activation='relu')(embedded_posts)
x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(256, 5, activation='relu')(x)
x = layers.Conv1D(256, 5, activation='relu')(x)
x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(256, 5, activation='relu')(x)
x = layers.Conv1D(256, 5, activation='relu')(x)
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dense(128, activation='relu')(x)

# note that the output layers are given names
age_prediction = layers.Dense(1, name='age')(x)
income_prediction = layers.Dense(num_income_groups,
                                 activation='softmax',
                                 name='income')(x)
gender_prediction = layers.Dense(1, activation='sigmoid', 
                                 name='gender')(x)

model = Model(posts_input, 
              [age_prediction, income_prediction, gender_prediction])


Training a model with multiple outputs does have
one complication we didn't see before.  We
need to specify different loss functions for the
the different output "heads" of the network:
for instance, age prediction is scalar
regression, but gender prediction is a binary
classification task and we have split
income into 10 levels, so it is a multi-category
classification task.  These require different
loss functions.  

Gradient descent requires us to minimize a
single scalar value, so we must combine these
losses into a single value in order to train
the model.  The simplest way is to sum them all.
In Keras, you can use either a list or a dictionary
of losses in `compile` to specify different
objects for different outputs; the resulting
loss values are summed into a global loss, which
is minimized during training.

In [21]:
# again example of using a list
model.compile(optimizer='rmsprop',
              loss=['mse', 'categorical_crossentropy', 'binary_crossentropy'])

In [22]:
# or if you name output layers, can be more
# precise and use an explicit mapping with a dict
model.compile(optimizer='rmsprop',
              loss={'age': 'mse',
                    'income': 'categorical_crossentropy',
                    'gender': 'binary_crossentropy'})

Not that very imbalanced loss contributions will cause
the model to optimize the parts with the higher
loss preferentially, at the expense of the other
tasks.  To remedy, we can assign different levels
of importance to the loss values in their contribution
to the final loss.  This is especially useful
if losses are using different scales, for example
MSE for age-regression tasks usually has values around
3-5, whereas the cross-entropy loss used for
gender-classification task can be as low as 0.1.

In such a situation, to balance, you can assign a weight of 10 to the crossentropy loss and a weight
of 0.25 to the MSE loss.

In [23]:
model.compile(optimizer='rmsprop',
              loss=['mse', 'categorical_crossentropy', 'binary_crossentropy'],
              loss_weights=[0.25, 1.0, 10.0])

In [24]:
# or if you name output layers, can be more
# precise and use an explicit mapping with a dict
model.compile(optimizer='rmsprop',
              loss={'age': 'mse',
                    'income': 'categorical_crossentropy',
                    'gender': 'binary_crossentropy'},
              loss_weights={'age': 0.25,
                            'income': 1.0,
                            'gender': 10.0})

As with multi-inputs, you can pass NumPy data to
the model for training either via a list of
arrays or via a dictionary of arrays.

In [25]:
# need to create some dummy data as before 
# if want to fit the multi-output model

In [26]:
#model.fit(posts, [age_targets, income_targets, gender_targets],
#          epochs=10, batch_size=64)

In [27]:
#model.fit(posts,
#          {'age': age_targets,
#           'income': income_targets],
#           'gender': gender_targets},
#          epochs=10, batch_size=64)

### 7.1.4 Directed acylic graphs of layers

We can use the functional API to build
more complex arbitrary *directed acylic graphs*
of layers. The qualifier *acylic* is important:
the graphs can't have cycles.  The only
processing *loops* that are allowed are those
internal to recurrent layers.

**Inception Modules**

This is a layer that itself looks like a small
stack of parallel branches.  

(Figure 7.8)

Here is an implementation by hand of the Inception
module shown in figure 7.8.

In [None]:
# create 4D tensor named x for input of correct shape
#x = Input(shape=(100,100,100))

# every branch has the same stride value (2),
# which is necessary to keep all branch outputs
# the same size
branch_a = layers.Conv2D(128, 1,
                         activation='relu', strides=2)(x)

# in this branch, the striding occurs in the spatial convolution layer
branch_b = layers.Conv2D(128, 1, activation='relu')(x)
branch_b = layers.Conv2D(128, 3, activation='relu', strides=2)(branch_b)

# in this branch, the striding occurs in the average pooling layer
branch_c = layers.AveragePooling2D(3, strides=2)(x)
branch_c = layers.Conv2D(128, 3, activation='relu')(branch_c)

# here stride occurs again in the last spatial
# convolution layer
branch_d = layers.Conv2D(128, 1, activation='relu')(x)
branch_d = layers.Conv2D(128, 3, activation='relu')(branch_d)
branch_d = layers.Conv2D(128, 3, activation='relu', strides=2)(branch_d)

# concatenates the branch outputs to obtain the module output
output = layers.concatenate(
    [branch_a, branch_b, branch_c, branch_d], axis=-1)

Note that the full Inception V3 architecture is 
available in Keras as `keras.applications.inception_v3.InceptionV3`
including weights pretrained on the ImageNet dataset.



**Residual Connections**

*Residual connections* are a common graph-like
network component found in many post 2015 
network architectures.  

They tackle two common problems that plague any
large-scale deep-learning model:
vanishing gradients and representational
bottlenecks.

In general, adding residual connections to any
model that has more than 10 layers is likely to be
beneficial.

At it simplest, a residual connection is simply 
connecting the otuput of an earlier layer to
a later layer.  Rather than being concatenated, the
earlier output is summed with the later activation.
This implies that both activations have to be
the same size.  If they're different, you can
use a linear transformation to reshape the
earlier activation into the target shape.

Here's an example of implementing a esidual connection
in Keras when the feature-map sizes are the same, using identity residual connections.

In [None]:
# create 4D tensor
# x = ...

# applies transformation to x
y = layers.Conv2D(128, 3, activation='relu', padding='same')(x)
y = layers.Conv2D(128, 3, activation='relu', padding='same')(y)
y = layers.Conv2D(128, 3, activation='relu', padding='same')(y)

# adds the original x back to the output features
y = layers.add([y, x])


And the following implements a residual connection
when the feature-map sizes are different, using a
linear residual connection.

In [None]:
# create 4D tensor
# x = ....
y = layers.Conv2D(128, 3, activation='relu', padding='same')(x)
y = layers.Conv2D(128, 3, activation='relu', padding='same')(y)
y = layers.MaxPooling2D(2, strides=2)(y)

# uses a 1x2 convolution to linearly downsample the
# original x tensor to the same shape as y
residual = layers.Conv2D(128, 1, strides=2, padding='same')(x)

# adds the residual tensor back to the output features
y = layers.add([y, residual])

### 7.1.5 Layer weight sharing

One more useful feature of the functional API
is the ability to reuse a layer instance several
times.  When you call a layer instance twice, it
reuses the same weights.  You can build models
with several branches, all sharing the same
knowledge.

For example, consider input from two cameras,
binocular vision set a few inches apart.  The
two streams need to perform basically the same task,
so it is much better to train a common set of weights.

Or consider a model that measuers semantic
similarity between two sentences.  The model
has two input sentences (the two sentences to
compare).  The model outputs a (probability)
score between 0 and 1, where 0 means unrelated, and
1 means the same sentence (or a rewording).

In this setup the two input sentences are interchangeable,
because semantic similarity is a symmetrical
relationship: the similarity of A to B is identical
to the similarity of B to A.

We want to process a single LSTM later, but using
two streams for two separate input sentences.

Here is an example using Keras functional API:

In [33]:
from keras import layers
from keras import Input
from keras.models import Model

In [34]:
# create left_data and right_data dummy data
# for example training

In [35]:
# instantiat single LSTM layer
lstm = layers.LSTM(32)

# build left branch, inputs are varibale-length
# sequences of vectors of size 128
left_input = Input(shape=(None, 128))
left_output = lstm(left_input)

# build the right branch, here is example of
# reusing existing lsm layer again
right_input = Input(shape=(None, 128))
right_output = lstm(right_input)

# builds a classifier on top
merged = layers.concatenate([left_output, right_output], axis=-1)
predictions = layers.Dense(1, activation='sigmoid')(merged)



In [None]:
# instantiating and training the model
# when you train, the weights of the LSTM layer are
# updated based on both inputs
model = Model([left_input, right_input], predictions)
model.fit([left_data, right_data], targets)

### 7.1.6 Models as layers

It is also useful to know that in Keras, a `Model`
can be used as a `layer`.  You can effectively
think of a `Model` as encapsulating several layers
into a conceptuall single layer, that takes an
input tensor and produces and output tensor.

This means you can call a model on an input 
tensor and retrieve an output tensor (i.e.
it performs a forward pass on the input
batch of data:

In [None]:
y = model(x)

If the model is multi-input and/or multi-output, you
should call it with a list of tensors, and it will
return a list/tuple:

In [None]:
y1, y2 = model([x1, x2])

Another example of reusing a model instance
is in the vision example we mentioned.  A dual
camera setup with two parallel cameras, a few
centimeters apart.  

Here is an example implementation of a Siamese vision 
model wiht a shared convolutional base built
in Keras (using Xception model/layers)

In [None]:
from keras import layers
from keras import applications
from keras import Input

In [None]:
# the base image-processing model is the Xception network
# (convolutional base only)
xception_base = applications.Xception(weights=None,
                                     include_top=False)

# the inputs are 250 x 250 RGB images
left_input = Input(shape=(250, 250, 3))
right_input = Input(shape=(250, 250, 3))

# calls the same vision model twice
left_features = xception_base(left_input)
right_features = xception_base(right_input)

# the merged features contain information from
# the right and left visual fields
merged_features = layers.concatenate(
 [left_features, right_features], axis=-1)

### 7.1.7 Wrapping up

- You can use Keras functional API when you need
  multi-input, multi-output, or more complext
  directed acylic graph-like networks.
- multi-input and/or multi-output models will usually
  outperform building separate networks, training them
  in isolation, then combining the results.
- You can reuse weights of a layer or even a model
  (a collection of layers).  This is useful to
  reuse knowledge in several parallel streams in many
  types of tasks.

# 7.2 Inspecting and monitoring deep-learning models using Keras callbacks and TensorBoard

We can use callbacks to gain greater control over
what goes on during training.  We can get greater
insights into what is being learned using TensorBoard
visualization tools.


### 7.2.1 Using callbacks to act on a model during training

We have mostly been using a simple fixed number of epochs for training.  But when training, there are many things we don't know.  How many epochs will we
need to get an optimal validation loss, for example.

We have used approach to first train and see what epoch model overfits, then train again only for
that number of epochs.  This is an example of
fixed training epochs schedule.

A much better way to handle this is to stop training
when you measure that the validation loss is no
longer improving (for some value of "no longer improving").
This can be achieved using a Keras callback.
A *callback* is an object that is passed to the
model in the call to `fit` and that is called by
the model at various points during training.
It has access to the available about the state
of the model and its performance.  And it can
take action: interupt training, save a model,
load a different weights, or otherwise alter the
state of the model.

Here are some examples of ways you can use callbacks:
- *Model checkpointing* - Saving the current weights
  of the model at different points during training.
- *Early stopping* - Interrupting training when the
  validation loss is no longer improving.
- *Dynamically adjusting the value of certain parameters during training* - such as the learning rate of the optimizer.
- *Logging training and validaiton metrics during training, or visualizing the representations learned by the model as they're updated* - The Keras progress bars is a callback, but you can do more.

The `keras.callbacks` module includes a number of
built-in (predefined) callbacks:

- `keras.callbacks.ModelCheckpoint`
- `keras.callbacks.EarlyStopping`
- `keras.callbacks.LearningRateScheduler`
- `keras.callbacks.ReduceLROnPlateau`
- `keras.callbacks.CSVLogger`




**ModelCheckpoint and EarlyStopping Callbacks**

You can use `EarlyStopping` callback to interrupt
training once a target metric has stopped improving
for a fixed number of epochs.

This callback is typically used in combination with 
`ModelCheckpoint`, which lets you continually
save the model during training.

In [42]:
import keras

In [None]:
callbacks_list = [
    # interrupts training when improvement stops
    keras.callbacks.EarlyStopping(
       monitor='acc', # monitors model accuracy
       patience=1, # stops when accuracy stops improving for more than 1 epoch (e.g. 2 epochs)
    ),
    
    # saves the current weights after every epochs
    keras.callbacks.ModelCheckpoint(
        filepath='my_model.h5', # path to destination model file
        monitor='val_loss', # only keep best model
        save_best_only=True, # and best means when the validation_loss improves
    )
]

# monitor accuracy, so it is pat of metrics
# (this is needed because we use for EarlyStopping)
model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])

# need to provide validation data because we use
# validation loss in the example for model checkpointing
model.fit(x, y,
          epochs=10, # will stop at 10 epochs if we don't stop early
          batch_size=32,
          callbacks=callbacks_list,
          validation_data=(x_val, y_val))


**ReduceLROnPlateu Callback**

You can use this callback to reduce the lr when
the validation loss has stopped improving.  

Reducing or increasing lr when loss stops
improving (a *loss plateau*) is
an effective strategy to get out of local minima.

In [None]:
callbacks_list = [
    keras.callbacks.ReduceLROnPlateau(
        monitor='val_loss', # monitor the model validation loss
        factor=0.1, # multiplies lr by 0.1 when triggered
        patience=10, # only trigger when validation loss stops improving for 10 epochs
    )
]

# again we are monitoring validation loss in the callback
# so we need to pass in validation data so it can be computed
model.fit(x, y,
          epochs=10,
          batch_size=32,
          callbacks=callbacks_list,
          validation_data=(x_val, y_val))

**Writing your own Callback**

You can write you own specialized callback.
Callbacks are implemented by simply 
subclassing the base class `keras.callbacks.Callback`.
You can then implement any of the following
functions, which are called during training:

- on_epoch_begin()
- on_epoch_end()
- on_batch_begin()
- on_batch_end()
- on_train_begin()
- on_train_end()

These methods are called with a `logs` argument,
which is a dictionary containing information about
the previous batch, epoch or training run.
In addition, since the callback is a class, there are
several attributes that are accessible including:

- `self.model` - the model instance being fit/trained
- `self.validation_data` - The value of was passed to `fit` as validation data

Here is a simple example.  This save to disk
the activations of every layer at the end
of every epoch.

In [None]:
import keras

In [None]:
# we subclass the base class keras.callbacks.Callback
class ActivationLogger(keras.callbacks.Callback):
    
    # called by parent model before training
    def set_model(self, model):
        self.model = model
        layer_outputs = [layer.output for layer in model.layers]
        # model instance that returns the activations of every layer
        self.activations_model = keras.models.Model(model.input, layer_outputs)
        
    # called at end of each epoch, duh
    # we want to save activations of all layers
    # at end of each epoch
    def on_epoch_end(self, epoch, logs=None):
        if self.validation_data is None:
            raise RuntimeError('Requires validation_data.')
            
        # obtain the first input sample of the validation data
        validation_sample = self.validation_data[0][0:1]
        
        activations = self.activations_model.predict(validation_sample)
        
        # save arrays to disk
        f = open('activations_at_epoch' + str(epoch) + '.npz', 'w')
        np.savez(f, activations)
        f.close()

# 7.2.2 Introduction to TensorBoard: the TensorFlow visualization framework

If you are working on tough or cuting edge problems,
you need to know how to do "experiments" on your
model, to understand and visualize what is going
on and to understand how it is performing its
work.

Making progress (like experimental science) is
an iterative process.  You start with an idea,
formulate a hypothesis, design an experiment to
test the hypothesis, and run it to validate or
invalidate your idea.

The key purpose of TensorBoard is to help you
visually monitor everything that goes on inside
of your model during training.  TensorBoard
is browser based.  It gives you access to:

- Visually monitoring metrics during training
- Visualizing your model architecture
- Visualizing histograms of activations and
  gradients
- Exploring embeddings in 3D
  
  
Let's demonstrate on a simple example.  You'll train
a 1D convnet on the IMDB sentiment-analysis task.
We consider only the top 2,000 words in the IMDB
vocabulary, to make visualizing word embeddings
more tractable.

In [None]:
import keras
from keras import layers
from keras.datasets import imdb
from keras.preprocessing import sequence

In [None]:
# numbe rof words to consider as features
max_features = 2000
# cuts off texts after this number of words
max_len = 500

# load the imdb data
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
x_train = sequence.pad_sequences(x_train, maxlen=max_len)
x_test = sequence.pad_sequences(x_test, maxlen=max_len)

In [None]:
model = keras.models.Sequential()
model.add(layers.Embedding(max_features, 128,
            input_length=max_len,
            name='embed'))
model.add(layers.Conv1D(32, 7, activation='relu'))
model.add(layers.MaxPooling1D(5))
model.add(layers.Conv1D(32, 7, activation='relu'))
model.add(layers.GlobalMaxPooling1D())
model.add(layers.Dense(1))

model.summary()

In [None]:
model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])

**NOTE**: I created a directory called logfiles
one level up from the current directory, to
hold the TensorFlow logs.

In [None]:
callbacks = [
    keras.callbacks.TensorBoard(
        log_dir='../logfiles',
        histogram_freq=1, # records activation histogram every 1 epoch
        embeddings_freq=1, # records embedding data every 1 epoch
    )
]

history = model.fit(x_train, y_train,
                    epochs=20,
                    batch_size=128,
                    validation_split=0.2,
                    callbacks=callbacks)

Start the tensorboard server, while training is
occurring:

```
$ tensorboard --logdir=logfiles
```

The graphs show all of the actual tensorflow
structures/layers that were created from the 
high-level Keras API.  Keras also provides
a cleaner way to plot models as graphs of
layers, rather than graphs of tensorflow
operations, the `keras.utils.plot_model` utility.
This may not work unless you have Python `pydot`
and `pydot-ng` libraries installed, as well
as the `graphviz` library.

(Try using conda install for pydot and graphviz)

In [None]:
from keras.utils import plot_model

plot_model(model, to_file='figures/model.png')

<img src="figures/model.png">

In [None]:
plot_model(model, show_shapes=True, to_file='figures/model-shapes.png')

<img src="figures/model-shapes.png">

### 7.2.3 Wrapping up

- Keras callbacks allow you to monitor models during
  training and automatically take actions to stop or
  modify training parameters while learning.
- When using TensorFlow backend, can use TensorBoard
  built in tool to visualize model activity in
  browser (TensorBoard callback in Keras).

## Getting the most out of your models
Most materials up to this point have been helping
you to build basic intutions on what works, to begin
specifying hyperparameters and model architectures.
This section has some discussions of newer promising
trends.

### 7.3.1 Advanced architecture patterns

Residual connections, covered in 7.1, are one 
example of advanced architecture patterns to keep
in mind.

**Batch Normalization**

Another is *batch normalization*.  You are familiar
with normalization techniques.  We have always, to
this point, exhorted you to normalize your data.
We have always normalized the data in preprocessing,
before using it for training in a DNN.

But even if data entering a network has been
normalized, there is no guarantee, given the
current settings of the weights of the layer, that
the output tensors will also be normalized.  
And this can again be a problem.  For the same reason
we have been normalizing data before training with it
if data tensors output from layers have values that
are of different scales or large ranges, it can make
it more difficult for the layer to learn.

Batch normalization is a type of layer 
(`BatchNormalization` in Keras) that can
adaptively normalize data, even as the mean and
variance change over time during training.  Basically
it does the same thing we did by hand, to compute
values to shift the mean to 0 and scale the
variance to have a standard deviation of 1.  But
since the mean and variance can and will change
during training, the batch normalization layer
maintains an exponentialy weighted moving
set of parameter to shift and rotate the output.
It learns these parameters at the same time that
gradient descent learning occurs.  

The main effect of batch normalization is that it
helps with gradient propagation.  By normalizing
outputs, it again ensures that signals of the
gradients are more properly preserved.  Thus it is
also a technique for addressing vanishing gradients
problems.  Thus batch normalization allows
for deeper networks to be created and trained
effectively.

The `BatchNormalization` layer is typically
used after a convolutional or densely connected
layer.  You can have multiple `BatchNormalization`
layers in a network.  It doesn't make sense to
put batch layer everywhere, like after maxpooling
layers.  It is less clear whether these can be
useful after recurrent layers (it will depend on
the sequence/series data).

In [None]:
# after a conv layer
conv_model.add(layers.Conv2D(32, 3, activation='relu'))
conv_mode.add(layers.BatchNormalization())

# after a Dense layer
dense_model.add(layers.Dense(32, activation='relu'))
dense_model.add(layers.BatchNormalization())

The `BatchNormalization` layer takes an `axis`
argument, which specifies the feature axis that
should be normalized.  The default is -1 (which
optimizes the last feature axis from the layer).
This is correct for `Dense` layers, `Conv1D` and
`Conv2D` when the channels are last.  When
channels are first in a `Conv2D` you should set
the `axis` to 1.

*Batch renormalization* is a recent improvement
to be aware of, from original designers of
batch normalization, that may become common and
replace batch normalization.  It appears to
be clearly a bit better, with no apparent cost.

**Depthwise Separable Convolution**

A drop-in replacement for `Conv2D` that has fewer
trainable parameters, and is faster, and usually
performs better than a simple `Conv2D`.  

It is an acylic directed graph.  Each channel
is handled in parallel using a convolution, which
is concatenated and then processed through a
1x1 pointwise convolution (Figure 7.16).

May be especially advantegous for small models
on limited data (either image classification
or using convolutions for natural language sequences).

An example of an image-classification networks
using softmax categorical classification with 
these separable convnets.


In [None]:
from keras.models import Sequential, Modle
from keras import layers

height = 64
width = 64
channels = 3
num_classes = 10

In [None]:
model = Sequential()
model.add(layers.SeparableConv2D(32, 3,
                                activation='relu',
                                input_shape=(height, width, channels)))
model.add(layers.SeparableConv2D(64, 3, activation='relu'))
model.add(layers.MaxPooling2D(2))

model.add(layers.SeparableConv2D(64, 3, activation='relu'))
model.add(layers.SeparableConv2D(128, 3, activation='relu'))
model.add(layers.MaxPooling2D(2))

model.add(layers.SeparableConv2D(64, 3, activation='relu'))
model.add(layers.SeparableConv2D(128, 3, activation='relu'))
model.add(layers.GlobalAveragePooling2D())

model.add(layers.Dense(32, activation='relu'))
model.add(layers.Dense(num_classes, activation='softmax'))

model.compile(optimizer='rmsprop', loss='cattegorical_crossentropy')


Separable convolutions are also used heavily in
more recent larger models, like Xception.

### 7.3.2 Hyperparameter optimization

When you do ML long enough, you build up intuition
on good hyperparameters to approach building
architectures for different types of tasks.
But your initial settings are almost certianly not
going to be optimal.  We can iterate by hand,
exploring the hyperparameter space.
But, there is no good principled way to
explore the hyperparameter space (unlike the
normal weight/parameter space), because:

- Computing the feedback signal can be extremely
  expensive (imagine training a ResNet from scratch
  to explore different number of layers or lr settings).
- The hyperparameter space is discrete.  Not being
  continuous, we can't do gradient descent optimization
  on the hyperparamters.  Must rely on gradient-free
  optimization techniques.
  
Random (covering space, and informed) search still
often the best.

Tools to try out, if you have a cluster and need to
really methodically explore a hyperparameter
space for a network: Hyperopt using Python fo
hyperparameter optimization in general, and
Hyperas (Hyperopt for Keras).

### 7.3.3 Model ensembling

Ensembling consists of pooling together the
predictions of a set of different models, to
produce better predictions.

Assumes different models trained independently are
likely to be good for *different reasons*.

When combining models, can take the simple average.
For example, if have 4 softmax classifiers, could
add and take average of predictions to get
'ensembled' predictions. 

In [None]:
preds_a = model_a.predict(x_val)
preds_b = model_b.predict(x_val)
preds_c = model_c.predict(x_val)
preds_d = model_d.predict(x_val)

final_preds = 0.25 * (preds_a + preds_b + preds_c + preds_d)

Only works if all 4 classifiers are more or less
equally good.

If not, use a weighted average.  For example,
this could be a ML/optimization task, to determine
the optimal weights for an ensemble of models.

In [None]:
preds_a = model_a.predict(x_val)
preds_b = model_b.predict(x_val)
preds_c = model_c.predict(x_val)
preds_d = model_d.predict(x_val)

final_preds = 0.5 * preds_a + 0.25 * preds_b + 0.1 * preds_c + 0.15 * preds_d


Emphasize, goal is ensemble models are *as good
as possible* while being *as different as possible*.

### 7.3.4 Wrapping up

