# 7. Advanced Deep Learning Best Practices

### The Keras Functional API

So far, the neural networks have been implemented using the `Sequential` model. This assumes that the model has <u>one and only one input</u> and <u>one and only one output</u>. Also, there is a linear stack of layers. Think of it as only 1 path with multiple layers.

This is not ideal for some cases. Some networks have multiple independent inputs and some produce multiple outputs. Futhermore, some models have internal branching between layers that make them look like graphs rather than linear stacks of layers.

Some tasks require <b>multimodal</b> inputs, that merge data from different input sources, processing each type of data using different kinds of neural layers. It's more ideal to predict jointly using different types of inputs (e.g. images & text) than learning different models for each output. Similarly, some models product multiple target attributes of input data. For example, jointly predicting the year of release and genre of a piece of writing.

<img src="img71.png" width="600">
<img src="img72.png" width="600">

The following are 3 examples of recent architectures that also don't obey the 1-input, 1-output, 1-stack architecture:

- <b>Wide & Deep</b> neural network - This architecture connects all or part of the inputs directly to the output layer. With this architecture, it is possible to learn both deep patterns  (using the deep path) and simple rules (using the short path). More at [Wide & Deep Learning: Better Together with TensorFlow](https://ai.googleblog.com/2016/06/wide-deep-learning-better-together-with.html)
<img src="img3.png" width="900"/>

- <b>Inception Family</b> - relies on inception modules, where the input is processed by several parallel convolutional branches, and their outputs are merged to a single tensor. More at [Going Deeper with Convolutions](https://arxiv.org/abs/1409.4842)

- <b>Adding Residual Connections</b> - A residual connection of injecting previous representations into the downstream flow by adding a past output tensor to a later output tensor. More at [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385)

<img src="img73.png" width="450"/>

To handle these use cases, and other cases, we cannot use the `Sequential` model but there is a more flexible way to use Keras - the <b>functional model</b>

In [20]:
from tensorflow.keras.datasets import boston_housing

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import KFold, train_test_split

from tensorflow.keras.utils import to_categorical
from tensorflow.keras import Input, layers, models, backend, applications

In [4]:
# Ingestion
###########
(train_data, y_train), (test_data, y_test) = boston_housing.load_data()

# Preprocessing
###############
sc = StandardScaler()
x_train = sc.fit_transform(train_data)
x_test = sc.transform(test_data)

x_train__train, x_train__val, y_train__train, y_train__val = train_test_split(x_train, y_train, test_size=0.15,
                                                                             random_state=0)
NUM_FEATURES = x_train.shape[1:]

### Introduction to the Functional API

In the functional API, you directly manipulate tensors, and use layers as <u>functions</u> that take tensors and return tensors (hence, functional).

#### Single Input, Single Output, One Linear Stack

Let's build a side-by-side comparison of a simple model to tackle the **housing prices** regression problem.

In [None]:
# Using models.Sequential()
###########################
backend.clear_session()
m11 = models.Sequential() # Model
m11.add(layers.Dense(32, activation='relu', 
                     input_shape=(NUM_FEATURES)))
m11.add(layers.Dense(32, activation='relu'))
m11.add(layers.Dense(1))
print(m11.summary())

m11.compile(optimizer='rmsprop', loss='mse', metrics=['mae']) # Compile & Fit
m11.fit(x_train__train, y_train__train, 
        epochs=20, batch_size=4,
        validation_data= (x_train__val, y_train__val),
       verbose=0)

In [None]:
# Using Functional API
######################
backend.clear_session()
m12_input = Input(shape=NUM_FEATURES) # Model
m12_l1 = layers.Dense(32, activation='relu')(m12_input)
m12_l2 = layers.Dense(32, activation='relu')(m12_l1)
m12_output = layers.Dense(1)(m12_l2)
m12 = models.Model(m12_input, m12_output)
print(m12.summary())
m12.compile(optimizer='rmsprop', loss='mse', metrics=['mae']) # Compile & Fit
m12.fit(x_train__train, y_train__train, 
       epochs=20, batch_size=4,
       validation_data=(x_train__val, y_train__val),
       verbose=0)

In [None]:
# Predict step
print(m11.predict(x_train__val[:3]))
print(m12.predict(x_train__val[:3]))

In the backend, Keras retrieves every layer going from the inputs to the outputs to a graphs-like data structure, a `Model`. Of course, you need to ensure that there are intermediate layers between the inputs and outputs.

<hr>

#### Multiple Inputs, Single Output
Now, we shall build a model that have multiple inputs. Typically, for these models, there is a step to merge the different input branches that can combine several tensors. 

<b>Example 1</b> - The **housing prices problem** now requires we use a subset of the features for one input and another subset of features for another. To do this, we need to make changes on <u>both the architecture</u> and the <u>input data</u>.

For the architecture, the key features are:
- 2 input layers
- concatenate layer

In [None]:
# Instantiate Model
###################
# Inputs
input_layera = layers.Input(shape=(10,))
input_layerb = layers.Input(shape=(7,))

# Dense layers, Concatenate layer & Output layer is the same as previous complex workflows
hidden_layer1 = layers.Dense(30, activation='relu')(input_layerb)
hidden_layer2 = layers.Dense(30, activation='relu')(hidden_layer1)
concat_layer = layers.Concatenate()([input_layera, hidden_layer2])
output_layer = layers.Dense(1)(concat_layer)
m21 = models.Model(inputs=[input_layera, input_layerb], outputs=output_layer)
print(m21.summary())

For the input data, the key features are:

- split the data to different subsets of features
- in the `fit()` step, specify the inputs as a list of the 2 inputs, where the order is reflected in the functional API architecture. This needs to be the same in the `.evaluate()` and `predict()` step

In [None]:
# Prepare data for training model
#################################
inputa_cols = list(range(0,10))
inputb_cols = [1,5,6,7,8,11,12]
x_train__trainA = x_train__train[:,inputa_cols]
x_train__trainB = x_train__train[:,inputb_cols]
x_train__val_A = x_train__val[:,inputa_cols]
x_train__val_B = x_train__val[:,inputb_cols]

In [None]:
# For testing
# print(x_train__trainA.shape)
# print(x_train__trainB.shape)

In [None]:
# Train & Tune Model
####################
m21.compile(optimizer='sgd', loss='mean_squared_error', metrics=['mae'])
m21.fit((x_train__trainA, x_train__trainB), y_train__train, epochs=20,
           validation_data=((x_train__val_A, x_train__val_B), y_train__val), verbose=0)

In [None]:
# Prepare test data
###################
x_testA = x_test[:,inputa_cols]
x_testB = x_test[:,inputb_cols]

# Evaluation
m21.evaluate((x_testA, x_testB), y_test)

# Prediction
m21.predict((x_testA[:2], x_testB[:2]))

<b>Example 2</b> - Consider a **Q&A problem** where there is a reference text and a question as the inputs, and the output is a one-word answer. Conceretely, there is a news article and "country/person/incident" as the question, and the outputs is a one-word answer.

In [None]:
TEXT_VOCAB_SIZE, QUESTION_VOCAB_SIZE, ANSWER_VOCAB_SIZE = 10000, 25, 500
max_length, max_qn_length, max_ans_length = 100, 25, 5
max_samples = 1000

text_corpus = np.random.randint(1, TEXT_VOCAB_SIZE,
                               size=(max_samples, max_length))
questions_corpus = np.random.randint(1, QUESTION_VOCAB_SIZE,
                               size=(max_samples, max_qn_length))
answers_corpus = np.random.randint(0,ANSWER_VOCAB_SIZE,
                                  size=(max_samples,))
answers_corpus = to_categorical(answers_corpus)

In [None]:
print(text_corpus.shape)
# print(text_corpus[:2])
# print()
print(questions_corpus.shape)
# print(questions_corpus[:2])
# print()
print(answers_corpus.shape)
# print(answers_corpus[:2])

In [None]:
backend.clear_session()
m31_corpus_input = Input(shape=(max_length,), dtype='int32')
m31_qn_input = Input(shape=(max_qn_length,), dtype='int32')

m31_corpus_emb = layers.Embedding(TEXT_VOCAB_SIZE, 64)(m31_corpus_input)
m31_qn_emb = layers.Embedding(QUESTION_VOCAB_SIZE, 64)(m31_qn_input)

m31_corpus_lstm = layers.LSTM(32)(m31_corpus_emb)
m31_qn_lstm = layers.LSTM(32)(m31_qn_emb)

m31_concat = layers.Concatenate()([m31_corpus_lstm, m31_qn_lstm])
m31_ans = layers.Dense(ANSWER_VOCAB_SIZE, activation='softmax')(m31_concat)
m31 = models.Model(inputs=[m31_corpus_input, m31_qn_input], outputs=m31_ans)
print(m31.summary())

In [None]:
m31.compile(optimizer='rmsprop', 
            loss='categorical_crossentropy',
            metrics=['acc'])

In [None]:
m31.fit([text_corpus, questions_corpus], answers_corpus, 
        epochs=10, batch_size=128)

In [None]:
m31_pred = m31.predict([text_corpus[:2], questions_corpus[:2]])
print(np.argmax(m31_pred[0]))
print(np.argmax(m31_pred[1]))
print(np.argmax(answers_corpus[0]))

<hr>

#### Single Input, Multiple Outputs
There are some other models that take one input and simultaneously predict different properties of the data.

<b>Example 1</b> - Consider a **social media problem** where the network takes in a social media post as the input and predicts 3 outputs: the age, gender and income level of the poster.

In [35]:
# Properties
TEXT_VOCAB_SIZE = 50000
NUM_GENDER_GROUPS, NUM_AGE_GROUPS = 2, 5
text_length = 512 
num_samples=1000

# Prepare data
text_corpus = np.random.randint(1, TEXT_VOCAB_SIZE,
                               size=(num_samples, text_length))
income_outcomes = np.random.random((num_samples,))
gender_outcomes = np.random.randint(0, NUM_GENDER_GROUPS,
                                   (num_samples,))
age_outcomes = np.random.randint(0, NUM_AGE_GROUPS, 
                                (num_samples,))
# IMPORTANT: When you do multiclass classification, you MUST 
# one-hot encode the results
age_outcomes = to_categorical(age_outcomes)

In [36]:
# Model
backend.clear_session()
input_layer = Input(shape=(None,), dtype='int32', name='posts')

embed_layer = layers.Embedding(
    TEXT_VOCAB_SIZE, 256, input_length=text_length)(input_layer)
stacked_layer = layers.Conv1D(16, 8, activation='relu')(embed_layer)
stacked_layer = layers.MaxPooling1D(4)(stacked_layer)
stacked_layer = layers.Conv1D(32, 8, activation='relu')(stacked_layer)
stacked_layer = layers.GlobalMaxPooling1D()(stacked_layer)
stacked_layer = layers.Dense(128, activation='relu')(stacked_layer)

age_layer = layers.Dense(NUM_AGE_GROUPS, 
                         activation='softmax',
                         name='age')(stacked_layer)
income_layer = layers.Dense(1, name='income')(stacked_layer)
gender_layer = layers.Dense(1, activation='sigmoid',
                            name='gender')(stacked_layer)
m41 = models.Model(input_layer, 
                   [age_layer, 
                       income_layer, gender_layer])
print(m41.summary())

In [38]:
m41.compile(optimizer='rmsprop', 
            loss={'age' : 'categorical_crossentropy', 
                  'income' : 'mse', 
                  'gender' : 'binary_crossentropy'},
            metrics={'age' : 'acc', 
                     'income' : 'mae', 
                     'gender' : 'acc'})
h41 = m41.fit(text_corpus, 
        {'age': age_outcomes, 
         'income' : income_outcomes, 
         'gender' : gender_outcomes},
        epochs=20, batch_size=64, validation_split=0.25)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [40]:
display(pd.DataFrame(h41.history).tail())

Unnamed: 0,loss,age_loss,income_loss,gender_loss,age_acc,income_mae,gender_acc,val_loss,val_age_loss,val_income_loss,val_gender_loss,val_age_acc,val_income_mae,val_gender_acc
15,0.036835,0.002021,0.033107,0.001707,1.0,0.144874,1.0,3.094037,1.913367,0.2289,0.951769,0.204,0.40229,0.492
16,0.044678,0.001954,0.041126,0.001598,1.0,0.163551,1.0,3.072343,1.924612,0.201363,0.946367,0.212,0.376508,0.492
17,0.033193,0.001859,0.029634,0.0017,1.0,0.140448,1.0,3.063819,1.914781,0.190971,0.958067,0.216,0.366486,0.492
18,0.030628,0.001592,0.02782,0.001216,1.0,0.132962,1.0,3.046853,1.963621,0.168141,0.915091,0.208,0.344419,0.492
19,0.028199,0.001462,0.025508,0.001229,1.0,0.128252,1.0,3.031599,1.947253,0.222498,0.861848,0.172,0.396601,0.492


<hr>

#### DAG of Layers / Complex Architecture
Beyond multiple inputs and multiple outputs, we can also build models with complex internal topology. Neural networks in Keras are allowed to be arbitrary directed acyclic graphs (DAGs) of layers. 

<b>Example 1</b> - Let's build a wide & deep network to tackle the **housing prices** problem. Take note of the comments describing each layer.

In [18]:
# Instantiate Model
###################
backend.clear_session()
# Input object. This is needed as we might have multiple inputs.
m51_input_layer = layers.Input(shape=NUM_FEATURES)

# Dense layer with 30 neurons & RELU activation. Notice it is called like a function,
# passing in the input layer. 
m51_x = layers.Dense(30, activation='relu')(m51_input_layer)
# Another Dense layer. Now, the first hidden layer is passed in.
m51_x = layers.Dense(30, activation='relu')(m51_x)

# Concatenate layer. concatenates the input & the output of the 2nd hidden layer
m51_concat_layer = layers.Concatenate()([m51_input_layer, m51_x])

# Output layer. Single neuron and no activation function.
m51_output_layer = layers.Dense(1)(m51_concat_layer)

# Finally, create the Keras model with this architecture.
m51 = models.Model(inputs=[m51_input_layer], outputs=m51_output_layer)

In [9]:
# Train & Tune Model
####################
m51.compile(optimizer='sgd', loss='mean_squared_error', metrics=['mae'])
m51 = m51.fit(x_train, y_train,  epochs = 10, verbose=0)

In [10]:
# Save model
# m51.save('model0.h5')

<b>Example 2</b> - Inception is a popular type of network architecture for CNNs, developed in Google in 2013 - 2014. More at [Going Deeper with Convolutions](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43022.pdf). It consists of a stack of modeuls that themselves look like small independent networks, split into several parallel branches. The most basic form of an Inception module has three to four branches starting with a 1x1 convolution, then 3x3 convolution, then ending with the concatenation of the resulting features. This allows the network to separately learn spatial features and channel-wise features, which is more efficient than learning them jointly.

There are several implementations of inception, but here is one:

In [12]:
m61_input = Input(shape=(128,128,4))
m61_bch_a = layers.Conv2D(128, 1, activation='relu', strides=2)(m61_input)

m61_bch_b = layers.Conv2D(128, 1, activation='relu',)(m61_input)
m61_bch_b = layers.Conv2D(128, 3, activation='relu', strides=2)(m61_bch_b)

m61_bch_c = layers.AveragePooling2D(3, strides=2)(m61_input)
m61_bch_c = layers.Conv2D(128, 3, activation='relu')(m61_bch_c)

m61_bch_d = layers.Conv2D(128, 1, activation='relu')(m61_input)
m61_bch_d = layers.Conv2D(128, 3, activation='relu')(m61_bch_d)
m61_bch_d = layers.Conv2D(128, 3, activation='relu', strides=2)(m61_bch_d)

output = layers.Concatenate([m61_bch_a, m61_bch_b, m61_bch_c, m61_bch_d])

<b>Example 3</b> - Residual connections are a common graph-like network component found in many post-2015 network architectures. They were introduced by Microsoft in their winning entry in the ILSVRC ImageNet challenge in 2015. They tackle two common problems that plague any large-scale deep learning model: vanishing gradients and representational bottlenecks. 

A residual connection consists of making the output of an earlier layer available as input to a later layer, effectively creating a shortcut in a sequential network. Rather than being concatenated to the later activation, the earlier output is summed with the later activation, which assumes that both activations are the same size.

In [13]:
# When feature map sizes are the same, using identity residual connections
###
# x = ...
# y = layers.Conv2D(128, 3, activation='relu', padding='same')(x)
# y = layers.Conv2D(128, 3, activation='relu', padding='same')(y)
# y = layers.Conv2D(128, 3, activation='relu', padding='same')(y)
# y = layers.add([y,x]) # Add the original x back to the output features

In [14]:
# When feature map sizes differ, using linear residual connection
###
# x = ...
# y = layers.Conv2D(128, 3, activation='relu', padding='same')(x)
# y = layers.Conv2D(128, 3, activation='relu', padding='same')(y)
# y = layers.MaxPooling2D(2, strides=2)(y)
# residual = layers.Conv2D(128, 1, strides=2, padding='same')(x)
# y = layers.add([y, residual])

<b>Representational Bottlenecks in Deep Learning</b> - In a `Sequential` model, each successive representation layer is built on top of the previous one, which means it only has access to information contained in the activation of the previous layer. If one layer is too small, then the model will be constrained by how much information can be crammed into the activations of this layer. Residual connections, by reinjecting earlier information downstream, partially solve this issue for deep learning models

<b>Vanishing Gradients</b> - Backpropagation works by propagating a feedback signal from the output loss down to earlier layers. If this feedback signal has to be propagated through a deep stack of layers, the signal may become weak or lost entirely, rendering the network untrainable. This is known as vanishing gradients.

This problem occurs both with deep networks and with recurrent networks over very long sequences - in both cases, a feedback signal must be propagated through a long series of operations. This is handled using the `LSTM` layer using the carry. Residual connections work in a similar way in feedforward deep networks, but they are even simpler: they introduce a purely linear information carry track parallel to the main layer stack, thus helpnig to propagate gradients through arbitrarily deep stacks of layers.

<hr>

#### Multiple Inputs, Multiple Outputs

For multiple outputs, you can use the following code snippets to help you.

```python
input_layera = tf_keras.layers.Input(shape=(10,))
input_layerb = tf_keras.layers.Input(shape=(7,))

hidden_layer1 = tf_keras.layers.Dense(30, activation='relu')(input_layerb)
hidden_layer2 = tf_keras.layers.Dense(30, activation='relu')(hidden_layer1)
concat_layer = tf_keras.layers.Concatenate()([input_layera, hidden_layer2])
output_layer1 = tf_keras.layers.Dense(1)(concat_layer)
output_layer2 = tf_keras.layers.Dense(1)(hidden_layer2) # Add this
model3 = tf_keras.models.Model(inputs=[input_layera, input_layerb], 
                               outputs=[output_layer1, output_layer2]) # Change this
```

When compiling the model, use different metrics for different outputs

```python
model3.compile(optimizer='sgd', loss='mean_squared_error', metrics=['mae', 'mse'])
```

When evaluating the model, Keras returns the total loss, as well as the individual losses
```python
model3.evaluate((x_testA, x_testB), y_test)```

<hr>

#### Layer Weight Sharing

There are other uses of the functional API. One is the ability to resue a layer instance several times. When you call a layer instance twice, instead of instantiating a new layer for each call, you reuse the same weights with every call. This allows you to build models with shared branches - several branches that all share the same knowledge and perform the same operations. That is they share the same representations and learn these representations simultaneously for different sets of inputs.

<b>Example 1</b> - Consider a model that assess the semantic similarity between two sentences. The model has two inputs and outputs a score between 0 and 1 - 0 being no similarity while 1 being complete similarity.

Here, the two input sentences are interchangeable. So it wouldn't make sense to learn two independent models for prcessing each input sentence. Rather, you want to process both with a single LSTM layer. The representations of this LSTM layer are learned based on both inputs simultaneously. This is called the <b>Siamese LSTM</b> or <b>shared LSTM</b> model.

In [19]:
backend.clear_session()
lstm = layers.LSTM(32)

left_input = Input(shape=(None, 128))
left_output = lstm(left_input)
right_input = Input(shape=(None, 128))
right_output = lstm(right_input)

merged = layers.Concatenate()([left_output, right_output])
predictions = layers.Dense(1, activation='sigmoid')(merged)
model = models.Model([left_input, right_input], predictions)
model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, None, 128)]  0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            [(None, None, 128)]  0                                            
__________________________________________________________________________________________________
lstm (LSTM)                     (None, 32)           20608       input_1[0][0]                    
                                                                 input_2[0][0]                    
__________________________________________________________________________________________________
concatenate (Concatenate)       (None, 64)           0           lstm[0][0]                   

<hr>

#### Models as Layers

Models can be used as you'd use layers. This means you can call a model on an input tensor and retrieve an output tensor:

```python
y = model(x)
```

If the model has multiple input tensors and multiple output tensors, it should be called with a list of tensors:
```python
y1, y2 = model([x1, x2])
```

When you call a model instance, you're reusing the weights of the model - exactly like what happens when you call a layer instance or a model instance. Calling an instance, whether it's a layer instance or model instance, will always reuse the existing learned representations of the instance, which is intuitive.

We have seen something like this before, using the convolutional base of trained networks:

In [22]:
# xception_base = applications.Xception(weights=None, include_top=False)

# left_input = Input(shape=(250,250,3))
# right_input = Input(shape=(250,250,3))

# left_features = xception_base(left_input)
# right_features = xception_base(right_input)

# merged_features = layers.concatenate([left_features, right_features])

To wrap up, the functional API achieves the following:

- use complex architectures, beyond the `Sequential` model
- build architectures with multiple inputs or with multiple outputs
- reuse the weights of a layer or model across different processing branches, by calling the same layer or model instance several times

<hr>

### Building Dynamic Models Using the Subclassing API

To add flexibility, we can use the Subclassing API to subclass the Model and create the layers needed.

Here, we separate the creating of the layers from their usage.

In [None]:
class WideAndDeepModel(tf_keras.models.Model):
    def __init__(self, units=30, activation='relu', **kwargs):
        super().__init__(**kwargs)
        self.hidden_layer1 = tf_keras.layers.Dense(units, activation=activation)
        self.hidden_layer2 = tf_keras.layers.Dense(units, activation=activation)
        self.output_layer = tf_keras.layers.Dense(1)
    
    def call(self, inputs):
        inputa, inputb = inputs
        hidden1 = self.hidden_layer1(inputb)
        hidden2 = self.hidden_layer2(hidden1)
        conct = tf_keras.layers.Concatenate()([inputa, hidden2])
        ouptt = self.output_layer(conct)
        return ouptt
        

In [None]:
# Load & Train model
model3 = WideAndDeepModel(30, 'relu')
model3.compile(optimizer='sgd', loss='mean_squared_error', metrics=['mae'])
model3.fit((x_train__trainA, x_train__trainB), y_train__train, epochs=20,
           validation_data=((x_train__val_A, x_train__val_B), y_train__val), verbose=0)

In [None]:
# Evaluate & Predict
model3.evaluate((x_testA, x_testB), y_test)
model3.predict((x_testA[:2], x_testB[:2]))

### Saving & Restoring a Model

This is useful when models take a long time to train or when you need access to a previously trained model.

In [None]:
# Saving a model
# model1.save('model3.h5')

In [None]:
# Load & Predict
# model1ld = tf_keras.models.load_model('model3.h5')
# model1ld.predict((x_testA[10:15], x_testB[10:15]))

Additional Readings:

- (1)  https://ai.googleblog.com/2016/06/wide-deep-learning-better-together-with.html
- (2)  https://github.com/lutzroeder/Netron