------

### Keras Funtional API

------

Keras functional API helps create Graph like models just like python functions. Keras callbacks and tensorboard allows us to monitor the performance of the model and use functions like Earlystopping whenever necessary. 

The chapter also discusses the important research in deep learning like batch normalization layers, residual networks, hyperparamter adjustments, and ensemble modelling.


Until now all the models which have been discussed have been Sequential. The sequential models are simple single input, single output models. Here we will discuss special models which might have multiple inputs or multiple outputs. Some might have data coming from different sources and it might need to merge the data.(Multimodal inputs). 

Example, consider a scenario where we have to predict cost of an item given some text, meta data, and an image. Training three different models does not make sense. So here what can be done is that The three models can be merged using a merging module and even this merging module would get trained. 

Also, there are a lot of architectures like Google Inception architecture which requires several parallel branches whose outputs are merged into a single tensor. 

Also, there is a ResNet architecture which involves reinjecting of previous information in downstream flow by adding the past output tensor to later output tensor. These architectures will be discussed in great depth later. Skip if you don't understand the connection now. 

---------------------------

With functional API the layers can be esentially be trated as functions. Each layer taking a tensor and returning a tensor. 

--------

***Introduction to Functional API***


---------

For the sake of simplicaity let's start with a sequential model using functional API. This would help us understand the syntax better. Here we are building a dense model using Keras Sequential API. 


In [7]:
from keras import layers
from keras.models import Sequential, Model

In [8]:
seq_model = Sequential()
seq_model.add(layers.Dense(32, activation = 'relu', input_shape = (64,)))
seq_model.add(layers.Dense(32, activation = 'relu'))
seq_model.add(layers.Dense(10, activation = 'softmax'))

In [9]:
seq_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 32)                2080      
_________________________________________________________________
dense_2 (Dense)              (None, 32)                1056      
_________________________________________________________________
dense_3 (Dense)              (None, 10)                330       
Total params: 3,466.0
Trainable params: 3,466
Non-trainable params: 0.0
_________________________________________________________________


***Writing it's functional equivalent***

In [10]:
input_tensor = layers.Input(shape=(64,))

In [14]:
x = layers.Dense(32, activation='relu')(input_tensor)
x = layers.Dense(32, activation='relu')(x)
output_tensor = layers.Dense(10, activation = 'softmax')(x)

In [15]:
model = Model(input_tensor, output_tensor)

In [16]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 64)                0         
_________________________________________________________________
dense_7 (Dense)              (None, 32)                2080      
_________________________________________________________________
dense_8 (Dense)              (None, 32)                1056      
_________________________________________________________________
dense_9 (Dense)              (None, 10)                330       
Total params: 3,466.0
Trainable params: 3,466.0
Non-trainable params: 0.0
_________________________________________________________________


***Basically Model is a Graph like data structure which is created by Keras using just the input and output tensor. As the output_tensor is esentially created by repeatedly transforming the input_tensor.***

If we try to create a Model object using disconnected graph Keras will throw a run time error. 

----

***Multi-input Models - Question-Answer System*** 



-----------

Multi input models require the converging of data which can be done by using different transformations. The merge operations can usually be performed by using the *keras.layers.add* or *keras.layers.concatenate*. 



In [17]:
from keras.models import Model
from keras import layers

In [18]:
text_vocabulary_size = 10000
question_vocabulary_size = 10000
answer_vocabulary_size = 500

In [19]:
### The text_input is a variable size integer array. The layers can be named using the name field. 
text_input = layers.Input(shape = (None,), dtype = 'int32', name = 'text')

In [20]:
### Embeds the input into sequence of vectors of size 64.

embedded_text = layers.Embedding(64, text_vocabulary_size)(text_input)

In [21]:
### Encodes the vectorsin a single vector using LSTM 

encoded_data = layers.LSTM(32)(embedded_text)

Instructions for updating:
keep_dims is deprecated, use keepdims instead


***Same process for question part***

In [22]:
### The text_input is a variable size integer array. The layers can be named using the name field. 
question_input = layers.Input(shape = (None,), dtype = 'int32', name = 'question')

In [23]:
### Embeds the input into sequence of vectors of size 64.

embedded_question = layers.Embedding(32, question_vocabulary_size)(question_input)

In [24]:
### Encodes the vectorsin a single vector using LSTM 

encoded_question = layers.LSTM(32)(embedded_question)

***Merging the encoded question and text data***

In [26]:
concatenated = layers.concatenate([encoded_data, encoded_question], axis = 1)

In [27]:
answer = layers.Dense(answer_vocabulary_size, activation = 'softmax')(concatenated)

In [28]:
model = Model([text_input, question_input], answer)

In [29]:
model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
text (InputLayer)                (None, None)          0                                            
____________________________________________________________________________________________________
question (InputLayer)            (None, None)          0                                            
____________________________________________________________________________________________________
embedding_1 (Embedding)          (None, None, 10000)   640000                                       
____________________________________________________________________________________________________
embedding_2 (Embedding)          (None, None, 10000)   320000                                       
___________________________________________________________________________________________

In [31]:
model.compile(optimizer = 'rmsprop', loss = 'categorical_crossentropy', metrics = ['acc'])

Instructions for updating:
keep_dims is deprecated, use keepdims instead


***Feeding data to multi-input model***


In [32]:
import numpy as np

num_samples = 1000
max_length = 100

In [33]:
text = np.random.randint(1, text_vocabulary_size, size = (num_samples, max_length))

In [36]:
question = np.random.randint(1, question_vocabulary_size, size = (num_samples, max_length))

In [37]:
### One hot encoded answers

answers = np.random.randint(0, 1, size = (num_samples, answer_vocabulary_size))

In [38]:
model.fit([text, question], answers, epochs = 10, batch_size = 128)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f24585538d0>

In [39]:
### The values could also be passed as an dict. 

model.fit({'text': text, 'question': question}, answers, epochs = 10, batch_size = 128)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f24309a4240>

-----
***Multi-output models***

------


A simple example would be a model which tries to predict different properties using the same data. For example consider finding the age, gender, and income from a person's tweet. 

In [20]:
from keras import layers
from keras.models import Model 

In [35]:
vocabulary_size = 10000
num_income_groups = 10

In [36]:
posts_input = layers.Input(shape = (None,), dtype = 'int32', name = 'posts')
embedded_posts = layers.Embedding(256, vocabulary_size)(posts_input)

In [37]:
x = layers.Conv1D(256, 5, activation = 'relu')(embedded_posts)
x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(256, 5, activation = 'relu')(x)
x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(256, 5, activation = 'relu')(x)
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dense(128, activation = 'relu')(x)

In [38]:
age_prediction = layers.Dense(1, name = 'age')(x)
income_prediction = layers.Dense(num_income_groups, 
                                activation = 'softmax', 
                                name = 'income')(x)
gender_prediction = layers.Dense(1, activation = 'sigmoid', name = 'gender')(x)

In [39]:
model = Model(posts_input, [age_prediction, income_prediction, gender_prediction])

In [40]:
model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
posts (InputLayer)               (None, None)          0                                            
____________________________________________________________________________________________________
embedding_3 (Embedding)          (None, None, 10000)   2560000                                      
____________________________________________________________________________________________________
conv1d_15 (Conv1D)               (None, None, 256)     12800256                                     
____________________________________________________________________________________________________
max_pooling1d_7 (MaxPooling1D)   (None, None, 256)     0                                            
___________________________________________________________________________________________

***Different loss functions are necessary if we use different output layers***

In [41]:
model.compile(optimizer = 'rmsprop', loss = ['mse','categorical_crossentropy', 'binary_crossentropy'])

The model can also be compiled as follows by using a dict(make sure you add a name tag to all output layers if using dict):

```python 
model.compile(optimizer = 'rmsprop', 
              loss = { 'age' : 'mse',
                      'income' : 'categorical_crossentropy',
                      'gender' : 'binary_crossentropy' })
```

Now, since all the three outputs focus on different things, it might be that the loss contributions will cause the model representations to be optimized preferentially with largest individual loss, at the expense of other tasks. So to avoid this loss weighting can be done. 

As usually the MSE loss is a little higher than the binary_crossentropy appropriate weighting is necessarry. 


In [45]:
model.compile(optimizer = 'rmsprop', 
              loss = { 'age' : 'mse',
                      'income' : 'categorical_crossentropy',
                      'gender' : 'binary_crossentropy' },
             loss_weights = [0.025, 1. , 10.])

In [53]:
### Some random data to test dimensionality 

from keras.utils import to_categorical
import numpy as np

num_samples = 1000
max_length = 200

posts = np.random.randint(1, vocabulary_size, size = (num_samples, max_length))
age_targets = np.random.random(size = (num_samples, ))

income_targets = np.random.randint(0, 9, size = (num_samples, ))
income_targets = to_categorical(income_targets, num_classes = 10)

gender_targets = np.random.randint(0, 1, size = (num_samples, ))

In [54]:
### The data can be fed in following way

model.fit(posts, [age_targets, income_targets, gender_targets], epochs = 10, batch_size = 64)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f30779bd358>

-----------

***Directed Acyclic Graphs of layers****

-----

With Keras functional API we can also build Directed Acyclic graphs. Several architectures use apecial architectures involving complex structures which require the use of functional API. Two of the important such modules are residual connections and inception module. 

------

Inception Module

-------

It is a popular architecture inspired by network-in-network architecture. It consists of stack of modules that themselves look like small independent networks. The most basic inception module has 3-4 branches, starting with a 1x1 convolution, followed by a 3x3 convolution, and ending with concatenation of the resulting features. 

The setup helps the network seperately learn spatial features and channel wise features. which is more efficient than learning them jointly. Complex modules involving Pooling anddifferent spatial convolutions have been also used in some inception modules. 

---------

***Purpose of 1x1 Convolution***

-------

So the purpose of using bigger kernals was to make sure we get the spatial information encoded within the Weights after training. The 1x1 kernal does not care about the spatial features but focuses on the correlation between the multiple channels. Suppose there are 64 channels in the input and we use a single 1x1 kernal. Then esentially what we are doing is taking a slice from the image across all channels and then passing it through a single neuron and then applying activation function. In this way we are trying to learn the relation between different channels. This helps improve the model a lot as there is a high correlation between multiple channels.

This is also used to decrease the number of parameters and increase the training speed of the model. As 1x1 convolution is good way of dimensionality reduction.

***Inception model can be implemented in following way*** 

##3 Consider there is a input tensor of dimension 4. 
```python
from keras import layers

branch_a = layers.Conv2D(128,1,activation = 'relu', strides = 2)(x)

branch_b = layers.Conv2D(128,1,activation = 'relu')(x)
branch_b = layers.Conv2D(128,3,activation = 'relu', strides = 2)(branch_b)

branch_c = layers.AveragePooling2D(3, strides = 2)(x)
branch_c = layers.Conv2D(128,3,activation = 'relu', strides = 2)(branch_c)

branch_d = layers.Conv2D(128,1,activation = 'relu')(x)
branch_d = layers.Conv2D(128,3,activation = 'relu')(branch_d)
branch_d = layers.Conv2D(128,3,activation = 'relu', stride = 2)(branch_d)

output = layers.concatenate([branch_a, branch_b, branch_c, branch_d], axis =1)
```

-------

***Xception***

-----

Xception is an extreme version of Inception model. It takes the logical idea of seperating the spatial and channel wise features to an extreme, and replaces the Inception modules with depth wise sepearable convolutions consisting of depthwise convolutions(spatial convolution where every channel is handled seperately) followed by 1x1 convolution. The model has approximately same number of parameters as original Inception model but performs better than Inception on ImageNet Data as it uses the paramaters more efficiently. 

-------------

***Residual Connections***

----

It is one of the breakthrough papers in Computer Vision as it tackles two common problems that plague any large scale deep learning model i.e. vanishing gradients, and representational bottlenecks. In general it is beneficial to add residual connections to network of the size 10 layers or more. 

A residual connection consists of making the output of earlier layer available as input to later layer, effectively creating a short cut in a sequential network. **Rather than being concatenated to the later activation, the earlier output is summed with later activation, which assumes that both activations are the same size.** If they are of different shapes a resize operation can be done using 1x1 convolution without activation or Dense layer. Following is a basic block which shows how Residual connections can be implemented. 

```python 
from keras import layers

x = ...
y = layers.Conv2D(128, 3, activation = 'relu', padding = 'same')(x)
y = layers.Conv2D(128, 3, activation = 'relu', padding = 'same')(y)
y = layers.Conv2D(128, 3, activation = 'relu', padding = 'same')(y)

y = layers.add([y,x], axis = 1)
```

Following is the snipplet where there is a dimensionality difference. 


```python 
from keras import layers

x = ...
y = layers.Conv2D(128, 3, activation = 'relu', padding = 'same')(x)
y = layers.Conv2D(128, 3, activation = 'relu', padding = 'same')(y)
y = layers.MaxPooling2D(2, strides = 2)(y)

### Adding this to make the dimensions of the x and y same. 
residual = layers.Conv2D(128, 1, strides = 2, padding = 'same')(x)

y = layers.add([y,residual], axis = 1)
```

***Representational Bottlenecks*** - In a sequential model, each successive representation layer is built on top of the previous one, which means that it only has access to information contained in the activations of previous layer. **If one layer is too small, then model is constrained by hoe much information can be crammed into activations of this layer.** So residual connections pass the earlier information to later layers even if there exists a representational bottleneck. 

***Vanishing Gradients*** - Backpropogation the backbone of model training, works by propogating a feedback signal from the output loss down to earlier layers. If the feedback signal has to be propogated through a deep stack of layers, the signal may become tenuous or even be lost entirely,  rendering the network untrainable. The residual connectins act as parallel short track for the gradients to propogate. 

-------

***Layer Weight Sharing***

-----

One of the important features of Funstional API is ability to reuse the layer instance several times. When we call a layer twice without instantiating a new object we are esentially working on the same layer and same weeights of the layer. This means that the this layer can be shared by different branches. 

Example Application - SUppose we want to find the similarity index between two sentences. The inputs are two sentences and output is the correlation between them. Since, semantic similarity is a symmetric similarity i.e. the output whether the input is A, B or B, A won't matter. So instead of processing the two input layers using seperate instance of LSTM network. same LSTM layer can be shared to learn the representations. This is what is called a **Siamese LSTM Model** or **shared LSTM**. 

Python pseudo code: 

```python
from keras import layers
from keras.models import Models

lstm = layers.LSTM(32)

left_input = layers.Input(shape = (None, 128))
left_output = lstm(left_input)

right_input = layers.Input(shape = (None, 128))
right_output = lstm(right_input)

merged = layers.concatenate([left_output, right_output], axis = 1)
predictions = layers.Dense(1, activation = 'sigmoid')(merged)

model = Model([left_input, right_input], predictions)
model.fit([left_data, right_data], targets
```



--------

***Models as layers***

-------

Models can also be treated as layers. Call a model on a tensor and obtain output tensor can be passed to another layer or model. All the things discussed for layers directly translate to Models. The weights can be reused, Models can take in multiple inputs, outputs etc.
