In [0]:
%tensorflow_version 1.x

TensorFlow 1.x selected.


In [0]:
!pip install keras==2.0.8

Collecting keras==2.0.8
[?25l  Downloading https://files.pythonhosted.org/packages/67/3f/d117d6e48b19fb9589369f4bdbe883aa88943f8bb4a850559ea5c546fefb/Keras-2.0.8-py2.py3-none-any.whl (276kB)
[K     |█▏                              | 10kB 16.2MB/s eta 0:00:01[K     |██▍                             | 20kB 1.8MB/s eta 0:00:01[K     |███▋                            | 30kB 2.1MB/s eta 0:00:01[K     |████▊                           | 40kB 1.7MB/s eta 0:00:01[K     |██████                          | 51kB 1.9MB/s eta 0:00:01[K     |███████▏                        | 61kB 2.2MB/s eta 0:00:01[K     |████████▎                       | 71kB 2.4MB/s eta 0:00:01[K     |█████████▌                      | 81kB 2.6MB/s eta 0:00:01[K     |██████████▊                     | 92kB 2.9MB/s eta 0:00:01[K     |███████████▉                    | 102kB 2.8MB/s eta 0:00:01[K     |█████████████                   | 112kB 2.8MB/s eta 0:00:01[K     |██████████████▎                 | 122kB 2.8MB/s

## 7.1.1 Introduction to the functional API

In [0]:
from keras import Input, layers

input_tensor = Input(shape=(32, ))
dense = layers.Dense(32, activation='relu')
output_tensor = dense(input_tensor)

Using TensorFlow backend.







In [0]:
from keras.models import Sequential, Model
from keras import layers
from keras import Input

# a model
seq_model = Sequential()
seq_model.add(layers.Dense(32, activation='relu', input_shape=(64, )))
seq_model.add(layers.Dense(32, activation='relu'))
seq_model.add(layers.Dense(10, activation='softmax'))

# its equivalent
input_tensor = Input(shape=(64, ))
x = layers.Dense(32, activation='relu')(input_tensor)
x = layers.Dense(32, activation='relu')(x)
output_tensor = layers.Dense(10, activation='softmax')(x)

# the Model class, createsa model from an input and an output tensor
model = Model(input_tensor, output_tensor)

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_4 (InputLayer)         (None, 64)                0         
_________________________________________________________________
dense_11 (Dense)             (None, 32)                2080      
_________________________________________________________________
dense_12 (Dense)             (None, 32)                1056      
_________________________________________________________________
dense_13 (Dense)             (None, 10)                330       
Total params: 3,466
Trainable params: 3,466
Non-trainable params: 0
_________________________________________________________________


Let's see what happens when creating a model from two unrelated tensors

In [0]:
unrelated_input = Input(shape=(32, ))
bad_model = model = Model(unrelated_input, output_tensor)

RuntimeError: ignored

Keras couldn't connect both tensors. 

Now, to compile the model we do:

In [0]:
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')

import numpy as np
x_train = np.random.random((1000, 64))
y_train = np.random.random((1000, 10))

model.fit(x_train, y_train, 
          epochs=10, 
          batch_size=128)

score = model.evaluate(x_train, y_train)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
  32/1000 [..............................] - ETA: 1s

## 7.1.2 Multi-input models

To merge the different branches of a multi-input model we can use keras layers such as: `keras.layers.add`,`keras.layers.concatenate`, etc. 

Let's create a *question-answering model*, which has 2 inputs and one output. The inputs are a question and a text snippet providing information to be used for answering the question. In the simplest model, the amswer is one word, obtained via a softmax activation.

### L7.1 Functional API implementation of a two-input question-answering model

In [0]:
from keras.models import Model
from keras import layers
from keras import Input

text_vocabulary_size = 10000
question_vocabulary_size = 10000
answer_vocabulary_size = 500

text_input = Input(shape=(None, ), dtype='int32', name='text')
embedded_text = layers.Embedding(text_vocabulary_size, 64)(text_input) # Embedding(input, output)
encoded_text = layers.LSTM(32)(embedded_text)

question_input = Input(shape=(None,), dtype='int32', name='question')
embedded_question = layers.Embedding(question_vocabulary_size, 32)(question_input)
encoded_question = layers.LSTM(16)(embedded_question)

concatenated = layers.concatenate([encoded_text, encoded_question], axis=-1)

answer = layers.Dense(answer_vocabulary_size, activation='softmax')(concatenated)

model = Model([text_input, question_input], answer)
model.compile(optimizer='rmsprop', 
              loss='categorical_crossentropy', 
              metrics=['acc'])

### L7.2 Feeding data to a multi-input model

In [0]:
import numpy as np
import keras

num_samples = 1000
max_length = 100

text = np.random.randint(1, text_vocabulary_size, size=(num_samples, max_length))

question = np.random.randint(1, 
                             question_vocabulary_size, 
                             size=(num_samples, max_length))

# answers = np.random.randint(0, 1, size=(num_samples, answer_vocabulary_size))
answers = np.random.randint(answer_vocabulary_size, size=(num_samples))
answers = keras.utils.to_categorical(answers, answer_vocabulary_size)

model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
text (InputLayer)                (None, None)          0                                            
____________________________________________________________________________________________________
question (InputLayer)            (None, None)          0                                            
____________________________________________________________________________________________________
embedding_5 (Embedding)          (None, None, 64)      640000      text[0][0]                       
____________________________________________________________________________________________________
embedding_6 (Embedding)          (None, None, 32)      320000      question[0][0]                   
___________________________________________________________________________________________

In [0]:
# one way to train the model by using a list of inputs
print('training with the first method: giving a list of inputs')
model.fit([text, question], answers, 
          epochs=10, 
          batch_size=128)

# second way is to give a dictionary (this only works if inputs are named using `name=`)
print()
print('training with the second method: giving a dictionary')
model.fit({'text': text, 'question': question}, answers,
          epochs=10, 
          batch_size=128)

training with the first method: giving a list of inputs
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10

training with the second method: giving a dictionary
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f184dab5a58>

## 7.1.3 Multi-outputs models

We can also create models with multiple outputs. For example, we could predict multiple properties from a single data input.    
Let's see an example where we predict attributes of a person, such as age, gender, and income level, from social media posts. 

### L7.3 Functional API implementation of a three-output model

```python
from keras import layers
from keras import Input
from keras.models import Model

vocabulary_size = 50000
num_income_groups = 10

posts_input = Input(shape=(None, ), dtype='int32', name='posts')
embedded_posts = layers.Embedding(256, vocabulary_size)(posts_input)

x = layers.Conv1D(128, 5, activation='relu')(embedded_posts)
x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(256, 5, activation='relu')(x)
x = layers.Conv1D(256, 5, activation='relu')(x)
x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(256, 5, activation='relu')(x)
x = layers.Conv1D(256, 5, activation='relu')(x)
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dense(128, activation='relu')(x)

age_prediction = layers.Dense(1, 
                              name='age')(x)
income_prediction = layers.Dense(num_income_groups, 
                                 activation='softmax', 
                                 name='income')(x)
gender_prediction = layers.Dense(1, 
                                 activation='sigmoid', 
                                 name='gender')(x)

model = Model(posts_input, [age_prediction, income_prediction, gender_prediction])  
```                              

An important difference when training a model with multiple outputs is the specification of the loss function. Each output should have its own loss function depending on the type of task. But to optimize the whole model, gradient descent requires a *scalar* value to minimize. To obtain a scalar value from the different loss funcions, you must combine them by summing them all. In Keras, this is done by giving a list or a dictionary in the `compile` step to specify different objects for different outputs. Internally, the different loss functions will be summed into a single value which will be minimized during training. 


### L7.4 Compilation options of a multi-output model: multiple losses
First method: Giving a list of loss functions:
```python
model.compile(optimizer='rmsprop',
              loss=['mse', 'categorical_cross_entropy', 'binary_crossentropy'])
```

Second method: Giving a dictionary of loss functions indicating the name of the layer:
```python
model.compile(optimizer='rmsprop', 
              loss={'age': 'mse', 
                    'income': 'categorical_crossentropy', 
                    'gender': 'binary_crossentropy'})               
```

What is even more important, is to know the possible range of the different loss function values, since this can draw the attention of the minimization process into one of the tasks. To solve this, you can give weights to the different losses so their importance is similar.

### L7.5 Compilation options of a multi-output model: loss weighting
List method:
```python
model.compile(optimizer='rmsprop', 
              loss=['mse', 'categorical_crossentropy', 'binary_crossentropy'], 
              loss_weights=[0.25, 1., 10.])
```
Dictionary method:
```python
model.compile(optimizer='rmsprop',
              loss={'age': 'mse',
                    'income': 'categorical_crossentropy',
                    'gender': 'binary_crossentropy'},
              loss_weights={'age': 0.25,
                            'income': 1.,
                            'gender': 10.}
```

For training, you can pass Numpy data to the model, in the same way as for the multiple-input models:

### L7.6 Feeding data to a multi-output model

List method
```python
model.fit(posts, [age_targets, income_targets,gender_targets],
          epochs=10, batch_size=64)
```
Dictionary method:
```python
model.fit(posts, {'age': age_targets,
                  'income': income_targets,
                  'gender': gender_targets},
          epochs=10, batch_size=64)
```

## 7.1.4 Directed acyclic graphs of layers

### Inception modules

This is a popular type of network architecture for convolutional neural networks, developed by Christian Szegedy and colleagues at Google in 2013-2014 [(publication)](https://arxiv.org/abs/1409.4842). It is inspired by the *network-in-network* architecture [(publication)](https://arxiv.org/abs/1312.4400). 

The inception network consist of a stack of modules that look like small networks, having several parallel branches. The branches have 1x1 convolutions, 3x3 convolutions, AvgPool2D, etc. They end with a concatenarion of the resulting features. The advantage of this network is that it separately learns channel-wise and spatial features, which is more efficient than doing all together. You can have modules with different complexities. 

#### *The purpose of 1x1 convolutions (pointwise convolutions)*  
It's equivalent to running each tile through a `Dense` layer: it mixes information from the channels of the input tensor, but it won't mix information across space. 

There is another model called [Xception](https://arxiv.org/abs/1610.02357) (extreme inception), that separates the learning of spatial and channel-wise features to its logical extreme. It has rougly the same number of parameters as Inception V3, but it shows better runtime performance and higher accuracy in ImageNet and other large-scale datasets.

### Residual Connections

This is a common network component found in many post-2015 network architectures. They were introduced by [He et al](https://arxiv.org/abs/1512.03385). They tackle 2 common problems in deep-learning models: vanishing gradients and representational bottlenecks. (in general, beneficial for more than 10 layers.) 

It consist of making the output of a layer available as input to a later layer. 

Both outputs are then summed to create a unique activation, assuming both activations are the same size. If the are not, then you should use a transformation to reshape the earlier activation (`Dense` layer, 1x1 conv. w/o activation).  

Here's an example when they are the same size (assuming a 4D input tensor `x`):

```python
from keras import layers

x = ...
y = layers.Conv2D(128, 3, activation='relu', padding='same')(x)
y = layers.Conv2D(128, 3, activation='relu', padding='same')(y)
y = layers.Conv2D(128, 3, activation='relu', padding='same')(y)

y = layers.add([y, x])
```

And now a residual connection when the feature-maps sizes differ:

```python
from keras import layers

x = ...
y = layers.Conv2D(128, 3, activation='relu', padding='same')(x)
y = layers.Conv2D(128, 3, activation='relu', padding='same')(y)
y = layers.MaxPooling2D(2, strides=2)(y)

residual = layers.Conv2D(128, 1, strides=2, padding='same')(x) # Uses a 1x1 conv. to linearly downsample 
                                                               # the original x tensor to the same shape as y

y = layers.add([y, residual])
```

### Important concepts
### - Representational bottlenecks in deep learning:  
 This concept can be undertood in the context of a network of Sequential `Dense` layers. If one layer has too few units, its capacity to represent the information will be more restricted than a layer with more units. This affects the amount of information passing to the next layers in the network, as any information loss will never be recover afterwards. Residual connections come to partially solve this issue for deep-learning models. 
### - Vanishing gradients in deep learning:  
Backpropagation is the core algorithm involved in the training of neural networks. It work propagating a feedback signal from the output back to the earlier layers. As it is a gradient, this signal may be too subtle or even effectively zero after many layers, which means earlier layer cannnot be trained.  
This vanishing signal occurs with deep networks as well as with recurrent neural network with very long sequences. For LSTM layers, a carry track is implemented to propagate information parallel to the main processing section. Residual connections work similarly but in a simpler way. They introduce a linear information carry track parallel to the main layer track, thus helping to propagate gradients through arbitrarily deep stacks of layers. 

## 7.1.5 Layer weight sharing

This is another important feature of the functional API, reusing a layer instance. When calling a layer instancce twice, we reuse the same weights. This allows to build models with shared branches, i.e. they share and learn the same representations simultaneously for different sets of inputs.  

For example, a model that looks for the semantic similarity between two sentences can use the same layer to process both input sentences in parallel. We can use a LSTM layer in what is called a *Siamese* LSTM model, or *shared* LSTM. 

Let's look at the implementation:

```python
from keras import layers
from keras import Input
from keras.models import Model

lstm = layers.LSTM(32) # We instantiate a single LSTM layer, once

left_input = Input(shape=(None, 128))
left_output = lstm(left_input)

right_input = Input(shape=(None, 128))
right_output = lstm(right_input) # We call the same layer

# We wuild the classifier on top
merged = layers.concatenate([left_output, right_output], axis=-1)
predictions = layers.Dense(1, activation='sigmoid')(merged)

# We instantiate and train the model 
# The weights of the LSTM layer are updated based on both inputs
model = Model([left_input, right_input], predictions)
model.fit([left_data, right_data], targets)
```

## 7.1.6 Models as layers

When using the functional API, models can be used as layers! This is true for both `Sequential` and `Model` classes. You can call a model on an input tensor and get an output tensor:
```python
y = model(x)
```
With a multi-input/output model, you should use a lists of tensors:
```python
y1, y2 = model([x1, x2])
```
Following the *shared weights* behaviour when instantiating a layer more than once, here the same happens, you reuse the model's weights.

An example of a *shared* model is a vision model that uses a dual camera as its input, to detect depth. Here you don't need two independent models to extract visual features for each camera before merging them. That processing can be shared across the two inputs: by using shared models, i.e. shared *layers*. Here is an implementation of this Siamese vision model based in the Xception network (convolutional base only):

```python
from keras import layers
from keras import applications
from keras import Input

xception_base = applications.Xception(weights=None,
include_top=False)

# The inputs are 250x250 RGB images
left_input = Input(shape=(250, 250, 3))
right_input = Input(shape=(250, 250, 3))

# We call the same vision model twice
left_features = xception_base(left_input)
right_features = xception_base(right_input)

merged_features = layers.concatenate([left_features, right_features], axis=-1)
```

## 7.1.7 Wrapping up

Concepts covered in the introduction to the Keras functional API:
- To use `Model` when `Sequential` doesn't allow to build the required model
- How to build models with several inputs/outputs, complex internal network topology
- How to reuse weights of layers or models across different processing branches, by calling the same layer/model instance more than once