[Keras](https://keras.io/) is an API that provides high-level building blocks for developing machine learning models.

Keras does not implement low level operations like tensor manipulations and differentiation itself but instead delegates them to a backend engine. 

Several different backend engines can be plugged into Keras:

 * [TensorFlow (Google)](https://www.tensorflow.org/)
 * [Theano (MILA lab, Universite of Montreal)](http://deeplearning.net/software/theano/)
 * [Microsoft Cognitive Toolkit (CNTK)](https://github.com/Microsoft/CNTK)

Keras models can be run with any of these backends without having to change the code.

Keras is able to run seamlessly on both CPUs and GPUs.

A typical deployment stack looks like this: 
 
<img src="images/keras_stack.png" height="250" width="400"/> 

 * [CUDA](https://developer.nvidia.com/cuda-toolkit) is a parallel computing API for Nvidia devices
 * [cuDNN (Deep Neural Network library)](https://developer.nvidia.com/cudnn) is a library that provides primitives for neural networks 
 * [BLAS (Basic Linear Algebra Subprograms)](http://www.netlib.org/blas/) is a library with basic vector and matrix operations
 * [Eigen](http://eigen.tuxfamily.org/index.php?title=Main_Page) is library for linear algebra
 


## Anatomy of a Keras model

A Keras model contains the following objects:

 * Layers, which are combined into a model
 * The input data and labels
 * The loss function, which defines the feedback signal used for learning
 * The optimizer, which determines how learning proceeds

<img src="images/keras_model.png" height="250" width="400"/> 

## Layers

A layer is a function that takes as input one or more tensors and that outputs one or more tensors.

Some layers are stateless, but more frequently layers have a state: the layer’s weights, one or several tensors learned with stochastic gradient descent.

Examples of stateless layers:
 * Dropout: regularization to reducing overfitting in models
 * Merge layers: concatenate, sum, mean, min, max etc.

Examples of stateful layers:
 * Dense layers
 * Recurrent layers
 * Convolution layers

Different layers are appropriate for different types of data processing:

 * Vector data, stored in 2D tensors of shape (batch_size, features), is usually processed by dense layers
 * Sequence data, stored in 3D tensors of shape (batch_size, timesteps, features), is usually processed by recurrent layers
 * Image data, stored in 4D tensors of shape (batch_size, height, width, colors), is usually processed by convolution layers

You can think of layers as LEGO bricks.

Models are built by clipping together compatible layers to form useful data-transformation pipelines.

The notion of layer compatibility here refers specifically to the fact that every layer will only accept input tensors of a certain shape and will return output tensors of a certain shape

A model is a directed, acyclic graph of layers. 

The most common instance is a linear stack of layers, mapping a single input to a single output. 

More complex models will have multiple inputs/outputs or short-cut connections.

For each problem class usually exist one or more standard model architectures. 

It is always a good idea to start with one of this models.

In general picking the right model architecture is more an art than a science.


## Layers in Keras

The following sections demonstrate the function and behavior of some Keras layers. 

With building a single layer model and running a forward pass (e.g. calling `predict()`) it is possible to introspect the behavior of a layer in isolation. 

For more information see [Keras layers](https://keras.io/layers/about-keras-layers/).

### Dense layer

A dense layer performs the computation `output = activation(dot(input, W) + b)`.

A dense layer takes a tensor of shape (batch_size, input_size) as input and returns a tensor of shape (batch_size, output_size).

 * `W` is a (input_size, output_size) weight matrix 
 * `b` is a output_size dim. vector

Some frequently used [activation functions](https://keras.io/activations) are:
 * `linear`: identity function, e.g. no activation is applied
 * `relu`: rectified linear unit
 * `sigmoid`: Sigmoid function, used in binary classification
 * `softmax`: softmax function, used in multi-class classification

In [2]:
import numpy as np
from keras.layers import Input, Dense
from keras.models import Model

W = np.array([
    [1,2,3,4,5],
    [1,2,3,4,5]])
b = np.array([0,0,0,0,0])
weights_and_bias = (W, b)

inputs = Input(shape=(2,))
outputs = Dense(5, activation='linear', weights=weights_and_bias)(inputs) 
model = Model(inputs=inputs, outputs=outputs)
model.compile(optimizer='sgd', loss='mse')

print('Input shape', model.input.shape)
print('Output shape', model.output.shape)

x = np.array([1,2])
print('Output:', model.predict(np.expand_dims(x, 0)))

np_result = np.dot(x, W) + b
print('Numpy result:', np_result)

Input shape (?, 2)
Output shape (?, 5)
Output: [[ 3.  6.  9. 12. 15.]]
Numpy result: [ 3  6  9 12 15]


## Softmax activation

For a vector $x = [x_0,...,x_n]$ the softmax function calculates:

$ softmax(x_i) = \frac{e^{x_i}}{\sum_i {e^{x_i}}} $

**Note:** In Keras any [activation function](https://keras.io/activations/) can either be used with an Activation layer, or through the activation argument in the layer constructor.

In [26]:
import numpy as np
from keras.layers import Input, Activation
from keras.models import Model

nb_classes = 4
inputs = Input(shape=(nb_classes,), dtype='float32')
softmax = Activation('softmax')(inputs)
model = Model(inputs=inputs, outputs=softmax)
model.compile(optimizer='sgd', loss='mse')

# simulate the output of the last layer
logits = np.array([[8.0, 2.0, 9.0, 3.0]])
probs = model.predict(logits)

for i, prob in enumerate(probs[0]):
    print('Probability for label %d: %f' % (i, prob))
print('Predicted label:', np.argmax(probs))

Probability for label 0: 0.268276
Probability for label 1: 0.000665
Probability for label 2: 0.729251
Probability for label 3: 0.001808
Predicted label: 2


You only have to calculate the softmax output if you are interested in the probabilities. If you are only interested in the predicted label just determine the index of the largest logit value.

### Cross Entropy loss

Input to the cross entropy function **must** be a probability distribution.

$ cross\_entropy(y\_true, y\_pred) = -log(y\_pred_{y\_true}) $

There are multiple implementations of cross entropy:
 * categorical vs. binary
 * sparse vs. one-hot encoded
 
In Tensorflow you usually use optimized functions that combine softmax and cross entropy.

In [27]:
import keras
from keras import backend as K
from keras.losses import sparse_categorical_crossentropy

nb_classes = 5
y_true = K.variable(value=np.array([1]), dtype='float32')
y_pred = K.variable(value=np.array([[0.01, 0.01, 0.96, 0.01, 0.01]]), dtype='float32')
loss_fn = keras.losses.sparse_categorical_crossentropy(y_true, y_pred)

loss = K.eval(loss_fn)
print('cross entropy loss:', loss[0])

cross entropy loss: 4.6051702
