## Deep Learning Framework based on theano
- [github](https://github.com/fchollet/keras)
- [website](http://keras.io/)
- [examples showing off modelling capability](http://keras.io/examples/)
- [example codes on image/text data](https://github.com/fchollet/keras/tree/master/examples)

### Philosophy
- fast prototyping with flexible and minimal configuration: Torch like interface within Python, also supports sklearn-like prediction interface, e.g., `fit`, `train_on_batch`, `evaluate`, `predict_classes`, `predict_proba`.
- run on both cpu and gpu
- support both convotlutional networks and recurrent networks
- easy to extend

### Basic Usage
Like almost all other apis, the main compoents of keras are (1) different types of layers (2) model (net) consisting of layers and a loss function, (3) optimizers and optionally some data-processing facilities e.g. for image/text/sequence data. 

### Main APIS

#### A. [Data Processing](http://keras.io/preprocessing/sequence/) - most of them are helper functions, and helper processors
- packages: 
    - `keras.preprocessing.sequence` for sequence data
    - `keras.preprocessing.text` for text data
    - `keras.preprocessing.image` for image data

#### B. [Layers](http://keras.io/layers/core/)
- packages:
    - `keras.layers.core` for core layers
    - `keras.layers.convolutional` for convolution/pooling layers
    - `keras.layers.recurrent` for recurrent layers
    - `keras.layers.advanced_activations` as its name suggests
    - `keras.layers.normalization` for normalizations
    - `keras.layers.embeddings` for text embedding (vector representation)
    - `keras.layers.noise` for noise-adding
    - `keras.layers.containers` for ensemble/composite layers, e.g. sequentially stacked multilayers
- activation functions: activations of layers can be specified (1) either via a separate activation layer or (2) through the activation argument supported by all forward layers.Existing activations are
    - softmax: expect shape to be either (nsamples, ntimesteps, ndims) or (nsamples, ndims)
    - softplus
    - relu
    - tanh
    - sigmoid
    - hard_sigmoid
    - linear
- initialization of layer weights can be specified by `init` param in the layer construtor, out-of-box initialization include
    - uniform
    - lecun_uniform (uniform initialization scaled by sqrt of nins)
    - normal
    - identity 
    - orthogonal
    - zero
    - glorot_normal (Gaussian initialization scaled by nin+nout)
    - glorot_uniform
    - he_normal
    - he_uniform
- regularization of layer weights: they are either on layer weights and/or layer activations. These are done via three parameters to a layer. The parameters can have different regularizer instances from the `keras.regularizers` package.
    - `W_regularizer`: l1(l=0.01), l2(l=0.01), l1l2(l1=0.01, l2=0.01)
    - `b_regularizer`: l1(l=0.01), l2(l=0.01), l1l2(l1=0.01, l2=0.01)
    - `activity_regularizer`: activity_l1(l=0.01), activity_l2(l=0.01), activity_l1l2(l1=0.01, l2=0.01)
- constraints: some layers need constraints, see [doc](http://keras.io/constraints/) for details

#### C. [Objective Functions](http://keras.io/objectives/)
Objective functions can be specifed either by name (see below the out-of-box objective function names) or a Theano symbolic function that returns a scalar for each data point - exmaples can be found in [source code](https://github.com/fchollet/keras/blob/master/keras/objectives.py). Available functions include,
- mean_squared_error / mse
- mean_absolute_error / mae
- mean_absolute_percentage_error / mape
- mean_squared_logarithmic_error / msle
- squared_hinge: only for binary classification
- hinge: only for binary classification
- binary_crossentropy: Also known as logloss.
- categorical_crossentropy: aka softmax for multi-classification. ***It needs the labels are in one-hot-encoding, i.e., binary arrays of shape (nsamples, nclasses)***

Note that keras follows the convention of theano where the final output function (e.g., softmax) is treated as an activation, instead of part of loss function (as in Caffe)

#### D. [Optimizers](http://keras.io/optimizers/) 
Existing optimizers and their parameters can be found in the [doc](http://keras.io/optimizers/).
- [comparison of different optimization e.g. rmsprop](http://www.erogol.com/comparison-sgd-vs-momentum-vs-rmsprop-vs-momentumrmsprop/)

#### E. [Callback functors](http://keras.io/callbacks/)
Callback functors are subclasses of `keras.callbacks.Callback` with specific event slots such as `on_train_begin/end(logs={})`, `on_epoch_begin/end(epoch, logs={})`, `on_batch_begin/end(batch, logs={})`. The commonly used out-of-box callbacks are 
- `ModelCheckpoint(filepath, verbose = 0, save_best_only=False)`: Save the model after every epoch. If save_best_only=True, the latest best model according to the validation loss will not be overwritten.
- `EarlyStopping(monitor='val_loss', patience=0, verbose=0)`: Stop training after no improvement of the metric monitor is seen for patience epochs. The parameter of monitor is a key in the `logs` dictionary passed into event listeners.

#### F. [Models](http://keras.io/models/)
- it is the main access point for training/evaluating. 
- it assembles other components such as layers, objective functions and optimizers, e.g.,
    - add layer by `model.add`
    - set loss function and optimizer in `model.compile`
    - set callback functions in `model.fit`
- specify callback functions at different stages
- typical steps to build a keras model
    - [optionally] massage the data into right format via data process helpers
    - create a model via constructors (most of time Sequential, sometime Graph)
    - create layers and add them to the model by `model.add`
    - specify loss function and optimizer by `model.compile`
    - [optionally] specify callback functions for house-keeping
    - train the model with data by `model.fit` or `model.batch_train`
    - evaluate the performance and go back to tune the parameters and models
    - make predictions on new data

## read and visualize the inner layer of model
keras exposes the layers and its parameters, activations via the `.layers` member

And that is pretty much everything to know about keras for a good start. It is highly recommended to read its well written [source code](https://github.com/fchollet/keras).

In [1]:
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

##Typical Model Structures in keras

##Examples 

In [43]:
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.optimizers import SGD, RMSprop
from keras.callbacks import ModelCheckpoint
from keras.datasets import mnist
from keras.utils import np_utils ## utility functions

### MNIST with vanila MLP

In [50]:
## load minist raw images 
(train_X, train_y), (test_X, test_y) = mnist.load_data()
print train_X.shape, test_X.shape, train_y.shape, test_y.shape

## massage the data to normalize/vectorize
## vectorizing images this way wont assume any spatial information in iamges, contrary to cnn
def process_mnist_input(images):
    return images.reshape((-1, 28 * 28)).astype(np.float32) / 255.
## one-hot encoding of class labels, required by softmax-crossentropy loss
def process_mnist_output(labels):
    return np_utils.to_categorical(labels, nb_classes=10)
train_X, test_X = process_mnist_input(train_X), process_mnist_input(test_X)
print train_X.shape, test_X.shape
train_y, test_y = process_mnist_output(train_y), process_mnist_output(test_y)
print train_y.shape, test_y.shape

## build the model - vanila mlp
model = Sequential()
model.add(Dense(28 * 28, 128, activation="relu"))
model.add(Dropout(0.2))
model.add(Dense(128, 128, activation="relu"))
model.add(Dropout(0.2))
model.add(Dense(128, 10, activation="softmax"))
rms = RMSprop()
model.compile(loss = "categorical_crossentropy", optimizer = rms)

## train model under an sklearn interface
## model snapshot callback
save_model = ModelCheckpoint("../data/tmp/keras_mnist_nlp.h5")
model.fit(train_X, train_y, batch_size=128, nb_epoch=20, 
          show_accuracy=True, verbose=2, validation_split=0.3, callbacks=[save_model])

## evaluate on test data
print model.evaluate(test_X, test_y, show_accuracy=True, verbose=0)
np.mean(model.predict_classes(test_X) == test_y.argmax(axis = 1))

(60000, 28, 28) (10000, 28, 28) (60000,) (10000,)
(60000, 784) (10000, 784)
(60000, 10) (10000, 10)
Train on 42000 samples, validate on 18000 samples
Epoch 0
2s - loss: 0.5160 - acc: 0.8490 - val_loss: 0.2357 - val_acc: 0.9284
Epoch 1
2s - loss: 0.2342 - acc: 0.9300 - val_loss: 0.1697 - val_acc: 0.9491
Epoch 2
2s - loss: 0.1745 - acc: 0.9480 - val_loss: 0.1301 - val_acc: 0.9604
Epoch 3
2s - loss: 0.1407 - acc: 0.9577 - val_loss: 0.1228 - val_acc: 0.9639
Epoch 4
2s - loss: 0.1172 - acc: 0.9637 - val_loss: 0.1122 - val_acc: 0.9666
Epoch 5
2s - loss: 0.1062 - acc: 0.9674 - val_loss: 0.1049 - val_acc: 0.9684
Epoch 6
2s - loss: 0.0925 - acc: 0.9715 - val_loss: 0.0964 - val_acc: 0.9717
Epoch 7
2s - loss: 0.0832 - acc: 0.9742 - val_loss: 0.0997 - val_acc: 0.9713
Epoch 8
2s - loss: 0.0757 - acc: 0.9778 - val_loss: 0.1000 - val_acc: 0.9724
Epoch 9
2s - loss: 0.0702 - acc: 0.9781 - val_loss: 0.0940 - val_acc: 0.9733
Epoch 10
2s - loss: 0.0670 - acc: 0.9786 - val_loss: 0.0906 - val_acc: 0.9749
Ep

0.97940000000000005

### MNIST with cnn - utilizing the spatial correlation

In [76]:
from keras.datasets import mnist
from keras.utils import np_utils
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation, Flatten
from keras.layers.convolutional import Convolution2D, MaxPooling2D

## load data
(train_X, train_y), (test_X, test_y) = mnist.load_data()

## massage data for cnn - make gray data into 3D Tensor, normalize, and convert to float32 for gpu
## sometime you need to resize, or crop images to the right shape
## one-hot-encode labels
def process_mnist_input(images):
    return images[:, np.newaxis, :, :].astype(np.float32) / 255.
def process_mnist_output(labels):
    return np_utils.to_categorical(labels)
train_X = process_mnist_input(train_X)
test_X = process_mnist_input(test_X)
print train_X.shape, test_X.shape
train_y, test_y = process_mnist_output(train_y), process_mnist_output(test_y)
print train_y.shape, test_y.shape

## build cnn - the convolution2D layer doesn't need you to specify stride, because it follows best practice
model = Sequential()
model.add(Convolution2D(nb_filter = 32, stack_size = 1, nb_row = 3, nb_col = 3, 
                        border_mode="full", activation="relu"))
model.add(Convolution2D(nb_filter = 32, stack_size = 32, nb_row = 3, nb_col = 3, 
                        activation="relu"))
model.add(MaxPooling2D(poolsize=(2, 2)))
model.add(Dropout(.25))

model.add(Flatten()) ## flatten to vectors - from convolution layer to vector layer
## 28x28 image after (2, 2)-pooling becomes (14, 14)
model.add(Dense(32 * 14 * 14, 128, activation="relu"))
model.add(Dropout(0.5))

model.add(Dense(128, 10, activation="softmax"))

## compile with loss function and optimizer
model.compile(loss = "categorical_crossentropy", optimizer = "adadelta")

## train the model
model.fit(train_X, train_y, batch_size=100, nb_epoch=10, 
          validation_split=0.3, show_accuracy=True, verbose=2)
print model.evaluate(test_X, test_y, show_accuracy=True)

(60000, 1, 28, 28) (10000, 1, 28, 28)
(60000, 10) (10000, 10)
Train on 42000 samples, validate on 18000 samples
Epoch 0
1067s - loss: 0.2822 - acc: 0.9115 - val_loss: 0.0785 - val_acc: 0.9767
Epoch 1
1065s - loss: 0.1020 - acc: 0.9698 - val_loss: 0.0541 - val_acc: 0.9835
Epoch 2
1062s - loss: 0.0764 - acc: 0.9772 - val_loss: 0.0461 - val_acc: 0.9866
Epoch 3
1063s - loss: 0.0625 - acc: 0.9817 - val_loss: 0.0737 - val_acc: 0.9797
Epoch 4
1064s - loss: 0.0517 - acc: 0.9841 - val_loss: 0.0498 - val_acc: 0.9860
Epoch 5
1060s - loss: 0.0487 - acc: 0.9850 - val_loss: 0.0436 - val_acc: 0.9882
Epoch 6
1066s - loss: 0.0456 - acc: 0.9864 - val_loss: 0.0428 - val_acc: 0.9888
Epoch 7
1062s - loss: 0.0383 - acc: 0.9882 - val_loss: 0.0454 - val_acc: 0.9887
Epoch 8
1068s - loss: 0.0354 - acc: 0.9885 - val_loss: 0.0433 - val_acc: 0.9888
Epoch 9
1062s - loss: 0.0321 - acc: 0.9898 - val_loss: 0.0421 - val_acc: 0.9895
0.0280298104641


In [80]:
print model.evaluate(test_X, test_y, show_accuracy=True)
#print np.mean(model.predict_classes(test_X) == test_y.argmax(axis = 1))

[0.028029810464144322, 0.99129999999999996]
