## Recognizing Handwritten Digits

For this goal, we'll use the MNIST (refer to http://yann.lecun.com/exdb/mnist/), a database of handwritten digits made up of a training set of 60,000 examples and a test set of 10,000 examples. Each MNIST image is in greyscale and it consists of 28x28 pixels.

Keras provides suitable libraries to load the dataset and split it into training sets and tests sets, used for assessing the performance. Data is converted to `float32` for supporting GPU computation and normalized to `[0, 1]`. In addition, we load the true labels `Y_train` and `Y_test` respectively and perform a one-hot encoding on them.

* The input layer has a neuron associated with each pixel in the image for a total of 28 x 28 = 784 neurons, one for each pixel in the MNIST images;
* Typically, the values associated with each pixel are normalized in the range [0, 1] (which means that the intensity of each pixel is divided by 255, the maximum intensity value);
* The final layer is a single neuron with activation function `softmax`, which is a generalization of the `sigmoid` function;

Once we defined the model, we have to compile it so that it can be executed by the Keras backend (either Theano or TensorFlow). There are a few choices to be made during compilation:

* We need to select the `optimizer` that is the algorithm used to update weights while we train our model;
* We need to select the `objective function` that is used by the optimizer to navigate the space of weights (frequently, objective functions are called `loss function`, and the process of optimization is defined as a process of loss minimization);
* We need to evaluate the trained model.

Some common choices for metrics (a complete list of Keras metrics is at https://keras.io/metrics/) are as follows:

* **Accuracy**: This is the proportion of correct predictions with respect to the targets;
* **Precision**: This denotes how many selected items are relevant for a multilabel classification;
* **Recal**: This denotes how many selected items are relevant for a multilabel classification.

Metrics are similar to objective functions, with the only difference that they are not used for training a model but only for evaluating a model.

Once the model is compiled, it can be then trained with the fit() function, which specifies a few parameters:

* **epochs**: This is the number of times the model is exposed to the training set. At each iteration, the optimizer tries to adjust the weights so that the objective function is minimized;
* **batch_size**: This is the number of training instances observed before the optimizer performs a weight update.

In [None]:
from __future__ import print_function
import numpy as np
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.optimizers import SGD
from keras.utils import np_utils
np.random.seed(1671) # for reproducibility

import ssl
ssl._create_default_https_context = ssl._create_unverified_context

# network and training
NB_EPOCH = 200
BATCH_SIZE = 128
VERBOSE = 1
NB_CLASSES = 10 # number of outputs
OPTIMIZER = SGD()
N_HIDDEN = 128
VALIDATION_SPLIT = 0.2 # how much training data is reserved for validation

(X_train, Y_train), (X_test, Y_test) = mnist.load_data()
RESHAPED = 784

# X_train is 60000 rows of 28x28 values --> reshaped in 60000 x 784
X_train = X_train.reshape(60000, RESHAPED)
X_test = X_test.reshape(10000, RESHAPED)
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')

# Normalize
X_train /= 255
X_test /= 255
print(X_train.shape[0], 'train samples')
print(X_test.shape[0], 'test samples')

# Convert class vectors to binary class matrices
Y_train = np_utils.to_categorical(Y_train, NB_CLASSES)
Y_test = np_utils.to_categorical(Y_test, NB_CLASSES)

# Creates the model
model = Sequential()
model.add(Dense(NB_CLASSES, input_shape=(RESHAPED,)))
model.add(Activation('softmax'))
model.summary()

# Selects the optimizer and the evaluation metrics.
model.compile(loss='categorical_crossentropy', optimizer=OPTIMIZER, metrics=['accuracy'])

# Trains the model
history = model.fit(X_train, Y_train,
                    batch_size=BATCH_SIZE,
                    epochs=NB_EPOCH,
                    verbose=VERBOSE,
                    validation_split=VALIDATION_SPLIT)

# Evaluates the model
score = model.evaluate(X_test, Y_test, verbose=VERBOSE)
print("Test score:", score[0])
print('Test accuracy:', score[1])

**Insights**
* The network is trained on 48,000 samples, and 12,000 are reserved for validation;
* Once the neural model is built, it is then tested on 10,000 samples;
* we can notice that the program runs for 200 iterations, and each time, the accuracy improves;

This means that a bit less than one handwritten character out of ten is not correctly recognized. We can certainly do better than that.

### Improving our neural network

* A first improvement is to add additional layers to our network;
* So, after the input layer, we have a first dense layer with the `N_HIDDEN` neurons and an activation function `relu`;
* This layer is called _hidden_ because it is not directly connected to either the input of the output;
* After the first hidden layer, we have a second hidden layer, again with the `N_HIDDEN` neurons, followed by an output layer with 10 neurons.

In [None]:
from __future__ import print_function
import numpy as np
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.optimizers import SGD
from keras.utils import np_utils
np.random.seed(1671) # for reproducibility

import ssl
ssl._create_default_https_context = ssl._create_unverified_context

# Network and training
NB_EPOCH = 20
BATCH_SIZE = 128
VERBOSE = 1
NB_CLASSES = 10 # number of outputs
OPTIMIZER = SGD()
N_HIDDEN = 128
VALIDATION_SPLIT = 0.2 # how much training data is reserved for validation

(X_train, Y_train), (X_test, Y_test) = mnist.load_data()
RESHAPED = 784

# X_train is 60000 rows of 28x28 values --> reshaped in 60000 x 784
X_train = X_train.reshape(60000, RESHAPED)
X_test = X_test.reshape(10000, RESHAPED)
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')

# Normalize
X_train /= 255
X_test /= 255
print(X_train.shape[0], 'train samples')
print(X_test.shape[0], 'test samples')

# Convert class vectors to binary class matrices
Y_train = np_utils.to_categorical(Y_train, NB_CLASSES)
Y_test = np_utils.to_categorical(Y_test, NB_CLASSES)

model = Sequential()
model.add(Dense(N_HIDDEN, input_shape=(RESHAPED,)))
model.add(Activation('relu'))
model.add(Dense(N_HIDDEN))
model.add(Activation('relu'))
model.add(Dense(NB_CLASSES))
model.add(Activation('softmax'))
model.summary()

# Selects the optimizer and the evaluation metrics.
model.compile(loss='categorical_crossentropy', optimizer=OPTIMIZER, metrics=['accuracy'])

# Trains the model
history = model.fit(X_train, Y_train,
                    batch_size=BATCH_SIZE, epochs=NB_EPOCH,
                    verbose=VERBOSE, validation_split=VALIDATION_SPLIT)

# Evaluates the model
score = model.evaluate(X_test, Y_test, verbose=VERBOSE)
print("Test score:", score[0])
print('Test accuracy:', score[1])

### Further improving our neural network

* The second improvement is to randomly drop with the dropout probability some of the values propagated inside our internal dense network of hidden layers;
* In Machine Learning, this is a well known form of regularization;
* It has been frequently observed that networks with random dropout in internal hidden layers can generalize better on unseen examples contained in test sets;
* One can think of this as each neuron becoming more capable because it knows it cannot depend on its neighbors;
* During testing, there is no dropout, so we are now using all our highly tuned neurons;
* It is generally a good approach to test how a net performs when some dropout function is adopted.

**OBS:** try first training the network with `NB_EPOCH` set to 20. Note that training accuracy should be above test accuracy, otherwise we're not training long enough. After testing it with 20, set the `NB_EPOCH` value to 250 and see the results.

In [None]:
from __future__ import print_function
import numpy as np
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.optimizers import SGD
from keras.utils import np_utils
np.random.seed(1671) # for reproducibility

import ssl
ssl._create_default_https_context = ssl._create_unverified_context

# Network and training
NB_EPOCH = 250
BATCH_SIZE = 128
VERBOSE = 1
NB_CLASSES = 10 # number of outputs
OPTIMIZER = SGD()
N_HIDDEN = 128
VALIDATION_SPLIT = 0.2 # how much training data is reserved for validation
DROPOUT = 0.3

(X_train, Y_train), (X_test, Y_test) = mnist.load_data()
RESHAPED = 784

# X_train is 60000 rows of 28x28 values --> reshaped in 60000 x 784
X_train = X_train.reshape(60000, RESHAPED)
X_test = X_test.reshape(10000, RESHAPED)
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')

# Normalize
X_train /= 255
X_test /= 255
print(X_train.shape[0], 'train samples')
print(X_test.shape[0], 'test samples')

# Convert class vectors to binary class matrices
Y_train = np_utils.to_categorical(Y_train, NB_CLASSES)
Y_test = np_utils.to_categorical(Y_test, NB_CLASSES)

model = Sequential()
model.add(Dense(N_HIDDEN, input_shape=(RESHAPED,)))
model.add(Activation('relu'))
model.add(Dropout(DROPOUT))
model.add(Dense(N_HIDDEN))
model.add(Activation('relu'))
model.add(Dropout(DROPOUT))
model.add(Dense(NB_CLASSES))
model.add(Activation('softmax'))
model.summary()

# Selects the optimizer and the evaluation metrics.
model.compile(loss='categorical_crossentropy', optimizer=OPTIMIZER, metrics=['accuracy'])

# Trains the model
history = model.fit(X_train, Y_train,
                    batch_size=BATCH_SIZE, epochs=NB_EPOCH,
                    verbose=VERBOSE, validation_split=VALIDATION_SPLIT)

# Evaluates the model
score = model.evaluate(X_test, Y_test, verbose=VERBOSE)
print("Test score:", score[0])
print('Test accuracy:', score[1])

### Testing different optimizers

* Let's focus on one popular training technique known as gradient descent (GD);
* The gradient descent can be seen as a hiker who aims at climbing down a mountain into a valley;
* Imagine a generic cost function `C(w)` in one single variable `w`;
* At each step `r`, the gradient is the direction of maximum increase;
* At each step, the hiker can decide what the leg length is before the next step, which is the `learning rate` in gradient descent jargon;
* If the learning rate is too small, the hiker will move slowly, but it's too high, the hiker will possibly miss the valley;
* In practice, we just choose the activation function, and Keras uses its backend (Tensorflow or Theano) for computing its derivative on our behalf;
* When we discuss backpropagation, we will discover that the minimization game is a bit more complex than our toy example;
* Keras implements a fast variant of gradient descent known as stochastic gradient descent (`SGD`) and two more advanced optimization techniques known as `RMSprop` and `Adam`.

In [None]:
from __future__ import print_function
import numpy as np
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.optimizers import RMSprop, Adam
from keras.utils import np_utils
np.random.seed(1671) # for reproducibility

import ssl
ssl._create_default_https_context = ssl._create_unverified_context

# Network and training
NB_EPOCH = 20
BATCH_SIZE = 128
VERBOSE = 1
NB_CLASSES = 10 # number of outputs
OPTIMIZER = Adam()
N_HIDDEN = 128
VALIDATION_SPLIT = 0.2 # how much training data is reserved for validation
DROPOUT = 0.3

(X_train, Y_train), (X_test, Y_test) = mnist.load_data()
RESHAPED = 784

# X_train is 60000 rows of 28x28 values --> reshaped in 60000 x 784
X_train = X_train.reshape(60000, RESHAPED)
X_test = X_test.reshape(10000, RESHAPED)
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')

# Normalize
X_train /= 255
X_test /= 255
print(X_train.shape[0], 'train samples')
print(X_test.shape[0], 'test samples')

# Convert class vectors to binary class matrices
Y_train = np_utils.to_categorical(Y_train, NB_CLASSES)
Y_test = np_utils.to_categorical(Y_test, NB_CLASSES)

model = Sequential()
model.add(Dense(N_HIDDEN, input_shape=(RESHAPED,)))
model.add(Activation('relu'))
model.add(Dropout(DROPOUT))
model.add(Dense(N_HIDDEN))
model.add(Activation('relu'))
model.add(Dropout(DROPOUT))
model.add(Dense(NB_CLASSES))
model.add(Activation('softmax'))
model.summary()

# Selects the optimizer and the evaluation metrics.
model.compile(loss='categorical_crossentropy', optimizer=OPTIMIZER, metrics=['accuracy'])

# Trains the model
history = model.fit(X_train, Y_train,
                    batch_size=BATCH_SIZE, epochs=NB_EPOCH,
                    verbose=VERBOSE, validation_split=VALIDATION_SPLIT)

# Evaluates the model
score = model.evaluate(X_test, Y_test, verbose=VERBOSE)
print("Test score:", score[0])
print('Test accuracy:', score[1])

* So far, we made progressive improvements; however, the gains are now more and more difficult;
* Note that we are optimizing with a dropout of 30%;
* For the sake of completeness, it could be useful to report the accuracy on the test only for other dropout values with `Adam` chosen as optimizer.

### Increasing the number of epochs

* We can make another attempt and increase the number of epochs used for training from 20 to 200;
* Unfortunately, this choice increases our computation time by 10, but it gives us no gain;
* **Learning is more about adopting smart techniques and not necessarily about the time spent in computations.**

### Controlling the optimizer learning rate

* There is another attempt we can make, which is changing the learning parameter for our optimizer;
* If you plot different values, you'll see that the optimal value is somewhere close to 0.001.

### Increasing the number of internal hidden neurons

* We can make yet another attempt, that is, changing the number of internal hidden neurons;
* We report the results of the experiments with an increasing number of hidden neurons;
* By increasing the complexity of the model, the run time increases significantly because there are more and more parameters to optimize.

### Increasing the size of batch computation

* Gradient descent tries to minimize the cost function on all the examples provided in the training sets;
* Stochastic gradient descent considers only `BATCH_SIZE`;
* If we check the behavior is by changing this parameter, we notice that the optimal accuracy value is reached for BATCH_SIZE=128.

### Adopting regularization for avoiding overfitting

* A model can become excessively complex in order to capture all the relations inherently expressed by the training data, which can bring two problems:
  - First, a complex model might require a significant amount of time to be executed;
  - Second, a complex model can achieve good performance on training data and not be able to generalize on unsee data.
* As a rule of thumb, if during the training we see that the loss increases on validation, after an initial decrease, then we have a problem of model complexity that overfits training;
* In order to solve the overfitting problem, we need a way to capture the complexity of a model, that is, how complex a model can be;
* A model is nothing more than a vector of weights. Therefore the complexity of a model can be conveniently represented as the number of nonzero weights;
* If we have two models, M1 and M2, achieving pretty much the same performance in terms of loss function, then we should choose the simplest model that has the minimum number of nonzero weights;
* Playing with regularization can be a good way to increase the performance of a network, in particular when there is an evident situation of overfitting.
* There 3 types of regularization in machine learning:
  - **L1 regularization** (also known as **lasso**): The complexity of the model is expressed as the sum of the absolute values of the weights;
  - **L2 regularization** (also known as **ridge**): The complexity of the model is expressed as the sum of the squares of the weights;
  - **Elastic net regularization**: The complexity of the model is captured by a combination of the two preceding techniques.

### Hyperparameter tuning

* For a given net, there are indeed multiple parameters that can be optimized (such as the number of hidden neurons, BATCH_SIZE, number of epochs, and many more);
* Hyperparameter tuning is the process of finding the optimal combination of those parameters that minimize cost functions.
* In other words, the parameters are divided into buckets, and different combinations of values are checked via a brute force approach.

### Predicting Output

* You can use the following method for predicting the output with Keras:
* `model.predict(X)`: This is used to predict the Y values;
* `model.evaluate()`: This is used to compute the loss values;
* `model.predict_classes()`: This is used to compute category outputs;
* `model.predict_proba()`: This is used to compute class probabilities.

## Getting Started with Keras

### What is a tensor?

* A tensor is nothing but a multidimensional array or matrix;
* Keras uses either Theano or TensorFlow to perform very efficient computations on tensors;
* Both the backends are capable of efficient symbolic computations on tensors, which are the fundamental building blocks for creating neural networks.

### Predefined Neural Network Layers

* **Regular dense**: A dense model is a fully connected neural network layer;
* **Recurrent neural networks -- simple LSTM and GRU**: Recurrent neural networks are a class of neural networks that exploit the sequential nature or their input. Such inputs could be a text, a speech, time series, and anything else where the occurrence of an element in the sequence is dependent on the elements that appeared before it;
* **Convolutional and pooling layers**: ConvNets are a class of neural networks using convolutional and pooling operations for progressively learning rather sophisticated models based on progressive levels of abstraction. It resembles vision models that have evolved over millions of years inside the human brain. People called it deep with 3-5 layers a few years ago, and now it has gone up to 100-200;
* **Regularization**: Regularization is a way to prevent overfitting. Multiple layers have parameters for regularization. One example is `Dropout`, but there are others;
* **Batch normalization**: It's a way to accelerate learning and generally achieve better accuracy;

### Losses functions

Losses functions (or objective functions, or optimization score function) can be classified into four categories:

* **Accuracy** which is used for classification problems;
* **Error loss**, which measures the difference between the values predicted and the values actually observed. There are multiple choices: `mse` (mean square error), `rmse` (root mean square error), `mae` (mean absolute error), `mape` (mean percentage error) and `msle` (mean squared logarithmic error);
* **Hinge loss**, which is generally used for training classifiers;
* **Class loss** is used to calculate the cross-entropy for classification problems (see https://en.wikipedia.org/wiki/Cross_entropy).

### Metrics

A metric function is similar to an objective function. The only difference is that the results from evaluating a metric are not used when training the model.

### Optimizers

Optimizers include `SGD`, `RMSprop`, and `Adam`.

## Deep learning with Convolutional Networks (ConvNets)

* Leverage spacial information and are suited for classifying images;
* Based on how our vision is based on multiple cortex levels, with each one recognizing more and more structured information;
* Two different types of layers, convolutional and pooling, are typically alternated.

### Local receptive fields

* To preserve spatial information, we represent each image with a matrix of pixels;
* A simple way to encode the local structure is to connect a submatrix of adjacent input neurons into one single hidden neuron (which is the **local receptive field**) belonging to the next layer;
* Of course, we can encode more information by having overlapping submatrices;
* In Keras, the size of each single submatrix is called _stride length_, and this is a hyperparameter that can be fine-tuned during the construction of our nets;
* Of course, we can have multiple feature maps that learn independently from each hidden layer.

![ConvNet example](ConvNet.gif)

* Rather than focus on one pixel at a time, ConvNets take in square patches of pixels and passes them through a _filter_ (or _kernel_), and the job of the filter is to find patterns in the pixels;
* We are going to take the dot product of the filter with this patch of the image channel. If the two matrices have high values in the same positions, the dot product’s output will be high. If they don’t, it will be low.
* We start in the upper lefthand corner of the image and we move the filter across the image step by step until it reaches the upper righthand corner. The size of the step is known as `stride`. You can move the filter to the right 1 column at a time, or you can choose to make larger steps;
* At each step, you take another dot product, and you place the results of that dot product in a third matrix known as an `activation map`;
* The width, or number of columns, of the activation map is equal to the number of steps the filter takes to traverse the underlying image;
* Since larger strides lead to fewer steps, a big stride will produce a smaller activation map.
* This is important, because the size of the matrices that convolutional networks process and produce at each layer is directly proportional to how computationally expensive they are and how much time they take to train.
* **A larger stride means less time and compute.**

### Max Pooling/Downsampling

![Max Pool example](MaxPool.png)

* The activation maps are fed into a downsampling layer, and like convolutions, this method is applied one patch at a time;
* In this case, max pooling simply takes the largest value from one patch of an image;
* Much information is lost in this step, which has spurred research into alternative methods. But downsampling has the advantage, precisely because information is lost, of decreasing the amount of storage and processing required;

### Average Pooling

* The alternative method to Max Pooling is simply taking the average of the regions, which is called _average pooling_.

### ConvNets Summary

![ConvNets Summary](ConvNetSummary.png)

In the image above you can see:

* The actual input image that is scanned for features;
* Activation maps stacked atop one another, one for each filter you employ;
* The activation maps condensed through downsampling;
* A new set of activation maps created by passing filters over the first downsampled stack;
* The second downsampling, which condenses the second set of activation maps;
* A fully connected layer that classifies output with one label per node.

There are various architectures of CNNs available which have been key in building algorithms which power and shall power AI as a whole in the foreseeable future:

1. LeNet
2. AlexNet
3. VGGNet
4. GoogLeNet
5. ResNet
6. ZFNet

### LeNet code in Keras



In [None]:
from keras import backend as K
from keras.models import Sequential
from keras.layers.convolutional import Conv2D
from keras.layers.convolutional import MaxPooling2D
from keras.layers.core import Activation
from keras.layers.core import Flatten
from keras.layers.core import Dense
from keras.optimizers import Adam
from keras.utils import np_utils
import numpy as np
import matplotlib.pyplot as plt

from keras.datasets import mnist

import ssl
ssl._create_default_https_context = ssl._create_unverified_context

#define the ConvNet
class LeNet:
    @staticmethod
    def build(input_shape, classes):
        model = Sequential()
        # CONV => RELU => POOL
        # Here, 20 is the number of convolution kernels/filters to use, each one with the size 5x5 and padding='same' means that padding is used.
        # Output dimension is the same one of the input shape, so it will be 28 x 28
        # pool_size=(2, 2) represents the factors by which the image is vertically and horizontally downscaled
        model.add(Conv2D(20, kernel_size=5, padding="same", input_shape=input_shape))
        model.add(Activation("relu"))
        model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))
        # CONV => RELU => POOL
        # A second convolutional stage with ReLU activations follows
        # In this case, we increase the number of convolutional filters learned to 50
        # Increasing the number of filters in deeper layers is a common technique used in deep learning
        model.add(Conv2D(50, kernel_size=5, border_mode="same"))
        model.add(Activation("relu"))
        model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))
        # Flatten => RELU layers
        # Pretty standard flattening and a dense network of 500 neurons
        model.add(Flatten())
        model.add(Dense(500))
        model.add(Activation("relu"))
        # Softmax classifier
        model.add(Dense(classes))
        model.add(Activation("softmax"))
        return model

# Training parameters
NB_EPOCH = 20
BATCH_SIZE = 128
VERBOSE = 1
OPTIMIZER = Adam()
VALIDATION_SPLIT = 0.2
IMG_ROWS, IMG_COLS = 28, 28 # input image dimensions
NB_CLASSES = 10 # number of outputs
INPUT_SHAPE = (1, IMG_ROWS, IMG_COLS)

# data: shuffled and split between train and test sets
(X_train, Y_train), (X_test, Y_test) = mnist.load_data()
K.set_image_dim_ordering("th")

# consider them as float and normalize
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255
X_test /= 255

# we need a 60K x [1 x 28 x 28] shape as input to the CONVNET
X_train = X_train[:, np.newaxis, :, :]
X_test = X_test[:, np.newaxis, :, :]

print(X_train.shape[0], 'train samples')
print(X_test.shape[0], 'test samples')

# convert class vectors to binary class matrices
Y_train = np_utils.to_categorical(Y_train, NB_CLASSES)
Y_test = np_utils.to_categorical(Y_test, NB_CLASSES)

# initialize the optimizer and model
model = LeNet.build(input_shape=INPUT_SHAPE, classes=NB_CLASSES)
model.summary()
model.compile(loss="categorical_crossentropy", optimizer=OPTIMIZER, metrics=["accuracy"])
history = model.fit(X_train, Y_train, batch_size=BATCH_SIZE, epochs=NB_EPOCH, verbose=VERBOSE, validation_split=VALIDATION_SPLIT)
score = model.evaluate(X_test, Y_test, verbose=VERBOSE)

print("Test score:", score[0])
print('Test accuracy:', score[1])

# list all data in history
print(history.history.keys())

# summarize history for accuracy
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

* Another test that we can run to better understand the power of deep learning and ConvNet is to reduce the size of the training set and observe the consequent decay in performance;
* The proper training set used for training our model will progressively reduce its size of (5900, 3000, 1800, 600, and 300) examples;
* Our test set is always fixed and it consists of 10,000 examples;
* Our deep network always outperforms the simple network and the gap is more and more evident when the number of examples provided for training is progressively reduced:

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Scatter plot
X      = np.array([5900, 3000, 1800, 600, 300])
Y_conv = np.array([96.68, 92.32, 90.00, 79.14, 72.44])
Y      = np.array([85.56, 81.76, 76.65, 60.26, 48.26])
plt.ylim((0, 100))
plt.xlabel('Training samples')
plt.ylabel('Accuracy')
plt.plot(X, Y_conv, 'b-', X, Y, 'r-')
plt.show()

## Recognizing CIFAR-10 images with deep learning

* The CIFAR-10 dataset contains 60,000 color images of 32 x 32 pixels in 3 channels divided into 10 classes. Each class contains 6000 images;
* The training set contains 50,000 images, while the test set provides 10,000 images;

In [None]:
from keras.datasets import cifar10
from keras.utils import np_utils
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation, Flatten
from keras.layers.convolutional import Conv2D, MaxPooling2D
from keras.optimizers import RMSprop
import matplotlib.pyplot as plt

import ssl
ssl._create_default_https_context = ssl._create_unverified_context

# CIFAR_10 is a set of 60K images 32x32 pixels on 3 channels
IMG_CHANNELS = 3
IMG_ROWS = 32
IMG_COLS = 32

#constant
BATCH_SIZE = 128
NB_EPOCH = 20
NB_CLASSES = 10
VERBOSE = 1
VALIDATION_SPLIT = 0.2
OPTIM = RMSprop()

#load dataset
(X_train, Y_train), (X_test, Y_test) = cifar10.load_data()
print('X_train shape:', X_train.shape)
print(X_train.shape[0], 'train samples')
print(X_test.shape[0], 'test samples')

# Convert to categorical
Y_train = np_utils.to_categorical(Y_train, NB_CLASSES)
Y_test = np_utils.to_categorical(Y_test, NB_CLASSES)

# Float and normalization
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255
X_test /= 255

# Our net will learn 32 convolutional filters, each of which with a 3 x 3 size.
model = Sequential()
model.add(Conv2D(32, (3, 3), padding='same', input_shape=(IMG_ROWS, IMG_COLS, IMG_CHANNELS)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
# Dense lyer of 512 units and ReLU activation + dropout at 50% + softmax layer with 10 classes (one for each category)
model.add(Flatten())
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(NB_CLASSES))
model.add(Activation('softmax'))
model.summary()


# Training
model.compile(loss='categorical_crossentropy', optimizer=OPTIM, metrics=['accuracy'])
model.fit(X_train, Y_train, batch_size=BATCH_SIZE, epochs=NB_EPOCH, validation_split=VALIDATION_SPLIT, verbose=VERBOSE)
score = model.evaluate(X_test, Y_test, batch_size=BATCH_SIZE, verbose=VERBOSE)
print("Test score:", score[0])
print('Test accuracy:', score[1])

# Save the model
model_json = model.to_json()
open('cifar10_architecture.json', 'w').write(model_json)
model.save_weights('cifar10_weights.h5', overwrite=True)

### Improving the CIFAR-10 performance with a deeper network

* One way to improve the performance is to define a deeper network with multiple convolutional operations;

In [None]:
from keras.datasets import cifar10
from keras.utils import np_utils
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation, Flatten
from keras.layers.convolutional import Conv2D, MaxPooling2D
from keras.optimizers import RMSprop

import ssl
ssl._create_default_https_context = ssl._create_unverified_context

# CIFAR_10 is a set of 60K images 32x32 pixels on 3 channels
IMG_CHANNELS = 3
IMG_ROWS = 32
IMG_COLS = 32

#constant
BATCH_SIZE = 128
NB_EPOCH = 20
NB_CLASSES = 10
VERBOSE = 1
VALIDATION_SPLIT = 0.2
OPTIM = RMSprop()

#load dataset
(X_train, Y_train), (X_test, Y_test) = cifar10.load_data()
print('X_train shape:', X_train.shape)
print(X_train.shape[0], 'train samples')
print(X_test.shape[0], 'test samples')

# Convert to categorical
Y_train = np_utils.to_categorical(Y_train, NB_CLASSES)
Y_test = np_utils.to_categorical(Y_test, NB_CLASSES)

# Float and normalization
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255
X_test /= 255

# conv + conv + maxpool + dropout + conv + conv + maxpool + dense + dropout + dense
model = Sequential()
model.add(Conv2D(32, (3, 3), padding='same', input_shape=(IMG_ROWS, IMG_COLS, IMG_CHANNELS)))
model.add(Activation('relu'))
model.add(Conv2D(32, (3, 3), padding='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Conv2D(64, (3, 3), padding='same'))
model.add(Activation('relu'))
model.add(Conv2D(64, 3, 3))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(NB_CLASSES))
model.add(Activation('softmax'))
model.summary()

# Training
model.compile(loss='categorical_crossentropy', optimizer=OPTIM, metrics=['accuracy'])
model.fit(X_train, Y_train, batch_size=BATCH_SIZE, epochs=NB_EPOCH, validation_split=VALIDATION_SPLIT, verbose=VERBOSE)
score = model.evaluate(X_test, Y_test, batch_size=BATCH_SIZE, verbose=VERBOSE)
print("Test score:", score[0])
print('Test accuracy:', score[1])

# Save the model
model_json = model.to_json()
open('cifar10_architecture.json', 'w').write(model_json)
model.save_weights('cifar10_weights.h5', overwrite=True)

### Improving the CIFAR-10 performance with data augmentation

* Another way to improve the performance is to generate more images for our training;
* We can take the CIFAR training set and augment it with multiple transformations including rotation, rescaling, horizontal/vertical flip, zooming, channel shift, and many more:

In [None]:
from keras.datasets import cifar10
from keras.preprocessing.image import ImageDataGenerator
from keras.utils import np_utils
import numpy as np
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation, Flatten
from keras.layers.convolutional import Conv2D, MaxPooling2D
from keras.optimizers import RMSprop

import ssl
ssl._create_default_https_context = ssl._create_unverified_context

# CIFAR_10 is a set of 60K images 32x32 pixels on 3 channels
IMG_CHANNELS = 3
IMG_ROWS = 32
IMG_COLS = 32

# Constants
NUM_TO_AUGMENT=5
BATCH_SIZE = 128
NB_EPOCH = 20
NB_CLASSES = 10
VERBOSE = 1
VALIDATION_SPLIT = 0.2
OPTIM = RMSprop()

#load dataset
(X_train, Y_train), (X_test, Y_test) = cifar10.load_data()

# Augumenting
# The rotation_range is a value in degrees (0 - 180) for randomly rotating pictures
# width_shift and height_shift are ranges for randomly translating pictures vertically or horizontally
# zoom_range is for randomly zooming pictures
# horizontal_flip is for randomly flipping half of the images horizontally
# fill_mode is the strategy used for filling in new pixels that can appear after a rotation or a shift
print("Augmenting training set images...")
datagen = ImageDataGenerator(
    rotation_range=40,
    width_shift_range=0.2,
    height_shift_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True,
    fill_mode='nearest'
)
datagen.fit(X_train)

# Convert to categorical
Y_train = np_utils.to_categorical(Y_train, NB_CLASSES)
Y_test = np_utils.to_categorical(Y_test, NB_CLASSES)

# Float and normalization
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255
X_test /= 255

# conv + conv + maxpool + dropout + conv + conv + maxpool + dense + dropout + dense
model = Sequential()
model.add(Conv2D(32, (3, 3), padding='same', input_shape=(IMG_ROWS, IMG_COLS, IMG_CHANNELS)))
model.add(Activation('relu'))
model.add(Conv2D(32, (3, 3), padding='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Conv2D(64, (3, 3), padding='same'))
model.add(Activation('relu'))
model.add(Conv2D(64, 3, 3))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(NB_CLASSES))
model.add(Activation('softmax'))
model.summary()

# Training
model.compile(loss='categorical_crossentropy', optimizer=OPTIM, metrics=['accuracy'])
history = model.fit_generator(
    datagen.flow(X_train, Y_train, batch_size=BATCH_SIZE),
    samples_per_epoch=len(X_train),
    epochs=NB_EPOCH,
    verbose=VERBOSE
)

score = model.evaluate(X_test, Y_test, batch_size=BATCH_SIZE, verbose=VERBOSE)
print("Test score:", score[0])
print('Test accuracy:', score[1])

# Save the model
model_json = model.to_json()
open('cifar10_architecture.json', 'w').write(model_json)
model.save_weights('cifar10_weights.h5', overwrite=True)

### Predicting with CIFAR-10

* Suppose that we want to use the deep learning model we just trained for CIFAR-10 for a bulk evaluation of images:

In [None]:
import numpy as np
import scipy.misc
from keras.models import model_from_json
from keras.optimizers import SGD

# Load the model
model_architecture = 'cifar10_architecture.json'
model_weights = 'cifar10_weights.h5'
model = model_from_json(open(model_architecture).read())
model.load_weights(model_weights)

# Load images
img_names = ['cat-standing.jpg', 'dog.jpg']
imgs = [np.transpose(scipy.misc.imresize(scipy.misc.imread(img_name), (32, 32)), (1, 0, 2)).astype('float32') for img_name in img_names]
imgs = np.array(imgs) / 255

# Train
optim = SGD()
model.compile(loss='categorical_crossentropy', optimizer=optim, metrics=['accuracy'])

# Predict
predictions = model.predict_classes(imgs)
print(predictions)

## Very deep convolutional networks for largescale image recognition

* In 2014, an interesting contribution for image recognition was presented:
* The paper shows that, a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers;
* One model in the paper denoted as D or **VGG-16** has 16 deep layers.

### Utilizing Keras built-in VGG-16 net module

* Keras _applications_ are pre-built and pre-trained deep learning models;
* Weights are downloaded automatically when instantiating a model and stored at `~/.keras/models/`;

In [None]:
from keras.models import Model
from keras.preprocessing import image
from keras.optimizers import SGD
from keras.applications.vgg16 import VGG16
import matplotlib.pyplot as plt
import numpy as np
import cv2

import ssl
ssl._create_default_https_context = ssl._create_unverified_context

# Prebuild model with pre-trained weights on imagenet
model = VGG16(weights='imagenet', include_top=True)
sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(optimizer=sgd, loss='categorical_crossentropy')

# Resize into VGG16 trained images' format
im = cv2.resize(cv2.imread('steam-locomotive.png'), (224, 224))
im = np.expand_dims(im, axis=0)

# Predict
out = model.predict(im)
plt.plot(out.ravel())
plt.show()
print(np.argmax(out)) # this should print 820 for steaming train

### Recycling pre-built deep learning models for extracting features

* One very simple idea is to use VGG-16 and, more generally, DCNN, for feature extraction.
* Why we want to extract the features from an intermediate layer in a DCNN?
  - as the network learns to classify images into categories, each layer learns to identify the features that are necessary to do the final classification;
  - Lower layers identify lower order features such as color and edges;
  - Higher layers compose these lower order feature into higher order features such as shapes or objects.
* This has many advantages:
  - We can rely on publicly available large-scale training and transfer this learning to novel domains;
  - We can save time for expensive large training.

In [None]:
from keras.applications.vgg16 import VGG16
from keras.models import Model
from keras.preprocessing import image
from keras.applications.vgg16 import preprocess_input
import numpy as np

# Pre-built and pre-trained deep learning VGG16 model
base_model = VGG16(weights='imagenet', include_top=True)

for i, layer in enumerate(base_model.layers):
    print (i, layer.name, layer.output_shape)

# Extract features from block4_pool block
model = Model(input=base_model.input, output=base_model.get_layer('block4_pool').output)
img_path = 'cat.png'
img = image.load_img(img_path, target_size=(224, 224))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)
# Get the features from this block
features = model.predict(x)

### Very deep inception-v3 net used for transfer learning

* Transfer learning is a very powerful deep learning technique which has more applications in different domains;
* It works like using your knowledge of English to learn Spanish;
* Computer vision researchers commonly use pre-trained CNNs to generate representations for novel tasks, where the dataset may not be large enough to train an entire CNN from scratch;
* Another common tactic is to take the pretrained ImageNet network and then to fine-tune the entire network to the novel task;
* **Inception-v3** net is a very deep ConvNet developed by Google;
* The default input size for this model is 299 x 299 on three channels;

![Inception-v3](Inception-v3.png)

* Suppose to have a training dataset D in a domain, different from ImageNet. D has 1,024 features in input and 200 categories in output;
* The top level is a dense layer with 1,024 inputs and where the last output level is a softmax dense layer with 200 classes of output;
* `x = GlobalAveragePooling2D()(x)` is used to convert the input to the correct shape for the dense layer to handle;

The `base_model.output` tensor has the shape `(samples, channels, rows, cols)` for `dim_ordering="th"` or `(samples, rows, cols, channels)` for `dim_ordering="tf"` but dense needs them as `(samples, channels)` and `GlobalAveragePooling2D` averages across `(rows, cols)`. So if you look at the last four layers (where `include_top=True`), you see these shapes:

```
# layer.name, layer.input_shape, layer.output_shape
('mixed10', [(None, 8, 8, 320), (None, 8, 8, 768), (None, 8, 8, 768),
(None, 8, 8, 192)], (None, 8, 8, 2048))
('avg_pool', (None, 8, 8, 2048), (None, 1, 1, 2048))
('flatten', (None, 1, 1, 2048), (None, 2048))
('predictions', (None, 2048), (None, 1000))
```

When you do `include_top=False`, you are removing the last three layers and exposing the `mixed10` layer so the `GlobalAveragePooling2D` layer converts the `(None, 8, 8, 2048)` to `(None, 2048)`, where each element in the `(None, 2048)` tensor is the average value for each corresponding `(8, 8)` subtensor in the `(None, 8, 8, 2048)` tensor.

* We'll then have a new deep network that reuses the standard Inception-v3 network, but it is trained on a new domain D via transfer learning;
* Even though there are many parameters to fine-tune for achieving good accuracy, we are now reusing a very large pretrained network as a starting point via transfer learning;

In [None]:
from keras.applications.inception_v3 import InceptionV3
from keras.preprocessing import image
from keras.models import Model
from keras.layers import Dense, GlobalAveragePooling2D
from keras import backend as K
from keras.optimizers import SGD

# Create the base pre-trained model
# We don't include the top model because we want to finetune on D
base_model = InceptionV3(weights='imagenet', include_top=False)

# Adds a global spatial average pooling layer
x = base_model.output
x = GlobalAveragePooling2D()(x) # let's add a fully-connected layer as first layer
x = Dense(1024, activation='relu')(x) # and a logistic layer with 200 classes as last layer
predictions = Dense(200, activation='softmax')(x) # model to train
model = Model(input=base_model.input, output=predictions)

# All the convolutional levels are pre-trained, so we freeze them during the training of the full model
for layer in base_model.layers: layer.trainable = False

# compile the model (should be done after setting layers to nontrainable)
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')

# The model is then compiled and trained for a few epochs so that the top layers are trained
# Train the model on the new data for a few epochs
model.fit_generator(...)

# Then we freeze the top layers in inception and fine-tune some inception layer
# In this example, we decide to freeze the first 172 layers (an hyperparameter to tune)
for layer in model.layers[:172]: layer.trainable = False
for layer in model.layers[172:]: layer.trainable = True

# The model is then recompiled for fine-tune optimization. We use SGD with a low learning rate
model.compile(optimizer=SGD(lr=0.0001, momentum=0.9), loss='categorical_crossentropy')

# We train our model again (this time fine-tuning the top 2 inception blocks) alongside the top Dense layers
model.fit_generator(...)

## Generative adversarial  networks (GAN) and WaveNet

* GANs have been defined as the most interesting idea in the last 10 years of Machine Learning;
* The key intuition of GAN can be easily considered as analogous to art forgery, which is the process of creating works of art;
* GANs are able to learn how to reproduce synthetic data that looks real;
* Computers can learn how to paint and create realistic images;
* _WaveNet_ is a deep generative network proposed by Google DeepMind to teach computers how to reproduce human voices and musical instruments, both with impressive quality;
* GANs train two neural nets simultaneously, as shown in the next diagram:
  - The generator G(Z) makes the forgery;
  - The discriminator D(Y) can judge how realistic the reproductions are;
* G(Z) takes an input from a random noise, Z, and trains itself to fool D into thinking that whatever G(Z) produces is real;
* So, G and D play an opposite game; hence the name adversarial training and their objectives is expressed as a loss function optimized via a gradient descent;
* The generative model learns how to forge more successfully, and the discriminative model learns how to recognize forgery more successfully.
* At the end, the generator will learn how to produce forged images that are indistinguishable from real ones;
* GANs require finding the equilibrium in a game with two players;
* Sometimes the two players eventually reach an equilibrium, but this is not always guaranteed and the two players can continue playing for a long time;

## Word Embeddings

* It's a way to transform words in text to numerical vectors so that they can be analyzed by machine learning algorithms;
* _One hot encoding_ is the most basic embedding apprach. It represents a word by a vector of the size of the vocabulary, where only the entry corresponding to that word is 1 and all the others are 0;
* The main problem with _one hot encoding_ is that there's no way to represent the similarity between words;
* Similarity between vectors is computed using the dot product, so the dot product between two words is always zero;
* The NLP community has borrowed techniques such as TF-IDF, _latent semantic analysis (LSA)_ and topic modeling to use the documents as the context;
* However, these representations capture a slightly different document-centric idea of semantic similarity;
* Today, word embedding is the technique of choice for vectorizing text for all kinds of NLP tasks, such as text classification, document clustering, part of speech tagging, named entity recognition, sentiment analysis and so on;

### Distributed representations

* _Distributed representations_ attempt to capture the meaning of a word by considering its relations with other words in its context;

For example, consider the following pair of sentences:

1. _Paris is the capital of France_
2. _Berlin is the capital of Germany_

You would still conclude without too much effort that the word pairs `(Paris, Berlin)` and `(France, Germany)` are related in some way:

`Paris : France :: Berlin : Germany`

Thus, the aim of distributed representations is to find a general transformation function φ to convert each word to its associated vector such that relations of the following form hold true:

`φ("Paris") - φ("France") ≈ φ("Berlin") - φ("Germany")`

### word2vec

* Created in 2003 at Google;
* The models are unsupervised, taking as input a large corpus of text and producing a vector space of words;
* The dimensionality of the word2vec embedding space is usually lower than the dimensionality of the one-hot embedding space;
* It has 2 architectures:
  - Continuous bag of words (CBOW);
  - Skip-gram;
* In CBOW, the model predicts the current word given a window of surrounding words;
* In the Skip-gram arquitecture, the model predicts the surroundig words given the center word;
* According to authors, CBOW is faster, but Skip-gram does a better job at predicting _infrequent words_;
* It's interesting to note that both flavors of word2vec are shallow neural networks;

#### Skip-gram word2vec model

* The skip-gram model is trained to predict the surrounding words given the current word;

Consider this example:

`I love green eggs and ham.`

Assuming a window size of three, we can break it in the following set of `(context, word)` pairs:

```
([I, green], love)
([love, eggs], green)
([green, and], eggs)
([eggs, ham], and)
```

* Since the skip-gram model predicts a context word given the center word, we can convert the preceding dataset to one of (input, output) pairs;
* We then generate positive examples by combining correct predictions with a result of 1 and negative examples by combining random words with a result of 0:

```
((love, I), 1)
((love, green), 1)
...
((love, ham), 0)
((love, and), 0)
```

* We can now train a classifier that takes in a word vector and a context vector and learns to predict one or zero depending on whether it sees a positive or negative sample;
* The deliverables from this trained network are the weights of the word embedding layer;
* The skip-gram model can be built in Keras as follows. Assume that the vocabulary size is set at 5000, the output embedding size is 300 and the window size is 1 (a window size of one means that the context for a word is the words immediately to the left and right);

In [None]:
from keras.layers import Dot
from keras.layers.core import Dense, Reshape
from keras.layers.embeddings import Embedding
from keras.models import Sequential, Model
from keras.engine.input_layer import Input

vocab_size = 5000
embed_size = 300

# The input to this model is the word ID in the vocabulary
# The embedding weights are initially set to small random values
# The next layer reshapes the input to the embedding size
word_model = Sequential()
word_model.add(Embedding(vocab_size, embed_size, embeddings_initializer="glorot_uniform", input_length=1))
word_model.add(Reshape((embed_size, )))

# The other model that we need is a sequential model for the context words
# For each of our skip-gram pairs, we have a single context word corresponding to the target word
context_model = Sequential()
context_model.add(Embedding(vocab_size, embed_size, embeddings_initializer="glorot_uniform", input_length=1))
context_model.add(Reshape((embed_size,)))

# The outputs of the two models are each a vector of size (embed_size).They're both merged into one
# using a dot product and fed into a dense layer.
# The sigmoid activation function modulates the output so numbers higher than 0.5 tend rapidly to 1 and
# flatten out.
merged_output = dot([word_model.output, context_model.output], axes=1)
dot_product_output = Dense(1, kernel_initializer="glorot_uniform", activation="sigmoid")(merged_output)
model = Model([word_model.input, context_model.input], dot_product_output)

model.compile(loss="mean_squared_error", optimizer="adam")
model.summary()

# The loss function used is the mean_squared_error. The idea is to minimize the dot product for positive
# examples and maximize it for negative examples. The dot product multiplies corresponding elements of vectors
# and sums up the result, which causes similar vectors to have higher dot products than dissimilar vectors,
# since the former has more overlapping elements.

* Keras provides a convenience function to extract skip-grams for a text that has been converted to a list of word indices;
* This is an example of using this function to extract the first 10 of 56 skip-grams generated (both positive and negative);
* The tokenizer creates a dictionary mapping each unique word to an integer ID and makes it available in the `word_index` attribute;
* The `skip-gram` method randomly samples the results from the pool of possibilities for the positive examples;
* The process of negative sampling, used for generating the negative examples, consists of randomly pairing up arbitrary tokens from the text. As the size of the input increases, it is more likely to pick up unrelated word pairs, but in this small example it can end pu generating positive examples as well;

In [None]:
from keras.preprocessing.text import *
from keras.preprocessing.sequence import skipgrams

#text = "I love green eggs and ham ."
text = "My life has been getting more and more complicated as the size of the input has increased with time."

# Declare the tokenizer and run the text against it.
tokenizer = Tokenizer()
tokenizer.fit_on_texts([text])

# Extracts the {word: id} dictionary and create a two-way lookup table:
word2id = tokenizer.word_index # {'i': 1, 'love': 2, 'eggs': 4, 'and': 5, 'ham': 6}
id2word = { v:k for k, v in word2id.items() } # {1: 'i', 2: 'love', 3: 'green', 4: 'eggs', 5: 'and', 6: 'ham'}

# Convert our input list of words to a list of IDs and pass it to the skipgrams function.
wids = [word2id[w] for w in text_to_word_sequence(text)] # [1, 2, 3, 4, 5, 6]
pairs, labels = skipgrams(wids, len(word2id))
#print('pairs', pairs) # [[6, 4], [6, 5], [5, 3], ..., [3, 5], [1, 3]]
#print('labels', labels) # [0, 1, 0, ..., 0, 1]

# Prints the first 10 words from the pool of possibilities
for i in range(10):
    print("(({:s} ({:d}), {:s} ({:d})), {:d})".format(
        id2word[pairs[i][0]], pairs[i][0],
        id2word[pairs[i][1]], pairs[i][1],
        labels[i])
    )

### CBOW word2vec model

* The CBOW model predicts the center word given the context words;
* In the first tuple in the following example, the CBOW model needs to predict the output word `love`, given the context words `I` and `green`;

```
[I, green], love)
([love, eggs], green)
([green, and], eggs)
...
```

* Like the skip-gram model, the CBOW model is also a classifier that takes the context words as input and predicts the target word;
* The input to the model is the word IDs for the context words;
* These word IDs are fed into a common embedding layer that is initialized with small random weights;
* Each word ID is transformed into a vector of size `(embed_size)` by the embedding layer;
* Thus, each row of the input context is transformed into a matrix of size (2 * window_size, embed_size) by this layer;
* This is then fed into a lambda layer, which computes an average of all the embeddings;
* This average is then fed to a dense layer, which creates a dense vector of size `(vocab_size)` for each row;
* The activation function on the dense layer is a `softmax`, which reports the maximum value on the output vector as a probability;
* The ID with the maximum probability corresponds to the target word;

In [None]:
from keras.models import Sequential
from keras.layers.core import Dense, Lambda
from keras.layers.embeddings import Embedding
import keras.backend as K

vocab_size = 5000
embed_size = 300
window_size = 1

# Note that the input_length of this embedding layer is equal to the number of context words.
model = Sequential()
model.add(
    Embedding(
        input_dim=vocab_size,
        output_dim=embed_size,
        embeddings_initializer='glorot_uniform',
        input_length=window_size*2
    )
)
model.add(Lambda(lambda x: K.mean(x, axis=1), output_shape=(embed_size,)))
model.add(Dense(vocab_size, kernel_initializer='glorot_uniform', activation='softmax'))

# The loss function used here is categorical_crossentropy, which is a common choice for cases where there
# are two or more (in our case, vocab_size) categories.
model.compile(loss='categorical_crossentropy', optimizer="adam")

model.summary()

### Extracting word2vec embeddings from the model

* Although word2vec models are classification problems, we are more interested in the side effect of this classification process;
* There are many examples of how these distributed representations exhibit often surprising syntactic and semantic information;
* Vectors connecting words that have similar meanings but opposite genders are approximately parallel in the reduced 2D space, and we can often get very intuitive results by doing arithmetic with the word vectors;
* Intuitively, the training process imparts enough information to the internal encoding to predict an output word that occurs in the context of an input word;

Keras provides a way to extract weights from trained models. For the skip-gram example, the embedding weights can be extracted as follows:

```
merge_layer = model.layers[0]
word_model = merge_layer.layers[0]
word_embed_layer = word_model.layers[0]
weights = word_embed_layer.get_weights()[0]
```

Similarly, the embedding weights for the CBOW example can be extracted using the following one-liner:

```
weights = model.layers[0].get_weights()[0]
```

* In both cases, the shape of the weights matrix is `vocab_size` and `embed_size`;
* In order to compute the distributed representation for a word in the vocabulary, you will need to construct a one-hot vector by setting the position of the word index to one in a zero vector of size `(vocab_size)` and multiply it with the matrix to get the embedding vector of size `(embed_size)`.

### Use 3rd-party implementations or word2vec

* Although you can implement word2vec models on your own, third-party implementations are readily available, and unless your use case is very complex or different, it makes sense to just use one such implementation instead of rolling your own;
* The `gensim` library provides an implementation of word2vec.
* Sice Keras does not provide any support for word2vec, integrating the `gensim` implementation into Keras code is very common practice;
* The following code shows how to build a word2vec model using `gensim` and train it with the text from the `text8` corpus (available for download at https://matthoney.net/dc/text8.zip) which is a file containing about 17 million words derived from Wikipedia. Wikipedia text was cleaned to remove markup, punctuation, and non-ASCII text, and the first 100 million characters of this cleaned text became the text8 corpus. This corpus is commonly used as an example for word2vec because it is quick to train and produces good results.

The steps go as follows:

* We read in the words from the text8 corpus, and split up the words into sentences of 50 words each (the `gensim` library provides a built-in text8 handler that does something similar);
* Since we want to illustrate how to generate a model with any (preferably large) corpus that may or may not fit into memory, we will show you how to generate these sentences using a Python generator;
* The Text8Sentences class will generate sentences of maxlen words each from the text8 file;
* In this case, we do ingest the entire file into memory, but when traversing through directories of files, generators allows us to load parts of the data into memory at a time, process them, and yield them to the caller;

In [None]:
from gensim.models import KeyedVectors
import logging
import os

class Text8Sentences(object):
    
    def __init__(self, fname, maxlen):
        self.fname = fname
        self.maxlen = maxlen
    
    def __iter__(self):
        with open(os.path.join(DATA_DIR, "text8"), "rb") as ftext:
            text = ftext.read().split(" ")
            sentences, words = [], []
            for word in text:
                if len(words) >= self.maxlen:
                    yield words
                    words = []
                    words.append(word)
                    yield words

#The gensim word2vec uses Python logging to report on progress, so we first enable it.
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

DATA_DIR = "../data/"

# Declares an instance of the Text8Sentences class, and the line after that trains the model with
# the sentences from the dataset.
sentences = Text8Sentences(os.path.join(DATA_DIR, "text8"), 50)

# We have chosen the size of the embedding vectors to be 300 and we only consider words that appear
# a minimum of 30 times in the corpus.
# The default window size is 5, so we will consider 5 words before and after the current word.
# By default, the word2vec model created is CBOW, but you can change that by setting sg=1 in the parameters.
model = word2vec.Word2Vec(sentences, size=300, min_count=30)

# The word2vec implementation will make two passes over the data, first to generate a vocabulary and then
# to build the actual model.

# Once the model is created, we should normalize the resulting vectors. According to the documentation,
# this saves lots of memory. Once the model is trained, we can optionally save it to disk:
model.init_sims(replace=True)
model.save("word2vec_gensim.bin")

# The model can be brought back into memory using the following call:
model = Word2Vec.load("word2vec_gensim.bin")

# We can now query the model to find all the words it knows about:
model.vocab.keys()[0:4] # ['homomorphism', 'woods', 'spiders', 'hanging']

# We can find the actual vector embedding for a given word:
model["woman"] # array([ -3.13099056e-01, -1.85702944e+00, ..., -1.30940580e+00], dtype=”float32”)

# We can also find words that are most similar to a certain word:
model.most_similar("woman") # [('child', 0.706), ('girl', 0.702), ..., ('daughter', 0.587)]

# We can provide hints for finding word similarity. For example, the following command returns the
# top 10 words that are like woman and king but unlike man:
model.most_similar(positive=['woman', 'king'], negative=['man'], topn=10) # [('queen', 0.624), ('prince', 0.564), ..., ('matilda', 0.517)]

# We can also find similarities between individual words:
model.similarity("girl", "woman") # 0.702182479574