# Model Training

This notebook defines and trains a Keras network to predict, for each order, which of the items in the user's _previous_ order will reappear in it.

This is slightly different from the goal of the Kaggle contest, which is to predict which items from _any_ point in the user's past history will reappear in this order.

Make sure you run the [Data Preparation](1-DataPrep.ipynb) notebook before this one.

Training will be performed on a GPU automatically if your machine has one, or CPU otherwise (assuming you haven't messed with Keras's configuration).

## Problem definition

Conceptually, we're trying to predict a number of items for each order -- a multilabel classification problem.

However, since the candidate items for each order come from a restricted set (the _n_ items in the previous order), this decomposes nicely into _n_ individual binary classification problems. This is generally an easier problem to solve, especially when _n_ is much smaller than the total number of items in the data.

So, each training (or test) instance is actually an item in an order, and the label is 1 if that item also appears in the next order, and 0 otherwise.

This means each order yields _n_ separate training instances, one for each item it contains.

## Network structure

The inputs to the network, for each training instance, are:

* (1) The order that this instance came from
 * This remains the same for all the _n_ instances from this order
* (2) The items in that order which were themselves reorders
 * This also remains the same for all instances from this order
* (3) The user
 * This also remains the same for all instances from this order
* (4) The item itself
 * This is the only input that differs between instances from the same order

The label, for each instance, is a 1 or 0 indicating whether or not it reappears in the user's next order.

All items are represented by an _embedding_, i.e. a vector of floats. These values are learnt during training, along with all the other weights of the network, and items which appear in similar contexts will result in similar embeddings.

A collection of items (i.e. (1) and (2)) is represented by getting the embeddings for those items and simply adding them together, so an empty collection is just a vector of zeros.

Each user is also represented by an embedding, again learnt during training.

The final input to the network, then, is a concatenation of four vectors:

* The order vector (sum of item embeddings in order)
* The reorder vector (sum of reordered item embeddings in order)
* The current item's embedding
* The current user's embedding

Or more visually (each row is a single training instance):

```
|---Order-Vector-6---||--Reorder-Vector-6--||---Item-Vector-13---||---User-Vector-22---|
|---Order-Vector-6---||--Reorder-Vector-6--||---Item-Vector-43---||---User-Vector-22---|
|---Order-Vector-6---||--Reorder-Vector-6--||---Item-Vector-91---||---User-Vector-22---|
...
|---Order-Vector-9---||--Reorder-Vector-9--||---Item-Vector-10---||---User-Vector-71---|
|---Order-Vector-9---||--Reorder-Vector-9--||---Item-Vector-13---||---User-Vector-71---|
...
```

These vectors are fed through a series of fully-connected layers, and then to a single output node which predicts the label. This is scored against the true label via [cross-entropy loss](https://jamesmccaffrey.wordpress.com/2013/11/05/why-you-should-use-cross-entropy-error-instead-of-classification-error-or-mean-squared-error-for-neural-network-classifier-training/). The exact structure of the network is discussed below.

## Evaluation metrics

Before training, 5000 randomly-selected order pairs will be set aside as validation data. After each training epoch -- i.e. each pass through the training data -- we'll evaluate the model's predictive power over the validation set, reporting [precision, recall and F1 score](https://en.wikipedia.org/wiki/Precision_and_recall) (balanced F-measure) averaged over these orders.

N.B. This isn't necessarily a perfect predictor of how well the model will do in production, as the training data can potentially contain order pairs that appear chronologically later in the input data than order pairs in the validation data. This is a violation of the "no time machines" rule. But, it's convenient for demonstration purposes. A thorough evaluation would involve a held-out test set consisting of each user's most recent order.

## RAM usage

Beware! This notebook requires over 4GB of RAM to run.

In [1]:
import numpy as np
import keras as k
import keras.backend as K
from keras.engine.topology import Layer
import sklearn as sk
import tensorflow as tf
import h5py
import os

h5_dir = 'h5/'

Using TensorFlow backend.


## Reopen saved datasets

Pull all the data out of HDF5 file and back into memory. This is much faster than doing random access directly on the files, particularly if you're using a cloud server with network-attached storage.

In [2]:
datafile = h5py.File(os.path.join(h5_dir, 'training_data.h5'), 'r')
orders_dataset = datafile['orders'][:]
reorders_dataset = datafile['reorders'][:]
users_dataset = datafile['users'][:]
items_dataset = datafile['items'][:]
labels_dataset = datafile['labels'][:]
num_rows = datafile['num_rows'][0]
biggest_order_size = datafile['biggest_order_size'][0]
max_product_id = datafile['max_product_id'][0]
max_user_id = datafile['max_user_id'][0]
datafile.close()
del datafile

## Network helpers

From: https://github.com/fchollet/keras/issues/2728

In [3]:
class ZeroMaskedEntries(k.engine.topology.Layer):
    """
    This layer is called after an Embedding layer.
    It zeros out all of the masked-out embeddings.
    It also swallows the mask without passing it on.
    You can change this to default pass-on behavior as follows:

    def compute_mask(self, x, mask=None):
        if not self.mask_zero:
            return None
        else:
            return K.not_equal(x, 0)
    """

    def __init__(self, **kwargs):
        self.support_mask = True
        super(ZeroMaskedEntries, self).__init__(**kwargs)

    def build(self, input_shape):
        self.output_dim = input_shape[1]
        self.repeat_dim = input_shape[2]

    def call(self, x, mask=None):
        mask = K.cast(mask, 'float32')
        mask = K.repeat(mask, self.repeat_dim)
        mask = K.permute_dimensions(mask, (0, 2, 1))
        return x * mask

    def compute_mask(self, input_shape, input_mask=None):
        return None


def mask_aware_mean(x):
    # recreate the masks - all zero rows have been masked
    mask = K.not_equal(K.sum(K.abs(x), axis=2, keepdims=True), 0)

    # number of that rows are not all zeros
    n = K.sum(K.cast(mask, 'float32'), axis=1, keepdims=False)
    
    # compute mask-aware mean of x, or all zeroes if no rows present
    x_mean = K.sum(x, axis=1, keepdims=False) / n
    x_mean = tf.where(tf.is_nan(x_mean), tf.zeros_like(x_mean), x_mean)
    x_mean = tf.verify_tensor_all_finite(x_mean, 'fuck')

    return x_mean


def mask_aware_mean_output_shape(input_shape):
    shape = list(input_shape)
    assert len(shape) == 3 
    return (shape[0], shape[2])

## Network structure

The following code defines the structure of the neural network, as described in the notes at the top of this file.

The constants at the top define the dimensionality of the user and product embeddings, and the hidden layer sizes. The final dimensionality of the concatenated input vector is:

`(product_embedding_size * 3) + user_embedding_size`

because we have order, reorder and current item vectors, again as described above.

Note that there are actually two separate product embedding tables, for the context (previous order and reorder) and the target item. This gives the model more room to manoeuvre. We call these the 'in' and 'out' embeddings like in word2vec. The user embedding is drawn from a different table altogether.

The hidden layers are [exponential linear units](https://arxiv.org/abs/1511.07289), with [dropout](http://jmlr.org/papers/v15/srivastava14a.html) to reduce the effects of overfitting. The network is trained using the [ADAM optimizer](https://arxiv.org/abs/1412.6980).

### Note

When you run the following cell, no data actually gets processed.

The Keras API calls here just _define_ the network structure -- think of it like a declarative DSL for describing a network.

We'll actually pass data into the network, and evaluate its output, later on.

In [18]:
# Layer size constants
user_embedding_size = 50
product_embedding_size = 50

# Don't change this -- number of continuous features
cont_features_size = 2

# Activation function for the hidden layers
activation = 'relu'

# Dropout rate for the hidden layers
dropout = 0.1

# Initial learning rate etc.
learning_rate = 0.1
decay_rate = 1e-5

def bn(input, scale=False):
  return k.layers.Activation(activation)(k.layers.normalization.BatchNormalization(scale=scale)(input))

# Input layers for the four datasets, and the continuous features

user_input = k.layers.Input(shape=(1,), name='user_input')

order_input = k.layers.Input(shape=(biggest_order_size,), name='order_input')

reorder_input = k.layers.Input(shape=(biggest_order_size,), name='reorder_input')

item_input = k.layers.Input(shape=(1,), name='item_input')

cont_features_input = k.layers.Input(shape=(cont_features_size,), name='cont_features_input')

# Set up the embeddings tables -- these map the order, reorder, user, and
# individual item IDs into embeddings
# Think of them as lookup tables mapping IDs to vectors of floats

# Similar to word2vec, for products we use separate 'in' embeddings
# (for the context) and 'out' embeddings (for the item we're trying to predict)

product_embedding_in = k.layers.Embedding(
  input_dim=max_product_id + 1, output_dim=product_embedding_size,
  name='product_embedding_in')

#product_embedding_out = k.layers.Embedding(
#  input_dim=max_product_id + 1, output_dim=product_embedding_size,
#  name='product_embedding_out')

# Set up a separate user embedding table

user_embedding = k.layers.Embedding(
  input_dim=max_user_id + 1, output_dim=user_embedding_size, name='user_embedding')

# Deep averaging networks for the order and reorder sets

order_embedding = product_embedding_in(order_input)

order_vector = k.layers.Lambda(
  mask_aware_mean, mask_aware_mean_output_shape)(order_embedding)

order_hidden_1 = bn((k.layers.Dense(
  product_embedding_size, name='order_hidden_1', use_bias=False)(order_vector)))

order_hidden_2 = bn((k.layers.Dense(
  product_embedding_size, name='order_hidden_2', use_bias=False)(order_hidden_1)))

order_hidden_3 = bn((k.layers.Dense(
  product_embedding_size, name='order_hidden_3', use_bias=False)(order_hidden_2)))

reorder_embedding = product_embedding_in(reorder_input)

reorder_vector = k.layers.Lambda(
  mask_aware_mean, mask_aware_mean_output_shape)(reorder_embedding)

reorder_hidden_1 = bn((k.layers.Dense(
  product_embedding_size, name='reorder_hidden_1', use_bias=False)(reorder_vector)))

reorder_hidden_2 = bn((k.layers.Dense(
  product_embedding_size, name='reorder_hidden_2', use_bias=False)(reorder_hidden_1)))

reorder_hidden_3 = bn((k.layers.Dense(
  product_embedding_size, name='reorder_hidden_3', use_bias=False)(reorder_hidden_2)))

# Flatten the single item embedding into a vector

item_vector = k.layers.Flatten(name='item_vector')(
  product_embedding_in(item_input))

# Flatten the single user embedding into a vector

user_vector = k.layers.Flatten(name='user_vector')(
  user_embedding(user_input))

# Concatenate embeddings with explicitly-supplied features

full_input_vector = k.layers.concatenate(
  [order_hidden_3, reorder_hidden_3, item_vector, user_vector, cont_features_input],
  name="full_input_vector")

# Define the hidden layer

hidden = bn(k.layers.Dense(
  2 + user_embedding_size + 3 * product_embedding_size,
  name='hidden', use_bias=False)(full_input_vector), scale=False)

output = k.layers.Dense(1, activation='sigmoid', name='output')(hidden)

# Compile the model

model = k.models.Model(
  inputs=[order_input, reorder_input, user_input, item_input, cont_features_input],
  outputs=output)

model.compile(
  optimizer=k.optimizers.SGD(lr=learning_rate, decay=decay_rate), loss='binary_crossentropy')

model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
order_input (InputLayer)         (None, 145)           0                                            
____________________________________________________________________________________________________
reorder_input (InputLayer)       (None, 145)           0                                            
____________________________________________________________________________________________________
product_embedding_in (Embedding) multiple              2484450     order_input[0][0]                
                                                                   reorder_input[0][0]              
                                                                   item_input[0][0]                 
___________________________________________________________________________________________

## Training

The following sections set up the training process.

We read the data in batches of 1000 order pairs at a time, although each of these gets expanded into one training instance per item in the earlier order, as described earlier. The label for each instance is whether that item appears in the later order or not.

The data is shuffled before training starts, and again at the end of each training epoch, because data with non-random ordering can cause batch-based optimization algorithms to behave in unexpected ways and perform badly.

Note that shuffling the data is done indirectly, via a numeric index into the input datasets, rather than by trying to shuffle them in-place but in a synchronized manner. This makes the code much simpler, and is faster too.

We set aside 5000 order pairs as the validation set. If you change these, make sure `num_valid` is a multiple of `batch_size` so it doesn't skew the average loss calculation in the `validate` function later.

Also, we use a helper function from scikit-learn to calculate weights for the classes, based on the frequencies of these classes in the training data. This is necessary because class 0 (not a reorder) is much more common than class 1 (reorder). Without adjusting for this, the model would just get stuck in a local optimum where it predicts 0 for every item. By reweighting them, we make false negatives (predicting 0s for items that should be 1s) much more costly to the model, so it avoids this trap.

In [5]:
batch_size = 1000
num_valid = 5000
assert num_valid % batch_size == 0

index = np.arange(num_rows, dtype=np.uint32)
np.random.shuffle(index)
index_valid = index[:num_valid]
index_train = index[num_valid:]

class_weights = sk.utils.class_weight.compute_class_weight(
  'balanced', [0, 1], np.hstack(labels_dataset[index_train]))

Now we define a couple of helper functions.

`make_training_data` is called once per batch, on a list of indices into the training data. It retrieves the records corresponding to those indices from the underlying datasets, performs some minor adjustments and returns them as Numpy arrays.

`validate` is called once per epoch, i.e. after one whole run through the training data. It calculates some metrics on the validation data -- batch by batch as if it was training data -- and returns them.

`validate` is a bit inefficient as it regenerates the input arrays (via `make_training_data`) every time it's called, even though these don't change and could be cached. This would be easy to implement at the cost of higher memory usage.

In [6]:
def make_training_data(indices):
  
  # Get labels for this batch's instances, and stack them into a single vector
  labels_separate = labels_dataset[indices]
  labels = np.hstack(labels_separate)
  
  # Get the arrays of ordered item IDs and reordered item IDs for each order
  # (these are already zero-padded to the same length)
  orders_separate = orders_dataset[indices]
  reorders_separate = reorders_dataset[indices]
  
  # From the length of each individual label array, we know how many items
  # are in each of the orders
  order_lengths = [len(items) for items in labels_separate]

  # Repeat each order array and each reorder array that many times
  orders_data = np.repeat(orders_separate, order_lengths, axis=0)
  reorders_data = np.repeat(reorders_separate, order_lengths, axis=0)
  
  # For the reorder lengths we have to actually count the IDs
  reorder_lengths = [np.count_nonzero(reorder) for reorder in reorders_separate]
  
  # Also repeat each user ID the appropriate number of times
  user_ids = np.repeat(users_dataset[indices], order_lengths)
  
  # We don't need to repeat the item IDs, each training instance corresponds
  # to a single item, to we just concatenate them into a single array
  item_ids = np.hstack(items_dataset[indices])
  
  # Calculate continuous features: currently just order size and reorder size
  # (both normalized by biggest_order_size), repeated over all items in order
  order_size_vec = np.array(order_lengths, dtype=np.float32) / biggest_order_size
  reorder_size_vec = np.array(reorder_lengths, dtype=np.float32) / biggest_order_size
  order_size_data = np.repeat(order_size_vec, order_lengths, axis=0)
  reorder_size_data = np.repeat(reorder_size_vec, order_lengths, axis=0)
  cont_features = np.vstack([reorder_size_data, order_size_data]).T
  
  # Now we return a bunch of things:
  #   orders_data is a zero-padded matrix of IDs, num instances x biggest_order_size
  #   reorders_data is a zero-padded matrix of IDs, same size as orders_data
  #   user_ids is a vector of IDs, length = num instances
  #   item_ids is a vector of IDs, length = num instances
  #   cont_features is a matrix of floats, num instances x 2
  #   labels is a vector of 0/1 values, length = num instances
  #   order_lengths is a vector of ints, length = batch size
  # where "batch size" is the number of orders in the batch, and
  # "num instances" is the total number of items in the batch
  # (the sum of all order sizes in the batch)
  return (orders_data, reorders_data, user_ids, item_ids, cont_features, labels, order_lengths)


def validate(model, class_weights):
  
  losses = []
  labels = []
  outputs = []
  split_intervals = []
  
  # Process validation data one batch at a time
  steps = num_valid // batch_size
  for chunk in np.array_split(np.arange(num_valid, dtype=np.uint16), steps):
    
    indices = index_valid[chunk]
    orders_valid, reorders_valid, users_valid, items_valid, cont_features_valid, labels_valid, lengths_valid = \
      make_training_data(indices)
    
    # This is so we weight the loss differently per class, so it's
    # comparable with training loss
    instance_weights = class_weights[labels_valid]
      
    split_intervals.append(np.cumsum(lengths_valid))
    
    # Feed the validation data into the model, with labels, and get the
    # value of the loss function defined by the model (i.e. cross-entropy)
    loss = model.test_on_batch({'order_input': orders_valid,
                                'reorder_input': reorders_valid,
                                'item_input': items_valid,
                                'user_input': users_valid,
                                'cont_features_input': cont_features_valid},
                               labels_valid, sample_weight=instance_weights)
    losses.append(loss)
    labels.append(labels_valid)

    # Feed the validation data into the model WITHOUT labels, and get
    # the output of the final layer of the model -- this is a vector
    # with the same length as the number of instances (total items)
    # in the validation set
    output = model.predict_on_batch({'order_input': orders_valid,
                                     'reorder_input': reorders_valid,
                                     'item_input': items_valid,
                                     'user_input': users_valid,
                                     'cont_features_input': cont_features_valid})
    outputs.append(output)
    
  # Flatten per-batch outputs and labels into single arrays, then
  # re-split labels into one array per order
  outputs_flat = np.vstack(outputs)
  labels_flat = np.hstack(labels)
  split_intervals_flat = np.hstack(split_intervals)
  labels_split = np.split(labels_flat, split_intervals_flat)[:-1]
  
  # Convert the continuous output values (0-1 range) into 1s and 0s
  thresholded = np.where(outputs_flat > 0.5, 1, 0)

  # Split these binary predictions into individual arrays, one for
  # each order in the validation data, so they align with the labels
  preds_split = np.split(thresholded, split_intervals_flat)[:-1]
  
  # Calculate precision and recall for each order, taking care to
  # avoid divide-by-zero errors where there are no 1s in the
  # predictions or labels for an order
  precision = np.zeros(num_valid)
  recall = np.zeros(num_valid)
  for i, (preds, labels) in enumerate(zip(preds_split, labels_split)):
    matches = float(sum(preds.T[0] * labels))
    if sum(preds) > 0:
      precision[i] = matches / sum(preds)
    if sum(labels) > 0:
      recall[i] = matches / sum(labels)
    
  # Calculate f-measure for each order, and return all these metrics,
  # averaged across all the orders in the validation set
  f1 = np.nan_to_num((2 * precision * recall) / (precision + recall))
  return (np.mean(losses), precision.mean(), recall.mean(), f1.mean())

Finally it's time to train the model.

We do this by defining a generator that Keras runs in a separate thread -- this is invoked repeatedly by Keras, and each time, it grabs the next batch of indices and yields the result of calling `make_training_data` on these.

The `epochs` constant configures how many whole passes through the data we do, scoring the model's predictions on the validation set after each one. After the last epoch you can re-run the cell manually -- it will continue training from where it left off, although the epoch number reported will reset to "1" when you restart it. If you want to totally reset the model, rerun the model definition cell above.

### Note about warnings

You may see a warning about a `StopIteration` exception at the end of each epoch. It's safe to ignore this, it's an artefact of how Keras invokes the generator.

You might also see a warning about invalid values in a divide, which occurs if one or more orders have both a precision _and_ recall of 0. It's also fine to ignore this, as the resulting NaNs are converted to zeros before averaging across the validation data.

In [19]:
steps_per_epoch = num_rows // batch_size
epochs = 30

print("Batch size: %d" % batch_size)
print("Training examples: %d" % num_rows)
print("Batches per epoch: %d" % steps_per_epoch)

print("Initial validation: loss = %0.5f, P = %0.5f, R = %0.5f, F = %0.5f"
      % validate(model, class_weights))

for epoch in range(epochs):
  
  def train_data_generator():
    for chunk in np.array_split(index_train, steps_per_epoch):
      orders_data, reorders_data, user_ids, item_ids, cont_features, labels, item_lengths = \
        make_training_data(chunk)
      yield ({'order_input': orders_data,
              'reorder_input': reorders_data,
              'item_input': item_ids,
              'user_input': user_ids,
              'cont_features_input': cont_features},
             {'output': labels})
    return

  # Train the model on the entire training set
  print("Starting epoch %d" % epoch)
  print("Learning rate: %0.5f" %
        K.eval(model.optimizer.lr * (1. / (1. + model.optimizer.decay * model.optimizer.iterations))))
  
  model.fit_generator(
    train_data_generator(), steps_per_epoch=steps_per_epoch, epochs=1, max_q_size=1,
    class_weight={0: class_weights[0], 1: class_weights[1]})
  
  # Score it against the validation set
  print("Validation: loss = %0.5f, P = %0.5f, R = %0.5f, F = %0.5f"
        % validate(model, class_weights))
  
  # Shuffle the training data (actually just the indices) for next epoch
  print("Shuffling training data")
  np.random.shuffle(index_train)

Batch size: 1000
Training examples: 2933665
Batches per epoch: 2933




Initial validation: loss = 0.69894, P = 0.23398, R = 0.26780, F = 0.21331
Starting epoch 0
Learning rate: 0.10000
Epoch 1/1

Exception in thread Thread-19:
Traceback (most recent call last):
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/andrew/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 612, in data_generator_task
    generator_output = next(self._generator)
StopIteration



Validation: loss = 0.68261, P = 0.25949, R = 0.25388, F = 0.21882
Shuffling training data
Starting epoch 1
Learning rate: 0.09715
Epoch 1/1

Exception in thread Thread-20:
Traceback (most recent call last):
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/andrew/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 612, in data_generator_task
    generator_output = next(self._generator)
StopIteration



Validation: loss = 0.71745, P = 0.28483, R = 0.65826, F = 0.36576
Shuffling training data
Starting epoch 2
Learning rate: 0.09446
Epoch 1/1

Exception in thread Thread-21:
Traceback (most recent call last):
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/andrew/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 612, in data_generator_task
    generator_output = next(self._generator)
StopIteration



Validation: loss = 0.65243, P = 0.30303, R = 0.47586, F = 0.33244
Shuffling training data
Starting epoch 3
Learning rate: 0.09191
Epoch 1/1

Exception in thread Thread-22:
Traceback (most recent call last):
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/andrew/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 612, in data_generator_task
    generator_output = next(self._generator)
StopIteration



Validation: loss = 0.64833, P = 0.30967, R = 0.47487, F = 0.33947
Shuffling training data
Starting epoch 4
Learning rate: 0.08950
Epoch 1/1

Exception in thread Thread-23:
Traceback (most recent call last):
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/andrew/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 612, in data_generator_task
    generator_output = next(self._generator)
StopIteration



Validation: loss = 0.67272, P = 0.30506, R = 0.61044, F = 0.37256
Shuffling training data
Starting epoch 5
Learning rate: 0.08721
Epoch 1/1

Exception in thread Thread-24:
Traceback (most recent call last):
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/andrew/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 612, in data_generator_task
    generator_output = next(self._generator)
StopIteration



Validation: loss = 0.83987, P = 0.16502, R = 0.12683, F = 0.12289
Shuffling training data
Starting epoch 6
Learning rate: 0.08504
Epoch 1/1

Exception in thread Thread-25:
Traceback (most recent call last):
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/andrew/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 612, in data_generator_task
    generator_output = next(self._generator)
StopIteration



Validation: loss = 0.64423, P = 0.33139, R = 0.51350, F = 0.36693
Shuffling training data
Starting epoch 7
Learning rate: 0.08297
Epoch 1/1

Exception in thread Thread-26:
Traceback (most recent call last):
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/andrew/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 612, in data_generator_task
    generator_output = next(self._generator)
StopIteration



Validation: loss = 0.64666, P = 0.31895, R = 0.36724, F = 0.30403
Shuffling training data
Starting epoch 8
Learning rate: 0.08100
Epoch 1/1

Exception in thread Thread-27:
Traceback (most recent call last):
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/andrew/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 612, in data_generator_task
    generator_output = next(self._generator)
StopIteration



Validation: loss = 0.78119, P = 0.14097, R = 0.11667, F = 0.10754
Shuffling training data
Starting epoch 9
Learning rate: 0.07912
Epoch 1/1

Exception in thread Thread-28:
Traceback (most recent call last):
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/andrew/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 612, in data_generator_task
    generator_output = next(self._generator)
StopIteration



Validation: loss = 0.63881, P = 0.32184, R = 0.43813, F = 0.33632
Shuffling training data
Starting epoch 10
Learning rate: 0.07732
Epoch 1/1

Exception in thread Thread-29:
Traceback (most recent call last):
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/andrew/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 612, in data_generator_task
    generator_output = next(self._generator)
StopIteration



Validation: loss = 0.71805, P = 0.30267, R = 0.69618, F = 0.39227
Shuffling training data
Starting epoch 11
Learning rate: 0.07561
Epoch 1/1

Exception in thread Thread-30:
Traceback (most recent call last):
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/andrew/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 612, in data_generator_task
    generator_output = next(self._generator)
StopIteration



Validation: loss = 0.69067, P = 0.20317, R = 0.26232, F = 0.19880
Shuffling training data
Starting epoch 12
Learning rate: 0.07397
Epoch 1/1

Exception in thread Thread-31:
Traceback (most recent call last):
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/andrew/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 612, in data_generator_task
    generator_output = next(self._generator)
StopIteration



Validation: loss = 0.64433, P = 0.33608, R = 0.40492, F = 0.33146
Shuffling training data
Starting epoch 13
Learning rate: 0.07240
Epoch 1/1

Exception in thread Thread-32:
Traceback (most recent call last):
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/andrew/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 612, in data_generator_task
    generator_output = next(self._generator)
StopIteration



Validation: loss = 0.72096, P = 0.24003, R = 0.21870, F = 0.20165
Shuffling training data
Starting epoch 14
Learning rate: 0.07089
Epoch 1/1

Exception in thread Thread-33:
Traceback (most recent call last):
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/andrew/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 612, in data_generator_task
    generator_output = next(self._generator)
StopIteration



Validation: loss = 0.67594, P = 0.30216, R = 0.23507, F = 0.23549
Shuffling training data
Starting epoch 15
Learning rate: 0.06945
Epoch 1/1

Exception in thread Thread-34:
Traceback (most recent call last):
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/andrew/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 612, in data_generator_task
    generator_output = next(self._generator)
StopIteration



Validation: loss = 0.71317, P = 0.30293, R = 0.67961, F = 0.38913
Shuffling training data
Starting epoch 16
Learning rate: 0.06806
Epoch 1/1

Exception in thread Thread-35:
Traceback (most recent call last):
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/andrew/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 612, in data_generator_task
    generator_output = next(self._generator)
StopIteration



Validation: loss = 0.71551, P = 0.18084, R = 0.17432, F = 0.15315
Shuffling training data
Starting epoch 17
Learning rate: 0.06673
Epoch 1/1

Exception in thread Thread-36:
Traceback (most recent call last):
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/andrew/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 612, in data_generator_task
    generator_output = next(self._generator)
StopIteration



Validation: loss = 0.66185, P = 0.32074, R = 0.32816, F = 0.29157
Shuffling training data
Starting epoch 18
Learning rate: 0.06545
Epoch 1/1

Exception in thread Thread-37:
Traceback (most recent call last):
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/andrew/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 612, in data_generator_task
    generator_output = next(self._generator)
StopIteration



Validation: loss = 0.63642, P = 0.32486, R = 0.43264, F = 0.33404
Shuffling training data
Starting epoch 19
Learning rate: 0.06421
Epoch 1/1

Exception in thread Thread-38:
Traceback (most recent call last):
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/andrew/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 612, in data_generator_task
    generator_output = next(self._generator)
StopIteration



Validation: loss = 0.67092, P = 0.30290, R = 0.24306, F = 0.24054
Shuffling training data
Starting epoch 20
Learning rate: 0.06303
Epoch 1/1

Exception in thread Thread-39:
Traceback (most recent call last):
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/andrew/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 612, in data_generator_task
    generator_output = next(self._generator)
StopIteration



Validation: loss = 0.64407, P = 0.29671, R = 0.29686, F = 0.26484
Shuffling training data
Starting epoch 21
Learning rate: 0.06188
Epoch 1/1

Exception in thread Thread-40:
Traceback (most recent call last):
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/andrew/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 612, in data_generator_task
    generator_output = next(self._generator)
StopIteration



Validation: loss = 0.63764, P = 0.32707, R = 0.53924, F = 0.37387
Shuffling training data
Starting epoch 22
Learning rate: 0.06078
Epoch 1/1

Exception in thread Thread-41:
Traceback (most recent call last):
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/andrew/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 612, in data_generator_task
    generator_output = next(self._generator)
StopIteration



Validation: loss = 0.64388, P = 0.29643, R = 0.30101, F = 0.26665
Shuffling training data
Starting epoch 23
Learning rate: 0.05972
Epoch 1/1

Exception in thread Thread-42:
Traceback (most recent call last):
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/andrew/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 612, in data_generator_task
    generator_output = next(self._generator)
StopIteration



Validation: loss = 0.67338, P = 0.31089, R = 0.64096, F = 0.38779
Shuffling training data
Starting epoch 24
Learning rate: 0.05869
Epoch 1/1

Exception in thread Thread-43:
Traceback (most recent call last):
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/andrew/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 612, in data_generator_task
    generator_output = next(self._generator)
StopIteration



Validation: loss = 0.63103, P = 0.31964, R = 0.55892, F = 0.37491
Shuffling training data
Starting epoch 25
Learning rate: 0.05770
Epoch 1/1

Exception in thread Thread-44:
Traceback (most recent call last):
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/andrew/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 612, in data_generator_task
    generator_output = next(self._generator)
StopIteration



Validation: loss = 0.63780, P = 0.33166, R = 0.40182, F = 0.33028
Shuffling training data
Starting epoch 26
Learning rate: 0.05674
Epoch 1/1

Exception in thread Thread-45:
Traceback (most recent call last):
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/andrew/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 612, in data_generator_task
    generator_output = next(self._generator)
StopIteration



Validation: loss = 0.62052, P = 0.33769, R = 0.47950, F = 0.36358
Shuffling training data
Starting epoch 27
Learning rate: 0.05581
Epoch 1/1

Exception in thread Thread-46:
Traceback (most recent call last):
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/andrew/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 612, in data_generator_task
    generator_output = next(self._generator)
StopIteration



Validation: loss = 0.63116, P = 0.32525, R = 0.59191, F = 0.38922
Shuffling training data
Starting epoch 28
Learning rate: 0.05491
Epoch 1/1

Exception in thread Thread-47:
Traceback (most recent call last):
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/andrew/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 612, in data_generator_task
    generator_output = next(self._generator)
StopIteration



Validation: loss = 0.64119, P = 0.31644, R = 0.32839, F = 0.29019
Shuffling training data
Starting epoch 29
Learning rate: 0.05404
Epoch 1/1

Exception in thread Thread-48:
Traceback (most recent call last):
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/home/andrew/anaconda2/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/andrew/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 612, in data_generator_task
    generator_output = next(self._generator)
StopIteration



Validation: loss = 0.61642, P = 0.33592, R = 0.50748, F = 0.37070
Shuffling training data


## TODO

* Decay learning rate less rapidly
* Maybe don't use BN over the concatenated vector
* Put the BN before the RELU
* Add dropout over the input items, as in the DAN paper

"Deep Unordered Composition Rivals Syntactic Methods for Text Classification"

## Remaining work

There are a few outstanding TODOs that would make this training loop more useful on real work:

* Make the generator 'greedier' by using multiple workers and a longer queue, this should speed it up
* See if it's possible to push some of the work of `make_training_data` into TensorFlow on the GPU instead
* Save the model to disk after every epoch
* Terminate training early if validation F-measure stagnates
* Reduce the learning rate manually if the training loss stagnates
* Test on a held-out set from the end of the time period (see notes at top of file)

Also, the process of experimenting with the model structure or constants (size of layers, dropout rate etc.), or the optimizer parameters, is pretty tedious to do by hand.

In real life, you'd want to write a hyperparameter tuning script that automatically generates a bunch of different model variants, trains them, tests them on the same validation data, and reports the results. You can do this in parallel over multiple servers to speed up the process.