# Word Embedded Bidirectional LSTM Siamese Network

This notebook shows how to train a sentence similarity/sentence embedding model using word embedding and bidirectional LSTM on a siamese network

## Import the needed modules

In [None]:
import os
import tensorflow as tf
import kotoba as kt
import narau as nr
from time import time

## Define data related constants

*   ```DST_```: TFRecord files of the SNLI data
*   ```EMBEDDING_FILE```: Path to the Glove embedding file
*   ```MODEL_PATH```: Path to save the model checkpoints or training progress. This is formatted with the timestamp of the run
*   ```DATA_COUNT```: Number of training data to use. Set to ```None``` to use all data

In [None]:
DST_DATA_PATH = os.path.join('data', 'glove.6B')
DST_TRAIN_PATH = os.path.join(DST_DATA_PATH, 'train.tfrecord')
DST_DEV_PATH = os.path.join(DST_DATA_PATH, 'dev.tfrecord')
DST_TEST_PATH = os.path.join(DST_DATA_PATH, 'test.tfrecord')

EMBEDDING_FILE = os.path.expanduser('~/Documents/data/glove.6B/glove.6B.100d.txt')

MODEL_PATH = os.path.join('word_bilstm_glove_siamese', 'model', '{}')

DATA_COUNT = 50000

## Define model related constants

*   ```EMBEDDING_WEIGHTS```: Initial weights of the embedding layer
*   ```EMBEDDING_SIZE```: Number of tokens in the embedding
*   ```EMBEDDING_DIMENSION```: Number of dimensions in the dense representation
*   ```EMBEDDING_SPECIAL_TOKENS```: Number of special tokens
*   ```EMBEDDING_WITH_PAD```: If embedding includes a padding
*   ```EMBEDDING_TRAINABLE```: If embedding is trainable
*   ```EMBEDDING_UNITS```: List of units in the embedding transform function
*   ```LSTM_UNITS```: List of the LSTM units
*   ```LSTM_DROPOUT```: LSTM output dropout probability
*   ```PROJECTION_UNITS```: List of units of the projection. Last item determines embedding dimensions
*   ```LOSS_MARGIN```: Target distance between different sentences
*   ```LEARNING_RATE```: Rate of gradient application

In [None]:
EMBEDDING_WEIGHTS = nr.embedding.load_glove_weights(EMBEDDING_FILE)
EMBEDDING_SIZE = EMBEDDING_WEIGHTS.shape[0]
EMBEDDING_DIMENSION = EMBEDDING_WEIGHTS.shape[1]
EMBEDDING_SPECIAL_TOKENS = 2
EMBEDDING_WITH_PAD = True
EMBEDDING_TRAINABLE = False

EMBEDDING_UNITS = [128]

LSTM_UNITS = [128]
LSTM_DROPOUT = 0.5

PROJECTION_UNITS = [1024]

LOSS_MARGIN = 1.0
LEARNING_RATE = 0.1

## Define training data constants

*   ```BATCH_SIZE```: Number of data per gradient update
*   ```EPOCHS```: Number of training epochs

In [None]:
BATCH_SIZE = 128
EPOCHS = 100

## Convert the TFRecords to a tensorflow Dataset

*   ```parse_example```: Converts TFRecord to a dictionary defined by ```_feature_def```
*   ```preprocess_text```: Converts a text to a dense tensor while calculating the length
*   ```preprocess_elements```: Processes the text and labels as well as convert them to the model required format
*   ```input_fn```: Converts the path of a TFRecord to a dataset. The dataset is also configured as shown below
    1.   Load the TFRecords
    2.   Parse each TFRecord
    3.   Unbatch the THRecord
    4.   Preprocess each item
    5.   Shuffle is training
    6.   Repeat based on epochs
    7.   Perform padded bacthing

In [None]:
_feature_def = {
    'x1': tf.VarLenFeature(tf.int64),
    'x2': tf.VarLenFeature(tf.int64),
    'y': tf.FixedLenSequenceFeature([], tf.float32)
}

def parse_example(example):
    context, features = tf.parse_single_sequence_example(example, sequence_features=_feature_def)
    return features

def preprocess_text(x):
    x = tf.sparse_reset_shape(x)
    x = tf.sparse_tensor_to_dense(x)
    return x, tf.size(x)

def preprocess_elements(features):
    x1, l1 = preprocess_text(features['x1'])
    x2, l2 = preprocess_text(features['x2'])
    y = features['y']
    inputs = {
        'x1': x1,
        'len1': l1, 
        'x2': x2,
        'len2': l2
    }
    return inputs, y

def input_fn(filenames, batch_size, epochs, is_training, buffer_multiplier=100):
    ds = tf.data.TFRecordDataset(filenames, buffer_size=100, num_parallel_reads=8)
    ds = ds.map(parse_example, num_parallel_calls=8)
    ds = ds.flat_map(tf.data.Dataset.from_tensor_slices)
    if DATA_COUNT:
        ds = ds.take(DATA_COUNT)
    ds = ds.map(preprocess_elements, num_parallel_calls=8)
    if is_training:
        ds = ds.shuffle(batch_size * buffer_multiplier)
    ds = ds.repeat(epochs)
    ds = ds.padded_batch(batch_size, ({'x1': [None], 'l1': [], 'x2': [None], 'l2': []}, []))
    ds = ds.prefetch(buffer_multiplier)
    return ds

## Create a tensorflow session configuration
This can be set as ```None``` to use the default. The session is part of the tensorflow's low level APIs. Visit https://www.tensorflow.org/guide/graphs to know more about sessions.

In [None]:
session_config = tf.ConfigProto()
session_config.allow_soft_placement = True
session_config.gpu_options.allow_growth = True

## Create an estimator run configuration
Define the run behavior of the estimator using the different parameters
*   ```tf_random_seed```: Seed to allow reproducible results
*   ```save_summary_steps```: Number of steps before the summary is recorded in tensorboard
*   ```save_checkpoints_steps```: Number of steps before a checkpoint is saved
*   ```keep_checkpoint_max```: Number of latest checkpoints to keep. ```None``` means do not delete any
*   ```session_config```: Configuration of the session used to run the model

In [None]:
config = tf.estimator.RunConfig(tf_random_seed=None,
                                save_summary_steps=50,
                                save_checkpoints_steps=400,
                                keep_checkpoint_max=None,
                                session_config=session_config)

## Create a model
Pass the model constants and other configurations to an instance of a narau estimator model

In [None]:
model_dir = MODEL_PATH.format(int(time()))

clf = nr.estimators.SiameseBiLSTMEmbedding(EMBEDDING_SIZE, 
                                           EMBEDDING_DIMENSION, 
                                           EMBEDDING_SPECIAL_TOKENS,
                                           EMBEDDING_WITH_PAD, 
                                           EMBEDDING_WEIGHTS,
                                           EMBEDDING_TRAINABLE, 
                                           EMBEDDING_UNITS,
                                           LSTM_UNITS,
                                           LSTM_DROPOUT,
                                           PROJECTION_UNITS,
                                           LOSS_MARGIN,
                                           LEARNING_RATE,
                                           model_dir,
                                           config)

## Defining training and evaluation behavior
Training and evaluation requires an input function. It is a function that accepts nothing and must return a dataset. Thus the ```input_fn``` created was wrapped with a lambda expression. These are passed to the ```TrainSpec``` and ```EvalSpec``` objects. For the ```EvalSpec``` the steps is set to ```None``` to use the whole data for the evaluation. The other delays/throttles are set low since there is no multi-server configuration that requires these delays.

In [None]:
train_input_fn = lambda: input_fn([DST_TRAIN_PATH], BATCH_SIZE, EPOCHS, True)
dev_input_fn = lambda: input_fn([DST_TEST_PATH], BATCH_SIZE, 1, False)

train_spec = tf.estimator.TrainSpec(train_input_fn)
eval_spec = tf.estimator.EvalSpec(dev_input_fn, None, start_delay_secs=0.1, throttle_secs=0.1)

## Training and Evaluating

Call the ```train_and_evaluate``` to execute training and evaluation. To view the results of the process run tensorboard. The command ```tensorboard --logdir /path/containing/model_dir --host localhost``` will execute tensorboard on localhost:6006

In [None]:
tf.estimator.train_and_evaluate(clf, train_spec, eval_spec)