Date: 07/11/2017
Author: Ankit Mundada


This is an implemetation of RNN in tensorflow. 

Dataset: Sample text corpus taken from http://www.taleswithmorals.com/. 
The corpus contains a vocabulary of only 112 different words. It can be extended to bigger size aswell.

There are three different datasets. Train.txt, Val.txt and Test.txt
Currently, all of them are the same, as the goal is not improve the model but understand the implementation.

In [17]:
from select import select
import sys
import tensorflow as tf
import utils

tf.logging.set_verbosity(tf.logging.INFO)
IS_DEBUG = True
IS_GPU_AVL = False
DO_PREDICTION = True

# parameters to tune
PARAMS = {
    'NUM_ITER': 2 if IS_DEBUG else 50,
    'STEPS_PER_ITER': 100,
    'LEARNING_RATE': 1e-3,
    'CONTEXT_SIZE': 3,  # context of words to predict next word
    'LSTM_SIZE': 32,  # number of hidden units in LSTM cell
    'BATCH_SIZE': 16,
}

FLAGS = {
    'MODEL_DIR': './logs/rnn_1', # directory where model will be saved
    'TRAINING_DATA': './datasets/train.txt',
    'VAL_DATA': './datasets/val.txt',
    'TEST_DATA': './datasets/test.txt'
}

vocab_size = utils.initiate_vocabs()
output_size = vocab_size

This input funtion is an implemetation of a standard Input Pipeline using Tensorflow's Dataset API for a Text data.

1. The input file has to be in csv format, with features at the start and labels at the end
2. `TextLineDataset` reads csv row by row. Hence it does not load entire file into the memory at the same time.
3. `decode_csv` preprocesses each row and returns a dictionary of `feature tensors` with feature name as the key and also a `target tensor`
4. Different parameter line `repeat_count`, `shuffle`, `batch_size` make it easier to tweak with the function vary easily without affecting any other code

In [18]:
def input_fn(filepath, mode=None):
    """
    Implements the Recommended Input pipeline architecture of Tensorflow.
    :param filepath: File to be loaded into memory line by line. (MUST be a CSV)
    :param mode: One of the tf.estimator.ModeKeys (Train, Eval, Predict)
    :return: The input features and target values for the current step
    """

    is_training = mode == tf.estimator.ModeKeys.TRAIN
    repeat_count = None if is_training else 1

    default_val = [[0.0] for _ in range(PARAMS['CONTEXT_SIZE'])]
    default_val.append([0])  # output class should have data-type tf.int32, for accuracy calculations in model_fn

    def decode_csv(line):
        line = tf.decode_csv(line, default_val)
        return {'context': line[:-1]}, line[-1]

    dataset = tf.contrib.data\
        .TextLineDataset(utils.make_csv(filepath, PARAMS['CONTEXT_SIZE']))\
        .map(decode_csv, num_threads=4 if IS_GPU_AVL else 2)  # preprocessing
    # shuffle input
    if is_training:
        dataset = dataset.shuffle(buffer_size=PARAMS['BATCH_SIZE'] * 2)
    dataset = dataset.repeat(repeat_count)
    dataset = dataset.batch(PARAMS['BATCH_SIZE'])
    iterator = dataset.make_one_shot_iterator()
    next_feature, next_label = iterator.get_next()
    return next_feature, next_label

The heart of the Estimator that we will be building next. This function will take as input `features` and `labels` returned from `input_fn` above and enque them into the Graph that we create. 

1. Based on the value of `mode`, the LSTM model performs different operations like traing, evaluation or inference.
2. Model architechture can easily be modified here, for ex by adding a new `LSTMCell` or adding `Dropout` or something else
3. The funtion returns an `EstimatorSpec` object, whose contensts depend on the `mode`
4. To view diferent types summaries in `Tensorboard` or on the `console` or to write to a file, different kind of hooks can be used. 

In [19]:
def model_fn(features, labels, mode, params):
    """
    Required to be passed into the Estimator. This function model a simple LSTM model using Tensorflow's in-built and
    efficient implementations of LSTM cell
    :param features: features as returned from input_fn
    :param labels: labels as returned from input_fn
    :param mode: mode set by the different method calls of the Estimator Object
    :param params: params passed with the Estimator
    :return: returns an EstimatorSpec which store's different important params to analyze tha model
    """
    # break the context words aka features into 'time-steps'
    x = tf.split(features['context'], PARAMS['CONTEXT_SIZE'], 1)

    # setting up LSTM
    lstm_cell = tf.contrib.rnn.BasicLSTMCell(PARAMS['LSTM_SIZE'])
    output, _ = tf.contrib.rnn.static_rnn(lstm_cell, x, dtype=tf.float32)

    # only last time-step matters
    output = output[-1]

    # get linear predictions from LSTM outputs
    weight = tf.Variable(tf.random_normal([PARAMS['LSTM_SIZE'], output_size]))
    bias = tf.Variable(tf.random_normal([output_size]))
    logits = tf.matmul(output, weight) + bias

    # final predictions
    preds = tf.argmax(tf.nn.softmax(logits), axis=1, output_type=tf.int32)
    preds_dict = {"predictions": preds}
    if mode == tf.estimator.ModeKeys.PREDICT:
        return tf.estimator.EstimatorSpec(
            mode=mode,
            predictions=preds_dict,
        )

    # using a Cross-Entropy error for this muticlass classification problem
    loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(labels=labels, logits=logits))
    if mode == tf.estimator.ModeKeys.EVAL:
        # setting up different metrices to be monitored using tensorboard
        eval_metric_ops = {
            'accuracy': tf.metrics.accuracy(labels=labels, predictions=preds),
            'precision': tf.metrics.precision(labels=labels, predictions=preds),
            'recall': tf.metrics.recall(labels=labels, predictions=preds)
        }
        return tf.estimator.EstimatorSpec(
            mode=mode,
            predictions=preds_dict,
            loss=loss,
            eval_metric_ops=eval_metric_ops
        )

    correct_pred = tf.equal(preds, labels)
    accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32), name='accuracy')
    # creating a summary to be added to SummarySaveHook
    with tf.name_scope('summaries'):
        tf.summary.scalar('accuracy', accuracy)

    # define the training operation/optimizer
    optimizer = tf.train.AdamOptimizer(learning_rate=PARAMS['LEARNING_RATE'])
    train_op = optimizer.minimize(loss=loss, global_step=tf.train.get_global_step())

    tensors_to_save = {'training_accuracy': 'accuracy'}
    # Hooks to monitor the performance of the model during training
    hook_logging = tf.train.LoggingTensorHook(tensors_to_save, every_n_iter=25)
    hook_summary = tf.train.SummarySaverHook(save_steps=25, output_dir=FLAGS['MODEL_DIR'], scaffold=None,
                                             summary_op=tf.summary.merge_all())

    # return EstimatorSpec
    return tf.estimator.EstimatorSpec(
        mode=mode,
        predictions=preds_dict,
        loss=loss,
        train_op=train_op,
        training_hooks=[hook_logging, hook_summary]
    )

Initialize estimator. it'll automatically load the most recent saved model in the 'model_dir'

In [20]:
rnn_regressor = tf.estimator.Estimator(model_fn=model_fn, model_dir=FLAGS['MODEL_DIR'], params=PARAMS)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': './logs/rnn_1', '_log_step_count_steps': 100, '_save_checkpoints_secs': 600, '_keep_checkpoint_every_n_hours': 10000, '_session_config': None, '_save_checkpoints_steps': None, '_save_summary_steps': 100, '_tf_random_seed': 1, '_keep_checkpoint_max': 5}


Training the model. In the debug mode, it only runs for a single step and a single iteration. This parameters can be changed at the beginning of this file

In [21]:
TIMEOUT = 5  # wait after each iter for 5 sec, to quit training the model, as it may require more tuning at that stage
for i in range(PARAMS['NUM_ITER']):
    rnn_regressor.train(input_fn=lambda: input_fn(FLAGS['TRAINING_DATA'], mode=tf.estimator.ModeKeys.TRAIN),
                        steps=2 if IS_DEBUG else PARAMS['STEPS_PER_ITER'])

    # Evaluate at the end of each iteration
    eval_results = rnn_regressor.evaluate(input_fn=lambda : input_fn(FLAGS['VAL_DATA'], mode=tf.estimator.ModeKeys.EVAL))
    print('Evaluation loss after %d iters is: %f' % (i, eval_results['loss']))

    # Stop training in between in more tuning is required.
    print("Stop training now?\nType y for Yes")
    rlist = select([sys.stdin], [], [], TIMEOUT)[0]
    feedback = None
    if rlist:
        feedback = sys.stdin.readline().strip()
        if (feedback is 'y') or (feedback is 'yes'):
            print('Finishing the training')
            break
    else:
        print('Training for the next iteration')
        continue

INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Restoring parameters from ./logs/rnn_1/model.ckpt-2050
INFO:tensorflow:Saving checkpoints for 2051 into ./logs/rnn_1/model.ckpt.
INFO:tensorflow:step = 2051, loss = 1.40404
INFO:tensorflow:training_accuracy = 0.6875
INFO:tensorflow:Saving checkpoints for 2052 into ./logs/rnn_1/model.ckpt.
INFO:tensorflow:Loss for final step: 1.17907.
INFO:tensorflow:Starting evaluation at 2017-11-08-02:05:56
INFO:tensorflow:Restoring parameters from ./logs/rnn_1/model.ckpt-2052
INFO:tensorflow:Finished evaluation at 2017-11-08-02:05:57
INFO:tensorflow:Saving dict for global step 2052: accuracy = 0.26, global_step = 2052, loss = 2.45616, precision = 0.95288, recall = 0.973262
Evaluation loss after 0 iters is: 2.456157
Stop training now?
Type y for Yes
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Restoring parameters from ./logs/rnn_1/model.ckpt-2052
INFO:tensorflow:Saving checkpoints for 2053 into ./logs/rnn_1/model.ckpt.
INFO:te

# Make predictions on the test dataset!

In each sample output, the first three words are the input given to the model and the last (ie fourth) word is the one _predicted_ by the model. 

Note: The can be tweaked a lot. This is only an implemetation exmple. The test-set outputs are not great.

In [22]:
if DO_PREDICTION:
    # get results on the test data after all the training
    predictions = rnn_regressor.predict(input_fn=lambda: input_fn(FLAGS['TEST_DATA'], mode=tf.estimator.ModeKeys.PREDICT))
    with open(utils.make_csv(FLAGS['TEST_DATA'], PARAMS['CONTEXT_SIZE']), 'r') as infile:
        for i, pred in enumerate(predictions):
            list_idx = infile.readline().strip().split(sep=',')[:-1] + [pred['predictions'].tolist()]
            pred_word = utils.convert_index_to_word(list_idx)
            print(i, ' ', ' '.join(pred_word))

INFO:tensorflow:Restoring parameters from ./logs/rnn_1/model.ckpt-2054
0   long ago , the
1   ago , the mice
2   , the mice had
3   the mice had a
4   mice had a general
5   had a general to
6   a general council to
7   general council to consider
8   council to consider what
9   to consider what danger
10   consider what measures when
11   what measures they could
12   measures they could consider
13   they could take to
14   could take to to
15   take to outwit what
16   to outwit their the
17   outwit their common the
18   their common enemy got
19   common enemy , cat
20   enemy , the cat
21   , the cat had
22   the cat . some
23   cat . some said
24   . some said this
25   some said this .
26   said this , and
27   this , and some
28   , and some had
29   and some said this
30   some said that .
31   said that but at
32   that but at last
33   but at last ,
34   at last a young
35   last a young to
36   a young mouse got
37   young mouse got up
38   mouse got up last
39   got up a

To see the entire code in this file: Please checkout __simple_rnn.py__ from _Jupyter Home_ 
Thank you. Please let me know of your feedbacks and sugesstions at ankit.mundada93@gmail.com or at +91-7076607420