### Text Prediction using LSTM
I've seen similar projects to this try to generate text that reads like shakespeare

How is this different?
* I wanted text that was easier to judge, so I scraped data from NSF which are project abstracts
* In order to better learn and demonstrate the foundational parts of Tensorflow, I didn't use tflearn

We start out with doing some simple imports and defining how the training/test data will be read and displayed.


In [1]:
import os
import numpy as np
import tensorflow as tf
from tensorflow.contrib.rnn import *

char_to_index = {}
index_to_char = {}

def array_info(description, x):
    """ Displays description and shape of the provided numpy array
    """
    return "Array " + description + " " + str(x.shape);
    
def array_to_str(char_array):
    """ Converts an array of character indices to a Python string
    """
    return "".join([index_to_char[c] for c in char_array])

def display_strs(chars_2d):
    """ Displays an array of strings
    """
    count = chars_2d.shape[0]
    split = np.split(chars_2d, count)
    for chars in split:
        print(array_to_str(chars[0]))
        
def read_textfile_to_arrays(path, x_len=25, y_len=2, skip_factor=3, to_lower_case=False, train_percent=99):
    global char_to_index;
    global index_to_char;
    global char_count;
    
    total_len = x_len+y_len
    
    # Read the whole text file
    string = open(path).read()
    if to_lower_case:
        string = string.lower()

    chars = sorted(list(set(string)))
    char_count = len(chars)
    char_to_index = {chars[i]: i for i in range(char_count)}  # maps from ascii codes to an index
    index_to_char = {v: k for k, v in char_to_index.items()}

    length = len(string)
    count = (length-total_len)//skip_factor

    x_all = np.zeros((count, x_len), dtype=np.int32)
    y_all = np.zeros((count, y_len), dtype=np.int32)

    for s in range(0, count):
        i = s * skip_factor
        for t, char in enumerate(string[i: i + x_len]):
            x_all[s][t] = char_to_index[char]
        for t, char in enumerate(string[i+x_len: i + total_len]):
            y_all[s][t] = char_to_index[char]

    test_index = count - count * train_percent//100

    x_train = x_all[test_index:]
    y_train = y_all[test_index:]
    x_test = x_all[:test_index]
    y_test = y_all[:test_index]

    print("Text total length: " + str(length))
    print("Distinct chars: " + str(len(chars)))
    print("Total sequences: " + str(count))

    print(array_info("x_train", x_train))
    print(array_info("y_train", y_train))
    print(array_info("x_test", x_test))
    print(array_info("y_test", y_test))

    i=2;
    print("Example at row: %d X[%d]: '%s'  Y[%d]: '%s'" % (i, i, array_to_str(x_train[i]),
                                       i, array_to_str(y_train[i])))
    return x_train, y_train, x_test, y_test

### Load the Data
Defines the length of input and output that the text will be divided into

Defines the location of where the data is stored

Loads the data

Prints information on the data

In [2]:
seq_len = 40
y_len = 1

data_dir = "/home/eric/Documents/data/NSF/"
data_file_name = "nsf_file_clean_10000.txt" 
data_path = data_dir + data_file_name

# Actually load the data
x_train, y_train, x_test, y_test = read_textfile_to_arrays(data_path, x_len=seq_len, y_len=y_len, skip_factor=100)

# We are only dealing with the first character of the expected output
y_train = y_train[:, 0];
y_test = y_test[:, 0];

print("index_to_char")
for i in range(char_count):
    c = str(index_to_char[i])
    if c[0]=='\t': c = '\\t'
    if c[0]=='\n': c = '\\n'
    print("%d->'%s' %c" % (i, c, '\n' if (i%10==9) else ' '), end='')


Text total length: 18062680
Distinct chars: 98
Total sequences: 180626
Array x_train (178819, 40)
Array y_train (178819, 1)
Array x_test (1807, 40)
Array y_test (1807, 1)
Example at row: 2 X[2]: 'research at the Field Museum. The PI giv'  Y[2]: 'e'
index_to_char
0->'\t'  1->'\n'  2->' '  3->'!'  4->'"'  5->'#'  6->'$'  7->'%'  8->'&'  9->''' 
10->'('  11->')'  12->'*'  13->'+'  14->','  15->'-'  16->'.'  17->'/'  18->'0'  19->'1' 
20->'2'  21->'3'  22->'4'  23->'5'  24->'6'  25->'7'  26->'8'  27->'9'  28->':'  29->';' 
30->'<'  31->'='  32->'>'  33->'?'  34->'@'  35->'A'  36->'B'  37->'C'  38->'D'  39->'E' 
40->'F'  41->'G'  42->'H'  43->'I'  44->'J'  45->'K'  46->'L'  47->'M'  48->'N'  49->'O' 
50->'P'  51->'Q'  52->'R'  53->'S'  54->'T'  55->'U'  56->'V'  57->'W'  58->'X'  59->'Y' 
60->'Z'  61->'['  62->'\'  63->']'  64->'^'  65->'_'  66->'`'  67->'a'  68->'b'  69->'c' 
70->'d'  71->'e'  72->'f'  73->'g'  74->'h'  75->'i'  76->'j'  77->'k'  78->'l'  79->'m' 
80->'n'  81->'o'  82->'p'

### Define the Model
Uses a fully connected layer as input to the LSTM layers

Four LSTM layers are in the middle which provides the recurrance so that a subsequent character can hold a state based on previous characters

Final layer reduces size to match the number of possible characters

In [3]:
num_neurons = 512
num_layers = 4

tf.reset_default_graph()

with tf.variable_scope("TextModel"):
    # define input placeholders
    x_input = tf.placeholder("int32", [None, seq_len])
    target_output = tf.placeholder("int32", [None])
    keep_prob = tf.placeholder("float32")

    # convert to 1-hot
    x_input_1hot = tf.one_hot(x_input, char_count, on_value=1.0, off_value=0.0, dtype=tf.float32)
    ideal_output_1hot = tf.one_hot(target_output, char_count, on_value=1.0, off_value=0.0, dtype=tf.float32)

    # transform x_input_1hot which is (N, seq_len, char_count) => (N, seq_len, num_neurons)
    result = tf.reshape(x_input_1hot, [-1, char_count])
    w_0 = tf.Variable(tf.random_normal([char_count, num_neurons], stddev=0.01))
    result = tf.matmul(result, w_0)
    result = tf.reshape(result, [-1, seq_len, num_neurons])

    cells = []
    for _ in range(num_layers):
        mycell = LSTMCell(num_neurons, state_is_tuple=True)  # or BasicLSTMCell
        mycell = DropoutWrapper(mycell, output_keep_prob=keep_prob)
        cells.append(mycell)

    multi_cell = MultiRNNCell(cells, state_is_tuple=True)  # simple way of stacking multiple identical layers

    output, _ = tf.nn.dynamic_rnn(multi_cell, result, dtype=tf.float32)

    # transpose so that the sequence is the last dimension (N, num_neurons, seq_len)
    output = tf.transpose(output, [0, 2, 1])
    # truncate so we only get the last output from the sequence
    output = output[:, :, seq_len -1]

    # define transforming matrix with bias to convert (N, num_neurons) => (N, char_count)
    bias_1 = tf.Variable(tf.random_normal([char_count], stddev=0.01))
    w_1 = tf.Variable(tf.random_normal([num_neurons, char_count], stddev=0.01))

    # note that we don't take the softmax at the end because our cost fn does that for us
    output = tf.add(tf.matmul(output, w_1), bias_1)

    # compute costs
    cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=output, labels=ideal_output_1hot))
    
    # use an optimizer that has goal of reducing cost
    train_op = tf.train.AdamOptimizer().minimize(cost)
    
    # Selects a single output with some randomness
    # large values (e.g. 10) are effectively selecting the argmax and low values are effectively random (e.g. .1)
    randomness = 2  
    predict_op = tf.multinomial(output * randomness, 1)[:,0]  # Selects the last character

    init_op = tf.global_variables_initializer()

### Showing Predictions
Defines a helper function that will show a few lines of predicted text
Each line's text is a prediction based on the initial text which is provided
Provided text comes from the test data set, so the data hasn't been seen before in training

In [9]:
def show_predictions(sess):
    prediction_row_offset = 1000
    prediction_row_count = 10
    display_row_count = 10  # not all of the predicted lines need to be displayed
    prediction_char_count = 60

    chars = np.append(x_test[prediction_row_offset:prediction_row_offset + prediction_row_count, :], 
                      np.zeros(shape=(prediction_row_count, prediction_char_count)), axis=1)

    for t in range(prediction_char_count):
        subchar = chars[:, t:t+seq_len]
        predictions = sess.run(predict_op, feed_dict={x_input: subchar, keep_prob: 1.0})

        # Take the generated output and add it on to the end of the array
        chars[:, t + seq_len] = predictions
    # Display a header line to indicate which characters are input (X) and which are predictions (Y)
    print("X"*seq_len + "Y"*prediction_char_count)
    display_strs(chars[0:display_row_count, 
                       0:seq_len + prediction_char_count])

### Run Training
Loads state from any previous executions (so we don't start from scratch every time)

After each epoch it performs
* Measures accuracy from testing set
* Shows examples of predicted strings (e.g. adds 60 predicted characters)
* Saves its state

In [11]:
dropout = .5
batch_size = 128
test_size = 2048

# Define where the saver is going to save/load the state
name = "text_prediction_NSF"
saver_data_root_dir = "/home/eric/Documents/models/"
saver_data_dir = saver_data_root_dir + name
if not os.path.exists(saver_data_dir):
    os.makedirs(saver_data_dir)

train_count = len(x_train)
test_count = len(x_test)

with tf.Session() as sess:
    sess.run(init_op)

    # Try to load a previously saved state
    saver = tf.train.Saver()
    if os.path.exists(saver_data_dir + ".index"):
        saver.restore(sess, saver_data_dir)
        print("Model loaded from: '%s'" % saver_data_dir)
    else:
        print("No Model found at: '%s'" % saver_data_dir)
    
    # Shows the predictions at the beginning (before doing the training)
    print("\nInitial Predictions Before Training")
    show_predictions(sess)
    
    epoch_count = 2
    for i in range(epoch_count):
        # Do training based on batches that are created by forming tuples from both ranges
        for start, end in zip(range(0, train_count, batch_size), range(batch_size, train_count + 1, batch_size)):
            print("Cycling %d/%d: %d%%\r" % (i+1, epoch_count, (start*100//train_count)), end='', flush=True)
            sess.run(train_op, feed_dict={x_input: x_train[start:end], target_output: y_train[start:end],
                                          keep_prob: dropout})
        
        # Measure accuracy based on testing data set
        y_predicted = sess.run(predict_op, feed_dict={x_input: x_test[0:test_size], keep_prob: 1.0})
        prediction_accuracy = np.mean(np.equal(y_test[0:test_size], y_predicted))
        print("Completed %d/%d with prediction accuracy of %.3f" % (i+1, epoch_count, prediction_accuracy))

        show_predictions(sess)

        # Save the current state
        save_path = saver.save(sess, saver_data_dir)
        print("Model saved in file: %s" % saver_data_dir)

INFO:tensorflow:Restoring parameters from /home/eric/Documents/models/text_prediction_NSF
Model loaded from: '/home/eric/Documents/models/text_prediction_NSF'

Initial Predictions Before Training
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY
inhibitor of the sugar transport protein information new development of the proposed research is a t
sugars via pores acclimate to cold or high second molecular processing between the initial organic g
a transport protein component. The focus of the proposed project will have experimental and properti
haracterizing a range of additional structures and the process need and a research in the both stude
 capacity, including the number and cross-scale clears to an accordition and the control of this wor
f pore connections, and membrane features and the proposed study of the proposed students will be co
se structural features may be augmented by the system in collective systems are the research in t

### Final Comments
This shows how a very basic 4-layered LSTM can generate text that looks rather scientific at first glance.  Although the above sample only shows two epochs, this analysis was performed using a model that was reloaded after having been previously trained with 100 epochs.

<b>Reading the output</b>

After every epoch, 10 text strings are displayed.  The portion of the text string underneath "X" are the characters that were given to the system.  The portion of the text string underneath "Y" are the characters that were generated by the system.  Notice in which ways the generated text appears to make sense and in which ways is is giberish.  When read carefully, it can be seen that it is just following simple patterns that it has seen repeated many times in the training text.

<b>Future improvement through Reinforcement Learning (RL)</b>

One fundamental limitation of this approach is that the output is always trying to reduce the cost function which is assessed solely based on the next individual character.  To encourage more logical text, the system must use reinforcement that is based on a longer term view (e.g. Reinforcement Learning).  This would also match how a human would assess a given sample of text only after seeing many words (not just after reading a single character).

<b>Future improvement through Generative Advesarial Network (GAN)</b>

A second fundamental limitation of this approch is that the system is trying to mimic what other humans wrote.  Even if the generated text is very logical, it will be punished when it doesn't match what was written as part of one specific text.  Even a human would have a very difficult time trying to guess what was previously written.  A better approach would be Generative Advesarial Network (GAN) where the generated text was judged on whether it could be determined to be computer generated or not.  This technique has worked well in generated images.