# TensorFlow Workshop - LSTM for Text Sequences
2019/08/14

> [ Reference ] :
1. Tom Hope, Yehezkel S. Resheff, and Itay Lieder, "**`Learning TensorFlow : A Guide to Building Deep Learning Systems`**", Chapter 5, O'Reilly, 2017.
2. Victor Chou, "**`An Introduction to Recurrent Neural Networks for Beginners`**" Towards Data Science, 2019/07/25. https://towardsdatascience.com/an-introduction-to-recurrent-neural-networks-for-beginners-664d717adbd
3. Wikipedia, "**`Long short-term memory`**", 2019. https://en.wikipedia.org/wiki/Long_short-term_memory

## 1. Text Sequences
+ Our simulated data consists of two classes of very short “sentences,” one composed of odd digits and the other of even digits (with numbers written in English). 
+ We generate sentences built of words representing even and odd numbers. 


+ **Our goal** is *to learn to classify each sentence as either odd or even in a supervised text-classification task*.

In [1]:
import numpy as np
import tensorflow as tf

# for the old-version usage of TensorFlow, such as tensorflow.examples.tutorials.mnist
old_v = tf.logging.get_verbosity()          
tf.logging.set_verbosity(tf.logging.ERROR)

batch_size = 128
embedding_dimension = 64
num_classes = 2
hidden_layer_size = 32
times_steps = 6
element_size = 1

  from ._conv import register_converters as _register_converters


+ Next, we create sentences. We sample random digits and map them to the corresponding
“words” (e.g., 1 is mapped to “One,” 7 to “Seven,” etc.).

In [2]:
digit_to_word_map = {1:"One",2:"Two", 3:"Three", 4:"Four", 5:"Five",
                     6:"Six",7:"Seven",8:"Eight",9:"Nine"}

+ Text sequences typically have variable lengths, which is of course the case for all real natural language data (such as in the sentences appearing on this page).

> + To make our simulated sentences have different lengths, we sample for each sentence a random length between 3 and 6 with **`np.random.choice(range(3, 7))`**—the lower bound is inclusive, and the upper bound is exclusive.


> + Now, to put all our input sentences in one tensor (per batch of data instances), we need them to somehow be of the same size—so we pad sentences with a length shorter than 6 with zeros (or PAD symbols) to make all sentences equally sized (artificially). This pre-processing step is known as **zero-padding**. 
    
The following code accomplishes all of this:

In [3]:
digit_to_word_map[0]="PAD"
even_sentences = []
odd_sentences = []
seqlens = []

for i in range(10000):
    rand_seq_len = np.random.choice(range(3,7))
    seqlens.append(rand_seq_len)
    rand_odd_ints = np.random.choice(range(1,10,2),rand_seq_len)
    rand_even_ints = np.random.choice(range(2,10,2),rand_seq_len)
    
    # Padding
    if rand_seq_len<6:
        rand_odd_ints = np.append(rand_odd_ints,
                                  [0]*(6-rand_seq_len))
        rand_even_ints = np.append(rand_even_ints,
                                  [0]*(6-rand_seq_len))
    even_sentences.append(" ".join([digit_to_word_map[r] 
                                    for r in rand_even_ints]))
    odd_sentences.append(" ".join([digit_to_word_map[r] 
                                   for r in rand_odd_ints]))
    
data = even_sentences+odd_sentences
# Same seq lengths for even, odd sentences
seqlens*=2

Let’s take a look at our sentences, each padded to length 6:

In [4]:
even_sentences[0:6]   

['Eight Four Six Six Four PAD',
 'Two Six Six Two Six Two',
 'Six Eight Six Two PAD PAD',
 'Four Two Two PAD PAD PAD',
 'Two Eight Six Two PAD PAD',
 'Eight Six Six PAD PAD PAD']

In [5]:
odd_sentences[0:6]   

['Five Nine Nine One One PAD',
 'One Seven Seven One Three Seven',
 'Three One Nine Three PAD PAD',
 'Five Seven One PAD PAD PAD',
 'Seven Five Three Nine PAD PAD',
 'Nine Three Nine PAD PAD PAD']

> + Notice that we add the **PAD** word (token) to our data and `digit_to_word_map` dictionary, and separately store even and odd sentences and their original lengths (before padding).

Let’s take a look at the original sequence lengths for the sentences we printed:

In [6]:
seqlens[0:6]  # Same seq lengths for even, odd sentences

[5, 6, 4, 3, 4, 3]

### Q : Why keep the original sentence lengths? 
> + By zero-padding, we solved one technical
problem but created another: if we naively pass these padded sentences through our
RNN model as they are, it will process useless **PAD** symbols. 


> + This would both harm model correctness by processing “*noise*” and increase computation time. We resolve this issue by first storing the original lengths in the seqlens array and then telling TensorFlow’s **`tf.nn.dynamic_rnn()`** where each sentence ends.

+ Our data is simulated—generated by us. In real applications, we would  start off by getting a collection of documents (e.g., one-sentence tweets) and then mapping each word to an integer ID.


+ So, we now **map words to indices**—word `identifiers`—by simply creating a dictionary with words as keys and indices as values. 
+ We also create the **inverse map**.

In [7]:
# Map from words to indices
word2index_map ={}
index=0
for sent in data:
    for word in sent.lower().split():
        if word not in word2index_map:
            word2index_map[word] = index
            index+=1
            
# Inverse map
index2word_map = {index: word for word, index in word2index_map.items()}
vocabulary_size = len(index2word_map)

### This is a supervised classification task—we need an array of labels in the `one-hot` format, train and test sets, a function to generate batches of instances, and placeholders, as usual.

+ First, we create the labels and split the data into train and test sets:

In [8]:
labels = [1]*10000 + [0]*10000
for i in range(len(labels)):
    label = labels[i]
    one_hot_encoding = [0]*2
    one_hot_encoding[label] = 1
    labels[i] = one_hot_encoding
    
data_indices = list(range(len(data)))
np.random.shuffle(data_indices)
data = np.array(data)[data_indices]

labels = np.array(labels)[data_indices]
seqlens = np.array(seqlens)[data_indices]
train_x = data[:10000]
train_y = labels[:10000]
train_seqlens = seqlens[:10000]

test_x = data[10000:]
test_y = labels[10000:]
test_seqlens = seqlens[10000:]

+ Next, we create a function that generates batches of sentences. Each sentence in a
batch is simply a list of integer IDs corresponding to words:

In [9]:
def get_sentence_batch(batch_size,data_x, data_y, data_seqlens):
    instance_indices = list(range(len(data_x)))
    np.random.shuffle(instance_indices)
    batch = instance_indices[:batch_size]
    x = [[word2index_map[word] for word in data_x[i].lower().split()]
    for i in batch]
    y = [data_y[i] for i in batch]
    seqlens = [data_seqlens[i] for i in batch]
    return x,y,seqlens

+ Finally, we create placeholders for data:

In [10]:
_inputs = tf.placeholder(tf.int32, shape=[batch_size,times_steps])
_labels = tf.placeholder(tf.float32, shape=[batch_size, num_classes])

# seqlens for dynamic calculation
_seqlens = tf.placeholder(tf.int32, shape=[batch_size])

##   
## 2. Supervised Word Embeddings

+ Our text data is now encoded as lists of word IDs—each sentence is a sequence of integers corresponding to words.
+ **We could end up with millions of such word IDs, each encoded in `one-hot` (binary) categorical form, leading to great data sparsity and computational issues.**
#### [ NOTE ]:
> + A powerful approach to work around this issue is to use **word embeddings**, *which makes the high-dimensional one-hot vectors “embedded” into a continuous vector space with a much lower dimensionality*.
> + In **Ref. 1's Chapter 6** we dive deeper into word embeddings, exploring a popular method to train them in an “*unsupervised*” manner known as **`word2vec`**.

+ Here, our end goal is *to solve a text classification problem, and we will train word vectors in a supervised framework, tuning the embedded word vectors to solve the downstream classification task*.


+ Previously, we gave each word an integer index, and sentences are then represented as sequences of these indices. 


#### `tf.nn.embedding_lookup() function`
+ It is helpful to think of word embeddings as *basic hash tables* or *lookup tables*, mapping words to their dense vector values. These vectors are optimized as part of the training process.
+ **Now, to obtain a word’s vector, we use the built-in `tf.nn.embedding_lookup()` function, which efficiently retrieves the vectors for each word in a given sequence of word indices**:

In [11]:
with tf.name_scope("embeddings"):
    embeddings = tf.Variable(
            tf.random_uniform([vocabulary_size,
                               embedding_dimension],
                               -1.0, 1.0),name='embedding')
    embed = tf.nn.embedding_lookup(embeddings, _inputs)

##  
## 3. LSTM and Using Sequence Length

+ A very popular recurrent network is the **`long short-term memory (LSTM) network`**. 
>    + It differs from vanilla RNN by having some special *memory mechanisms* that enable the recurrent cells to better store information for long periods of time, thus allowing them to capture long-term dependencies better than plain RNN.
    + **These memory mechanisms simply consist of some more parameters added to each recurrent cell, enabling the RNN to overcome optimization issues and propagate information.** 
    + **These trainable parameters act as filters that select what information is worth “*remembering*” and passing on, and what is worth “*forgetting*.”**
+ **They are trained in exactly the same way as any other parameter in a network, with gradient-descent algorithms and backpropagation.**

**Create an LSTM cell with `tf.contrib.rnn.BasicLSTMCell()` and feed it to `tf.nn.dynamic_rnn()`.** 

+ We also give **`dynamic_rnn()`** the length of each sequence in a batch of examples, using the **`_seqlens`** placeholder we created earlier. 
    + TensorFlow uses this to stop all RNN steps beyond the last real sequence element. 
    + It also returns all output vectors over time (in the outputs tensor), which are all zero-padded beyond the true end of the sequence.


+ So, for example, if the length of our original sequence is 5 and we zero-pad it to a sequence of length 15, the output for all time steps beyond 5 will be zero:

In [12]:
with tf.variable_scope("lstm"):
    lstm_cell = tf.contrib.rnn.BasicLSTMCell(hidden_layer_size,
                                              forget_bias=1.0)
    outputs, states = tf.nn.dynamic_rnn(lstm_cell, embed,
                                        sequence_length = _seqlens,
                                        dtype=tf.float32)
weights = {
'linear_layer': tf.Variable(tf.truncated_normal([hidden_layer_size,
                            num_classes], mean=0,stddev=.01))
}

biases = {
'linear_layer':tf.Variable(tf.truncated_normal([num_classes],
                           mean=0,stddev=.01))
}

# Extract the last relevant output and use in a linear layer
final_output = tf.matmul(states[1],
                         weights["linear_layer"]) + biases["linear_layer"]
softmax = tf.nn.softmax_cross_entropy_with_logits(logits = final_output,
                                                  labels = _labels)
cross_entropy = tf.reduce_mean(softmax)


For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.



> **[ NOTE ] :  We take the last valid output vector — in this case conveniently available for us in the `states` tensor returned by `dynamic_rnn()` — and pass it through a linear layer (and the softmax function), using it as our final prediction.**

##  
## 4. Training Embeddings and the LSTM Classifier

In [13]:
train_step = tf.train.RMSPropOptimizer(0.001, 0.9).minimize(cross_entropy)

correct_prediction = tf.equal(tf.argmax(_labels,1), tf.argmax(final_output,1))
accuracy = (tf.reduce_mean(tf.cast(correct_prediction, tf.float32)))*100

In [14]:
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    
    for step in range(1000):
        x_batch, y_batch,seqlen_batch = get_sentence_batch(batch_size,
                                         train_x,train_y, train_seqlens)
        sess.run(train_step,feed_dict={_inputs:x_batch, _labels:y_batch,
                                       _seqlens:seqlen_batch})
        
        if step % 100 == 0:
            acc = sess.run(accuracy,feed_dict={_inputs:x_batch,
                                               _labels:y_batch,
                                               _seqlens:seqlen_batch})
            print("Accuracy at %d: %.5f" % (step, acc))
        
    for test_batch in range(5):
        x_test, y_test,seqlen_test = get_sentence_batch(batch_size,
                                                        test_x,test_y,
                                                        test_seqlens)
        batch_pred,batch_acc = sess.run([tf.argmax(final_output,1), accuracy],
                                         feed_dict={_inputs:x_test,
                                                    _labels:y_test,
                                                    _seqlens:seqlen_test})
        print("Test batch accuracy %d: %.5f" % (test_batch, batch_acc))
    
    output_example = sess.run([outputs],feed_dict={_inputs:x_test,
                                                   _labels:y_test,
                                                   _seqlens:seqlen_test})
    states_example = sess.run([states[1]],feed_dict={_inputs:x_test,
                                                     _labels:y_test,
                                                     _seqlens:seqlen_test})

Accuracy at 0: 32.03125
Accuracy at 100: 100.00000
Accuracy at 200: 100.00000
Accuracy at 300: 100.00000
Accuracy at 400: 100.00000
Accuracy at 500: 100.00000
Accuracy at 600: 100.00000
Accuracy at 700: 100.00000
Accuracy at 800: 100.00000
Accuracy at 900: 100.00000
Test batch accuracy 0: 100.00000
Test batch accuracy 1: 100.00000
Test batch accuracy 2: 100.00000
Test batch accuracy 3: 100.00000
Test batch accuracy 4: 100.00000


+ Let’s take a look at one example of these outputs, for a sentence that was zero-padded (in your random batch of data you may see different output, of course—look for a sentence whose seqlen was lower than the maximal 6):

In [15]:
seqlen_test[1]

5

In [16]:
output_example[0][1].shape

(6, 32)

+ This output has, as expected, six time steps, each a vector of size 32. 

Let’s take a glimpse at its values (printing only the first few dimensions to avoid clutter):

In [17]:
output_example[0][1][:6,0:3]

array([[-0.45283183,  0.18316785, -0.41409463],
       [-0.759822  ,  0.3627894 , -0.73591554],
       [-0.86810327,  0.5210694 , -0.8492956 ],
       [-0.8953353 ,  0.59556055, -0.88158625],
       [-0.9370094 ,  0.6916338 , -0.86893153],
       [ 0.        ,  0.        ,  0.        ]], dtype=float32)

+ We see that for this sentence, whose original length was 5, the last one time step have zero vectors due to padding.

Finally, we look at the states vector returned by `dynamic_rnn()`:

In [18]:
states_example[0][1][0:3]

array([-0.9370094 ,  0.6916338 , -0.86893153], dtype=float32)

+ **We can see that it conveniently stores for us the last relevant output vector — its values match the last relevant output vector before zero-padding.**

### At this point, you may be wondering how to access and manipulate the word vectors and explore the trained representation. 
> [ Hint ] : How to do so, including interactive embedding visualization, please refer to **Ref. 1's Chapter 6**.

--------------------------------
## [ EXERCISE ] : Stacking multiple LSTMs

+ Earlier, we focused on a one-layer LSTM network for ease of exposition. 
+ Adding more layers is straightforward, using the **`MultiRNNCell()`** wrapper that combines multiple RNN cells into one multilayer cell.

Say, for example, we wanted to stack two LSTM layers in the preceding example. We
can do this as follows:

In [None]:
num_LSTM_layers = 2
with tf.variable_scope("lstm"):
    lstm_cell = tf.contrib.rnn.BasicLSTMCell(hidden_layer_size,
                                             forget_bias=1.0)
    cell = tf.contrib.rnn.MultiRNNCell(cells=[lstm_cell]*num_LSTM_layers,
                                       state_is_tuple=True)
    outputs, states = tf.nn.dynamic_rnn(cell, embed,
                                        sequence_length = _seqlens,
                                        dtype=tf.float32)

+ We first define an LSTM cell as before, and then feed it into the `tf.contrib.rnn.MultiRNNCell()` wrapper.

+ Now our network has two layers of LSTM, causing some shape issues when trying to
extract the final state vectors. To get the final state of the second layer, we simply
adapt our indexing a bit:

In [None]:
# Extract the final state and use in a linear layer
final_output = tf.matmul(states[num_LSTM_layers-1][1],
                         weights["linear_layer"]) + biases["linear_layer"]