# Long-short Term Memory(LSTM)
LSTM is actually also a special kind of neural network. The major difference between LSTM and RNN is that LSTM has a special mechanism called **forget gate**. As RNNs do, LSTMs also have **hidden state** that would pass to the next time slot. But not like RNN, iformations are selected through the forget gate and only let through those informations that is usefull. Let see how this would help LSTM get away from gradient vanish or gradient explode.

### Input gate in LSTM
First, the input is squashed between -1 and 1 using a tanh activation function. This can be expressed by:

$$g = tanh(b^g+x_tU^g+y_{t-1}V^g)$$

Where **$U^g$** and **$V^g$** are the weights for the input and previous cell output, respectively, and **$b^g$** is the input bias. Note that the exponents **g** are not a raised power, but rather signify that these are the input weights and bias values (as opposed to the input gate, forget gate, output gate etc.).

This squashed input is then multiplied element-wise by the output of the input gate. The input gate is basically a hidden layer of sigmoid activated nodes, with weighted **$x_t$** and **$y_{t-1}$** input values, which outputs values of between 0 and 1 and when multiplied element-wise by the input determines which inputs are switched on and off. In other words, it is a kind of input filter or gate. The expression for the input gate is:

$$i = \sigma(b^i + x_tU^i+ y_{t-1}V^i)$$

### The hidden state and the forget gate
Forget gate is again a sigmoid activated set of nodes which is element-wise multiplied by the hidden state of the previous moment **$s_{t-1}$** to determine which previous states should be remembered (i.e. forget gate output close to 1) and which should be forgotten (i.e. forget gate output close to 0). This allows the LSTM cell to learn appropriate context. The forget gate is like:

$$f = \sigma(b^f + x_tU^f + y_{t-1}V^i)$$

So the hidden state of the current moment is:

$$s_t = s_{t-1}\circ f + g \circ i$$

Where $\circ$ denotes element-wise multiplication.

### The output gate in LSTM

The final stage of the LSTM cell is the output gate. The output gate has two components – another tanh squashing function and an output sigmoid gating function. The output sigmoid gating function, like the other gating functions in the cell, is multiplied by the squashed state st to determine which values of the state are output from the cell. 

The output gate is like:

$$o = \sigma(b^o + x_tU^o + y_{t-1}V^o)$$

So the final output of the cell is:

$$y_t = tanh(s_t)\circ o$$

In [1]:
import numpy as np
import tensorflow as tf
import collections
import math
import datetime as dt

# Preparing the corpus

Before we start to setup our LSTM network, we first need to create the word vector respresentations(embeddings) of words in the corpus. Those words imbeddings are the inputs to our LSTM network.

## Text processing
This part mainly did two things:
* Read the corpus file.
* Divide the corpus into individual words.

The corpus used in this tutorial can be found in my GitHub repositry, or you can download it in mattmahoney.net/dc/text8.zip (not available in China mainland).
    
    

In [2]:
#Read the file
f = open("text8","r")
rawData = f.read()
print("The forst 100 letters in the corpus are:")
print(rawData[:100])

The forst 100 letters in the corpus are:
 anarchism originated as a term of abuse first used against early working class radicals including t


In [3]:
# Transfer the raw data as strings using Tensorflow. 
# Then split it into individual words
dataStr = tf.compat.as_str(rawData) #Convert to string
print('The first 10 words in the corpus')
data = dataStr.split() #Split by blank
print(data[:10])

The first 10 words in the corpus
['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against']


## Build Word2Index & Index2Word dictionary
Words need to be represented by something the computer would understand which are numbers. So basically we would use numbers (we call them indeces) to represent those. In this part, the following two tasks are accomplished:
* Select 9999 different most frequent words(plus one 'Less frequent words class').
* Using indeces to represent each word: build Word2Index dictionary.
* Build the inverse dictionary - Index2Word dictionary.

In [4]:
counted_words = collections.Counter(data)  #counted_words is a dictionary {'word1': num_1, 'word2': num_2, ...}
print('%d different words were found in the corpus'%(len(counted_words)))
print()
print('Part of the dictionary:')
dict(list(counted_words.items())[0:10])    #print 10 of them

253854 different words were found in the corpus

Part of the dictionary:


{'anarchism': 303,
 'originated': 572,
 'as': 131815,
 'a': 325873,
 'term': 7219,
 'of': 593677,
 'abuse': 563,
 'first': 28810,
 'used': 22737,
 'against': 8432}

In [5]:
freq_counted_words = dict(counted_words.most_common(9999))
print('There are %d words in the frequent words dictionary'%(len(freq_counted_words)))

There are 9999 words in the frequent words dictionary


In [6]:
lfw_count = 0
for word in counted_words:
    if not (word in freq_counted_words.keys()):
        lfw_count += 1
word_count_dict = {'lfw': lfw_count}
word_count_dict.update(freq_counted_words)
print('There are %d words in the dictionary(plus less frequent words class)'%(len(word_count_dict)))
print()
print('The first 5 entries in the dictionary is:')
dict(list(word_count_dict.items())[:5])

There are 10000 words in the dictionary(plus less frequent words class)

The first 5 entries in the dictionary is:


{'lfw': 243855, 'the': 1061396, 'of': 593677, 'and': 416629, 'one': 411764}

In [7]:
#index words with numbers aka establish the word2number projection
ind = 0
word_dict = {}
for word in word_count_dict:
    word_dict.update({word: ind})
    ind += 1
print('The first 5 entries in the dictionary "word_dict" is: ')
dict(list(word_dict.items())[:5])

The first 5 entries in the dictionary "word_dict" is: 


{'lfw': 0, 'the': 1, 'of': 2, 'and': 3, 'one': 4}

In [8]:
# Build reverse dictionary --- Index2Word dictionary
index_dict = dict(zip(word_dict.values(), word_dict.keys()))
print('The first 10 entries in dictionary index_dict is: ')
dict(list(index_dict.items())[:10])

The first 10 entries in dictionary index_dict is: 


{0: 'lfw',
 1: 'the',
 2: 'of',
 3: 'and',
 4: 'one',
 5: 'in',
 6: 'a',
 7: 'to',
 8: 'zero',
 9: 'nine'}

## Transfer the text sequence into index sequence

Now we have the Word2Index dictionary, we can rerepresent the corpus using index sequences.

In [9]:
num_corpus = []
for word in data:
    if word in word_dict.keys():
        num_corpus.append(word_dict[word])
    else:
        num_corpus.append(word_dict['lfw'])
print('The first 10 words in the corpus represented by their corresponding indeces are:')
print(num_corpus[:10])

train_data = num_corpus[:80000]

The first 10 words in the corpus represented by their corresponding indeces are:
[5234, 3081, 12, 6, 195, 2, 3134, 46, 59, 156]


Now we restore the corpus from that number representation to see if they're identiclel

In [10]:
restored_cor = []
for i in range(10):
    restored_cor.append(index_dict[num_corpus[i]])
print('The first 10 words restored from the index representation are:')
print(restored_cor)

The first 10 words restored from the index representation are:
['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against']


# Set up Word2Vec neural network with Tensorfow

We use a three layers autoencoder to calculate the embedding(vector representation) for each word. Detials of this neural network can be found in my GitHub notebook - **Softmax Word2Vec** tutorial.

In [11]:
data_index = 0
# generate batch data
def generate_batch(data, batch_size, skip, sub_gram):
    global data_index
    assert batch_size % skip == 0
    assert skip <= 2 * sub_gram
    batch = np.ndarray(shape=(batch_size), dtype=np.int32)
    context = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
    span = 2 * sub_gram + 1  # [ sub_gram input_word sub_gram]
    buffer = collections.deque(maxlen=span)
    for _ in range(span):
        buffer.append(data[data_index])
        data_index = (data_index + 1) % len(data)
    for i in range(batch_size // skip):
        target = sub_gram  # input word at the center of the buffer
        targets_to_avoid = [sub_gram]
        for j in range(skip):
            while target in targets_to_avoid:
                target = np.random.randint(0, span - 1)
            targets_to_avoid.append(target)
            batch[i * skip + j] = buffer[sub_gram]  # this is the input word
            context[i * skip + j, 0] = buffer[target]  # these are the context words
        buffer.append(data[data_index])
        data_index = (data_index + 1) % len(data)
    # Backtrack a little bit to avoid skipping words in the end of a batch
    data_index = (data_index + len(data) - span) % len(data)
    return batch, context

In [12]:
vocabulary_size = 10000
batch_size = 650
embedding_size = 650  # Dimension of the embedding vector aka the number of units in the hidden layer.
sub_gram = 2          # (Span_of_gram-1)/2
skip = 2              # How many times to reuse an input to generate a context.

train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
train_context = tf.placeholder(tf.int32, shape=[batch_size, 1])

# weight matrix between input layer and hidden layer
embeddings = tf.Variable(
    tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))

# get the corresponding embedding of input word.
embed = tf.nn.embedding_lookup(embeddings, train_inputs)

# weight matrix between hidden layer and the softmax layer
weights = tf.Variable(tf.truncated_normal([embedding_size, vocabulary_size],
                          stddev=1.0 / math.sqrt(embedding_size)))

# biases of softmax layer
biases = tf.Variable(tf.zeros([vocabulary_size]))


hidden_out = tf.matmul(embed, weights) + biases

train_one_hot = tf.one_hot(train_context, vocabulary_size)

cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=hidden_out, 
    labels=train_one_hot))

# Construct the SGD optimizer using a learning rate of 1.0.
optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(cross_entropy)

init = tf.global_variables_initializer()

W1017 20:33:16.672229  1888 deprecation.py:323] From <ipython-input-12-169572810c5f>:30: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See `tf.nn.softmax_cross_entropy_with_logits_v2`.



In [13]:
num_steps = 10000
model_path = 'models'
model_save_name = 'embedding_model'

with tf.Session() as session:
    saver = tf.train.Saver()
    # We must initialize all variables before we use them.
    init.run()
    print('Initialized')

    average_loss = 0
    for step in range(num_steps):
        batch_inputs, batch_context = generate_batch(num_corpus,
            batch_size, skip, sub_gram)
        feed_dict = {train_inputs: batch_inputs, train_context: batch_context}

        # We perform one update step by evaluating the optimizer op (including it
        # in the list of returned values for session.run()
        _, loss_val = session.run([optimizer, cross_entropy], feed_dict=feed_dict)
        average_loss += loss_val

        if (step + 1) % 2000 == 0:
            if step > 0:
                average_loss /= 2000
            # The average loss is an estimate of the loss over the last 2000 batches.
            print('Average loss at step ', (step + 1), ': ', average_loss)
            average_loss = 0
    # Normalize embeddings: scale them by dividing their norm
    # Compute the cosine similarity between minibatch examples and all embeddings.
    norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
    normalized_embeddings = embeddings / norm
    final_embeddings = normalized_embeddings.eval()
    saver.save(session, model_path + '\\' + model_save_name, global_step=epoch)

print('The size of the embedding matrix is:')
print(final_embeddings.shape)
print('The embedding of word "the" is:')
print(final_embeddings[1][:100])
print('...')

Initialized
Average loss at step  2000 :  6.614294956207275
Average loss at step  4000 :  6.04705004632473
Average loss at step  6000 :  6.078072331190109
Average loss at step  8000 :  6.123249652862548


W1017 21:06:52.049563  1888 deprecation.py:506] From <ipython-input-13-ab29e2220d4f>:30: calling reduce_sum_v1 (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead


Average loss at step  10000 :  6.086918895125389


NameError: name 'sess' is not defined

In [14]:
print('The size of the embedding matrix is:')
print(final_embeddings.shape)
print('The embedding of word "the" is:')
print(final_embeddings[1][:100])
print('...')

The size of the embedding matrix is:
(10000, 650)
The embedding of word "the" is:
[-2.6655966e-02 -6.6376761e-02 -1.9247079e-02 -1.2780695e-02
  1.2078389e-02  7.5808614e-02  4.4842158e-02  2.5679041e-02
 -5.2528664e-02  2.2594307e-02  1.8927136e-02  1.7888596e-02
  7.0898436e-02 -6.7119035e-03  3.8841940e-02 -2.4650207e-02
  2.5805410e-03  3.4857657e-02  3.9091509e-02 -1.5377350e-03
  7.8662165e-02  1.9272333e-05 -4.3310743e-02  1.0005997e-02
  6.5957971e-02 -1.6092813e-02  3.5364520e-02  4.2544510e-02
  2.1110817e-03 -1.9665316e-02  2.4460493e-02 -1.4648095e-02
 -2.0771667e-02  4.1408874e-02 -5.8948997e-02 -4.7584381e-02
  4.8056729e-02 -1.5734762e-02  4.3267258e-02 -5.9940714e-02
 -2.5871905e-02  1.7673658e-02  5.0935992e-03 -7.4655600e-02
 -3.6489550e-02 -1.5665328e-02 -3.4987353e-02  3.7250951e-02
 -5.3153817e-02 -1.4909114e-02  6.0377330e-02  2.2778971e-02
  4.4647507e-02 -3.8469475e-02 -7.0016189e-03 -1.7846381e-02
 -5.8639278e-03  2.3296148e-02 -6.1928079e-02  1.3785200e-02
  1

# Create batches for LSTM
In this experiment, we are going to perform one word prediction using **MIMO** (multiple in multiple out) LSTM. For example, given a subsentence with **num_steps = 6** like: **"anarchism originated as a term of"** , we want the LSTM to predict the subsentence: **"originated as a term of abuse"**.

We need to create batches with inputs and their corresponding labels. Both inputs and labels are subsentences with same length. Labels delayed one step compared to their inputs as is shown above.

### Batch example
For a fake corpus with 18 words(**data_len = 18**), its numerical representation is [0, 1, 2, ..., 15, 16, 17]. Let's say we want to creat batches with **batch_size = 3**, and **num_steps = 2**. 
* **batch_size** is the number of inputs-labels pairs which are fed into LSTM to calculate a loss.
* **num_step** is the length of subsentences the **MIMO** LSTM processing at a time which is demonstrated above.

To create batches from the corpus data, we first reshape the corpus into a **[batch_size, batch_len]** sized matrix. Where **batch_len = data_len // batch_size = 18 // 3 = 6**.

In [15]:
a = [i for i in range(18)]
print(a)
b = tf.reshape(a, [3, 6])
print()
with tf.Session() as sess:  print(b.eval())

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]

[[ 0  1  2  3  4  5]
 [ 6  7  8  9 10 11]
 [12 13 14 15 16 17]]


After the corpus data was reshaped to matrix with size **[batch_size, batch_len]**, we can easily generate batches by segment the matrix. For exaple, we want to generate a batch with **batch_size = 3**, and sunsentence length **num_step = 2**:

In [16]:
print('Batch 1:')
with tf.Session() as sess:  print(b[:, 0:2].eval())
print()
print('Its corresponding label:')
with tf.Session() as sess:  print(b[:, 1:3].eval())

Batch 1:
[[ 0  1]
 [ 6  7]
 [12 13]]

Its corresponding label:
[[ 1  2]
 [ 7  8]
 [13 14]]


In [17]:
print('Batch 2:')
with tf.Session() as sess:  print(b[:, 1:3].eval())
print()
print('Its corresponding label:')
with tf.Session() as sess:  print(b[:, 2:4].eval())

Batch 2:
[[ 1  2]
 [ 7  8]
 [13 14]]

Its corresponding label:
[[ 2  3]
 [ 8  9]
 [14 15]]


In [18]:
print('Batch 3:')
with tf.Session() as sess:  print(b[:, 2:4].eval())
print()
print('Its corresponding label:')
with tf.Session() as sess:  print(b[:, 3:5].eval())

Batch 3:
[[ 2  3]
 [ 8  9]
 [14 15]]

Its corresponding label:
[[ 3  4]
 [ 9 10]
 [15 16]]


In [19]:
print('Batch 4:')
with tf.Session() as sess:  print(b[:, 3:5].eval())
print()
print('Its corresponding label:')
with tf.Session() as sess:  print(b[:, 4:6].eval())

Batch 4:
[[ 3  4]
 [ 9 10]
 [15 16]]

Its corresponding label:
[[ 4  5]
 [10 11]
 [16 17]]


As shown above, using the reshaped matrix, 4 batches in an epoch can be easily generated. The following function does the exactly the same thing to generated a batch, where **x** and **y** are the aforementioned input and label respectively.

In [11]:
def batch_producer(raw_data, batch_size, num_steps):
    raw_data = tf.convert_to_tensor(raw_data, name="raw_data", dtype=tf.int32)

    data_len = tf.size(raw_data)
    batch_len = data_len // batch_size
    data = tf.reshape(raw_data[0: batch_size * batch_len],
                      [batch_size, batch_len])

    epoch_size = (batch_len - 1) // num_steps

    i = tf.train.range_input_producer(epoch_size, shuffle=False).dequeue()
    x = data[:, i * num_steps:(i + 1) * num_steps]
    x.set_shape([batch_size, num_steps])
    y = data[:, i * num_steps + 1: (i + 1) * num_steps + 1]
    y.set_shape([batch_size, num_steps])
    return x, y

# Implementing word prediction using LSTM

In the following section, we will create a two layer LSTM followed by a decoding level and a softmax level.

# Building LSTM model

### Input class

In [12]:
class Input(object):
    def __init__(self, batch_size, num_steps, data):
        self.batch_size = batch_size  #number of samples in a batch
        self.num_steps = num_steps    #length of sample subsentences
        self.epoch_size = ((len(data) // batch_size) - 1) // num_steps  #number of batches in an epoch
        self.input_data, self.targets = batch_producer(data, batch_size, num_steps)  #generate a batch

### Model parameters

In [47]:
# create the main model
class Model(object):
    def __init__(self, input, is_training, hidden_size, vocab_size, num_layers,
                 dropout=0.5, init_scale=0.05):
        self.is_training = is_training        # a index of weather back propagate
        self.input_obj = input
        self.batch_size = input.batch_size
        self.num_steps = input.num_steps        

### Add dropout to the input
The **dropout** reduce the chance of overfitting by adding noise to the inputs. It does this by discarding some elements in the input with certain rate $dropout$. 

* For elements which are discarded, **dropout** makes theirs values equal 0.
* For elements which are reserved, **dropout** scales their initial value by $\frac{1}{1-dropout}$.

The summation of the inputs remains unchanged.

In [52]:
if self.is_training and dropout < 1:
    inputs = tf.nn.dropout(inputs, dropout)        #add drop out to inputs

NameError: name 'self' is not defined

### Set previous output/inner state
Create placeholder for previous output **$h_{t-1}$** and inner state **$s_{t-1}$**. The **init_state** placeholder has the shape of **[num_layers, 2, batch_size, hidden_size]**.
* **num_layers** is the stack number(layer number) of stacked LSTM(one LSTM layer stacked over another LSTM layer). 
* **2** represents there are two other inputs besides the current word embedding. The two other inputs are previous output and previous inner state **$s_{t-1}$**.
* **batch_size** is the batch size which is the number of samples passed in the neural network to calculate a loss.(Different samples in a batch have different previous outputs/ inner states.)
* **hidden_size** is the size of output **$h$**, inner state **$s$**. It is also the size of embedding of a word.

In [49]:
# set up the state storage / extraction
self.init_state = tf.placeholder(tf.float32, [num_layers, 2, self.batch_size, self.hidden_size])

NameError: name 'num_layers' is not defined

### Transform the state into structure suits for Tensorflow

In [51]:
state_per_layer_list = tf.unstack(self.init_state, axis=0)
rnn_tuple_state = tuple(
            [tf.contrib.rnn.LSTMStateTuple(state_per_layer_list[idx][0], state_per_layer_list[idx][1])
             for idx in range(num_layers)]
        )

NameError: name 'self' is not defined

### Create LSTM cell followed by dropout

In [None]:
# create an LSTM cell to be unrolled
cell = tf.contrib.rnn.LSTMCell(hidden_size, forget_bias=1.0)
# add a dropout wrapper if training
if is_training and dropout < 1:
    cell = tf.contrib.rnn.DropoutWrapper(cell, output_keep_prob=dropout)

### Stack LSTM cell to two layer

In [None]:
if num_layers > 1:
    cell = tf.contrib.rnn.MultiRNNCell([cell for _ in range(num_layers)], state_is_tuple=True)

### Unrolling LSTM cell
All the **num_step** inputs flows through the same stacked LSTM cells repeatedly. This is done by the following function which feed in inputs of size [batch_size, num_steps, hidden_size] and outputs tensor of the same size. 
For one sample in a batch, the **dynamic_rnn** function sequentailly feed time steps to the stacked LSTM cell **num_steps** times.

In [None]:
output, self.state = tf.nn.dynamic_rnn(cell, inputs, dtype=tf.float32, initial_state=rnn_tuple_state)

### Flatting LSTM outputs and Decoding
To perform the decoding process, the LSTM output tensor of size [batch_size, num_step, hidden_size] are transformed to size [batch_size*num_step, hidden_size] first. 

Then, the word embeddings with size **hidden_state** are transformed to embeddings with 

In [None]:
# reshape to (batch_size * num_steps, hidden_size)
output = tf.reshape(output, [-1, hidden_size])

softmax_w = tf.Variable(tf.random_uniform([hidden_size, vocab_size], -init_scale, init_scale))
softmax_b = tf.Variable(tf.random_uniform([vocab_size], -init_scale, init_scale))
logits = tf.nn.xw_plus_b(output, softmax_w, softmax_b)

### Loss function
Raw logits without softmax are directlly used to calculate the loss.

In [None]:
# Reshape logits to be a 3-D tensor for sequence loss
logits = tf.reshape(logits, [self.batch_size, self.num_steps, vocab_size])

# Use the contrib sequence loss and average over the batches
loss = tf.contrib.seq2seq.sequence_loss(
            logits,
            self.input_obj.targets,
            tf.ones([self.batch_size, self.num_steps], dtype=tf.float32),
            average_across_timesteps=False,
            average_across_batch=True)
# Update the cost
self.cost = tf.reduce_sum(loss)

### Softmax normalization and accuracy calculation
To calculate accuracy, softmax activation is used to normalize logits. 

In [None]:
# get the prediction accuracy
self.softmax_out = tf.nn.softmax(tf.reshape(logits, [-1, vocab_size]))
self.predict = tf.cast(tf.argmax(self.softmax_out, axis=1), tf.int32)
correct_prediction = tf.equal(self.predict, tf.reshape(self.input_obj.targets, [-1]))
self.accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

### Optimization

In [None]:
if not is_training:
   return
self.learning_rate = tf.Variable(0.0, trainable=False)

tvars = tf.trainable_variables()
grads, _ = tf.clip_by_global_norm(tf.gradients(self.cost, tvars), 5)
optimizer = tf.train.GradientDescentOptimizer(self.learning_rate)
self.train_op = optimizer.apply_gradients(
            zip(grads, tvars),
            global_step=tf.contrib.framework.get_or_create_global_step())

### Update learning rate

In [None]:
self.new_lr = tf.placeholder(tf.float32, shape=[])
self.lr_update = tf.assign(self.learning_rate, self.new_lr)

## Complete LSTM model

In [13]:
# create the main model
class Model(object):
    def __init__(self, input, is_training, hidden_size, vocab_size, num_layers,
                 dropout=0.5, init_scale=0.05):
        self.is_training = is_training
        self.input_obj = input
        self.batch_size = input.batch_size
        self.num_steps = input.num_steps
        self.hidden_size = hidden_size

#         # create the word embeddings
#         with tf.device("/cpu:0"):
        embedding = tf.Variable(tf.random_uniform([vocab_size, self.hidden_size], -init_scale, init_scale))
#         embedding = tf.Variable(embeddings, trainable = False)
        inputs = tf.nn.embedding_lookup(embedding, self.input_obj.input_data)

        if is_training and dropout < 1:
            inputs = tf.nn.dropout(inputs, dropout)

        # set up the state storage / extraction
        self.init_state = tf.placeholder(tf.float32, [num_layers, 2, self.batch_size, self.hidden_size])

        state_per_layer_list = tf.unstack(self.init_state, axis=0)
        rnn_tuple_state = tuple(
            [tf.contrib.rnn.LSTMStateTuple(state_per_layer_list[idx][0], state_per_layer_list[idx][1])
             for idx in range(num_layers)]
        )

        # create an LSTM cell to be unrolled
        cell = tf.contrib.rnn.LSTMCell(hidden_size, forget_bias=1.0)
        # add a dropout wrapper if training
        if is_training and dropout < 1:
            cell = tf.contrib.rnn.DropoutWrapper(cell, output_keep_prob=dropout)
        if num_layers > 1:
            cell = tf.contrib.rnn.MultiRNNCell([cell for _ in range(num_layers)], state_is_tuple=True)

        output, self.state = tf.nn.dynamic_rnn(cell, inputs, dtype=tf.float32, initial_state=rnn_tuple_state)
        # reshape to (batch_size * num_steps, hidden_size)
        output = tf.reshape(output, [-1, hidden_size])

        softmax_w = tf.Variable(tf.random_uniform([hidden_size, vocab_size], -init_scale, init_scale))
        softmax_b = tf.Variable(tf.random_uniform([vocab_size], -init_scale, init_scale))
        logits = tf.nn.xw_plus_b(output, softmax_w, softmax_b)
        # Reshape logits to be a 3-D tensor for sequence loss
        logits = tf.reshape(logits, [self.batch_size, self.num_steps, vocab_size])

        # Use the contrib sequence loss and average over the batches
        loss = tf.contrib.seq2seq.sequence_loss(
            logits,
            self.input_obj.targets,
            tf.ones([self.batch_size, self.num_steps], dtype=tf.float32),
            average_across_timesteps = False,
            average_across_batch = True)

        # Update the cost
        self.cost = tf.reduce_sum(loss)

        # get the prediction accuracy
        self.softmax_out = tf.nn.softmax(tf.reshape(logits, [-1, vocab_size]))
        self.predict = tf.cast(tf.argmax(self.softmax_out, axis=1), tf.int32)
        correct_prediction = tf.equal(self.predict, tf.reshape(self.input_obj.targets, [-1]))
        self.accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

        if not is_training:
           return
        self.learning_rate = tf.Variable(0.0, trainable=False)

        tvars = tf.trainable_variables()
        grads, _ = tf.clip_by_global_norm(tf.gradients(self.cost, tvars), 5)
        optimizer = tf.train.GradientDescentOptimizer(self.learning_rate)
        # optimizer = tf.train.AdamOptimizer(self.learning_rate)
        self.train_op = optimizer.apply_gradients(
            zip(grads, tvars),
            global_step=tf.contrib.framework.get_or_create_global_step())
        # self.optimizer = tf.train.GradientDescentOptimizer(self.learning_rate).minimize(self.cost)

        self.new_lr = tf.placeholder(tf.float32, shape=[])
        self.lr_update = tf.assign(self.learning_rate, self.new_lr)
        
    def assign_lr(self, session, lr_value):
        session.run(self.lr_update, feed_dict={self.new_lr: lr_value})

In [13]:
# create the main model
class Model_pre_embedding(object):
    def __init__(self, input, embeddings, is_training, hidden_size, vocab_size, num_layers,
                 dropout=0.5, init_scale=0.05):
        self.is_training = is_training
        self.input_obj = input
        self.batch_size = input.batch_size
        self.num_steps = input.num_steps
        self.hidden_size = hidden_size

#         # create the word embeddings
#         with tf.device("/cpu:0"):
            #embedding = tf.Variable(tf.random_uniform([vocab_size, self.hidden_size], -init_scale, init_scale))
        embedding = tf.Variable(embeddings, trainable = False)
        inputs = tf.nn.embedding_lookup(embedding, self.input_obj.input_data)

        if is_training and dropout < 1:
            inputs = tf.nn.dropout(inputs, dropout)

        # set up the state storage / extraction
        self.init_state = tf.placeholder(tf.float32, [num_layers, 2, self.batch_size, self.hidden_size])

        state_per_layer_list = tf.unstack(self.init_state, axis=0)
        rnn_tuple_state = tuple(
            [tf.contrib.rnn.LSTMStateTuple(state_per_layer_list[idx][0], state_per_layer_list[idx][1])
             for idx in range(num_layers)]
        )

        # create an LSTM cell to be unrolled
        cell = tf.contrib.rnn.LSTMCell(hidden_size, forget_bias=1.0)
        # add a dropout wrapper if training
        if is_training and dropout < 1:
            cell = tf.contrib.rnn.DropoutWrapper(cell, output_keep_prob=dropout)
        if num_layers > 1:
            cell = tf.contrib.rnn.MultiRNNCell([cell for _ in range(num_layers)], state_is_tuple=True)

        output, self.state = tf.nn.dynamic_rnn(cell, inputs, dtype=tf.float32, initial_state=rnn_tuple_state)
        # reshape to (batch_size * num_steps, hidden_size)
        output = tf.reshape(output, [-1, hidden_size])

        softmax_w = tf.Variable(tf.random_uniform([hidden_size, vocab_size], -init_scale, init_scale))
        softmax_b = tf.Variable(tf.random_uniform([vocab_size], -init_scale, init_scale))
        logits = tf.nn.xw_plus_b(output, softmax_w, softmax_b)
        # Reshape logits to be a 3-D tensor for sequence loss
        logits = tf.reshape(logits, [self.batch_size, self.num_steps, vocab_size])

        # Use the contrib sequence loss and average over the batches
        loss = tf.contrib.seq2seq.sequence_loss(
            logits,
            self.input_obj.targets,
            tf.ones([self.batch_size, self.num_steps], dtype=tf.float32),
            average_across_timesteps = False,
            average_across_batch = True)

        # Update the cost
        self.cost = tf.reduce_sum(loss)

        # get the prediction accuracy
        self.softmax_out = tf.nn.softmax(tf.reshape(logits, [-1, vocab_size]))
        self.predict = tf.cast(tf.argmax(self.softmax_out, axis=1), tf.int32)
        correct_prediction = tf.equal(self.predict, tf.reshape(self.input_obj.targets, [-1]))
        self.accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

        if not is_training:
           return
        self.learning_rate = tf.Variable(0.0, trainable=False)

        tvars = tf.trainable_variables()
        grads, _ = tf.clip_by_global_norm(tf.gradients(self.cost, tvars), 5)
        optimizer = tf.train.GradientDescentOptimizer(self.learning_rate)
        # optimizer = tf.train.AdamOptimizer(self.learning_rate)
        self.train_op = optimizer.apply_gradients(
            zip(grads, tvars),
            global_step=tf.contrib.framework.get_or_create_global_step())
        # self.optimizer = tf.train.GradientDescentOptimizer(self.learning_rate).minimize(self.cost)

        self.new_lr = tf.placeholder(tf.float32, shape=[])
        self.lr_update = tf.assign(self.learning_rate, self.new_lr)
        
    def assign_lr(self, session, lr_value):
        session.run(self.lr_update, feed_dict={self.new_lr: lr_value})

# Trainning LSTM model

In [None]:
def train(train_data, vocabulary, num_layers, num_epochs, batch_size, model_path, model_save_name,
          learning_rate=1.0, max_lr_epoch=10, lr_decay=0.93, print_iter=50):
    # setup data and models
    training_input = Input(batch_size=batch_size, num_steps=35, data=train_data)
    m = Model(training_input, is_training=True, hidden_size=650, vocab_size=vocabulary,
              num_layers=num_layers)
    init_op = tf.global_variables_initializer()
    orig_decay = lr_decay
    with tf.Session() as sess:
        # start threads
        sess.run([init_op])
        coord = tf.train.Coordinator()
        threads = tf.train.start_queue_runners(coord=coord)
        saver = tf.train.Saver()
        for epoch in range(num_epochs):
            new_lr_decay = orig_decay ** max(epoch + 1 - max_lr_epoch, 0.0)
#             m.lr_update(m.learning_rate, new_lr_decay)
            m.assign_lr(sess, learning_rate * new_lr_decay)
            # m.assign_lr(sess, learning_rate)
            # print(m.learning_rate.eval(), new_lr_decay)
            current_state = np.zeros((num_layers, 2, batch_size, m.hidden_size))
            curr_time = dt.datetime.now()
            for step in range(training_input.epoch_size):
                # cost, _ = sess.run([m.cost, m.optimizer])
                if step % print_iter != 0:
                    cost, _, current_state = sess.run([m.cost, m.train_op, m.state],
                                                      feed_dict={m.init_state: current_state})
                else:
                    seconds = (float((dt.datetime.now() - curr_time).seconds) / print_iter)
                    curr_time = dt.datetime.now()
                    cost, _, current_state, acc = sess.run([m.cost, m.train_op, m.state, m.accuracy],
                                                           feed_dict={m.init_state: current_state})
                    print("Epoch {}, Step {}, cost: {:.3f}, accuracy: {:.3f}, Seconds per step: {:.3f}".format(epoch,
                            step, cost, acc, seconds))

            # save a model checkpoint
            saver.save(sess, model_path + '\\' + model_save_name, global_step=epoch)
        # do a final save
        saver.save(sess, model_path + '\\' + model_save_name + '-final')
        # close threads
        coord.request_stop()
        coord.join(threads)
        
# train(num_corpus, final_embeddings, vocabulary = 10000, 2, num_epochs = 60, batch_size = 20, model_save_name = 'two-layer-lstm-medium-config-60-epoch-0p93-lr-decay-10-max-lr', learning_rate=1.0, max_lr_epoch=10, lr_decay=0.93, print_iter=50)
train(train_data, 10000, 2, 15, 20, 'models','two-layer-lstm-medium-config-60-epoch-0p93-lr-decay-10-max-lr', 1.0, 10, 0.93, 50)

W1020 22:38:11.208746 13476 deprecation.py:323] From <ipython-input-11-aec0b10157ce>:11: range_input_producer (from tensorflow.python.training.input) is deprecated and will be removed in a future version.
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.range(limit).shuffle(limit).repeat(num_epochs)`. If `shuffle=False`, omit the `.shuffle(...)`.
W1020 22:38:11.208746 13476 deprecation.py:323] From C:\Users\lenovo\Anaconda3\lib\site-packages\tensorflow\python\training\input.py:320: input_producer (from tensorflow.python.training.input) is deprecated and will be removed in a future version.
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.from_tensor_slices(input_tensor).shuffle(tf.shape(input_tensor, out_type=tf.int64)[0]).repeat(num_epochs)`. If `shuffle=False`, omit the `.shuffle(...)`.
W1020 22:38:11.208746 13476 deprecation.py:323] From C:\Users\lenovo\Anaconda3\l

Epoch 0, Step 0, cost: 322.346, accuracy: 0.000, Seconds per step: 0.000
Epoch 0, Step 50, cost: 247.691, accuracy: 0.077, Seconds per step: 0.900
Epoch 0, Step 100, cost: 232.159, accuracy: 0.067, Seconds per step: 0.900
Epoch 1, Step 0, cost: 218.557, accuracy: 0.101, Seconds per step: 0.000
Epoch 1, Step 50, cost: 225.501, accuracy: 0.086, Seconds per step: 0.900
Epoch 1, Step 100, cost: 221.384, accuracy: 0.109, Seconds per step: 0.900
Epoch 2, Step 0, cost: 209.334, accuracy: 0.110, Seconds per step: 0.000
Epoch 2, Step 50, cost: 218.158, accuracy: 0.086, Seconds per step: 0.880
Epoch 2, Step 100, cost: 215.280, accuracy: 0.104, Seconds per step: 0.780
Epoch 3, Step 0, cost: 203.081, accuracy: 0.124, Seconds per step: 0.000
Epoch 3, Step 50, cost: 213.844, accuracy: 0.103, Seconds per step: 0.860
Epoch 3, Step 100, cost: 209.897, accuracy: 0.120, Seconds per step: 0.840
Epoch 4, Step 0, cost: 197.990, accuracy: 0.137, Seconds per step: 0.000
Epoch 4, Step 50, cost: 206.438, accura

W1020 22:48:06.228312 13476 deprecation.py:323] From C:\Users\lenovo\Anaconda3\lib\site-packages\tensorflow\python\training\saver.py:960: remove_checkpoint (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to delete files with this prefix.


Epoch 6, Step 0, cost: 183.495, accuracy: 0.197, Seconds per step: 0.000
Epoch 6, Step 50, cost: 193.490, accuracy: 0.164, Seconds per step: 0.820
Epoch 6, Step 100, cost: 200.197, accuracy: 0.149, Seconds per step: 0.780
Epoch 7, Step 0, cost: 182.846, accuracy: 0.166, Seconds per step: 0.000
Epoch 7, Step 50, cost: 190.050, accuracy: 0.161, Seconds per step: 0.800
Epoch 7, Step 100, cost: 195.189, accuracy: 0.156, Seconds per step: 0.840
Epoch 8, Step 0, cost: 180.570, accuracy: 0.189, Seconds per step: 0.000
Epoch 8, Step 50, cost: 185.783, accuracy: 0.160, Seconds per step: 0.860
Epoch 8, Step 100, cost: 191.009, accuracy: 0.170, Seconds per step: 0.820
Epoch 9, Step 0, cost: 173.134, accuracy: 0.213, Seconds per step: 0.000
Epoch 9, Step 50, cost: 182.976, accuracy: 0.156, Seconds per step: 0.760
Epoch 9, Step 100, cost: 186.692, accuracy: 0.167, Seconds per step: 0.780
Epoch 10, Step 0, cost: 170.524, accuracy: 0.214, Seconds per step: 0.000
Epoch 10, Step 50, cost: 179.188, accu

In [20]:
def train_pre_embedding(train_data, embeddings, vocabulary, num_layers, num_epochs, batch_size, model_path, model_save_name,
          learning_rate=1.0, max_lr_epoch=10, lr_decay=0.93, print_iter=50):
    # setup data and models
    training_input = Input(batch_size=batch_size, num_steps=35, data=train_data)
    m = Model_pre_embedding(training_input, embeddings, is_training=True, hidden_size=650, vocab_size=vocabulary,
              num_layers=num_layers)
    init_op = tf.global_variables_initializer()
    orig_decay = lr_decay
    with tf.Session() as sess:
        # start threads
        sess.run([init_op])
        coord = tf.train.Coordinator()
        threads = tf.train.start_queue_runners(coord=coord)
        saver = tf.train.Saver()
        for epoch in range(num_epochs):
            new_lr_decay = orig_decay ** max(epoch + 1 - max_lr_epoch, 0.0)
#             m.lr_update(m.learning_rate, new_lr_decay)
            m.assign_lr(sess, learning_rate * new_lr_decay)
            # m.assign_lr(sess, learning_rate)
            # print(m.learning_rate.eval(), new_lr_decay)
            current_state = np.zeros((num_layers, 2, batch_size, m.hidden_size))
            curr_time = dt.datetime.now()
            for step in range(training_input.epoch_size):
                # cost, _ = sess.run([m.cost, m.optimizer])
                if step % print_iter != 0:
                    cost, _, current_state = sess.run([m.cost, m.train_op, m.state],
                                                      feed_dict={m.init_state: current_state})
                else:
                    seconds = (float((dt.datetime.now() - curr_time).seconds) / print_iter)
                    curr_time = dt.datetime.now()
                    cost, _, current_state, acc = sess.run([m.cost, m.train_op, m.state, m.accuracy],
                                                           feed_dict={m.init_state: current_state})
                    print("Epoch {}, Step {}, cost: {:.3f}, accuracy: {:.3f}, Seconds per step: {:.3f}".format(epoch,
                            step, cost, acc, seconds))

            # save a model checkpoint
            saver.save(sess, model_path + '\\' + model_save_name, global_step=epoch)
        # do a final save
        saver.save(sess, model_path + '\\' + model_save_name + '-final')
        # close threads
        coord.request_stop()
        coord.join(threads)
        
# train(num_corpus, final_embeddings, vocabulary = 10000, 2, num_epochs = 60, batch_size = 20, model_save_name = 'two-layer-lstm-medium-config-60-epoch-0p93-lr-decay-10-max-lr', learning_rate=1.0, max_lr_epoch=10, lr_decay=0.93, print_iter=50)
train(num_corpus, final_embeddings, 10000, 2, 60, 20, 'models','two-layer-lstm-medium-config-60-epoch-0p93-lr-decay-10-max-lr', 1.0, 10, 0.93, 50)

ValueError: in converted code:
    relative to C:\Users\lenovo\Anaconda3\lib\site-packages\tensorflow\python:

    ops\rnn_cell_impl.py:1719 call *
        cur_inp, new_state = cell(cur_inp, cur_state)
    ops\rnn_cell_impl.py:1159 __call__
        inputs, state, cell_call_fn=self.cell.__call__, scope=scope)
    ops\rnn_cell_impl.py:1436 _call_wrapped_cell
        output, new_state = cell_call_fn(inputs, state, **kwargs)
    ops\rnn_cell_impl.py:385 __call__
        self, inputs, state, scope=scope, *args, **kwargs)
    layers\base.py:537 __call__
        outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
    keras\engine\base_layer.py:591 __call__
        self._maybe_build(inputs)
    keras\engine\base_layer.py:1881 _maybe_build
        self.build(input_shapes)
    keras\utils\tf_utils.py:295 wrapper
        output_shape = fn(instance, input_shape)
    ops\rnn_cell_impl.py:957 build
        partitioner=maybe_partitioner)
    keras\engine\base_layer.py:1484 add_variable
        return self.add_weight(*args, **kwargs)
    layers\base.py:450 add_weight
        **kwargs)
    keras\engine\base_layer.py:384 add_weight
        aggregation=aggregation)
    training\tracking\base.py:663 _add_variable_with_custom_getter
        **kwargs_for_getter)
    ops\variable_scope.py:1496 get_variable
        aggregation=aggregation)
    ops\variable_scope.py:1239 get_variable
        aggregation=aggregation)
    ops\variable_scope.py:545 get_variable
        return custom_getter(**custom_getter_kwargs)
    ops\rnn_cell_impl.py:251 _rnn_get_variable
        variable = getter(*args, **kwargs)
    ops\variable_scope.py:514 _true_getter
        aggregation=aggregation)
    ops\variable_scope.py:864 _get_single_variable
        (err_msg, "".join(traceback.format_list(tb))))

    ValueError: Variable rnn/multi_rnn_cell/cell_0/lstm_cell/kernel already exists, disallowed. Did you mean to set reuse=True or reuse=tf.AUTO_REUSE in VarScope? Originally defined at:
    
      File "C:\Users\lenovo\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 2005, in __init__
        self._traceback = tf_stack.extract_stack()
      File "C:\Users\lenovo\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 3616, in create_op
        op_def=op_def)
      File "C:\Users\lenovo\Anaconda3\lib\site-packages\tensorflow\python\util\deprecation.py", line 507, in new_func
        return func(*args, **kwargs)
      File "C:\Users\lenovo\Anaconda3\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 788, in _apply_op_helper
        op_def=op_def)
      File "C:\Users\lenovo\Anaconda3\lib\site-packages\tensorflow\python\ops\gen_state_ops.py", line 2023, in variable_v2
        shared_name=shared_name, name=name)
    


# Validate LSTM model

In [None]:

def test(model_path, test_data, reversed_dictionary):
    test_input = Input(batch_size=20, num_steps=35, data=test_data)
    m = Model(test_input, is_training=False, hidden_size=650, vocab_size=vocabulary,
              num_layers=2)
    saver = tf.train.Saver()
    with tf.Session() as sess:
        # start threads
        coord = tf.train.Coordinator()
        threads = tf.train.start_queue_runners(coord=coord)
        current_state = np.zeros((2, 2, m.batch_size, m.hidden_size))
        # restore the trained model
        saver.restore(sess, model_path)
        # get an average accuracy over num_acc_batches
        num_acc_batches = 30
        check_batch_idx = 25
        acc_check_thresh = 5
        accuracy = 0
        for batch in range(num_acc_batches):
            if batch == check_batch_idx:
                true_vals, pred, current_state, acc = sess.run([m.input_obj.targets, m.predict, m.state, m.accuracy],
                                                               feed_dict={m.init_state: current_state})
                pred_string = [reversed_dictionary[x] for x in pred[:m.num_steps]]
                true_vals_string = [reversed_dictionary[x] for x in true_vals[0]]
                print("True values (1st line) vs predicted values (2nd line):")
                print(" ".join(true_vals_string))
                print(" ".join(pred_string))
            else:
                acc, current_state = sess.run([m.accuracy, m.state], feed_dict={m.init_state: current_state})
            if batch >= acc_check_thresh:
                accuracy += acc
        print("Average accuracy: {:.3f}".format(accuracy / (num_acc_batches-acc_check_thresh)))
        # close threads
        coord.request_stop()
        coord.join(threads)


if args.data_path:
    data_path = args.data_path
train_data, valid_data, test_data, vocabulary, reversed_dictionary = load_data()
if args.run_opt == 1:
    train(train_data, vocabulary, num_layers=2, num_epochs=60, batch_size=20,
          model_save_name='two-layer-lstm-medium-config-60-epoch-0p93-lr-decay-10-max-lr')
else:
    trained_model = args.data_path + "\\two-layer-lstm-medium-config-60-epoch-0p93-lr-decay-10-max-lr-38"
    test(trained_model, test_data, reversed_dictionary)