# Part II: Sequence to Sequence Models in tensorflow

 

## Task: Language Modeling
Today we will experiment with using tensorflow to build a very simple seq2seq structure for language modeling.
### Definition
We input a sequence of words/characters to an RNN so that it can learn the probability distribution of the next word/character in the sequence given the history of previous characters. This will then allow us to generate text one unit at a time.

We will use 全唐诗 as our training data, and try to generate new poems later!

## Check the content

I have already uploaded the poem text to this server. We need to first do some preprocessing.

** Confirm the correctness of preprocessing **

When you deal with your own dataset, you have to write your own preprocessing procedure. There are all kinds of noise in text data. Remember to check the correctness of the preprocessing!

The format of our input data is like this:

`(optional title + ":")poem`

We will use only the poem part and not the title.

However, some special cases like:

* 河鱼未上冻，江蛰已闻雷。（见《纬略》）
* □□□□□

We need some preprocessing as mentioned in our last codelab.

* Remove title
* Remove spaces
* Remove empty symbols
* Replace other symbols

Finally, we will randomize them.

In [0]:
import re
import numpy as np

data_filename ='poetry.txt'
poems = []
with open(data_filename, "r") as in_file:
  for line in in_file.readlines():
    line = line.strip()
    # find title if exists
    if ':' in line:
      line = line.split(':')
    # some poems are empty
    if len(line) == 2:
      poem = line[1]
    else:
      continue
    # discard if contains special symbols
    if re.search(r'[(（《_□]', poem):
      continue
    # discard if too short or too long
    if len(poem) < 5 or len(poem) > 40:
      continue
    # remove symbols
    poem = re.sub(u'[，。]','',poem)
    poems.append(poem)

poems = np.random.permutation(poems)

We select 5 poems as our test set.

In [0]:
poems_train, poems_test = poems[:-5], poems[-5:]
len(poems_train), len(poems_test)

## Word to ID

As mentioned lst time, we can use the tokenizer in Keras to help us. We set
```python
Tokenizer(num_words=None, lower=False, char_level=True)
```
to not limit the number of words in the dictionary, and use character as our unit.

We also need to correct the dictionary because it starts from 1 and that will be a problem later!

In [0]:
import time
from tensorflow.keras.preprocessing.text import Tokenizer

poem_tokenizer = Tokenizer(num_words=None, lower=False, char_level=True)
# Create word to ID dictionary
poem_tokenizer.fit_on_texts(poems)
# Get dictionary
word_index = poem_tokenizer.word_index

# Note that ID starts from 1!!
# We need to add special ID 0
word_index["<PAD>"] = 0
# Create ID to word 
reverse_word_index = dict([(v, k) for (k, v) in word_index.items()])
print("Number of unique chars: {}".format(len(word_index)))

Again, always check if there is any strange symbols in the dictionary. Here we only print first and last parts.

In [0]:
# sort word index by ID
for (w,i) in sorted(word_index.items(), key=lambda w: w[1]):
# print some words to check if there are errors!
  if i > 10 and i < len(word_index)-5: continue
  print("{} {}".format(w,i))

In [0]:
# Apply word to ID on training and test set
poems_train = poem_tokenizer.texts_to_sequences(poems_train)
poems_test = poem_tokenizer.texts_to_sequences(poems_test)
# Check and see if there is any error
print(poems_train[0])
print(''.join([reverse_word_index[w] for w in poems_train[0]]))

## Prepare the data for input

We flatten the input to a long list.

In [0]:
# flatten to a long string of characters
poems_train = [w for po in poems_train for w in po]

# flatten to a long string of characters
poems_test = [w for po in poems_test for w in po]

## Define an input object

We need to put the input into batches.
* Reshape input data into a rectangular matrix and crop remainders
* Calculate shape of each batch
* Generate batch with input and output = input shift by one time step


In [0]:
import tensorflow as tf

In [0]:
class PoemInput(object):
  def __init__(self, data, config, name=None):
    self.batch_size = batch_size = config.batch_size
    self.num_steps = num_steps = config.num_steps
    self.epoch_size = ((len(data) // batch_size) - 1) // num_steps
    self.sources, self.targets = self.input_producer(
        data, batch_size, num_steps, name=name)

  def input_producer(self, raw_data, batch_size, num_steps, name=None):
    """Reshape the poem data to form input and output.
    This chunks the raw_data into batches of examples and returns Tensors that
    are drawn from these batches.
    Args:
      raw_data: a list of words
      batch_size: int, the batch size.
      num_steps: int, the sequence length.
      name: the name of this operation (optional).
    Returns:
      A pair of Tensors, each shaped [batch_size, num_steps]. The second element
      of the tuple is the same data time-shifted to the right by one.
    """
    raw_data = tf.convert_to_tensor(raw_data, name="raw_data", dtype=tf.int32)
    # get size of the 1-d tensor
    data_len = tf.size(raw_data)
    # calculate how many batches
    batch_len = data_len // batch_size
    # crop data that does not fit in a batch
    data = tf.reshape(raw_data[0:batch_size*batch_len],
                      [batch_size, batch_len])
    # calculate how many batches in an epoch
    epoch_size = (batch_len - 1) // num_steps
    # make sure there is at least one batch
    assertion = tf.assert_positive(epoch_size,
        message="epoch_size == 0, decrease batch_size or num_steps")
    with tf.control_dependencies([assertion]):
      epoch_size = tf.identity(epoch_size, name="epoch_size")

    # start generating slices
    # range_input_producer returns a sequence of IDs 
    i = tf.train.range_input_producer(epoch_size, shuffle=False).dequeue()
    x = data[:, i*num_steps  :(i+1)*num_steps]
    y = data[:, i*num_steps+1:(i+1)*num_steps+1]
    return x, y

## Define hyperparameters

In [0]:
# Define hyperparameters
class Hparam(object):
  learning_rate = 1.0
  max_grad_norm = 5
  num_layers = 1
  num_steps = 10
  vocab_size = len(word_index)
  embedding_size = 100
  hidden_size = 100
  warmup_epochs = 2
  num_epochs_to_train = 5
  keep_prob = 0.8
  lr_decay = 0.9
  batch_size = 100

config = Hparam()

## Construct model
In this step, the entire model structure must be defined completely. Including
* Input
* Size of layers
* Connection between layers
* Variables in layers
* Output
* Loss
* Operations that apply the gradients (optimizer)
* Placeholder for feeding special values
* Properties that can be read from outside

Note that we will use CudnnLSTM to speed up our training if available. However, I will provide two versions of LSTM here in case you cannot find a machine with GPUs.

In [0]:
from tensorflow.contrib.cudnn_rnn import CudnnLSTM
from tensorflow.contrib.rnn import BasicLSTMCell, MultiRNNCell
from tensorflow.nn import embedding_lookup, dropout

# Build our model
class MyModel(object):
  def __init__(self, is_training, config, input_):
    self._is_training = is_training
    self._input = input_
    self._rnn_params = None
    self._cell = None
    self.batch_size = input_.batch_size
    self.num_steps = input_.num_steps
    rnn_size = config.hidden_size
    vocab_size = config.vocab_size
    embedding_size = config.embedding_size

    # Embeddings can only exist on CPU
    with tf.device("/cpu:0"):
      embedding_weights = tf.get_variable("embedding", \
                     [vocab_size, embedding_size])
      embed_inputs = tf.nn.embedding_lookup(embedding_weights, input_.sources)

    if is_training and config.keep_prob < 1.:
      embed_inputs = tf.nn.dropout(embed_inputs, config.keep_prob)

    # build RNN using CudnnLSTM
    output, _ = self._build_rnn(embed_inputs, config, is_training)
    # build RNN using basic LSTM
    # output, _ = self._build_rnn_old_lstm(embed_inputs, config, is_training)

    # Remember RNN output is [batch_size x time, rnnsize]
    # Dense layer for projecting onto vocabulary size
    softmax_w = tf.get_variable("softmax_w", [rnn_size, vocab_size])
    softmax_b = tf.get_variable("softmax_b", [vocab_size])
    logits = tf.nn.xw_plus_b(output, softmax_w, softmax_b)
    # Reshape logits to be a 3-D tensor for sequence loss
    logits = tf.reshape(logits, [self.batch_size, self.num_steps, vocab_size])
    self._logits = logits

    # Use the contrib sequence loss and average over the batches
    loss = tf.contrib.seq2seq.sequence_loss(
        logits,
        input_.targets,
        tf.ones([self.batch_size, self.num_steps]),
        average_across_timesteps=False,
        average_across_batch=True)

    # Update the cost
    self._cost = tf.reduce_sum(loss)

    if not is_training:
      return

    # A variable to store learning rate
    self._lr = tf.Variable(0.0, trainable=False)

    # Calculate gradients
    # Get a list of trainable variables
    tvars = tf.trainable_variables()
    # Get gradient and clip by norm
    grads, _ = tf.clip_by_global_norm(\
                 tf.gradients(self._cost, tvars),
                 config.max_grad_norm)
    # Define an optimizer
    # Note that the optimizer reads the value of learning rate from variable
    optimizer = tf.train.GradientDescentOptimizer(self._lr)
    # Define an operation that actually applies the gradients
    self._train_op = optimizer.apply_gradients(
        zip(grads, tvars),
        global_step=tf.train.get_or_create_global_step())
    # A placeholder for feeding new learning rates
    self._new_lr = tf.placeholder(
         tf.float32, shape=[], name="new_learning_rate")
    self._lr_update_op = tf.assign(self._lr, self._new_lr)
  
  def _build_rnn(self, inputs, config, is_training):
    # RNN requires time-major
    inputs = tf.transpose(inputs, [1, 0, 2])
    self._cell = CudnnLSTM(
        num_layers=config.num_layers,
        num_units=config.hidden_size,
        )
    self._cell.build(inputs.get_shape())
    outputs, state = self._cell(inputs)
    # Transpose from time-major to batch-major
    outputs = tf.transpose(outputs, [1, 0, 2])
    # Reshape from [batch, time, rnnsize] to [batch x time, rnnsize]
    # For computing softmax later
    outputs = tf.reshape(outputs, [-1, config.hidden_size])
    return outputs, state

  def _build_rnn_old_lstm(self, inputs, config, is_training):
    def make_cell():
      cell = BasicLSTMCell(
        config.hidden_size, forget_bias=0.0, state_is_tuple=True,
        reuse=not is_training)
      if is_training and config.keep_prob < 1:
        cell = tf.contrib.rnn.DropoutWrapper(
            cell, output_keep_prob=config.keep_prob)
      return cell

    cell = tf.contrib.rnn.MultiRNNCell(
        [make_cell() for _ in range(config.num_layers)], state_is_tuple=True)

    self._initial_state = cell.zero_state(config.batch_size, tf.float32)
    state = self._initial_state
    outputs = []
    inputs = tf.unstack(inputs, num=self.num_steps, axis=1)
    outputs, state = tf.nn.static_rnn(cell, inputs,
                                      initial_state=self._initial_state)
    output = tf.reshape(tf.concat(outputs, 1), [-1, config.hidden_size])
    return output, state
  
  def assign_lr(self, session, lr_value):
    session.run(self._lr_update_op, feed_dict={self._new_lr: lr_value})

  @property
  def input(self):
    return self._input

  @property
  def cost(self):
    return self._cost

  @property
  def lr(self):
    return self._lr

  @property
  def train_op(self):
    return self._train_op

  @property
  def logits(self):
    return self._logits

## Define a training operation for an epoch
This procedure gets the output from the model for each batch.
We need a dictionary with these keys:

* "cost": Reads the propertie `model.cost` that we defined above. 
* "do_op": Perform operation `model.train_op` that applies gradients

After running (calling `session.run()`), the same key will contain the return values.

We can add any key in the dictionary that corresponds to `@property` in the model!

In [0]:
def run_epoch(session, model, do_op=None, verbose=False):
  start_time = time.time()
  costs = 0.0
  iters = 0
  feed_to_model_dict = {
      "cost": model.cost,
  }
  # if an operation is provided, put that in the feed
  if do_op is not None:
    feed_to_model_dict["do_op"] = do_op

  for step in range(model.input.epoch_size):
    # use the session to run, feed the dictionary
    s_out = session.run(feed_to_model_dict)
    # The returned dictionary will contain the information we need
    cost = s_out["cost"]
    # Accumulate cost
    costs += cost
    # Accumulate number of training steps
    iters += model.input.num_steps
    # Print loss periodically
    if verbose and step % (model.input.epoch_size // 10) == 10:
      print("%.0f%% ppl: %.3f, speed: %.0f char/sec" %
            (step * 100.0 / model.input.epoch_size, \
             np.exp(costs/iters), \
             iters * model.input.batch_size/(time.time() - start_time)))

  return np.exp(costs / iters)


## Define an operation for printing test data
In the end, we want to see the generated poems. We will modify the decode procedure from last week's codelab to show Chinese characters.

Here, we will again use `feed_to_model_dict` to get the output from the last layer (`logits`). Since they represent probability over vocabulary, we need to calculate `argmax` to get the actual ID of the char.

In [0]:
def run_test(session, model):

  def decode_text(text):
    words = [reverse_word_index.get(i, "<UNK>") for i in text]
    fixed_width_string = []
    # limit max length = 5
    for w_pos in range(len(words)):
      fixed_width_string.append(words[w_pos])
      if (w_pos+1) % 5 == 0:
        fixed_width_string.append('\n')
    return ''.join(fixed_width_string)

  feed_to_model_dict = {
      "logits": model.logits,
  }

  vals = session.run(feed_to_model_dict)
  logits = vals["logits"]
  # Transform to a list of IDs
  logits = logits.squeeze().argmax(axis=-1).tolist()
  print(decode_text(logits))
  return

## Main training controller
Finally, we define a controller that:
* Create the model for training
* Create the model for testing, copying from the training model
* Prepare the input data
* Define what to log in the progress of training
* Create a `session` that communicates with computation graph
* Change learning rate optionally
* Get test set results


In [0]:
def main(_):

  eval_config = Hparam()
  eval_config.batch_size = 1
  eval_config.num_steps = 20

  with tf.Graph().as_default():
    initializer = tf.random_uniform_initializer(-0.1, 0.1)

    with tf.name_scope("Train"):
      # Create input producer
      train_input = PoemInput(poems_train, config, name="TrainInput")
      # Create the model instance
      with tf.variable_scope("Model", reuse=None, initializer=initializer):
        m = MyModel(is_training=True, config=config, input_=train_input)
      # Add information to logs
      tf.summary.scalar("Training_Loss", m.cost)
      tf.summary.scalar("Learning_Rate", m.lr)

    with tf.name_scope("Test"):
      # Create another input for test data
      # Note that eval_config was set locally
      test_input = PoemInput(poems_test, eval_config, name="TestInput")
      # Create another model but reuse the variables in the training model
      with tf.variable_scope("Model", reuse=True, initializer=initializer):
        mtest = MyModel(is_training=False, config=eval_config,
                         input_=test_input)
    # Hardware settings
    config_proto = tf.ConfigProto(allow_soft_placement=True)
    # Create a session that controls the training process
    # Also automatically logs and reports 
    # Note the `checkpoint_dir` setting
    with tf.train.MonitoredTrainingSession(checkpoint_dir="logs", \
                                           config=config_proto, \
                                           log_step_count_steps=-1) as session:
      for i in range(config.num_epochs_to_train):
        # Calculate learning rate decay
        lr_decay = config.lr_decay ** max(i + 1 - config.warmup_epochs, 0.0)
        # Set learning rate
        m.assign_lr(session, config.learning_rate * lr_decay)
        # Print new learning rate
        print("Epoch: %d Learning rate: %.3f" % (i + 1, session.run(m.lr)))
        # Train one epoch and report loss
        train_perplexity = run_epoch(session, m, do_op=m.train_op,
                                     verbose=True)
        print("Epoch: %d Train Perplexity: %.3f" % (i + 1, train_perplexity))
      
      # End of training
      # Evaluate test set performance
      test_perplexity = run_epoch(session, mtest)
      print("Test Perplexity: %.3f" % test_perplexity)
      # Print some examples from test
      run_test(session, mtest)


## Start training
We can actually start training by calling the controller.

In [0]:
main(1)

We can see the training process shown here. Observe that training loss keeps decreasing, which means that the model is actually learning. 

Also, due to the speedup of CudnnLSTM, the speed can be very fast (100,000 w/s). Using basic LSTM can only achieve ~6,000 w/s.

If you are running this script locally, start `tensorboard` and point it to the `logs` directory will allow you to see the loss plot over time. We will not be able to show that easily in Colab environment.

You can also continue training by calling the controller again. Try this later and see if the poems generated gets better over time.

In [0]:
main(2)

## Clear previous output

Tensorflow will automatically load previous models if you specify a path for the `session`. However, that will be a problem if you change some parts of the model. e.g., change embedding size, LSTM size, or number of layers.

You will see something like 
```
INFO:tensorflow:Restoring parameters from logs/model.ckpt-4465
...
InvalidArgumentError: Assign requires shapes of both tensors to match.
```
Always remember to clear output directory if you are experimenting with different model structures!

In [0]:
!rm -R logs

# Summary
What we learned today:
1. Preprocessing for language modeling data
    * Create a dictionary that maps words to unique IDs
    * Convert words to ID
    * Reshape sequences to unified lengths
    * Create a helper to produce data
2. Building a model using tensorflow
    * Hyperparameters
    * Training operation
    * Testing operation
    * Control function
3. Training and evaluation
    * Observe loss
    * Evaluate on test set

You are now capable of building a deep learning model for a basic seq2seq task using **tensorflow**! 

However, tensorflow is extremely complicated (but powerful). There are numerous examples online for you to explore.

## Extension

Can you think of anything else that may also be learned using this model? 

# Appendix: connect your Google Drive to Colab for uploading your data

First, copy the file into Google Drive. Then run the following code to link your Drive to this notebook. Follow the link in the output.

In [0]:
from google.colab import drive
drive.mount('/gdrive')

Copy (`cp`) the file from `/gdrive` to this server.

In [0]:
!cp /gdrive/My\ Drive/Colab\ Notebooks/poetry.txt ./

In [0]:
!cp /gdrive/My\ Drive/Colab\ Notebooks/Book*.txt ./

# Part III: Attention


## Task: Translation
Hints on how to add attention in seq2seq model in order to perform translation. 
### CWMT corpus
This is a Chinese-English translation dataset.

Visit source website to download manually:
http://nlp.nju.edu.cn/cwmt-wmt/

Take a look at some examples:

In [0]:
import time
import numpy as np
import tensorflow as tf
from tensorflow.contrib.cudnn_rnn import CudnnLSTM
from tensorflow.contrib.rnn import BasicLSTMCell, MultiRNNCell
from tensorflow.nn import embedding_lookup, dropout

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [0]:
c_sents = [ss.strip() for ss in open('Book14_cn.txt').readlines()]

In [0]:
c_sents[0]

In [0]:
e_sents=[ss.strip() for ss in open('Book14_en.txt').readlines()]


In [0]:
e_sents[0]

In [0]:
c_tokenizer = Tokenizer(num_words=None, lower=False, char_level=True)
# Create word to ID dictionary
c_tokenizer.fit_on_texts(c_sents)
# Get dictionary
c_word_index = c_tokenizer.word_index
# Fix word to ID
c_word_index = {c:i+1 for c, i in c_word_index.items()}
c_word_index["<PAD>"] = 0
c_word_index["<UNK>"] = 1
c_tokenizer.word_index = c_word_index
c_reverse_word_index = dict([(v, k) for (k, v) in c_word_index.items()])

In [0]:
# sort word index by ID
for (w,i) in sorted(c_word_index.items(), key=lambda w: w[1]):
# print some words to check if there are errors!
  if i > 10 and i < len(c_word_index)-5: continue
  print("{} {}".format(w,i))

In [0]:
e_vocab_size = 20000
e_tokenizer = Tokenizer(num_words=e_vocab_size, lower=True, oov_token="<UNK>")
# Create word to ID dictionary
e_tokenizer.fit_on_texts(e_sents)
# Get dictionary
e_word_index = e_tokenizer.word_index
# Fix word to ID
e_word_index = {e:i+1 for e, i in e_word_index.items() if i < e_vocab_size-1}
e_word_index["<PAD>"] = 0
e_word_index["<UNK>"] = 1
e_tokenizer.word_index = e_word_index
e_reverse_word_index = dict([(v, k) for (k, v) in e_word_index.items()])
# sort word index by ID
for (w,i) in sorted(e_word_index.items(), key=lambda w: w[1]):
# print some words to check if there are errors!
  if i > 10 and i < len(e_word_index)-5: continue
  print("{} {}".format(w,i))

In [0]:
c_sents = c_tokenizer.texts_to_sequences(c_sents)
e_sents = e_tokenizer.texts_to_sequences(e_sents)
c_sents = pad_sequences(c_sents,value=c_word_index["<PAD>"], padding='post', truncating='post', maxlen=10)
e_sents = pad_sequences(e_sents,value=e_word_index["<PAD>"], padding='post', truncating='post', maxlen=10)
train_data = (c_sents[:-5], e_sents[:-5])
test_data = (c_sents[-5:], e_sents[-5:])

In [0]:
train_data[0][0], train_data[0][1]

### Change hyperparameters
* Add separate vocabulary sized for English and Chinese

In [0]:
# Define hyperparameters
class Hparam(object):
  # ...
  source_vocab_size = len(c_word_index)
  target_vocab_size = len(e_word_index)
  # ...

### Prepare input for translation

* Modify `class PoemInput(object)` to create different source and target sentences. Most importantly, change the final part.
* Use `pad_sequences` to pad both Chinese and English sentences



In [0]:
class TranslationInput(object):
  def __init__(self, data, config, name=None):
    self.batch_size = batch_size = config.batch_size
    self.num_steps = config.num_steps
    self.sources, self.targets = self.input_producer(
        data, batch_size, name=name)

  def input_producer(self, raw_data, batch_size, name=None):
    source_data = tf.convert_to_tensor(raw_data[0], name="source_data", dtype=tf.int32)
    target_data = tf.convert_to_tensor(raw_data[1], name="target_data", dtype=tf.int32)

    num_batches = len(raw_data[0]) // self.batch_size
    i = tf.train.range_input_producer(num_batches, shuffle=False).dequeue()
    x = source_data[i*self.batch_size:(i+1)*self.batch_size, :]
    y = target_data[i*self.batch_size:(i+1)*self.batch_size, :]
    return x, y

## Build Translation Model
This is also a seq2seq model, with some major differences:
* Use one LSTM as the encoder
* Add another as the decoder

```python
def _build_rnn_encoder
    # RNN requires time-major
    inputs = tf.transpose(inputs, [1, 0, 2])
    self._enccell = CudnnLSTM(
        num_layers=config.num_layers,
        num_units=config.hidden_size,
        name=name)
    outputs, state = self._enccell(inputs)
    return outputs, state

def _build_rnn_decoder
    self._deccell = CudnnLSTM(
            num_layers=config.num_layers,
            num_units=config.hidden_size,
            name=name)

    outputs, state = self._deccell(inputs)
    # Transpose from time-major to batch-major
    outputs = tf.transpose(outputs, [1, 0, 2])
    return outputs, state
 ```

### Modify training controller

Use `TranslationInput` and `MyTranslationModel`.

```python
train_input = TranslationInput(train_data, config, name="TrainInput")

m = MyTranslationModel(is_training=True, config=config, input_=train_input)

```

### Snippet for adding attention mechanism in the model
* Calculate attention score
* Normalize score
* Calculate context vector = *attention weighted sum*
* Concatenate context vector with input
* Use decoder to decode next step

In [0]:
# Require:
# hidden: decoder hidden (memory)
# enc_output: encoder output
        
# hidden shape == (batch_size, hidden size)
# hidden_with_time_axis shape == (batch_size, 1, hidden size)
# we are doing this to perform addition to calculate the score
hidden_with_time_axis = tf.expand_dims(hidden, 1)

# enc_output shape == (batch_size, max_length, hidden_size)
# score shape == (batch_size, max_length, hidden_size)
score = tf.nn.tanh(W1(enc_output) + W2(hidden_with_time_axis))

# attention_weights shape == (batch_size, max_length, 1)
# we get 1 at the last axis because we are applying score to V
attention_weights = tf.nn.softmax(V(score), axis=1)

# context_vector shape after sum == (batch_size, hidden_size)
context_vector = attention_weights * enc_output
context_vector = tf.reduce_sum(context_vector, axis=1)

# x shape after passing through embedding == (batch_size, 1, embedding_dim)
x = self.embedding(input_)

# x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)

# passing the concatenated vector to the decoder
output, state = self.decoder(x)

# output shape == (batch_size * 1, hidden_size)
output = tf.reshape(output, (-1, output.shape[2]))

# output shape == (batch_size * 1, vocab)
x = SoftmaxLayer(output)

# Output:
# x: decoder output
# state: decoder state
# attention_weights: weights over encoder output at one time