##### Copyright 2018 The TensorFlow Authors.

Licensed under the Apache License, Version 2.0 (the "License").

# Neural Machine Translation with Attention

<table class="tfo-notebook-buttons" align="left"><td>
<a target="_blank"  href="https://colab.research.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/nmt_with_attention/nmt_with_attention.ipynb">
    <img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>  
</td><td>
<a target="_blank"  href="https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/eager/python/examples/nmt_with_attention/nmt_with_attention.ipynb"><img width=32px src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a></td></table>

This notebook trains a sequence to sequence (seq2seq) model for Spanish to English translation using TF 2.0 APIs. This is an advanced example that assumes some knowledge of sequence to sequence models.

After training the model in this notebook, you will be able to input a Spanish sentence, such as *"¿todavia estan en casa?"*, and return the English translation: *"are you still at home?"*

The translation quality is reasonable for a toy example, but the generated attention plot is perhaps more interesting. This shows which parts of the input sentence has the model's attention while translating:

<img src="https://tensorflow.org/images/spanish-english.png" alt="spanish-english attention plot">

Note: This example takes approximately 10 mintues to run on a single P100 GPU.

In [None]:
!pip install tf-nightly-2.0-preview

In [None]:
from __future__ import absolute_import, division, print_function

import collections
import itertools
import os
import random
import re
import time
import unicodedata

import numpy as np
import tensorflow as tf
# TODO(brianklee): remove when summaryv2 ops are exported

from tensorflow.python.ops import summary_ops_v2

import matplotlib.pyplot as plt

print(tf.__version__)

## Download and prepare the dataset

We'll use a language dataset provided by http://www.manythings.org/anki/. This dataset contains language translation pairs in the format:

```
May I borrow this book?	¿Puedo tomar prestado este libro?
```

There are a variety of languages available, but we'll use the English-Spanish dataset. For convenience, we've hosted a copy of this dataset on Google Cloud, but you can also download your own copy. After downloading the dataset, here are the steps we'll take to prepare the data:

1. Add a *start* and *end* token to each sentence.
2. Clean the sentences by removing special characters.
3. Create a word index and reverse word index (dictionaries mapping from word → id and id → word).
4. Pad each sentence to a maximum length.

In [None]:
class LanguageIndex(object):
  """A bidirectional mapping from word <=> integer index."""

  def __init__(self, vocab):
    self.word2idx = collections.defaultdict(int)  # If not in vocab, return 0.
    self.idx2word = {}
    for i, word in enumerate(vocab):
      self.word2idx[word] = i + 1
      self.idx2word[i + 1] = word
    self.idx2word[0] = '<OOV>'  # "Out of Vocab"

  def __len__(self):
    return len(self.idx2word)


In [None]:
# Converts the unicode file to ascii
# https://stackoverflow.com/a/518232/2809427
def unicode_to_ascii(s):
  return ''.join(c for c in unicodedata.normalize('NFD', s)
                 if unicodedata.category(c) != 'Mn')

START_TOKEN = '<start>'
END_TOKEN = '<end>'

def preprocess_sentence(w):
  w = unicode_to_ascii(w.lower().strip())

  # creating a space between a word and the punctuation following it
  # eg: "he is a boy." => "he is a boy ."
  # https://stackoverflow.com/a/3645931/3645946
  w = re.sub(r'([?.!,¿])', r' \1 ', w)

  # replacing everything with space except (a-z, A-Z, '.', '?', '!', ',')
  w = re.sub(r'[^a-zA-Z?.!,¿]+', ' ', w)

  # adding a start and an end token to the sentence
  # so that the model know when to start and stop predicting.
  return [START_TOKEN] + w.split() + [END_TOKEN]


Training on the complete dataset of >100,000 sentences will take a long time. To train faster, we can limit the size of the dataset (of course, translation quality degrades with less data).


In [None]:
# TODO(brianklee): This preprocessing should ideally be implemented in TF
# because preprocessing should be exported as part of the SavedModel.
# In other words, the NmtTranslator's interface should take a list of strings,
# not a list of integers.
def load_anki_data(num_examples=None):
  # Download the file
  path_to_zip = tf.keras.utils.get_file(
      'spa-eng.zip', origin='http://download.tensorflow.org/data/spa-eng.zip',
      extract=True)

  path_to_file = os.path.dirname(path_to_zip) + '/spa-eng/spa.txt'
  with open(path_to_file, 'rb') as f:
    lines = f.read().decode('utf8').strip().split('\n')
  # Clean the sentences
  eng_spa_pairs = [line.split('\t') for line in lines]
  eng_spa_pairs = [(preprocess_sentence(eng), preprocess_sentence(spa))
                   for eng, spa in eng_spa_pairs]
  # The translations file is ordered from shortest to longest, so slicing from
  # the front will select the shorter examples. This speeds up training.
  if num_examples is not None:
    eng_spa_pairs = eng_spa_pairs[:num_examples]
  eng_sentences, spa_sentences = zip(*eng_spa_pairs)
  # Construct a vocabulary and integer mapping for both Spanish/English.
  eng_vocab = sorted(set(itertools.chain.from_iterable(eng_sentences)))
  spa_vocab = sorted(set(itertools.chain.from_iterable(spa_sentences)))
  eng_index = LanguageIndex(eng_vocab)
  spa_index = LanguageIndex(spa_vocab)
  return eng_spa_pairs, eng_index, spa_index


In [None]:
NUM_EXAMPLES = 30000
sentence_pairs, english_index, spanish_index = load_anki_data(NUM_EXAMPLES)


In [None]:
# Turn our english/spanish pairs into TF Datasets by mapping words -> integers.
def make_dataset(eng_spa_pairs, eng_index, spa_index):
  eng_sentences, spa_sentences = zip(*eng_spa_pairs)
  eng_ints = [[eng_index.word2idx[word] for word in sentence] for sentence in eng_sentences]
  spa_ints = [[spa_index.word2idx[word] for word in sentence] for sentence in spa_sentences]

  max_eng_len = max(map(len, eng_sentences))
  max_spa_len = max(map(len, spa_sentences))
  padded_eng_ints = tf.keras.preprocessing.sequence.pad_sequences(eng_ints, maxlen=max_eng_len, padding='post')
  padded_spa_ints = tf.keras.preprocessing.sequence.pad_sequences(spa_ints, maxlen=max_spa_len, padding='post')

  dataset = tf.data.Dataset.from_tensor_slices((padded_eng_ints, padded_spa_ints))
  return dataset


In [None]:
# Train/test split
train_size = int(len(sentence_pairs) * 0.8)
random.shuffle(sentence_pairs)
train_sentence_pairs, test_sentence_pairs = sentence_pairs[:train_size], sentence_pairs[train_size:]
# Show length
len(train_sentence_pairs), len(test_sentence_pairs)


In [None]:
train_sentence_pairs[:10]

In [None]:
# Set up datasets
BATCH_SIZE = 32

train_ds = make_dataset(train_sentence_pairs, english_index, spanish_index)
test_ds = make_dataset(test_sentence_pairs, english_index, spanish_index)
train_ds = train_ds.shuffle(len(train_sentence_pairs)).batch(BATCH_SIZE, drop_remainder=True)
test_ds = test_ds.batch(BATCH_SIZE, drop_remainder=True)


In [None]:
first_batch = next(iter(train_ds))
print("Dataset outputs elements with shape ({}, {})".format(first_batch[0].shape, first_batch[1].shape))

## Write the encoder and decoder model

Here, we'll implement an encoder-decoder model with attention which you can read about in the TensorFlow [Neural Machine Translation (seq2seq) tutorial](https://www.tensorflow.org/tutorials/seq2seq). This example uses a more recent set of APIs. This notebook implements the [attention equations](https://www.tensorflow.org/tutorials/seq2seq#background_on_the_attention_mechanism) from the seq2seq tutorial. The following diagram shows that each input words is assigned a weight by the attention mechanism which is then used by the decoder to predict the next word in the sentence.

<img src="https://www.tensorflow.org/images/seq2seq/attention_mechanism.jpg" width="500" alt="attention mechanism">

The input is put through an encoder model which gives us the encoder output of shape *(batch_size, max_length, hidden_size)* and the encoder hidden state of shape *(batch_size, hidden_size)*. 

Here are the equations that are implemented:

<img src="https://www.tensorflow.org/images/seq2seq/attention_equation_0.jpg" alt="attention equation 0" width="800">
<img src="https://www.tensorflow.org/images/seq2seq/attention_equation_1.jpg" alt="attention equation 1" width="800">

We're using *Bahdanau attention*. Lets decide on notation before writing the simplified form:

* FC = Fully connected (dense) layer
* EO = Encoder output
* H = hidden state
* X = input to the decoder

And the pseudo-code:

* `score = FC(tanh(FC(EO) + FC(H)))`
* `attention weights = softmax(score, axis = 1)`. Softmax by default is applied on the last axis but here we want to apply it on the *1st axis*, since the shape of score is *(batch_size, max_length, hidden_size)*. `Max_length` is the length of our input. Since we are trying to assign a weight to each input, softmax should be applied on that axis.
* `context vector = sum(attention weights * EO, axis = 1)`. Same reason as above for choosing axis as 1.
* `embedding output` = The input to the decoder X is passed through an embedding layer.
* `merged vector = concat(embedding output, context vector)`
* This merged vector is then given to the GRU
  
The shapes of all the vectors at each step have been specified in the comments in the code:

In [None]:
def gru(units):
  # If you have a GPU, we recommend using CuDNNGRU(provides a 3x speedup than GRU)
  # the code automatically does that.
  if tf.test.is_gpu_available():
    return tf.keras.layers.CuDNNGRU(units,
                                    return_sequences=True,
                                    return_state=True,
                                    recurrent_initializer='glorot_uniform')
  else:
    return tf.keras.layers.GRU(units,
                               return_sequences=True,
                               return_state=True,
                               recurrent_activation='sigmoid',
                               recurrent_initializer='glorot_uniform')


In [None]:
ENCODER_SIZE = DECODER_SIZE = 128
EMBEDDING_DIM = 32
MAX_OUTPUT_LENGTH = 100

In [None]:

class Encoder(tf.keras.Model):
  def __init__(self, vocab_size):
    super(Encoder, self).__init__()
    self.embedding = tf.keras.layers.Embedding(vocab_size, EMBEDDING_DIM)
    self.gru = gru(ENCODER_SIZE)

  def call(self, x, hidden):
    x = self.embedding(x)
    output, state = self.gru(x, initial_state = hidden)
    return output, state

  def initialize_hidden_state(self, batch_size):
    return tf.zeros((batch_size, ENCODER_SIZE))



In [None]:
class Decoder(tf.keras.Model):
  def __init__(self, vocab_size):
    super(Decoder, self).__init__()
    self.vocab_size = vocab_size
    self.embedding = tf.keras.layers.Embedding(vocab_size, EMBEDDING_DIM)
    self.gru = gru(DECODER_SIZE)
    self.fc = tf.keras.layers.Dense(vocab_size)

    # used for attention
    self.W1 = tf.keras.layers.Dense(DECODER_SIZE)
    self.W2 = tf.keras.layers.Dense(DECODER_SIZE)
    self.V = tf.keras.layers.Dense(1)

  def call(self, x, hidden, enc_output):
    # enc_output shape == (batch_size, max_length, hidden_size)

    # hidden shape == (batch_size, hidden size)
    # hidden_with_time_axis shape == (batch_size, 1, hidden size)
    # we are doing this to perform addition to calculate the score
    hidden_with_time_axis = tf.expand_dims(hidden, 1)

    # score shape == (batch_size, max_length, 1)
    # we get 1 at the last axis because we are applying tanh(FC(EO) + FC(H)) to self.V
    score = self.V(tf.nn.tanh(self.W1(enc_output) + self.W2(hidden_with_time_axis)))

    # attention_weights shape == (batch_size, max_length, 1)
    attention_weights = tf.nn.softmax(score, axis=1)

    # context_vector shape after sum == (batch_size, hidden_size)
    context_vector = attention_weights * enc_output
    context_vector = tf.reduce_sum(context_vector, axis=1)

    # x shape after passing through embedding == (batch_size, 1, EMBEDDING_DIM)
    x = self.embedding(x)

    # x shape after concatenation == (batch_size, 1, EMBEDDING_DIM + hidden_size)
    x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)

    # passing the concatenated vector to the GRU
    output, state = self.gru(x)

    # output shape == (batch_size * 1, hidden_size)
    output = tf.reshape(output, (-1, output.shape[2]))

    # output shape == (batch_size * 1, vocab)
    x = self.fc(output)

    return x, state, attention_weights

  def initialize_hidden_state(self, batch_size):
    return tf.zeros((batch_size, DECODER_SIZE))


## Define a translate function

1. Pass the *input* through the *encoder* which return *encoder output* and the *encoder hidden state*.
2. The encoder output, encoder hidden state and the decoder initial state (i.e. the &lt;START&gt; token) is passed to the decoder.
3. The decoder returns the *predictions* and the *decoder hidden state*.
4. The next token is then fed back into the decoder repeatedly. This has two different behaviors under training and inference:
  - during training, we use *teacher forcing*, where the correct next word is fed into the decoder, regardless of what the decoder emitted.
  - during inference, we use `tf.argmax` to select the most likely continuation and feed it back into the decoder. This repeats until either the decoder emits an &lt;END&gt; token, indicating that it's done translating, or we run into a hardcoded length limit. 


In [None]:
class NmtTranslator(tf.keras.Model):
  def __init__(self, encoder, decoder, start_token_id):
    super().__init__()
    self.encoder = encoder
    self.decoder = decoder
    # Uses start_token_id to initialize the decoder. (The token ID should
    # match the decoder's language.)
    self.start_token_id = start_token_id

  def call(self, inp, targ=None):
    '''Translate an input.

    If targ is provided, teacher forcing is used to generate the translation.
    '''
    batch_size = inp.shape[0]
    hidden = self.encoder.initialize_hidden_state(batch_size)
    enc_output, enc_hidden = self.encoder(inp, hidden)
    dec_hidden = enc_hidden

    if targ is not None:
      output_length = targ.shape[1]
    else:
      output_length = MAX_OUTPUT_LENGTH
    predictions_array = tf.TensorArray(tf.float32, output_length)
    attention_array = tf.TensorArray(tf.float32, output_length)
    # Initialize predictions array with <START> token
    predictions = tf.one_hot([self.start_token_id] * batch_size, decoder.vocab_size)
    predictions_array = predictions_array.write(0, predictions)
    attention_array = attention_array.write(0, tf.zeros((batch_size, enc_output.shape[1], 1)))
    for i in tf.range(1, output_length):
      if targ is not None:
        # if target is known, use teacher forcing
        dec_input = targ[:, i - 1]
      else:
        # Otherwise, pick the most likely continuation
        dec_input = tf.argmax(predictions, axis=1)
      # Keras RNNs wants an array instead of one time step. Add a time dimension.
      dec_input = tf.expand_dims(dec_input, 1)
      # passing enc_output to the decoder
      predictions, dec_hidden, attention_weights = self.decoder(dec_input, dec_hidden, enc_output)
      # TODO: Mask predictions so that if <END> is emitted, no more further outputs are recorded
      # TODO: Check if all sequences have reached <END>, stop running decoder.
      predictions_array = predictions_array.write(i, predictions)
      attention_array = attention_array.write(i, attention_weights)

    # Transpose from [time, batch, predictions] -> [batch, time, predictions]
    return tf.transpose(predictions_array.stack(), [1, 0, 2]), tf.transpose(attention_array.stack(), [1, 0, 2, 3])
    
  
    

In [None]:
encoder = Encoder(len(english_index))
decoder = Decoder(len(spanish_index))
start_token_id = spanish_index.word2idx[START_TOKEN]
model = NmtTranslator(encoder, decoder, start_token_id)

## Define the optimizer and the loss function

In [None]:
optimizer = tf.train.AdamOptimizer()

def loss_fn(real, pred):
  # The target output is a batch of sentences of varying length,
  # and the prediction is a batch of sentences of varying length,
  # which aren't necessarily the same length as the target output.

  # Loss functions like BLEU try to handle this, but we'll hack a simpler one here.

  # First, cut down the prediction to the correct shape
  pred = pred[:, :real.shape[1], :]
  # then mask the prediction so that we're comparing word-for-word with
  # the true answer, and ignoring any extra words
  # i.e. 
  # ['This', 'is', 'the', 'correct', 'answer', '.', '<end>', '<padding>', '<padding>', '<padding>']
  # ['This', 'is', 'what', 'the', 'model', 'emitted', 'blah', 'blah', '.', '<end>']
  # results in comparing
  # This/This, is/is, the/what, correct/the, answer/model, ./emitted, <end>/blah, and ignoring the rest.
  mask = 1 - (real == 0)
  loss_ = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=real, logits=pred) * mask
  return tf.reduce_mean(loss_)

# Create a step variable to track how much we've trained.
global_step = tf.Variable(0, dtype=tf.int64)

In [None]:
def train(model, optimizer, dataset, global_step):
  """Trains model on `dataset` using `optimizer`."""
  start = time.time()
  avg_loss = tf.metrics.Mean('loss', dtype=tf.float32)
  for (batch, (inp, targ)) in enumerate(dataset):
    with tf.GradientTape() as tape:
      predictions, _ = model(inp, targ=targ)
      loss = loss_fn(targ, predictions)

    avg_loss(loss)
    gradients = tape.gradient(loss, model.variables)
    optimizer.apply_gradients(zip(gradients, model.variables))
    if batch % 10 == 0:
      summary_ops_v2.scalar('loss', avg_loss.result(), step=global_step)
      avg_loss.reset_states()
      rate = 10 / (time.time() - start)
      print('Step #%d\tLoss: %.6f (%.2f steps/sec)' % (batch, loss, rate))
      start = time.time()
      break


In [None]:
def test(model, dataset, global_step):
  """Perform an evaluation of `model` on the examples from `dataset`."""
  avg_loss = tf.metrics.Mean('loss', dtype=tf.float32)
  for (batch, (inp, targ)) in enumerate(dataset):
    predictions, _ = model(inp)
    loss = loss_fn(targ, predictions)
    avg_loss(loss)
    break

  print('Model test set loss: {:0.4f}'.format(avg_loss.result()))
  summary_ops_v2.scalar('loss', avg_loss.result(), step=global_step)


## Configure model directory

We'll use one directory to save all of our relevant artifacts (summary logs, checkpoints, SavedModel exports, etc.)

In [None]:
# Where to save checkpoints, tensorboard summaries, etc.
MODEL_DIR = '/tmp/tensorflow/nmt_attention'

def apply_clean():
  if tf.io.gfile.exists(MODEL_DIR):
    print('Removing existing model dir: {}'.format(MODEL_DIR))
    tf.io.gfile.rmtree(MODEL_DIR)


In [None]:
# Optional: remove directory
apply_clean()

In [None]:
train_summary_writer = tf.summary.create_file_writer(
  os.path.join(MODEL_DIR, 'summaries', 'train'), flush_millis=10000)
test_summary_writer = tf.summary.create_file_writer(
  os.path.join(MODEL_DIR, 'summaries', 'eval'), flush_millis=10000, name='test')

checkpoint_dir = os.path.join(MODEL_DIR, 'checkpoints')
checkpoint_prefix = os.path.join(checkpoint_dir, 'ckpt')
checkpoint = tf.train.Checkpoint(
    encoder=encoder, decoder=decoder, optimizer=optimizer,
    global_step=global_step)
# Restore variables on creation if a checkpoint exists.
checkpoint.restore(tf.train.latest_checkpoint(checkpoint_dir))

In [None]:
NUM_TRAIN_EPOCHS = 10
for i in range(NUM_TRAIN_EPOCHS):
  start = time.time()
  with train_summary_writer.as_default():
    train(model, optimizer, train_ds, global_step)
  end = time.time()
  print('\nTrain time for epoch #{} ({} total steps): {}'.format(
      i + 1, global_step.numpy(), end - start))
  with test_summary_writer.as_default():
    test(model, test_ds, global_step)
  checkpoint.save(checkpoint_prefix)


In [None]:
# function for plotting the attention weights
def plot_attention(attention, sentence, predicted_sentence):
    fig = plt.figure(figsize=(10,10))
    ax = fig.add_subplot(1, 1, 1)
    ax.matshow(attention, cmap='viridis')
    
    fontdict = {'fontsize': 14}
    
    ax.set_xticklabels([''] + sentence, fontdict=fontdict, rotation=90)
    ax.set_yticklabels([''] + predicted_sentence, fontdict=fontdict)

    plt.show()

def translate_and_plot(model, sentence, english_index, spanish_index):
    """Run translation on a sentence and plot an attention matrix.
    
    Sentence should be passed in as list of tokenized words.
    """
    english_ints = tf.constant([[english_index.word2idx[word] for word in sentence]])
    predictions, attention = model(sentence)
    prediction_ids = tf.squeeze(tf.argmax(predictions, axis=-1))
    attention = tf.squeeze(attention)
    predicted_sentence = [spanish_index.idx2word[id.numpy()] for id in prediction_ids]
    print('Input: {}'.format(' '.join(sentence)))
    print('Predicted translation: {}'.format(' '.join(sentence)))
    plot_attention(attention, sentence, predicted_sentence)

In [None]:
translate_and_plot(model, train_sentence_pair[0])

## Restore the latest checkpoint and test

In [None]:
# restoring the latest checkpoint in checkpoint_dir
checkpoint.restore(tf.train.latest_checkpoint(checkpoint_dir))

In [None]:
translate('hace mucho frio aqui.', encoder, decoder, inp_lang, targ_lang, max_length_inp, max_length_targ)

In [None]:
translate('esta es mi vida.', encoder, decoder, inp_lang, targ_lang, max_length_inp, max_length_targ)

In [None]:
translate('¿todavia estan en casa?', encoder, decoder, inp_lang, targ_lang, max_length_inp, max_length_targ)

In [None]:
# wrong translation
translate('trata de averiguarlo.', encoder, decoder, inp_lang, targ_lang, max_length_inp, max_length_targ)

## Next steps

* [Download a different dataset](http://www.manythings.org/anki/) to experiment with translations, for example, English to German, or English to French.
* Experiment with training on a larger dataset, or using more epochs
