*This is an assignment based on the Tensorflow provided example.*


In this assignment, you will implement a seq2seq model with different configurations for machine translation task in NLP. you will: 
- Customize two different architectures in seq2se2.
- Implement a simple decoder for the machine translation task using your customized layers.

Try to keep your model as simple as possible (minimum number of layers and neurons with acceptable performance).

**Notes:** 
- When you submit your assignment, the output of every cell should be visible.
- You are not eligible to change any parts of the code except the predefined sections.
- You can add your implementation only in the predefined sections.

 

In [1]:
import tensorflow as tf
from sklearn.model_selection import train_test_split

import unicodedata
import re
import numpy as np
import os
import io
import time

## Download and prepare the dataset

We'll use a language dataset provided by http://www.manythings.org/anki/. This dataset contains language translation pairs in the format:

```
May I borrow this book?	¿Puedo tomar prestado este libro?
```

There are a variety of languages available, but we'll use the English-Spanish dataset. After downloading the dataset, here are the steps we'll take to prepare the data:

1. Add a *start* and *end* token to each sentence.
2. Clean the sentences by removing special characters.
3. Create a word index and reverse word index (dictionaries mapping from word → id and id → word).
4. Pad each sentence to a maximum length.

In [2]:
# Download the file
path_to_zip = tf.keras.utils.get_file(
    'spa-eng.zip', origin='http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip',
    extract=True)

path_to_file = os.path.dirname(path_to_zip)+"/spa-eng/spa.txt"

In [3]:
# Converts the unicode file to ascii
def unicode_to_ascii(s):
  return ''.join(c for c in unicodedata.normalize('NFD', s)
                 if unicodedata.category(c) != 'Mn')


def preprocess_sentence(w):
  w = unicode_to_ascii(w.lower().strip())

  # creating a space between a word and the punctuation following it
  # eg: "he is a boy." => "he is a boy ."
  # Reference:- https://stackoverflow.com/questions/3645931/python-padding-punctuation-with-white-spaces-keeping-punctuation
  w = re.sub(r"([?.!,¿])", r" \1 ", w)
  w = re.sub(r'[" "]+', " ", w)

  # replacing everything with space except (a-z, A-Z, ".", "?", "!", ",")
  w = re.sub(r"[^a-zA-Z?.!,¿]+", " ", w)

  w = w.strip()

  # adding a start and an end token to the sentence
  # so that the model know when to start and stop predicting.
  w = '<start> ' + w + ' <end>'
  return w

In [4]:
en_sentence = u"May I borrow this book?"
sp_sentence = u"¿Puedo tomar prestado este libro?"
print(preprocess_sentence(en_sentence))
print(preprocess_sentence(sp_sentence).encode('utf-8'))

<start> may i borrow this book ? <end>
b'<start> \xc2\xbf puedo tomar prestado este libro ? <end>'


In [5]:
# 1. Remove the accents
# 2. Clean the sentences
# 3. Return word pairs in the format: [ENGLISH, SPANISH]
def create_dataset(path, num_examples):
  lines = io.open(path, encoding='UTF-8').read().strip().split('\n')

  word_pairs = [[preprocess_sentence(w) for w in line.split('\t')]
                for line in lines[:num_examples]]

  return zip(*word_pairs)

In [6]:
en, sp = create_dataset(path_to_file, None)
print(en[-1])
print(sp[-1])

<start> if you want to sound like a native speaker , you must be willing to practice saying the same sentence over and over in the same way that banjo players practice the same phrase over and over until they can play it correctly and at the desired tempo . <end>
<start> si quieres sonar como un hablante nativo , debes estar dispuesto a practicar diciendo la misma frase una y otra vez de la misma manera en que un musico de banjo practica el mismo fraseo una y otra vez hasta que lo puedan tocar correctamente y en el tiempo esperado . <end>


In [7]:
def tokenize(lang):
  lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='')
  lang_tokenizer.fit_on_texts(lang)

  tensor = lang_tokenizer.texts_to_sequences(lang)

  tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor,
                                                         padding='post')

  return tensor, lang_tokenizer

In [8]:
def load_dataset(path, num_examples=None):
  # creating cleaned input, output pairs
  targ_lang, inp_lang = create_dataset(path, num_examples)

  input_tensor, inp_lang_tokenizer = tokenize(inp_lang)
  target_tensor, targ_lang_tokenizer = tokenize(targ_lang)

  return input_tensor, target_tensor, inp_lang_tokenizer, targ_lang_tokenizer

### Limit the size of the dataset to experiment faster (optional)

Training on the complete dataset of >100,000 sentences will take a long time. To train faster, we can limit the size of the dataset to 30,000 sentences (of course, translation quality degrades with fewer data):

In [9]:
# Try experimenting with the size of that dataset
num_examples = 20000
input_tensor, target_tensor, inp_lang, targ_lang = load_dataset(path_to_file,
                                                                num_examples)

# Calculate max_length of the target tensors
max_length_targ, max_length_inp = target_tensor.shape[1], input_tensor.shape[1]

In [10]:
# Creating training and validation sets using an 80-20 split
input_tensor_train, input_tensor_val, target_tensor_train, target_tensor_val = train_test_split(input_tensor, target_tensor, test_size=0.2)

# Show length
print(len(input_tensor_train), len(target_tensor_train), len(input_tensor_val), len(target_tensor_val))

16000 16000 4000 4000


In [11]:
def convert(lang, tensor):
  for t in tensor:
    if t != 0:
      print(f'{t} ----> {lang.index_word[t]}')

In [12]:
print("Input Language; index to word mapping")
convert(inp_lang, input_tensor_train[0])
print()
print("Target Language; index to word mapping")
convert(targ_lang, target_tensor_train[0])

Input Language; index to word mapping
1 ----> <start>
7 ----> es
39 ----> muy
519 ----> dulce
3 ----> .
2 ----> <end>

Target Language; index to word mapping
1 ----> <start>
9 ----> it
10 ----> s
55 ----> very
503 ----> sweet
3 ----> .
2 ----> <end>


### Create a tf.data dataset

In [13]:
BUFFER_SIZE = len(input_tensor_train)
BATCH_SIZE = 64
steps_per_epoch = len(input_tensor_train)//BATCH_SIZE
embedding_dim = 256
units = 1024
vocab_inp_size = len(inp_lang.word_index)+1
vocab_tar_size = len(targ_lang.word_index)+1

dataset = tf.data.Dataset.from_tensor_slices((input_tensor_train, target_tensor_train)).shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)

In [14]:
example_input_batch, example_target_batch = next(iter(dataset))
example_input_batch.shape, example_target_batch.shape

(TensorShape([64, 16]), TensorShape([64, 10]))

## Write the encoder and decoder model

This is a simple encoder-decoder model for machine translation.

<img src="https://www.guru99.com/images/1/111318_0848_seq2seqSequ1.png" width="500" alt="attention mechanism">

The input is put through an encoder model which gives us the encoder output of shape *(batch_size, max_length, hidden_size)* and the encoder hidden state of shape *(batch_size, hidden_size)*.


<font color="red"><b>What is the main drawback of this architecture?</b></font>
<br>
<font color="red" size=4><div dir=rtl>گلوگاه این معماری در بهبود کارایی استفاده از یک بردار با طول ثابت است  که ترجمه را برای جملات طولانی به خصوص جملاتی که از متون آموزش طولانی‌تر هستند را مشکل می‌‌کند. هر چه قدر جمله طولانی‌تر شود کارایی انکودر - دیکودر پایه به شدت بدتر می‌شود. </div></font>

In [15]:
class Encoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
    super(Encoder, self).__init__()
    self.batch_sz = batch_sz
    self.enc_units = enc_units
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(self.enc_units,
                                   return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform')

  def call(self, x, hidden):
    x = self.embedding(x)
    output, state = self.gru(x, initial_state=hidden)
    return output, state

  def initialize_hidden_state(self):
    return tf.zeros((self.batch_sz, self.enc_units))

In [16]:
encoder = Encoder(vocab_inp_size, embedding_dim, units,BATCH_SIZE)

# sample input
sample_hidden = encoder.initialize_hidden_state()
sample_output, sample_hidden = encoder(example_input_batch, sample_hidden)
print('Encoder output shape: (batch size, sequence length, units)', sample_output.shape)
print('Encoder Hidden state shape: (batch size, units)', sample_hidden.shape)

Encoder output shape: (batch size, sequence length, units) (64, 16, 1024)
Encoder Hidden state shape: (batch size, units) (64, 1024)


Now, it's your turn. Implement a layer that instead of passing the last encoder output to the decoder, makes an average vector of all encoder outputs in all time step. we call it *cotext vector*.

In [17]:
class Context_vector(tf.keras.layers.Layer):
  def __init__(self):
    super(Context_vector, self).__init__()

  def call(self, inputs):
    ############################################
    ########put your implementation here########
    ############################################
    context_vector = tf.reduce_mean(inputs, axis=1)
    # context_vector = tf.nn.softmax(context_vector)
    return context_vector

Implement another layer that makes a weighted average of encoder outputs as the context vector. The weights should be learned during training. 

In [18]:
class Weighted_Context_vector(tf.keras.layers.Layer):
  def __init__(self, units):
    ############################################
    ########put your implementation here########
    ############################################
    super(Weighted_Context_vector, self).__init__()
    
    self.Weight = tf.keras.layers.Dense(units)
    


  def call(self, inputs):

    
    score = self.Weight(inputs)
    
    attention_weights = tf.nn.softmax(score, axis=1)

    # context_vector shape after sum == (batch_size, hidden_size)
    weighted_context_vector = attention_weights * inputs
    weighted_context_vector = tf.reduce_sum(weighted_context_vector, axis=1)


    return weighted_context_vector

In [19]:
context_layer = Context_vector()
context_result = context_layer(sample_output)

print("Cotext layer result shape:", context_result.shape)


Cotext layer result shape: (64, 1024)


In [20]:
weighted_context_layer = Weighted_Context_vector(sample_output.shape[-1])
weighted_context_result = weighted_context_layer(sample_output)

print("Weighted Cotext layer result shape: ", weighted_context_result.shape)

Weighted Cotext layer result shape:  (64, 1024)


<font color="red"><b>Explain the advantages of the (weighted) context vector you have just implemented.

How could these architectures solve the vanishing gradient problem?</b></font>

<font color="red" size=3><div dir=rtl><b> مزیت بردارهای زمینه‌ی وزن‌دار ایجاد مکانیزم توجه است. یعنی تصمیم گرفته می‌شود چه کلماتی از جمله‌ی مبدا در ترجمه‌ی یک کلمه اهمیت بیشتری دارد و چه کلماتی هیچ اهمیتی ندارند</b> </div></font>
<br/>
<font color="red" size=3><div dir=rtl><b>مشکل ناپدید شدن گرادیان با استفاده از لایه‌های بازگشتی GRU 
حل شده است. در این لایه از یک مکانیزم میانگین گرفتن وزن‌دار استفاده می‌شود که تصمیم می‌گیرد چه اطلاعاتی از گذشته مهم هستند و باید آنها را به خاطر سپرد. پارامترهای لایه‌ای که برای این امر استفاده می‌شود در آموزش مدل به دست می‌آید. در این لایه به نوعی تصمیم گرفته می‌شود به چه اطلاعاتی باید اهیت داد و به چه اطلاعاتی خیر، که یادآور مکانیزم توجه است.</b> </div></font>

Implement **two** differen decoders using the Context vector and weighted context vector. and then try both of them with making two different models.

In [21]:
class DecoderVN(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
    super(DecoderVN, self).__init__()
    self.batch_sz = batch_sz
    self.dec_units = dec_units
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(self.dec_units,
                                   return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform')
    self.fc = tf.keras.layers.Dense(vocab_size)

    ############################################
    ########put your implementation here########
    ############################################
    self.context_layer = Context_vector()

  def call(self, x, hidden, enc_output):
    # x shape after passing through embedding == (batch_size, 1, embedding_dim)
    x = self.embedding(x)

    ############################################
    ########put your implementation here########
    ############################################
    context_vector = self.context_layer(enc_output)
    x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)
   
    # passing the concatenated vector to the GRU
    output, state = self.gru(x)

    # output shape == (batch_size * 1, hidden_size)
    output = tf.reshape(output, (-1, output.shape[2]))

    # output shape == (batch_size, vocab)
    x = self.fc(output)

    return x, state

In [22]:
class Decoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
    super(Decoder, self).__init__()
    self.batch_sz = batch_sz
    self.dec_units = dec_units
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(self.dec_units,
                                   return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform')
    self.fc = tf.keras.layers.Dense(vocab_size)

    ############################################
    ########put your implementation here########
    ############################################
    self.weighted_context_layer = Weighted_Context_vector(self.dec_units)

  def call(self, x, hidden, enc_output):
    # x shape after passing through embedding == (batch_size, 1, embedding_dim)
    x = self.embedding(x)

    ############################################
    ########put your implementation here########
    ############################################
    weighted_context_vector = self.weighted_context_layer(enc_output)
    x = tf.concat([tf.expand_dims(weighted_context_vector, 1), x], axis=-1)
    # passing the concatenated vector to the GRU
    output, state = self.gru(x)

    # output shape == (batch_size * 1, hidden_size)
    output = tf.reshape(output, (-1, output.shape[2]))

    # output shape == (batch_size, vocab)
    x = self.fc(output)

    return x, state

In [23]:
decoder = Decoder(vocab_tar_size, embedding_dim, units, BATCH_SIZE)

sample_decoder_output, _= decoder(tf.random.uniform((BATCH_SIZE, 1)),
                                      sample_hidden, sample_output)

print('Decoder output shape: ', sample_decoder_output.shape)

Decoder output shape:  (64, 3728)


## Define the optimizer and the loss function

In [24]:
optimizer = tf.keras.optimizers.Adam()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True,
                                                            reduction='none')


def loss_function(real, pred):
  mask = tf.math.logical_not(tf.math.equal(real, 0))
  loss_ = loss_object(real, pred)

  mask = tf.cast(mask, dtype=loss_.dtype)
  loss_ *= mask

  return tf.reduce_mean(loss_)

## Checkpoints (Object-based saving)

In [25]:
checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
checkpoint = tf.train.Checkpoint(optimizer=optimizer,
                                 encoder=encoder,
                                 decoder=decoder)

## Training

1. Pass the *input* through the *encoder* which return *encoder output* and the *encoder hidden state*.
2. The encoder output, encoder hidden state and the decoder input (which is the *start token*) is passed to the decoder.
3. The decoder returns the *predictions* and the *decoder hidden state*.
4. The decoder hidden state is then passed back into the model and the predictions are used to calculate the loss.
5. The final step is to calculate the gradients and apply it to the optimizer and backpropagate.



<font color="red"><b>Explain *teacher forcing technique* and tell us why it is necessary for training seq2seq models?

(It is used in training step function below)</b></font>

<font color="red" size=3><div dir=rtl><b>  
کمک معلم یا اصطلاحا Teacher Forcing روش سریع و کارآمدی است که برای آموزش مدلهای مبتنی بر شبکه های عصبی بازگشتی که از خروجی  یک گام قبل بعنوان ورودی بهره میبرند مورد استفاده قرار میگیرد. این متد یک روش آموزش شبکه است که در توسعه مدلهای زبانی مبتنی بر یادگیری عمیق که در حوزه های گوناگون من جمله ترجمه ماشینی،  خلاصه سازی متن و شرح نویسی تصویر مورد استفاده قرار میگیرند نقش حیاتی دارد.
<br/> استفاده از خروجی بعنوان ورودی در پیش بینی دنباله : مدلهای پیش بینی دنباله ای وجود دارند که در آنها از خروجی تولید شده در اخرین گام زمانی o_{t-1} بعنوان ورودی برای مدل در گام زمانی فعلی (X_t)بهره برده میشود. این گونه مدلها در مدلهای زبانی که خروجی هر گام، یک کلمه بوده و سپس این خروجی بعنوان ورودی گام زمانی بعدی مورد استفاده قرار میگیرد تا کلمه بعدی در دنباله ایجاد شود رایج است. بعنوان مثال این گونه مدل های زبانی در معماری های شبکه عصبی بازگشتی Encoder-Decoder برای مسائل تولید دنباله به دنباله (Sequence to sequence) ای همانند : 
<br/>ترجمه ماشینی (Machine Translation) 
<br/>تولید عنوان (Caption Generation) 
<br/>خلاصه سازی متن (Text Summarization) 
<br/>مورد استفاده قرار میگیرند. بعد از آنکه مدل اموزش دید. میتوان از توکن (نشانه) “شروع دنباله” برای آغاز فرآیند استفاده کرد و یک کلمه را درگام زمانی اول تولید کرد حالا از این خروجی (کلمه تازه) بعنوان ورودی برای گام زمانی دوم استفاده میشود و خروجی آن بعنوان ورودی برای گام زمانی بعدی و همینطور الی اخر استفاده میشود.
<br/>
در مدل‌های Sequence to sequence
اگر از روش‌ teaching force استفاده نکنیم مدل دنباله‌های بی‌ربط زیادی را برای پیشبینی دنباله‌ی بعدی ارزیابی می‌کند که باعث می‌شود فرایند یادگیری کند و مدل ناپایدار شود.
 </b> </div></font>
<br/>

In [26]:
@tf.function
def train_step(inp, targ, enc_hidden):
  loss = 0

  with tf.GradientTape() as tape:
    enc_output, enc_hidden = encoder(inp, enc_hidden)

    dec_hidden = enc_hidden

    dec_input = tf.expand_dims([targ_lang.word_index['<start>']] * BATCH_SIZE, 1)

    # Teacher forcing - feeding the target as the next input
    for t in range(1, targ.shape[1]):
      # passing enc_output to the decoder
      predictions, dec_hidden = decoder(dec_input, dec_hidden, enc_output)
      
      loss += loss_function(targ[:, t], predictions)

      # using teacher forcing
      dec_input = tf.expand_dims(targ[:, t], 1)

  batch_loss = (loss / int(targ.shape[1]))

  variables = encoder.trainable_variables + decoder.trainable_variables

  gradients = tape.gradient(loss, variables)

  optimizer.apply_gradients(zip(gradients, variables))

  return batch_loss

In [27]:
EPOCHS = 10

for epoch in range(EPOCHS):
  start = time.time()

  enc_hidden = encoder.initialize_hidden_state()
  total_loss = 0

  for (batch, (inp, targ)) in enumerate(dataset.take(steps_per_epoch)):
    batch_loss = train_step(inp, targ, enc_hidden)
    
    total_loss += batch_loss

    if batch % 100 == 0:
      print(f'Epoch {epoch+1} Batch {batch} Loss {batch_loss.numpy():.4f}')
  # saving (checkpoint) the model every 2 epochs
  if (epoch + 1) % 2 == 0:
    checkpoint.save(file_prefix=checkpoint_prefix)

  print(f'Epoch {epoch+1} Loss {total_loss/steps_per_epoch:.4f}')
  print(f'Time taken for 1 epoch {time.time()-start:.2f} sec\n')

Epoch 1 Batch 0 Loss 4.7033
Epoch 1 Batch 100 Loss 2.1907
Epoch 1 Batch 200 Loss 1.8123
Epoch 1 Loss 2.1210
Time taken for 1 epoch 29.02 sec

Epoch 2 Batch 0 Loss 1.6417
Epoch 2 Batch 100 Loss 1.4445
Epoch 2 Batch 200 Loss 1.4028
Epoch 2 Loss 1.5025
Time taken for 1 epoch 19.46 sec

Epoch 3 Batch 0 Loss 1.3228
Epoch 3 Batch 100 Loss 1.1745
Epoch 3 Batch 200 Loss 1.1431
Epoch 3 Loss 1.1753
Time taken for 1 epoch 19.08 sec

Epoch 4 Batch 0 Loss 0.8595
Epoch 4 Batch 100 Loss 0.9559
Epoch 4 Batch 200 Loss 0.8244
Epoch 4 Loss 0.8969
Time taken for 1 epoch 19.55 sec

Epoch 5 Batch 0 Loss 0.6129
Epoch 5 Batch 100 Loss 0.6306
Epoch 5 Batch 200 Loss 0.5939
Epoch 5 Loss 0.6465
Time taken for 1 epoch 19.21 sec

Epoch 6 Batch 0 Loss 0.4401
Epoch 6 Batch 100 Loss 0.4689
Epoch 6 Batch 200 Loss 0.4392
Epoch 6 Loss 0.4460
Time taken for 1 epoch 19.56 sec

Epoch 7 Batch 0 Loss 0.2723
Epoch 7 Batch 100 Loss 0.3810
Epoch 7 Batch 200 Loss 0.3477
Epoch 7 Loss 0.2991
Time taken for 1 epoch 19.25 sec

Epoch 

## Translate

* The evaluate function is similar to the training loop, except we don't use *teacher forcing* here. The input to the decoder at each time step is its previous predictions along with the hidden state and the encoder output.
* Stop predicting when the model predicts the *end token*.


Note: The encoder output is calculated only once for one input.

In [28]:
def evaluate(sentence):
  sentence = preprocess_sentence(sentence)

  inputs = [inp_lang.word_index[i] for i in sentence.split(' ')]
  inputs = tf.keras.preprocessing.sequence.pad_sequences([inputs],
                                                         maxlen=max_length_inp,
                                                         padding='post')
  inputs = tf.convert_to_tensor(inputs)

  result = ''

  hidden = [tf.zeros((1, units))]
  
  enc_out, enc_hidden = encoder(inputs, hidden)

  dec_hidden = enc_hidden
  dec_input = tf.expand_dims([targ_lang.word_index['<start>']], 0)

  for t in range(max_length_targ):
    predictions, dec_hidden = decoder(dec_input,
                                                         dec_hidden,
                                                         enc_out)


    predicted_id = tf.argmax(predictions[0]).numpy()

    result += targ_lang.index_word[predicted_id] + ' '

    if targ_lang.index_word[predicted_id] == '<end>':
      return result, sentence

    # the predicted ID is fed back into the model
    dec_input = tf.expand_dims([predicted_id], 0)

  return result, sentence

In [29]:
def translate(sentence):
  result, sentence = evaluate(sentence)

  print('Input:', sentence)
  print('Predicted translation:', result)


## Restore the latest checkpoint and test

In [30]:
# restoring the latest checkpoint in checkpoint_dir
checkpoint.restore(tf.train.latest_checkpoint(checkpoint_dir))

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7f3db2940850>

In [31]:
translate(u'hace mucho frio aqui.')

Input: <start> hace mucho frio aqui . <end>
Predicted translation: it s very cold here . <end> 


In [32]:
translate(u'esta es mi vida.')

Input: <start> esta es mi vida . <end>
Predicted translation: this is my life . <end> 


In [34]:
decoder_nv = DecoderVN(vocab_tar_size, embedding_dim, units, BATCH_SIZE)

sample_decoder_nv_output, _= decoder_nv(tf.random.uniform((BATCH_SIZE, 1)),
                                      sample_hidden, sample_output)

print('Decoder output shape: ', sample_decoder_nv_output.shape)

Decoder output shape:  (64, 3728)


## Define the optimizer and the loss function

In [35]:
optimizer = tf.keras.optimizers.Adam()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True,
                                                            reduction='none')


def loss_function(real, pred):
  mask = tf.math.logical_not(tf.math.equal(real, 0))
  loss_ = loss_object(real, pred)

  mask = tf.cast(mask, dtype=loss_.dtype)
  loss_ *= mask

  return tf.reduce_mean(loss_)

## Checkpoints (Object-based saving)

In [36]:
checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
checkpoint = tf.train.Checkpoint(optimizer=optimizer,
                                 encoder=encoder,
                                 decoder=decoder_nv)

## Training

1. Pass the *input* through the *encoder* which return *encoder output* and the *encoder hidden state*.
2. The encoder output, encoder hidden state and the decoder input (which is the *start token*) is passed to the decoder.
3. The decoder returns the *predictions* and the *decoder hidden state*.
4. The decoder hidden state is then passed back into the model and the predictions are used to calculate the loss.
5. The final step is to calculate the gradients and apply it to the optimizer and backpropagate.

In [37]:
@tf.function
def train_step(inp, targ, enc_hidden):
  loss = 0

  with tf.GradientTape() as tape:
    enc_output, enc_hidden = encoder(inp, enc_hidden)

    dec_hidden = enc_hidden

    dec_input = tf.expand_dims([targ_lang.word_index['<start>']] * BATCH_SIZE, 1)

    # Teacher forcing - feeding the target as the next input
    for t in range(1, targ.shape[1]):
      # passing enc_output to the decoder
      predictions, dec_hidden = decoder_nv(dec_input, dec_hidden, enc_output)
      
      loss += loss_function(targ[:, t], predictions)

      # using teacher forcing
      dec_input = tf.expand_dims(targ[:, t], 1)

  batch_loss = (loss / int(targ.shape[1]))

  variables = encoder.trainable_variables + decoder_nv.trainable_variables

  gradients = tape.gradient(loss, variables)

  optimizer.apply_gradients(zip(gradients, variables))

  return batch_loss

In [38]:
EPOCHS = 10

for epoch in range(EPOCHS):
  start = time.time()

  enc_hidden = encoder.initialize_hidden_state()
  total_loss = 0

  for (batch, (inp, targ)) in enumerate(dataset.take(steps_per_epoch)):
    batch_loss = train_step(inp, targ, enc_hidden)
    
    total_loss += batch_loss

    if batch % 100 == 0:
      print(f'Epoch {epoch+1} Batch {batch} Loss {batch_loss.numpy():.4f}')
  # saving (checkpoint) the model every 2 epochs
  if (epoch + 1) % 2 == 0:
    checkpoint.save(file_prefix=checkpoint_prefix)

  print(f'Epoch {epoch+1} Loss {total_loss/steps_per_epoch:.4f}')
  print(f'Time taken for 1 epoch {time.time()-start:.2f} sec\n')

Epoch 1 Batch 0 Loss 4.7781
Epoch 1 Batch 100 Loss 1.4148
Epoch 1 Batch 200 Loss 1.0604
Epoch 1 Loss 1.5359
Time taken for 1 epoch 22.49 sec

Epoch 2 Batch 0 Loss 0.7608
Epoch 2 Batch 100 Loss 0.7716
Epoch 2 Batch 200 Loss 0.6361
Epoch 2 Loss 0.6641
Time taken for 1 epoch 13.84 sec

Epoch 3 Batch 0 Loss 0.3840
Epoch 3 Batch 100 Loss 0.3629
Epoch 3 Batch 200 Loss 0.3448
Epoch 3 Loss 0.3775
Time taken for 1 epoch 13.40 sec

Epoch 4 Batch 0 Loss 0.2758
Epoch 4 Batch 100 Loss 0.2223
Epoch 4 Batch 200 Loss 0.2483
Epoch 4 Loss 0.2380
Time taken for 1 epoch 13.82 sec

Epoch 5 Batch 0 Loss 0.1300
Epoch 5 Batch 100 Loss 0.1675
Epoch 5 Batch 200 Loss 0.2078
Epoch 5 Loss 0.1727
Time taken for 1 epoch 13.37 sec

Epoch 6 Batch 0 Loss 0.1739
Epoch 6 Batch 100 Loss 0.1351
Epoch 6 Batch 200 Loss 0.1607
Epoch 6 Loss 0.1333
Time taken for 1 epoch 13.79 sec

Epoch 7 Batch 0 Loss 0.0845
Epoch 7 Batch 100 Loss 0.0988
Epoch 7 Batch 200 Loss 0.0933
Epoch 7 Loss 0.1083
Time taken for 1 epoch 13.40 sec

Epoch 

## Translate

* The evaluate function is similar to the training loop, except we don't use *teacher forcing* here. The input to the decoder at each time step is its previous predictions along with the hidden state and the encoder output.
* Stop predicting when the model predicts the *end token*.


Note: The encoder output is calculated only once for one input.

In [39]:
def evaluate(sentence):
  sentence = preprocess_sentence(sentence)

  inputs = [inp_lang.word_index[i] for i in sentence.split(' ')]
  inputs = tf.keras.preprocessing.sequence.pad_sequences([inputs],
                                                         maxlen=max_length_inp,
                                                         padding='post')
  inputs = tf.convert_to_tensor(inputs)

  result = ''

  hidden = [tf.zeros((1, units))]
  
  enc_out, enc_hidden = encoder(inputs, hidden)

  dec_hidden = enc_hidden
  dec_input = tf.expand_dims([targ_lang.word_index['<start>']], 0)

  for t in range(max_length_targ):
    predictions, dec_hidden = decoder_nv(dec_input,
                                                         dec_hidden,
                                                         enc_out)


    predicted_id = tf.argmax(predictions[0]).numpy()

    result += targ_lang.index_word[predicted_id] + ' '

    if targ_lang.index_word[predicted_id] == '<end>':
      return result, sentence

    # the predicted ID is fed back into the model
    dec_input = tf.expand_dims([predicted_id], 0)

  return result, sentence

In [40]:
def translate(sentence):
  result, sentence = evaluate(sentence)

  print('Input:', sentence)
  print('Predicted translation:', result)


## Restore the latest checkpoint and test

In [41]:
# restoring the latest checkpoint in checkpoint_dir
checkpoint.restore(tf.train.latest_checkpoint(checkpoint_dir))

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7f3db2f65fd0>

In [42]:
translate(u'hace mucho frio aqui.')

Input: <start> hace mucho frio aqui . <end>
Predicted translation: it s very cold . <end> 


In [43]:
translate(u'esta es mi vida.')

Input: <start> esta es mi vida . <end>
Predicted translation: this is my life . <end> 
