<a href="https://colab.research.google.com/github/aakhterov/ML_algorithms_from_scratch/blob/master/machine_translation_using_bahdanau_attention.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 0. Description

We're going to build an NN model to translate from Russian to English. This notebook is committed to the implementation of the encoder-decoder network with Bahdanau attention (Additive Attention) mechanism.

We will use the following terms:
- source language - the language from which the model translates
- target language - the language to which the model translates
- token = word


Dataset: https://www.kaggle.com/datasets/hijest/englishrussian-dictionary-for-machine-translate/

References:
- https://github.com/Hvass-Labs/TensorFlow-Tutorials/blob/master/21_Machine_Translation.ipynb
- https://www.youtube.com/watch?v=vI2Y3I-JI2Q
- https://medium.com/analytics-vidhya/encoder-decoder-seq2seq-models-clearly-explained-c34186fbf49b
- https://blog.floydhub.com/attention-mechanism/#bahdanau-att-step2
- https://towardsdatascience.com/implementing-neural-machine-translation-with-attention-using-tensorflow-fc9c6f26155f



In [2]:
from typing import List
import numpy as np
import pandas as pd
import tensorflow as tf
import seaborn as sns
import matplotlib.pyplot as plt
import pickle
from tqdm import tqdm
from string import punctuation
from collections import Counter
from tensorflow.keras.layers import Embedding, LSTM, Dense, Input
from tensorflow.keras.models import Model
from tensorflow.keras.losses import SparseCategoricalCrossentropy
from tensorflow.keras.utils import pad_sequences
from sklearn.model_selection import train_test_split

from google.colab import drive
drive.mount('/content/drive')
base_path = '/content/drive/MyDrive/Colab Notebooks/'

Mounted at /content/drive


# 1. Vectorization

In [3]:
UNKNOWN_TOKEN = '[UNK]' # Out of vocabulary token
START_TOKEN = '[START]' # The token that denotes the beginning of the target language phrase
END_TOKEN = '[END]' # The token that denotes the end of the target language phrase

In [4]:
class Vectorization:
  '''
    Vectorization text class.
    Main goals:
     - make a vocabulary
     - convert the list of strings to the list of integer tokens
     - convert the list of integer tokens to the list of strings
  '''

  def __init__(self,
               max_tokens,
               max_length=None,
               unknown_token=UNKNOWN_TOKEN,
               start_token=START_TOKEN,
               end_token=END_TOKEN
               ):
    '''
      :param max_tokens: length of the vocabulary
      :param max_length: max length of the phrases
      :param unknown_token: out of vocabulary token
      :param start_token: token that denotes the beginning of the phrase
      :param end_token: token that denotes the end of the phrase
    '''
    self.max_tokens = max_tokens
    self.max_length = max_length
    self.unknown_token = unknown_token
    self.start_token = start_token
    self.end_token=end_token
    # add to the vocabulary:
    #  (1) padding token (we're going to pad using 0, so padding token index is 0)
    #  (2) out of vocabulary token
    #  (3) start token
    #  (4) end token
    self.vocabulary = ['', self.unknown_token, self.start_token, self.end_token]

  def __preprocessing(self, input: str) -> str:
    '''
      Preprocess of the string (convert to lowcase and remove punctuation).
      ex.: I'm going! -> i m going
      :param input - input string
      :return preprocessed string
    '''
    output = ''.join(map(lambda ch: ch if ch not in punctuation else ' ', input.lower())).strip()
    return output

  def token_to_text(self, tokens: List) -> str:
    '''
      Convert the list of the integer tokens to the string
      :param tokens: list of the integer tokens
      :return string contains words that correspond to the integer tokens
    '''
    words = [self.vocabulary[token] for token in tokens]
    return " ".join(words)

  def fit(self, X: List):
    '''
      Make the vocabulary and calculate the max length of the phrase
      :param X: corpus - list of the strings
      :return the instance of the current class
    '''
    lens = []
    for x in X:
      # Make preprocessing and get the list of the words.
      # Ex. I'm going! -> ['i', 'm', 'going']
      words = self.__preprocessing(x).split()

      # Collect phrases lengths
      lens.append(len(words))

      # Make the vocabulary
      for word in words:
        token = word.strip()
        # Add the word to the vocabulary if it usn't "full"
        if token not in self.vocabulary and self.max_tokens is not None and len(self.vocabulary)<self.max_tokens:
          self.vocabulary.append(token)

    # Calculate the max length of the phrases if it isn't set in the __init__
    # max_length = Average length + two standard devations
    lens = np.array(lens)
    if self.max_length is None:
      self.max_length = int(np.mean(lens) + 2 * np.std(lens))
    return self

  def predict(self,
              X: List[str],
              is_padding=True,
              is_add_start_token=False,
              is_add_end_token=False
              ) -> List[List]:
    '''
      :param X - corpus - list of the strings
      :param is_padding - whether to pad the list of tokens to the max. length with 0s
      :param is_add_start_token - whether to add the start_token to the list of tokens
      :param is_add_end_token - whether to add the end_token to the list of tokens
      :return list of the lists of tokens
    '''
    output = []
    self.max_length += int(is_add_start_token) + int(is_add_end_token)

    for x in X:
      # If nedded add the index of the start_token to the beginning of the list of tokens
      vector = [self.vocabulary.index(self.start_token)] if is_add_start_token else []

      # Make preprocessing and get the list of the words.
      words = self.__preprocessing(x).split()

      # If the current word is in the vocabulary add its index to the list else add the index of the unknown_token
      for word in words:
        token = word.strip()
        vector.append(self.vocabulary.index(token) if token in self.vocabulary else self.vocabulary.index(self.unknown_token))

      # Truncate the vector to the max. length
      vector = vector[:self.max_length-1]

      # If needed add the index of the end_token
      if is_add_end_token:
        vector.append(self.vocabulary.index(self.end_token))

      output.append(vector)

    # If needed pad the vector to the max. length with 0s
    return pad_sequences(output,
                         maxlen=self.max_length,
                         padding='post',
                         truncating='post') if is_padding else output

In [51]:
# Read from N to M samples
N = 199_000
M = 200_000

input_phrases, output_phrases = [], []
with open('/content/drive/MyDrive/Colab Notebooks/Data/rus.txt') as f:
  for line in f.readlines()[N:M]:
    eng, rus = line.split('CC-BY')[0].strip().split('\t')
    input_phrases.append(rus)
    output_phrases.append(eng)

In [52]:
input_vocab = 2000 # size of the source language vocaulary
output_vocab = 2000 # size of the target language vocaulary

In [53]:
# Make vectorization of the source language phrases
encoder_vec = Vectorization(max_tokens=input_vocab)
encoder_vec.fit(input_phrases)
X_encoder = encoder_vec.predict(input_phrases, is_padding=False)

# Make vectorization of the target language phrases
decoder_vec = Vectorization(max_tokens=output_vocab)
decoder_vec.fit(output_phrases)
# For the reason of the sequence model training we need decoder input contains the start_token and
# the decoder output which is without the start_token
X_decoder = decoder_vec.predict(output_phrases, is_add_start_token=True, is_add_end_token=True)
Y_decoder = decoder_vec.predict(output_phrases, is_add_end_token=True)

In [54]:
idx = 100
print(f"Index: {idx}")
print("======= Encoder =======")
print(f"Input phrase: {input_phrases[idx]}")
print(f"Vector: {X_encoder[idx]}")
print(f"Max. length: {encoder_vec.max_length}")
print("======= Decoder =======")
print(f"Input phrase: {output_phrases[idx]}")
print(f"Vector: {X_decoder[idx]}")
print(f"Output phrase: {output_phrases[idx]}")
print(f"Vector: {Y_decoder[idx]}")
print(f"Max. length: {decoder_vec.max_length}")
print("==============")
print(f"Start phrase token index: {decoder_vec.vocabulary.index(START_TOKEN)}")
print(f"End phrase token index: {decoder_vec.vocabulary.index(END_TOKEN)}")

Index: 100
Input phrase: Том пошёл к гадалке.
Vector: [8, 197, 148, 220]
Max. length: 7
Input phrase: Tom went to a fortune teller.
Vector: [  2   4 168  20  18 183 184   3   0]
Output phrase: Tom went to a fortune teller.
Vector: [  4 168  20  18 183 184   3   0   0   0]
Max. length: 10
Start phrase token index: 2
End phrase token index: 3


# 2. Construct Encoder-Decoder NN with Bahdanau attention

In [9]:
class Encoder(tf.keras.Model):
  '''
    Encoder for using with Bahdanau attention
  '''
  def __init__(self, input_vocab: int, embedding_dim: int, lstm_hidden_units: int):
    '''
      :param input_vocab - vocabluary dimension of the source language
      :param embedding_dim - dimension of the source language words embeddings
      :param lstm_hidden_units - the number of the LSTM cell units
    '''
    super(Encoder, self).__init__()
    self.lstm_hidden_units = lstm_hidden_units
    self.emedding = Embedding(input_dim=input_vocab,
                              output_dim=embedding_dim,
                              mask_zero=True,
                              name='encoder_embedding')
    self.lstm = LSTM(units=lstm_hidden_units,
                     return_sequences=True,
                     return_state=True,
                     name='encoder_lstm')

  def __call__(self, x):
    '''
      Calculate forward propagation through the Encoder
      :param x - input sequence (batch_size, sequence_length)
    '''
    # Get embeddings.
    # 'x' dimension is (batch_size, sequence_length)
    # 'out' dimension is (batch_size, sequence_length, embedding_dim)
    out = self.emedding(x)

    # Hence we don't need LSTM output, we get only LSTM states (hidden state and cell state)
    # One of the problems here is that dispite the return_sequences parameter is True,
    # we get only last (after propagation a whole sequence) values of the states and
    # didn't get the states after each timestep. We will struggle with this later.
    _, h, c = self.lstm(out)
    return h, c

In [10]:
class BahdanauAttention(tf.keras.layers.Layer):
  '''
    Layers implements a Bahdanau attention mechanism
  '''
  def __init__(self, units: int, name=None):
    '''
      :param units - the number of the encoder and decoder hidden units.
      This value can be obtained from the inputs dimensions but for the purpose of simplicity we will set it here.
      :param name - tne name of the layer
    '''
    super(BahdanauAttention, self).__init__(name=name)
    self.units = units
    self.fc_encoder_states = Dense(units=units, activation='linear')
    self.fc_decoder_states = Dense(units=units, activation='linear')
    self.fc_combined = Dense(units=1, activation='linear')

  def __call__(self, encoder_states, decoder_hidden_state):
    '''
      Calculate forward propagation through the Layer
      :param encoder_states - encoder hidden states (batch_size, sequence_length, encoder_lstm_hidden_units)
      :param decoder_hidden_state - decoder hidden state (batch_size, decoder_lstm_hidden_units)
    '''

    # Linear layer for the encoder hidden states (it has its own trainable weights).
    fc_encoder_out = self.fc_encoder_states(encoder_states) # fc_encoder_out dimension is (batch_size, sequence_length, encoder_lstm_hidden_units)

    # Linear layer for the decoder hidden state from the previous timestep (it has its own trainable weights).
    fc_decoder_out = self.fc_decoder_states(decoder_hidden_state) # fc_decoder_out dimension is (batch_size, decoder_lstm_hidden_units)

    # Add additional dimension to the fc_decoder_out
    fc_decoder_out = tf.expand_dims(fc_decoder_out, axis=1) # fc_decoder_out dimension is (batch_size, 1, decoder_lstm_hidden_units)

    # Calculate alignment score using linear layer  (it has its own trainable weights).
    # Alignment_scores dimension is (batch_size, sequence_length)
    alignment_scores = self.fc_combined(tf.math.tanh(fc_encoder_out + fc_decoder_out))

    # Calculate attention weights of the each encoder hidden state within a batch
    # softmax_alignment_scores dimension is (batch_size, sequence_length)
    softmax_alignment_scores = tf.nn.softmax(alignment_scores)

    # Calculate context vector. Its dimension is (batch_size, encoder_lstm_hidden_units)
    context_vector = tf.reduce_sum(softmax_alignment_scores * encoder_states, axis=1)
    return context_vector, softmax_alignment_scores

In [11]:
class Decoder(tf.keras.Model):
  '''
    Decoder with Bahdanau attention mechanism
  '''
  def __init__(self, output_vocab, embedding_dim, lstm_hidden_units):
    '''
      :param output_vocab - vocabluary dimension of the target language
      :param embedding_dim - dimension of the target language words embeddings
      :param lstm_hidden_units - the number of the LSTM cell units
    '''
    super(Decoder, self).__init__()
    self.emedding = Embedding(input_dim=output_vocab,
                              output_dim=embedding_dim,
                              mask_zero=True,
                              name='decoder_embedding')

    self.lstm = LSTM(units=lstm_hidden_units,
                     return_sequences=True,
                     return_state=True,
                     name='decoder_lstm')

    self.attention = BahdanauAttention(units=lstm_hidden_units,
                                       name='decoder_attention')

    # Dense layer with softmax activation function
    self.output_dense = Dense(units=output_vocab,
                              activation='softmax',
                              name='decoder_output')

  def __call__(self, x, decoder_states, encoder_states):
    '''
      Calculate forward propagation through the Decoder
      : param x - input sequence (batch_size, sequence_length_of_target_lang)
      :param decoder_states - hidden and cell decoder states from the previous timestep (or last encoder states for the first timestep)
      Dimension ((batch_size, lstm_hidden_units), (batch_size, lstm_hidden_units))
      :param encoder_states - hidden encoder states from the each timesteps  (batch_size, sequence_length, lstm_hidden_units)
    '''
    # Unpack decoder states
    hidden_state, cell_state = decoder_states

    # Calculate contect vector based on Bahdanau attention mechanism
    # context_vector dimension is (batch_size, lstm_hidden_units)
    context_vector, _ = self.attention(encoder_states, hidden_state)

    # Get target language embedding
    out = self.emedding(x)

    # Concatenate context_vector with the next embedded token
    input = tf.expand_dims(tf.concat([context_vector, out], axis=-1), 1)

    # Get LSTM outputs
    out, h, c = self.lstm(input, initial_state=decoder_states)

    # Propagate LSTM output through dense layer with softmax activation function
    out = self.output_dense(out)
    return out, h, c

In [58]:
class Seq2SeqBahdanauAttention(tf.keras.Model):
  '''
  Encoder-Decoder network implements the Bahdanau attention mechanism
  '''
  def __init__(self,
               input_vocab,
               output_vocab,
               encoder_embd_dim,
               decoder_embd_dim,
               encoder_lstm_units,
               decoder_lstm_units,
               max_output_length,
               start_token_index,
               end_token_index):
    '''
      :param input_vocab - vocabluary dimension of the source language
      :param output_vocab - vocabluary dimension of the target language
      :param encoder_embd_dim - dimension of the source language words embeddings
      :param decoder_embd_dim - dimension of the target language words embeddings
      :param encoder_lstm_units - the number of the LSTM cell units
      :param decoder_lstm_units - the number of the LSTM cell units
      :param max_output_length - the maximum length of the output sequence
      :param start_token_index - index of the start token in the output vocabulary
      :param end_token_index - index of the end token in the output vocabulary

    '''
    super(Seq2SeqBahdanauAttention, self).__init__()
    self.encoder = Encoder(input_vocab=input_vocab,
                           embedding_dim=encoder_embd_dim,
                           lstm_hidden_units=encoder_lstm_units)
    self.decoder = Decoder(output_vocab=output_vocab,
                           embedding_dim=decoder_embd_dim,
                           lstm_hidden_units=decoder_lstm_units)
    self.max_output_length = max_output_length
    self.start_token_index = start_token_index
    self.end_token_index = end_token_index
    self.__loss_object = tf.keras.losses.SparseCategoricalCrossentropy(reduction='none')


  def __loss_function(self, true, pred):
    '''
    '''
    mask = tf.math.logical_not(tf.math.equal(true, 0))
    loss_ = self.__loss_object(true, pred)
    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask
    return tf.reduce_mean(loss_)

  def __forward(self, X_encoder, X_decoder=None, Y_decoder=None):
    '''
      Forward propagation
      :param X_encoder - input encoder sequence (source language) (batch_size, sequence_length_in_the_source_lang)
      :param X_decoder - input decoder sequence (target language) (batch_size, sequence_length_in_the_target_lang + 2)
      (ex. [start_token_index, word1_index, woprd2_index, end_token_index)
      :param Y_decoder - output decoder sequence (target language) (batch_size, sequence_length_in_the_target_lang + 1)
      (ex. [word1_index, woprd2_index, end_token_index)

      We use X_decoder=None and Y_decoder=None for the purpose of prediction
      We use X_decoder=None and Y_decoder is not None for the purpose of validation during training
    '''
    output = []
    batch_size = X_encoder.shape[0]
    encoder_states = []
    loss = 0
    accuracy = np.array([])

    # Here we deal with the mentioned earlier problem of getting encoder hidden states on each timestep.
    # As we mentoined before LSTM parameter 'return_sequences' doesn't affect on the hidden and cell states, i.e.
    # we can't get LSTM states after every timestep. Therefore we need do it manually. It means the following:
    # 1) we take the first token of the input sequence, propagate it through the encoder and save the LSTM states.
    # 2) we take the two first tokens of the input sequence, propagate them through the encoder and save the last LSTM states.
    # 3) We repeat step 2 adding the next token and save the last LSTM states.

    for t in range(X_encoder.shape[1]):
      h, c = self.encoder(X_encoder[:, :t+1])
      encoder_states.append(h)
    encoder_states = tf.stack(encoder_states, axis=1) # make tensor from the list

    # save the last encoder hidden and cell states, since they are the initial decoder states
    hidden_state = encoder_states[:, -1, :]
    cell_state = c

    if X_decoder is not None and Y_decoder is not None: # if we train network
      # for every timestep (i.e. every token) of the target language sequence
      for t in range(X_decoder.shape[1]):
        # Set the decoder_input to the t-th token of the decoder input sequence.
        # We use here the teacher forcing method for faster and efficient decoder training.
        # The method uses the ground true as the decoder input instead of the prediction
        # from the previous timestep.
        decoder_input = X_decoder[:, t]
        # print(decoder_input.shape)

        # Calculate decoder output and states. We
        out, hidden_state, cell_state = self.decoder(x=decoder_input,
                                                     decoder_states=(hidden_state, cell_state),
                                                     encoder_states=encoder_states)
        # Collect output token
        output.append(out)
        loss += self.__loss_function(Y_decoder[:, t], out) # Calculate loss function
        # Calculate accuracy
        accuracy = np.hstack((accuracy, tf.keras.metrics.sparse_categorical_accuracy(Y_decoder[:, t], np.squeeze(out))))
    else: # if we validate (calculate loss and accuracy on the test set) the network or make prediction
        current_step = 0 # current timestep

        # Set the first decoder input to the start token index. decoder_input dimension is (batch_size, )
        decoder_input = np.full((batch_size, ), self.start_token_index)

        while current_step<self.max_output_length:
          out, hidden_state, cell_state = self.decoder(x=decoder_input,
                                                       decoder_states=(hidden_state, cell_state),
                                                       encoder_states=encoder_states)
          if Y_decoder is not None: # if we validate the network
            true, pred = Y_decoder[:, current_step], np.squeeze(out)
            loss += self.__loss_function(true, pred)
            accuracy = np.hstack((accuracy, tf.keras.metrics.sparse_categorical_accuracy(true, pred)))

          tokens = np.argmax(np.squeeze(out), axis=1)
          decoder_input = tokens
          output.append(out)
          current_step += 1

    batch_loss = tf.reduce_sum(loss) / Y_decoder.shape[0] if Y_decoder is not None else None
    return output, loss, batch_loss, accuracy

  def __train_step(self, X_encoder, X_decoder, Y_decoder, learning_rate):
    '''
    '''
    optimizer = tf.keras.optimizers.RMSprop(learning_rate=learning_rate)
    with tf.GradientTape() as tape:
      _, loss, batch_loss, accuracy = self.__forward(X_encoder, X_decoder, Y_decoder)
      variables = self.encoder.trainable_variables + self.decoder.trainable_variables
      gradients = tape.gradient(loss, variables)
      optimizer.apply_gradients(zip(gradients, variables))
    return batch_loss, accuracy

  def __align_to_length(self, X, batch_size=64):
    '''
    '''
    c = Counter([len(x[0]) for x in X])
    X_new = []
    for length, count in c.most_common():
      if count >= batch_size:
        batches_count = count//batch_size
        X_new += list(filter(lambda x: len(x[0])==length, X))[:batches_count*batch_size]
    return X_new

  def __generate_batches(self, X, Y, batch_size=64):
    '''
    '''
    X_encoder, X_decoder = zip(*X)
    ds = list(zip(X_encoder, X_decoder, Y))
    X_new = self.__align_to_length(ds, batch_size)
    X_encoder, X_decoder, Y_decoder = zip(*X_new)
    X_encoder, X_decoder, Y_decoder = list(X_encoder), list(X_decoder), list(Y_decoder)
    batches_count = len(X_encoder) // batch_size
    for i in range(batches_count):
      lower_idx, upper_idx = i*batch_size, (i+1)*batch_size
      yield np.array(X_encoder[lower_idx:upper_idx]), \
            np.array(X_decoder[lower_idx:upper_idx]), \
            np.array(Y_decoder[lower_idx:upper_idx])

  def fit(self, X_encoder, X_decoder, Y_decoder, epoch=20, batch_size=64, train_size=0.8, learning_rate=0.001):
    '''
      :param learning_rate - optimizer learning rate
    '''
    X_train, X_test, Y_decoder_train, Y_decoder_test = train_test_split(list(zip(X_encoder, X_decoder)),
                                                                        Y_decoder,
                                                                        train_size=train_size)
    # X_encoder_train, X_decoder_train = zip(*X_train)
    # X = list(zip(X_encoder_train, X_decoder_train, Y_decoder_train))
    # X_new = self.align_to_length(X, batch_size)
    # X_encoder_train, X_decoder_train, Y_decoder_train = zip(*X_new)

    # train_ds = tf.data.Dataset.from_tensor_slices((X_encoder_train, X_decoder_train, Y_decoder_train)).batch(batch_size)
    # train_ds_size = train_ds.cardinality()

    # X_encoder_test, X_decoder_test = zip(*X_test)
    # test_ds = tf.data.Dataset.from_tensor_slices((X_encoder_test, X_decoder_test, Y_decoder_test))

    # M = len(X_encoder)
    # train_sample_size = int(train_size*M)
    # full_dataset = tf.data.Dataset.from_tensor_slices((X_encoder, X_decoder, Y_decoder))
    # train_ds = full_dataset.take(train_sample_size)
    # train_ds_size = train_ds.cardinality()
    # train_ds = train_ds.batch(batch_size)
    # test_ds = full_dataset.skip(train_sample_size)

    X_new_train = self.__align_to_length(X_train, batch_size)
    X_new_test = self.__align_to_length(X_test, batch_size)
    total_train_batches = len(X_new_train) // batch_size
    total_test_batches = len(X_new_test) // batch_size
    train_ds_size = len(X_new_train)
    test_ds_size = len(X_new_test)

    history = {"train_loss": [], "train_accuracy": [], "test_loss": [], "test_accuracy": []}

    print(f"Train dataset: {total_train_batches} batches, {train_ds_size} samples")
    print(f"Test dataset: {total_test_batches} batches, {test_ds_size} samples")
    print(f"{'='*10}")
    for ep in range(1, epoch+1):
      print(f"Epoch {ep}/{epoch}")
      total_loss = 0
      accuracy = np.array([])
      for batch, (X_batch_encoder,
                  X_batch_decoder,
                  Y_batch_decoder) in tqdm(enumerate(self.__generate_batches(X_train,
                                                                             Y_decoder_train,
                                                                             batch_size=batch_size)),
                                           desc=f"Train dataset"):
        batch_loss, batch_accuracy = self.__train_step(X_batch_encoder, X_batch_decoder, Y_batch_decoder, learning_rate)

        total_loss += batch_loss
        accuracy = np.hstack((accuracy, batch_accuracy))

        # if batch%1000==0:
        #   print(f"Loss: {total_loss.numpy()/batch_size}. Accuracy: {np.mean(accuracy)}")

      total_loss /= batch_size
      history["train_loss"].append(total_loss.numpy())
      history["train_accuracy"].append(np.mean(accuracy))
      print(f"Loss on train: {total_loss.numpy():.4f} Accuracy on train: {np.mean(accuracy):.4f}")

      total_loss = 0
      accuracy = np.array([])
      for batch, (X_batch_encoder,
                  _,
                  Y_batch_decoder) in tqdm(enumerate(self.__generate_batches(X_test,
                                                                             Y_decoder_test,
                                                                             batch_size=batch_size)),
                                           desc=f"Test dataset"):
        _, _, batch_loss, batch_accuracy = self.__forward(X_batch_encoder,
                                                          None,
                                                          Y_batch_decoder)
        total_loss += batch_loss
        accuracy = np.hstack((accuracy, batch_accuracy))

      total_loss /= batch_size
      history["test_loss"].append(total_loss.numpy())
      history["test_accuracy"].append(np.mean(accuracy))
      print(f"Loss on test: {total_loss.numpy():.4f} Accuracy on test: {np.mean(accuracy):.4f}")
    return history

  def predict(self, X_encoder):
    '''
    '''
    out, _, _, _ = self.__forward(X_encoder)
    return out

In [55]:
embedding_dim = 64
lstm_hidden_units = 64

In [59]:
model = Seq2SeqBahdanauAttention(input_vocab=input_vocab,
                                 output_vocab=output_vocab,
                                 encoder_embd_dim=embedding_dim,
                                 decoder_embd_dim=embedding_dim,
                                 encoder_lstm_units=lstm_hidden_units,
                                 decoder_lstm_units=lstm_hidden_units,
                                 max_output_length=decoder_vec.max_length,
                                 start_token_index=decoder_vec.vocabulary.index(START_TOKEN),
                                 end_token_index=decoder_vec.vocabulary.index(END_TOKEN))

In [60]:
history = model.fit(X_encoder, X_decoder, Y_decoder, epoch=20, batch_size=32, train_size=0.8, learning_rate=0.01)

Train dataset: 23 batches, 736 samples
Test dataset: 5 batches, 160 samples
Epoch 1/20


Train dataset: 23it [00:13,  1.75it/s]


Loss on train: 0.9193 Accuracy on train: 0.1940


Test dataset: 0it [00:00, ?it/s]

(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)


Test dataset: 1it [00:00,  1.98it/s]

(28,) (28, 2000)
(17,) (17, 2000)
(2,) (2, 2000)
(0,) (0, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)


Test dataset: 2it [00:00,  2.04it/s]

(17,) (17, 2000)
(2,) (2, 2000)
(0,) (0, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)


Test dataset: 3it [00:01,  2.12it/s]

(28,) (28, 2000)
(14,) (14, 2000)
(1,) (1, 2000)
(0,) (0, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(29,) (29, 2000)


Test dataset: 4it [00:01,  2.16it/s]

(10,) (10, 2000)
(0,) (0, 2000)
(0,) (0, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(29,) (29, 2000)
(19,) (19, 2000)
(4,) (4, 2000)


Test dataset: 5it [00:02,  2.16it/s]


(0,) (0, 2000)
(0,) (0, 2000)
Loss on test: nan Accuracy on test: 0.0905
Epoch 2/20


Train dataset: 23it [00:12,  1.91it/s]


Loss on train: 0.7189 Accuracy on train: 0.3978


Test dataset: 0it [00:00, ?it/s]

(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(28,) (28, 2000)
(17,) (17, 2000)
(2,) (2, 2000)


Test dataset: 1it [00:00,  2.28it/s]

(0,) (0, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(17,) (17, 2000)
(2,) (2, 2000)


Test dataset: 2it [00:00,  2.32it/s]

(0,) (0, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(28,) (28, 2000)
(14,) (14, 2000)
(1,) (1, 2000)


Test dataset: 3it [00:01,  2.35it/s]

(0,) (0, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(29,) (29, 2000)
(10,) (10, 2000)
(0,) (0, 2000)


Test dataset: 4it [00:01,  2.45it/s]

(0,) (0, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(29,) (29, 2000)
(19,) (19, 2000)


Test dataset: 5it [00:02,  2.44it/s]


(4,) (4, 2000)
(0,) (0, 2000)
(0,) (0, 2000)
Loss on test: nan Accuracy on test: 0.1974
Epoch 3/20


Train dataset: 23it [00:13,  1.68it/s]


Loss on train: 0.6282 Accuracy on train: 0.4701


Test dataset: 0it [00:00, ?it/s]

(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)


Test dataset: 1it [00:00,  3.34it/s]

(32,) (32, 2000)
(28,) (28, 2000)
(17,) (17, 2000)
(2,) (2, 2000)
(0,) (0, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(17,) (17, 2000)
(2,) (2, 2000)


Test dataset: 2it [00:00,  3.01it/s]

(0,) (0, 2000)
(32,) (32, 2000)
(32,) (32, 2000)


Test dataset: 3it [00:01,  2.77it/s]

(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(28,) (28, 2000)
(14,) (14, 2000)
(1,) (1, 2000)
(0,) (0, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(29,) (29, 2000)
(10,) (10, 2000)


Test dataset: 4it [00:01,  2.71it/s]

(0,) (0, 2000)
(0,) (0, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)


Test dataset: 5it [00:01,  2.79it/s]


(32,) (32, 2000)
(32,) (32, 2000)
(29,) (29, 2000)
(19,) (19, 2000)
(4,) (4, 2000)
(0,) (0, 2000)
(0,) (0, 2000)
Loss on test: nan Accuracy on test: 0.1750
Epoch 4/20


Train dataset: 23it [00:14,  1.59it/s]


Loss on train: 0.5799 Accuracy on train: 0.5038


Test dataset: 0it [00:00, ?it/s]

(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(28,) (28, 2000)
(17,) (17, 2000)
(2,) (2, 2000)
(0,) (0, 2000)


Test dataset: 2it [00:00,  3.21it/s]

(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(17,) (17, 2000)
(2,) (2, 2000)
(0,) (0, 2000)


Test dataset: 3it [00:00,  3.36it/s]

(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(28,) (28, 2000)
(14,) (14, 2000)
(1,) (1, 2000)
(0,) (0, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(29,) (29, 2000)
(10,) (10, 2000)
(0,) (0, 2000)
(0,) (0, 2000)


Test dataset: 5it [00:01,  3.38it/s]


(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(29,) (29, 2000)
(19,) (19, 2000)
(4,) (4, 2000)
(0,) (0, 2000)
(0,) (0, 2000)
Loss on test: nan Accuracy on test: 0.2414
Epoch 5/20


Train dataset: 23it [00:14,  1.62it/s]


Loss on train: 0.5314 Accuracy on train: 0.5447


Test dataset: 0it [00:00, ?it/s]

(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)


Test dataset: 1it [00:00,  3.26it/s]

(32,) (32, 2000)
(32,) (32, 2000)
(28,) (28, 2000)
(17,) (17, 2000)
(2,) (2, 2000)
(0,) (0, 2000)


Test dataset: 2it [00:00,  3.19it/s]

(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(17,) (17, 2000)
(2,) (2, 2000)
(0,) (0, 2000)


Test dataset: 3it [00:00,  3.37it/s]

(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(28,) (28, 2000)
(14,) (14, 2000)
(1,) (1, 2000)
(0,) (0, 2000)


Test dataset: 4it [00:01,  3.40it/s]

(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(29,) (29, 2000)
(10,) (10, 2000)
(0,) (0, 2000)
(0,) (0, 2000)


Test dataset: 5it [00:01,  3.41it/s]


(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(29,) (29, 2000)
(19,) (19, 2000)
(4,) (4, 2000)
(0,) (0, 2000)
(0,) (0, 2000)
Loss on test: nan Accuracy on test: 0.2284
Epoch 6/20


Train dataset: 23it [00:14,  1.59it/s]


Loss on train: 0.4903 Accuracy on train: 0.5759


Test dataset: 0it [00:00, ?it/s]

(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)


Test dataset: 1it [00:00,  3.30it/s]

(32,) (32, 2000)
(32,) (32, 2000)
(28,) (28, 2000)
(17,) (17, 2000)
(2,) (2, 2000)
(0,) (0, 2000)


Test dataset: 2it [00:00,  3.15it/s]

(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(17,) (17, 2000)
(2,) (2, 2000)
(0,) (0, 2000)


Test dataset: 3it [00:00,  3.31it/s]

(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(28,) (28, 2000)
(14,) (14, 2000)
(1,) (1, 2000)
(0,) (0, 2000)


Test dataset: 4it [00:01,  3.40it/s]

(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(29,) (29, 2000)
(10,) (10, 2000)
(0,) (0, 2000)
(0,) (0, 2000)


Test dataset: 5it [00:01,  3.40it/s]


(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(29,) (29, 2000)
(19,) (19, 2000)
(4,) (4, 2000)
(0,) (0, 2000)
(0,) (0, 2000)
Loss on test: nan Accuracy on test: 0.2603
Epoch 7/20


Train dataset: 23it [00:14,  1.62it/s]


Loss on train: 0.4554 Accuracy on train: 0.6030


Test dataset: 0it [00:00, ?it/s]

(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)


Test dataset: 1it [00:00,  3.28it/s]

(32,) (32, 2000)
(28,) (28, 2000)
(17,) (17, 2000)
(2,) (2, 2000)
(0,) (0, 2000)
(32,) (32, 2000)


Test dataset: 2it [00:00,  3.23it/s]

(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(17,) (17, 2000)
(2,) (2, 2000)
(0,) (0, 2000)


Test dataset: 3it [00:00,  3.32it/s]

(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(28,) (28, 2000)
(14,) (14, 2000)
(1,) (1, 2000)
(0,) (0, 2000)


Test dataset: 4it [00:01,  3.42it/s]

(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(29,) (29, 2000)
(10,) (10, 2000)
(0,) (0, 2000)
(0,) (0, 2000)


Test dataset: 5it [00:01,  3.43it/s]


(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(29,) (29, 2000)
(19,) (19, 2000)
(4,) (4, 2000)
(0,) (0, 2000)
(0,) (0, 2000)
Loss on test: nan Accuracy on test: 0.2569
Epoch 8/20


Train dataset: 23it [00:13,  1.65it/s]


Loss on train: 0.4198 Accuracy on train: 0.6327


Test dataset: 0it [00:00, ?it/s]

(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)


Test dataset: 1it [00:00,  1.96it/s]

(28,) (28, 2000)
(17,) (17, 2000)
(2,) (2, 2000)
(0,) (0, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)


Test dataset: 2it [00:01,  1.99it/s]

(17,) (17, 2000)
(2,) (2, 2000)
(0,) (0, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(28,) (28, 2000)


Test dataset: 3it [00:01,  2.10it/s]

(14,) (14, 2000)
(1,) (1, 2000)
(0,) (0, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)


Test dataset: 4it [00:01,  2.47it/s]

(32,) (32, 2000)
(32,) (32, 2000)
(29,) (29, 2000)
(10,) (10, 2000)
(0,) (0, 2000)
(0,) (0, 2000)
(32,) (32, 2000)


Test dataset: 5it [00:02,  2.49it/s]


(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(29,) (29, 2000)
(19,) (19, 2000)
(4,) (4, 2000)
(0,) (0, 2000)
(0,) (0, 2000)
Loss on test: nan Accuracy on test: 0.2853
Epoch 9/20


Train dataset: 23it [00:14,  1.56it/s]


Loss on train: 0.3866 Accuracy on train: 0.6587


Test dataset: 0it [00:00, ?it/s]

(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(28,) (28, 2000)


Test dataset: 1it [00:00,  1.97it/s]

(17,) (17, 2000)
(2,) (2, 2000)
(0,) (0, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)


Test dataset: 2it [00:01,  1.97it/s]

(17,) (17, 2000)
(2,) (2, 2000)
(0,) (0, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(28,) (28, 2000)


Test dataset: 3it [00:01,  2.05it/s]

(14,) (14, 2000)
(1,) (1, 2000)
(0,) (0, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)


Test dataset: 4it [00:01,  2.01it/s]

(29,) (29, 2000)
(10,) (10, 2000)
(0,) (0, 2000)
(0,) (0, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)


Test dataset: 5it [00:02,  1.94it/s]


(29,) (29, 2000)
(19,) (19, 2000)
(4,) (4, 2000)
(0,) (0, 2000)
(0,) (0, 2000)
Loss on test: nan Accuracy on test: 0.2716
Epoch 10/20


Train dataset: 23it [00:15,  1.48it/s]


Loss on train: 0.3619 Accuracy on train: 0.6807


Test dataset: 0it [00:00, ?it/s]

(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(28,) (28, 2000)


Test dataset: 1it [00:00,  1.99it/s]

(17,) (17, 2000)
(2,) (2, 2000)
(0,) (0, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)


Test dataset: 2it [00:01,  1.96it/s]

(32,) (32, 2000)
(17,) (17, 2000)
(2,) (2, 2000)
(0,) (0, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(28,)

Test dataset: 3it [00:01,  2.02it/s]

 (28, 2000)
(14,) (14, 2000)
(1,) (1, 2000)
(0,) (0, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(29,) (29, 2000)


Test dataset: 4it [00:01,  2.08it/s]

(10,) (10, 2000)
(0,) (0, 2000)
(0,) (0, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(29,) (29, 2000)
(19,) (19, 2000)
(4,) (4, 2000)

Test dataset: 5it [00:02,  2.09it/s]



(0,) (0, 2000)
(0,) (0, 2000)
Loss on test: nan Accuracy on test: 0.2957
Epoch 11/20


Train dataset: 23it [00:15,  1.50it/s]


Loss on train: 0.3348 Accuracy on train: 0.7040


Test dataset: 0it [00:00, ?it/s]

(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(28,) (28, 2000)


Test dataset: 1it [00:00,  1.98it/s]

(17,) (17, 2000)
(2,) (2, 2000)
(0,) (0, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)


Test dataset: 2it [00:01,  1.98it/s]

(17,) (17, 2000)
(2,) (2, 2000)
(0,) (0, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(28,) (28, 2000)


Test dataset: 3it [00:01,  2.06it/s]

(14,) (14, 2000)
(1,) (1, 2000)
(0,) (0, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(29,) (29, 2000)


Test dataset: 4it [00:01,  2.11it/s]

(10,) (10, 2000)
(0,) (0, 2000)
(0,) (0, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(29,) (29, 2000)
(19,) (19, 2000)
(4,) (4, 2000)


Test dataset: 5it [00:02,  2.12it/s]


(0,) (0, 2000)
(0,) (0, 2000)
Loss on test: nan Accuracy on test: 0.2759
Epoch 12/20


Train dataset: 23it [00:13,  1.65it/s]


Loss on train: 0.3121 Accuracy on train: 0.7283


Test dataset: 0it [00:00, ?it/s]

(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(28,) (28, 2000)
(17,) (17, 2000)
(2,) (2, 2000)


Test dataset: 1it [00:00,  2.20it/s]

(0,) (0, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(17,) (17, 2000)
(2,) (2, 2000)


Test dataset: 2it [00:00,  2.27it/s]

(0,) (0, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(28,) (28, 2000)
(14,) (14, 2000)
(1,) (1, 2000)


Test dataset: 3it [00:01,  2.35it/s]

(0,) (0, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(29,) (29, 2000)
(10,) (10, 2000)
(0,) (0, 2000)


Test dataset: 4it [00:01,  2.28it/s]

(0,) (0, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)


Test dataset: 5it [00:02,  2.33it/s]

(32,) (32, 2000)
(32,) (32, 2000)
(29,) (29, 2000)
(19,) (19, 2000)
(4,) (4, 2000)
(0,) (0, 2000)
(0,) (0, 2000)


Test dataset: 5it [00:02,  2.30it/s]


Loss on test: nan Accuracy on test: 0.3052
Epoch 13/20


Train dataset: 23it [00:13,  1.66it/s]


Loss on train: 0.2946 Accuracy on train: 0.7375


Test dataset: 0it [00:00, ?it/s]

(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(28,) (28, 2000)
(17,) (17, 2000)
(2,) (2, 2000)


Test dataset: 1it [00:00,  2.23it/s]

(0,) (0, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(32,) (32, 2000)
(17,) (17, 2000)


Test dataset: 1it [00:00,  1.16it/s]

(2,) (2, 2000)





KeyboardInterrupt: ignored

In [None]:
# model.save_weights(base_path + 'Data/machine_translation_bahdanau_attention_weights.h5')
with open(base_path + 'Data/machine_translation_bahdanau_attention_history.pickle', 'wb') as f:
    pickle.dump(history, f)

In [17]:
# with open(base_path + 'Data/machine_translation_bahdanau_attention_history.pickle', 'rb') as f:
#     history = pickle.load(f)

In [None]:
df = pd.DataFrame(data=history)
# Plot the learning curves (the loss function and the accuracy metric)
# which calculated on the training and validation datasets
_, axs = plt.subplots(1, 2, figsize=(10, 5))
sns.lineplot(data=df[['train_loss', 'test_loss']], ax=axs[0])
sns.lineplot(data=df[['train_accuracy', 'test_accuracy']], ax=axs[1])
axs[0].set_title('Loss function')
axs[1].set_title('Accuracy')
plt.show()