<a href="https://colab.research.google.com/github/argenis-gomez/Traductor/blob/master/training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fase 1: Importar las dependencias

**Paper original**: All you need is Attention https://arxiv.org/pdf/1706.03762.pdf

In [1]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [None]:
!pip install -q tensorflow_text_nightly
!pip install -q tf-nightly

In [63]:
import numpy as np
import pandas as pd
import math
import re
import time
import os

import tensorflow as tf
from tensorflow.keras import layers

from tensorflow_text.tools.wordpiece_vocab import bert_vocab_from_dataset as bert_vocab
import tensorflow_text as text

# Fase 2: Pre Procesado de Datos



## Carga de Ficheros

Importamos los ficheros de nuestro Google Drive personal

In [4]:
PATH = '/content/drive/My Drive/Traductor'

In [None]:
with open(f"{PATH}/data/europarl-v7.es-en.en", 
          mode = "r", encoding = "utf-8") as f:
    europarl_en = f.read()
with open(f"{PATH}/data/europarl-v7.es-en.es", 
          mode = "r", encoding = "utf-8") as f:
    europarl_es = f.read()
with open(f"{PATH}/data/nonbreaking_prefix.en", 
          mode = "r", encoding = "utf-8") as f:
    non_breaking_prefix_en = f.read()
with open(f"{PATH}/data/nonbreaking_prefix.es", 
          mode = "r", encoding = "utf-8") as f:
    non_breaking_prefix_es = f.read()

## Limpieza de datos

Vamos a obtener los non_breaking_prefixes como una lista de palabras limpias con un punto al final para que nos sea más fácil de utilizar.

In [None]:
non_breaking_prefix_en = non_breaking_prefix_en.split("\n")
non_breaking_prefix_en = [' ' + pref + '.' for pref in non_breaking_prefix_en]
non_breaking_prefix_es = non_breaking_prefix_es.split("\n")
non_breaking_prefix_es = [' ' + pref + '.' for pref in non_breaking_prefix_es]

Necesitaremos cada palabra y otro símbolo que queramos mantener en minúsculas y separados por espacios para que podamos "tokenizarlos".

In [None]:
def clean_data(corpus, prefixes):
  for prefix in prefixes:
    corpus_en = corpus.replace(prefix, prefix + '$$$')
  corpus = re.sub(r"\.(?=[0-9]|[a-z]|[A-Z])", ".$$$", corpus)
  corpus = re.sub(r"\.\$\$\$", '', corpus)
  corpus = re.sub(r"  +", " ", corpus)
  return corpus.split('\n')

In [None]:
%%time

print('Creando corpus...')
corpus_en = clean_data(europarl_en, non_breaking_prefix_en)
print('Corpus en inglés creado...')
corpus_es = clean_data(europarl_es, non_breaking_prefix_es)
print('Corpus en español creado...')
print('Finalizado', end='\n\n')

Creando corpus...
Corpus en inglés creado...
Corpus en español creado...
Finalizado

CPU times: user 58.7 s, sys: 1.98 s, total: 1min
Wall time: 1min


## Vocabulario

In [5]:
VOCAB_SIZE = 2**13

In [6]:
def write_vocab_file(filepath, vocab):
  with open(filepath, 'w') as f:
    for token in vocab:
      print(token, file=f)

In [7]:
bert_tokenizer_params = dict()
reserved_tokens = ["[PAD]", "[START]", "[END]"]

bert_vocab_args = dict(
    vocab_size = VOCAB_SIZE,
    reserved_tokens=reserved_tokens,
    bert_tokenizer_params=bert_tokenizer_params,
    learn_params={},
)

In [8]:
%%time

vocab_en_path = f'{PATH}/vocab_en.txt'

if not os.path.isfile(vocab_en_path):
  print('Creando vocabulario en inglés...')
  vocab = bert_vocab.bert_vocab_from_dataset(
      tf.data.Dataset.from_tensor_slices(corpus_en).batch(1000).prefetch(tf.data.experimental.AUTOTUNE),
      **bert_vocab_args)

  write_vocab_file(vocab_en_path, vocab)
  print('Finalizado...', end='\n\n')
  
else:
  print('Vocabulario en inglés existente...', end='\n\n')

Vocabulario en inglés existente...

CPU times: user 233 µs, sys: 48 µs, total: 281 µs
Wall time: 654 µs


In [9]:
%%time

vocab_es_path = f'{PATH}/vocab_es.txt'

if not os.path.isfile(vocab_es_path):
  print('Creando vocabulario en español...')
  vocab = bert_vocab.bert_vocab_from_dataset(
      tf.data.Dataset.from_tensor_slices(corpus_es).batch(1000).prefetch(tf.data.experimental.AUTOTUNE),
      **bert_vocab_args)

  write_vocab_file(vocab_es_path, vocab)
  print('Finalizado...', end='\n\n')

else:
  print('Vocabulario en español existente...', end='\n\n')

Vocabulario en español existente...

CPU times: user 635 µs, sys: 132 µs, total: 767 µs
Wall time: 803 µs


## Tokenizar el Texto

In [10]:
tokenizer = tf.Module()
tokenizer.en = text.BertTokenizer(vocab_en_path)
tokenizer.es = text.BertTokenizer(vocab_es_path)

In [None]:
tokenizer_path = f"{PATH}/ckpt/Tokenizer"
tf.saved_model.save(tokenizer, tokenizer_path)

INFO:tensorflow:Assets written to: /content/drive/My Drive/Traductor/ckpt/Tokenizer/assets


In [None]:
inputs = tokenizer.en.tokenize(corpus_en)
outputs = tokenizer.es.tokenize(corpus_es)

inputs = inputs.merge_dims(-2,-1)
outputs = outputs.merge_dims(-2,-1)

In [11]:
START = tf.argmax(tf.constant(reserved_tokens) == "[START]")
END = tf.argmax(tf.constant(reserved_tokens) == "[END]")

def add_start_end(ragged):
  count = ragged.bounding_shape()[0]
  starts = tf.fill([count,1], START)
  ends = tf.fill([count,1], END)
  return tf.concat([starts, ragged, ends], axis=1)

In [None]:
inputs  = add_start_end(inputs).to_list()
outputs = add_start_end(outputs).to_list()

## Eliminamos las frases demasiado largas

In [12]:
MAX_LENGTH = 20

In [None]:
%%time

print('Eliminando frases demasiado largas...')

idx_to_remove = [count for count, sent in enumerate(inputs)
                 if len(sent) > MAX_LENGTH]
for idx in reversed(idx_to_remove):
    del inputs[idx]
    del outputs[idx]
idx_to_remove = [count for count, sent in enumerate(outputs)
                 if len(sent) > MAX_LENGTH]
for idx in reversed(idx_to_remove):
    del inputs[idx]
    del outputs[idx]

print('Finalizado...', end='\n\n')

Eliminando frases demasiado largas...
Finalizado...

CPU times: user 5min 32s, sys: 654 ms, total: 5min 33s
Wall time: 5min 34s


## Creamos las entradas y las salidas

A medida que entrenamos con bloques, necesitaremos que cada entrada tenga la misma longitud. Rellenamos con el token apropiado, y nos aseguraremos de que este token de relleno no interfiera con nuestro entrenamiento más adelante.

In [None]:
inputs = tf.keras.preprocessing.sequence.pad_sequences(inputs,
                                                       value=0,
                                                       padding='post',
                                                       maxlen=MAX_LENGTH)
outputs = tf.keras.preprocessing.sequence.pad_sequences(outputs,
                                                        value=0,
                                                        padding='post',
                                                        maxlen=MAX_LENGTH)

In [None]:
pd.DataFrame(inputs).to_csv(f'{PATH}/inputs.csv', index=False)
pd.DataFrame(outputs).to_csv(f'{PATH}/outputs.csv', index=False)

In [13]:
inputs  = pd.read_csv(f'{PATH}/inputs.csv').values
outputs = pd.read_csv(f'{PATH}/outputs.csv').values

In [14]:
BATCH_SIZE = 64
BUFFER_SIZE = 20000

dataset = tf.data.Dataset.from_tensor_slices((inputs, outputs))

dataset = dataset.cache()
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)

# Fase 3: Construcción del Modelo

## Embedding

Fórmula de la Codificación Posicional:

$PE_{(pos,2i)} =\sin(pos/10000^{2i/dmodel})$

$PE_{(pos,2i+1)} =\cos(pos/10000^{2i/dmodel})$

In [15]:
class PositionalEncoding(layers.Layer):

    def __init__(self):
        super(PositionalEncoding, self).__init__()
    
    def get_angles(self, pos, i, d_model): # pos: (seq_length, 1) i: (1, d_model)
        angles = 1 / np.power(10000., (2*(i//2)) / np.float32(d_model))
        return pos * angles # (seq_length, d_model)

    def call(self, inputs):
        seq_length = inputs.shape.as_list()[-2]
        d_model = inputs.shape.as_list()[-1]
        angles = self.get_angles(np.arange(seq_length)[:, np.newaxis],
                                 np.arange(d_model)[np.newaxis, :],
                                 d_model)
        angles[:, 0::2] = np.sin(angles[:, 0::2])
        angles[:, 1::2] = np.cos(angles[:, 1::2])
        pos_encoding = angles[np.newaxis, ...]
        return inputs + tf.cast(pos_encoding, tf.float32)

## Attention

### Cálculo de la Atención

$Attention(Q, K, V ) = \text{softmax}\left(\dfrac{QK^T}{\sqrt{d_k}}\right)V $

In [16]:
def scaled_dot_product_attention(queries, keys, values, mask):
    product = tf.matmul(queries, keys, transpose_b=True)
    
    keys_dim = tf.cast(tf.shape(keys)[-1], tf.float32)
    scaled_product = product / tf.math.sqrt(keys_dim)
    
    if mask is not None:
        scaled_product += (mask * -1e9)
    
    attention = tf.matmul(tf.nn.softmax(scaled_product, axis=-1), values)
    
    return attention

### Sub capa de atención de encabezado múltiple

In [17]:
class MultiHeadAttention(layers.Layer):
    
    def __init__(self, nb_proj):
        super(MultiHeadAttention, self).__init__()
        self.nb_proj = nb_proj
        
    def build(self, input_shape):
        self.d_model = input_shape[-1]
        assert self.d_model % self.nb_proj == 0
        
        self.d_proj = self.d_model // self.nb_proj
        
        self.query_lin = layers.Dense(units=self.d_model)
        self.key_lin = layers.Dense(units=self.d_model)
        self.value_lin = layers.Dense(units=self.d_model)
        
        self.final_lin = layers.Dense(units=self.d_model)
        
    def split_proj(self, inputs, batch_size): # inputs: (batch_size, seq_length, d_model)
        shape = (batch_size,
                 -1,
                 self.nb_proj,
                 self.d_proj)
        splited_inputs = tf.reshape(inputs, shape=shape) # (batch_size, seq_length, nb_proj, d_proj)
        return tf.transpose(splited_inputs, perm=[0, 2, 1, 3]) # (batch_size, nb_proj, seq_length, d_proj)
    
    def call(self, queries, keys, values, mask):
        batch_size = tf.shape(queries)[0]
        
        queries = self.query_lin(queries)
        keys = self.key_lin(keys)
        values = self.value_lin(values)
        
        queries = self.split_proj(queries, batch_size)
        keys = self.split_proj(keys, batch_size)
        values = self.split_proj(values, batch_size)
        
        attention = scaled_dot_product_attention(queries, keys, values, mask)
        
        attention = tf.transpose(attention, perm=[0, 2, 1, 3])
        
        concat_attention = tf.reshape(attention,
                                      shape=(batch_size, -1, self.d_model))
        
        outputs = self.final_lin(concat_attention)
        
        return outputs

## Codificación

In [18]:
class EncoderLayer(layers.Layer):
    
    def __init__(self, FFN_units, nb_proj, dropout_rate):
        super(EncoderLayer, self).__init__()
        self.FFN_units = FFN_units
        self.nb_proj = nb_proj
        self.dropout_rate = dropout_rate
    
    def build(self, input_shape):
        self.d_model = input_shape[-1]
        
        self.multi_head_attention = MultiHeadAttention(self.nb_proj)
        self.dropout_1 = layers.Dropout(rate=self.dropout_rate)
        self.norm_1 = layers.LayerNormalization(epsilon=1e-6)
        
        self.dense_1 = layers.Dense(units=self.FFN_units, activation="relu")
        self.dense_2 = layers.Dense(units=self.d_model)
        self.dropout_2 = layers.Dropout(rate=self.dropout_rate)
        self.norm_2 = layers.LayerNormalization(epsilon=1e-6)
        
    def call(self, inputs, mask, training):
        attention = self.multi_head_attention(inputs,
                                              inputs,
                                              inputs,
                                              mask)
        attention = self.dropout_1(attention, training=training)
        attention = self.norm_1(attention + inputs)
        
        outputs = self.dense_1(attention)
        outputs = self.dense_2(outputs)
        outputs = self.dropout_2(outputs, training=training)
        outputs = self.norm_2(outputs + attention)
        
        return outputs

In [19]:
class Encoder(layers.Layer):
    
    def __init__(self,
                 nb_layers,
                 FFN_units,
                 nb_proj,
                 dropout_rate,
                 vocab_size,
                 d_model,
                 name="encoder"):
        super(Encoder, self).__init__(name=name)
        self.nb_layers = nb_layers
        self.d_model = d_model
        
        self.embedding = layers.Embedding(vocab_size, d_model)
        self.pos_encoding = PositionalEncoding()
        self.dropout = layers.Dropout(rate=dropout_rate)
        self.enc_layers = [EncoderLayer(FFN_units,
                                        nb_proj,
                                        dropout_rate) 
                           for _ in range(nb_layers)]
    
    def call(self, inputs, mask, training):
        outputs = self.embedding(inputs)
        outputs *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        outputs = self.pos_encoding(outputs)
        outputs = self.dropout(outputs, training)
        
        for i in range(self.nb_layers):
            outputs = self.enc_layers[i](outputs, mask, training)

        return outputs

## Descodificación

In [20]:
class DecoderLayer(layers.Layer):
    
    def __init__(self, FFN_units, nb_proj, dropout_rate):
        super(DecoderLayer, self).__init__()
        self.FFN_units = FFN_units
        self.nb_proj = nb_proj
        self.dropout_rate = dropout_rate
    
    def build(self, input_shape):
        self.d_model = input_shape[-1]
        
        # Self multi head attention
        self.multi_head_attention_1 = MultiHeadAttention(self.nb_proj)
        self.dropout_1 = layers.Dropout(rate=self.dropout_rate)
        self.norm_1 = layers.LayerNormalization(epsilon=1e-6)
        
        # Multi head attention combinado con la salida del encoder 
        self.multi_head_attention_2 = MultiHeadAttention(self.nb_proj)
        self.dropout_2 = layers.Dropout(rate=self.dropout_rate)
        self.norm_2 = layers.LayerNormalization(epsilon=1e-6)
        
        # Feed foward
        self.dense_1 = layers.Dense(units=self.FFN_units,
                                    activation="relu")
        self.dense_2 = layers.Dense(units=self.d_model)
        self.dropout_3 = layers.Dropout(rate=self.dropout_rate)
        self.norm_3 = layers.LayerNormalization(epsilon=1e-6)
        
    def call(self, inputs, enc_outputs, mask_1, mask_2, training):
        attention = self.multi_head_attention_1(inputs,
                                                inputs,
                                                inputs,
                                                mask_1)
        attention = self.dropout_1(attention, training)
        attention = self.norm_1(attention + inputs)
        
        attention_2 = self.multi_head_attention_2(attention,
                                                  enc_outputs,
                                                  enc_outputs,
                                                  mask_2)
        attention_2 = self.dropout_2(attention_2, training)
        attention_2 = self.norm_2(attention_2 + attention)
        
        outputs = self.dense_1(attention_2)
        outputs = self.dense_2(outputs)
        outputs = self.dropout_3(outputs, training)
        outputs = self.norm_3(outputs + attention_2)
        
        return outputs

In [21]:
class Decoder(layers.Layer):
    
    def __init__(self,
                 nb_layers,
                 FFN_units,
                 nb_proj,
                 dropout_rate,
                 vocab_size,
                 d_model,
                 name="decoder"):
        super(Decoder, self).__init__(name=name)
        self.d_model = d_model
        self.nb_layers = nb_layers
        
        self.embedding = layers.Embedding(vocab_size, d_model)
        self.pos_encoding = PositionalEncoding()
        self.dropout = layers.Dropout(rate=dropout_rate)
        
        self.dec_layers = [DecoderLayer(FFN_units,
                                        nb_proj,
                                        dropout_rate) 
                           for _ in range(nb_layers)]
    
    def call(self, inputs, enc_outputs, mask_1, mask_2, training):
        outputs = self.embedding(inputs)
        outputs *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        outputs = self.pos_encoding(outputs)
        outputs = self.dropout(outputs, training)
        
        for i in range(self.nb_layers):
            outputs = self.dec_layers[i](outputs,
                                         enc_outputs,
                                         mask_1,
                                         mask_2,
                                         training)

        return outputs

## Transformer

In [22]:
class Transformer(tf.keras.Model):
    
    def __init__(self,
                 vocab_size_enc,
                 vocab_size_dec,
                 d_model,
                 nb_layers,
                 FFN_units,
                 nb_proj,
                 dropout_rate,
                 name="transformer"):
        super(Transformer, self).__init__(name=name)
        
        self.encoder = Encoder(nb_layers,
                               FFN_units,
                               nb_proj,
                               dropout_rate,
                               vocab_size_enc,
                               d_model)
        self.decoder = Decoder(nb_layers,
                               FFN_units,
                               nb_proj,
                               dropout_rate,
                               vocab_size_dec,
                               d_model)
        self.last_linear = layers.Dense(units=vocab_size_dec, name="lin_ouput")
    
    def create_padding_mask(self, seq): #seq: (batch_size, seq_length)
        mask = tf.cast(tf.math.equal(seq, 0), tf.float32)
        return mask[:, tf.newaxis, tf.newaxis, :]

    def create_look_ahead_mask(self, seq):
        seq_len = tf.shape(seq)[1]
        look_ahead_mask = 1 - tf.linalg.band_part(tf.ones((seq_len, seq_len)), -1, 0)
        return look_ahead_mask
    
    def call(self, enc_inputs, dec_inputs, training):
        enc_mask = self.create_padding_mask(enc_inputs)
        dec_mask_1 = tf.maximum(
            self.create_padding_mask(dec_inputs),
            self.create_look_ahead_mask(dec_inputs)
        )
        dec_mask_2 = self.create_padding_mask(enc_inputs)
        
        enc_outputs = self.encoder(enc_inputs, enc_mask, training)
        dec_outputs = self.decoder(dec_inputs,
                                   enc_outputs,
                                   dec_mask_1,
                                   dec_mask_2,
                                   training)
        
        outputs = self.last_linear(dec_outputs)
        
        return outputs

# Entrenamiento

In [56]:
tf.keras.backend.clear_session()

# Hiper Parámetros
D_MODEL = 128 # 512
NB_LAYERS = 4 # 6
FFN_UNITS = 512 # 2048
NB_PROJ = 8 # 8
DROPOUT_RATE = 0.1 # 0.1

transformer = Transformer(vocab_size_enc=VOCAB_SIZE,
                          vocab_size_dec=VOCAB_SIZE,
                          d_model=D_MODEL,
                          nb_layers=NB_LAYERS,
                          FFN_units=FFN_UNITS,
                          nb_proj=NB_PROJ,
                          dropout_rate=DROPOUT_RATE)

In [27]:
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True,
                                                            reduction="none")

def loss_function(target, pred):
    mask = tf.math.logical_not(tf.math.equal(target, 0))
    loss_ = loss_object(target, pred)
    
    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask
    
    return tf.reduce_mean(loss_)

train_loss = tf.keras.metrics.Mean(name="train_loss")
train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name="train_accuracy")

In [28]:
class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):
    
    def __init__(self, d_model, warmup_steps=4000):
        super(CustomSchedule, self).__init__()
        
        self.d_model = tf.cast(d_model, tf.float32)
        self.warmup_steps = warmup_steps
    
    def __call__(self, step):
        arg1 = tf.math.rsqrt(step)
        arg2 = step * (self.warmup_steps**-1.5)
        
        return tf.math.rsqrt(self.d_model) * tf.math.minimum(arg1, arg2)

leaning_rate = CustomSchedule(D_MODEL)

optimizer = tf.keras.optimizers.Adam(leaning_rate,
                                     beta_1=0.9,
                                     beta_2=0.98,
                                     epsilon=1e-9)
        

In [52]:
checkpoint_path = f"{PATH}/ckpt/Traductor"

ckpt = tf.train.Checkpoint(transformer=transformer,
                           optimizer=optimizer)

ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=5)

if ckpt_manager.latest_checkpoint:
    ckpt.restore(ckpt_manager.latest_checkpoint)
    print("Último checkpoint restaurado!!")

Último checkpoint restaurado!!


In [None]:
EPOCHS = 5
for epoch in range(EPOCHS):
    print(f"Inicio del epoch {epoch+1}")
    start = time.time()
    
    train_loss.reset_states()
    train_accuracy.reset_states()
    
    for (batch, (enc_inputs, targets)) in enumerate(dataset):
        dec_inputs = targets[:, :-1]
        dec_outputs_real = targets[:, 1:]
        with tf.GradientTape() as tape:
            predictions = transformer(enc_inputs, dec_inputs, True)
            loss = loss_function(dec_outputs_real, predictions)
        
        gradients = tape.gradient(loss, transformer.trainable_variables)
        optimizer.apply_gradients(zip(gradients, transformer.trainable_variables))
        
        train_loss(loss)
        train_accuracy(dec_outputs_real, predictions)
        
        if batch % 50 == 0:
            print(f"\rEpoch {epoch+1} Lote {batch} Pérdida {train_loss.result():.4f} Precisión {train_accuracy.result():.4f}",
                  end='')
            
    ckpt_save_path = ckpt_manager.save()
    print(f"\nGuardando checkpoint para el epoch {epoch+1} en {ckpt_save_path}")

    print(f"Tiempo que ha tardado el epoch: {time.time() - start:6.2f} segs\n")

Inicio del epoch 1
Epoch 1 Lote 6650 Pérdida 0.9202 Precisión 0.4751
Guardando checkpoint para el epoch 1 en /content/drive/My Drive/Transformer/ckpt/ckpt-16
Tiempo que ha tardado el epoch: 6819.25 segs

Inicio del epoch 2
Epoch 2 Lote 6650 Pérdida 0.9124 Precisión 0.4763
Guardando checkpoint para el epoch 2 en /content/drive/My Drive/Transformer/ckpt/ckpt-17
Tiempo que ha tardado el epoch: 6877.75 segs

Inicio del epoch 3
Epoch 3 Lote 6650 Pérdida 0.9055 Precisión 0.4775
Guardando checkpoint para el epoch 3 en /content/drive/My Drive/Transformer/ckpt/ckpt-18
Tiempo que ha tardado el epoch: 6887.22 segs

Inicio del epoch 4
Epoch 4 Lote 6650 Pérdida 0.8985 Precisión 0.4786
Guardando checkpoint para el epoch 4 en /content/drive/My Drive/Transformer/ckpt/ckpt-19
Tiempo que ha tardado el epoch: 6768.95 segs

Inicio del epoch 5
Epoch 5 Lote 6650 Pérdida 0.8922 Precisión 0.4797
Guardando checkpoint para el epoch 5 en /content/drive/My Drive/Transformer/ckpt/ckpt-20
Tiempo que ha tardado el e

In [62]:
transformer.save_weights(f"{PATH}/ckpt/model_weights/model")

# Evaluación

In [59]:
transformer.load_weights(f"{PATH}/ckpt/model_weights/model")

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7fc2ea571390>

In [60]:
def translate(sentence):

  sentence  = tf.convert_to_tensor([sentence])
  encoder_input = tokenizer.en.tokenize(sentence)
  encoder_input = encoder_input.merge_dims(-2,-1)
  encoder_input = add_start_end(encoder_input).to_tensor()

  output = tf.convert_to_tensor([START])
  output = tf.expand_dims(output, 0)

  for i in range(MAX_LENGTH):
    predictions = transformer(encoder_input, output, False)
    predictions = predictions[: ,-1:, :]
    predicted_id = tf.argmax(predictions, axis=-1)

    output = tf.concat([output, predicted_id], axis=-1)
    
    if predicted_id == END:
        break
  
  text = tokenizer.es.detokenize(output)[0].numpy()
  text = tf.strings.reduce_join(text, separator=' ', axis=-1)

  return text.numpy().decode('utf-8')[8:-6]

In [61]:
traduction = translate("This is a problem we have to solve.")
traduction

'Este es un problema que tenemos que solucionar .'

In [None]:
traduction = translate("This is a really powerful tool!")
traduction

'¡ Es una herramienta realmente poder !'

In [None]:
traduction = translate("This is an interesting course about Natural Language Processing")
traduction

'Es un modo interesante sobre el Proceso Nandelguad .'

In [None]:
traduction = translate("Imagine all the citizens")
traduction

'Imaginaléndolo a todos los ciudadanos'

In [None]:
traduction = translate("Alejandra is the best woman in the whole world")
traduction

'Alejara es la mejor mujer en todo el mundo'