<a href="https://colab.research.google.com/github/emrllh/My_works/blob/main/Attention_Mechanisms.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Attention Mechanisms

The decoder combines the target language input with the context from the encoder to generate its output sequence. The encoder state acts as a guide, providing information about the source language input that helps the decoder make informed predictions about the target language sequence.

All the encoder's outputs needs to be fed to the Attention layer, so we must add `return_sequences=True` to the encoder:

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
import os
import tensorflow_hub as hub


from pathlib import Path

In [None]:
url = "https://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip"
path = tf.keras.utils.get_file('spa-eng.zip', origin=url, cache_dir='datasets',
                               extract=True)

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip
[1m2638744/2638744[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 0us/step


In [None]:
text = (Path(path).with_name('spa-eng')/ 'spa.txt').read_text()

In [None]:

text = text.replace('i','').replace('¿','')
print(text[:100])
pairs = [line.split('\t') for line in text.splitlines()]
#pairs[:10]
np.random.seed(42)
np.random.shuffle(pairs)
sentences_en, sentences_es = zip(*pairs)

Go.	Ve.
Go.	Vete.
Go.	Vaya.
Go.	Váyase.
H.	Hola.
Run!	¡Corre!
Run.	Corred.
Who?	Quén?
Fre!	¡Fuego!
F


In [None]:
for i in range(3):
  print(sentences_en[i], '=>', sentences_es[i])

How borng! => ¡Qué aburrmento!
I love sports. => Adoro el deporte.
Would you lke to swap jobs? => Te gustaría que ntercambemos los trabajos?


In [None]:
vocab_size = 1000  # most frequent 1000 words will be considered during the text vectorization process.
max_length = 50 # max lenght of output sequence
text_vec_layer_en = tf.keras.layers.TextVectorization(
    vocab_size, output_sequence_length=max_length)
text_vec_layer_es = tf.keras.layers.TextVectorization(
    vocab_size, output_sequence_length=max_length)
text_vec_layer_en.adapt(sentences_en)
text_es=text_vec_layer_es.adapt([f"startofseq {s} endofseq" for s in sentences_es])
#print(list(text_es[:10])) didnt work

`tf.keras.layers.TextVectorization:` This is a preprocessing layer that converts text into numerical sequences.

In [None]:
text_vec_layer_en.get_vocabulary()[:10]

['', '[UNK]', 'the', 'i', 'to', 'you', 'tom', 'a', 's', 'he']

In [None]:
text_vec_layer_es.get_vocabulary()[:10]

['', '[UNK]', 'startofseq', 'endofseq', 'de', 'que', 'a', 'no', 'tom', 'la']

- Encoder: Processes the English sentences (X_train, X_valid).
- Decoder: Processes the Spanish sentences (X_train_dec, X_valid_dec) and aims to predict the correct translation. The "startofseq" token helps the decoder understand the beginning of the target sequence.

In [None]:
X_train = tf.constant(sentences_en[:100_000])
print(X_train[:5])
X_valid = tf.constant(sentences_en[100_000:])

X_train_dec = tf.constant([f'startofseq {s} ' for s in sentences_es[:100_000]])
print(X_train[:5])
X_valid_dec = tf.constant([f'startofseq {s}' for s in sentences_es[100_000:]])

#It iterates through the first 100,000 Spanish sentences (sentences_es)
#and adds the suffix " endofseq" to each sentence.

Y_train = text_vec_layer_es([f'{s} endofseq' for s in sentences_es[:100_000]])
print(Y_train[:1])
Y_valid = text_vec_layer_es([f'{s} endofseq' for s in sentences_es[100_000:]])
print(Y_valid[:1])

tf.Tensor(
[b'How borng!' b'I love sports.' b'Would you lke to swap jobs?'
 b'My mother dd nothng but weep.'
 b'Croata s n the southeastern part of Europe.'], shape=(5,), dtype=string)
tf.Tensor(
[b'How borng!' b'I love sports.' b'Would you lke to swap jobs?'
 b'My mother dd nothng but weep.'
 b'Croata s n the southeastern part of Europe.'], shape=(5,), dtype=string)
tf.Tensor(
[[437   1   3   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0]], shape=(1, 50), dtype=int64)
tf.Tensor(
[[ 14  37   1 141   1   3   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0]], shape=(1, 50), dtype=int64)


In [None]:
tf.random.set_seed(42)
encoder_inputs = tf.keras.layers.Input(shape=[], dtype=tf.string)
decoder_inputs = tf.keras.layers.Input(shape=[], dtype=tf.string)

`text_vec_layer_en:` This is an instance of the TextVectorization layer. It's responsible for converting the English text input into numerical IDs

`vocab_size`: This argument specifies the size of the vocabulary. It indicates the total number of unique words or tokens that the embedding layer needs to handle.

`embed_size`: This argument defines the dimensionality of the word embeddings. It sets the size of the vector that will represent each word. In our case, it's 128, meaning each word will be represented by a 128-dimensional vector.

https://colah.github.io/posts/2014-07-NLP-RNNs-Representations/#Word%20Embeddings

https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/

`mask_zero`=True: This argument is optional but important for handling padding in sequences. When set to True, it tells the embedding layer to ignore any input values of 0. This is useful when you have variable-length sequences and need to pad them with 0s to make them the same length. By masking these padding values, you prevent them from influencing the model's learning

In [None]:

embed_size = 128
encoder_input_ids = text_vec_layer_en(encoder_inputs)
decoder_input_ids = text_vec_layer_es(decoder_inputs)
encoder_embedding_layer = tf.keras.layers.Embedding(vocab_size, embed_size,
                                                    mask_zero=False)
decoder_embedding_layer = tf.keras.layers.Embedding(vocab_size, embed_size,
                                                    mask_zero=False)
encoder_embeddings = encoder_embedding_layer(encoder_input_ids)
decoder_embeddings = decoder_embedding_layer(decoder_input_ids)

In [None]:
"""
tf.random.set_seed(42)
encoder = tf.keras.layers.Bidirectional(
    tf.keras.layers.LSTM(256, return_sequences=True, return_state=True)
)

In [None]:
"""
#ensures that the decoder outputs a sequence of hidden states,
#one for each input timestep.

encoder_outputs, *encoder_state = encoder(encoder_embeddings)

encoder_state = [tf.concat(encoder_state[::2], axis=-1),  # short-term (0 & 2)
                 tf.concat(encoder_state[1::2], axis=-1)] # long-term (1 & 3)



ValueError: TypeError: object of type 'KerasTensor' has no len()


In [None]:
#After The error with @ NotImplementedError: Iterating over a symbolic KerasTensor is not supported.
#the below code worked fine
class EncoderLayer(tf.keras.layers.Layer):

  def __init__(self,units, **kwargs):
    super().__init__(**kwargs)
    self.units = units
    self.bidirectional = tf.keras.layers.Bidirectional(
        tf.keras.layers.LSTM(units, return_sequences=True, return_state=True)
    )

  def call(self, inputs):

    encoder_outputs, *encoder_state = self.bidirectional(inputs)

    encoder_state = [
        tf.concat(encoder_state[::2], axis=-1), #short_term (0 & 2)
        tf.concat(encoder_state[1::2], axis=-1) #long_term (1 $ 3)
    ]
    return encoder_outputs, encoder_state


In [None]:
encoder =EncoderLayer(256)
encoder_outputs, encoder_state = encoder(encoder_embeddings)

In [None]:
"""
decoder = tf.keras.layers.LSTM(512, return_sequences=True)

decoder_outputs = decoder(decoder_embeddings, initial_state=encoder_state)


In [None]:
"""
# Now let's add the Attention layer and the output layer:
attention_layer = tf.keras.layers.Attention()

attention_outputs = attention_layer([decoder_outputs, encoder_outputs])

output_layer = tf.keras.layers.Dense(vocab_size, activation='softmax')

Y_proba = output_layer(attention_outputs)


ValueError: A KerasTensor cannot be used as input to a TensorFlow function. A KerasTensor is a symbolic placeholder for a shape and dtype, used when constructing Keras Functional models or Keras Functions. You can only use it as input to a Keras layer or a Keras operation (from the namespaces `keras.layers` and `keras.operations`). You are likely doing something like:

```
x = Input(...)
...
tf_fn(x)  # Invalid.
```

What you should do instead is wrap `tf_fn` in a layer:

```
class MyLayer(Layer):
    def call(self, x):
        return tf_fn(x)

x = MyLayer()(x)
```


In [None]:
embed_size = 128
encoder_input_ids = text_vec_layer_en(encoder_inputs)
decoder_input_ids = text_vec_layer_es(decoder_inputs)
encoder_embedding_layer = tf.keras.layers.Embedding(vocab_size, embed_size,
                                                    mask_zero=False)
decoder_embedding_layer = tf.keras.layers.Embedding(vocab_size, embed_size,
                                                    mask_zero=False)
encoder_embeddings = encoder_embedding_layer(encoder_input_ids)
decoder_embeddings = decoder_embedding_layer(decoder_input_ids)

`tf.keras.Model` is a subclass of `tf.keras.layers.Layer` that represents a complete model. It groups layers together and provides methods for training, evaluation, and prediction.

The main difference between the two is that a Layer is a single unit of computation, while a Model is a collection of layers that work together to perform a specific task.

In [None]:
# Define the encoder, decoder, and attention layer within a Keras Model
# This ensures that all operations are treated symbolically

class AttentionLayer(tf.keras.Model):

  def __init__(self,units, **kwargs):
    super().__init__(**kwargs)
    self.encoder= EncoderLayer(units)
    self.decoder = tf.keras.layers.LSTM(units * 2, return_sequences=True)
    self.attention_layer = tf.keras.layers.Attention()
    self.output_layer = tf.keras.layers.Dense(vocab_size, activation= 'softmax')

  def call(self, inputs):
    encoder_embeddings, decoder_embeddings = inputs
    encoder_outputs, encoder_state = self.encoder(encoder_embeddings)
    decoder_outputs = self.decoder(decoder_embeddings, initial_state=encoder_state)
    attention_outputs = self.attention_layer([decoder_outputs, encoder_outputs])
    Y_proba = self.output_layer(attention_outputs)
    return Y_proba

In [None]:
model= AttentionLayer(256)
Y_proba = model([encoder_embeddings, decoder_embeddings])

In [None]:
# combining both class functions
class Layers(tf.keras.Model):

  def __init__(self, units, **kwargs):
    super().__init__(**kwargs)

    self.encoder = tf.keras.layers.Bidirectional(
        tf.keras.layers.LSTM(256, return_sequences=True, return_state=True))
    self.decoder = tf.keras.layers.LSTM(512, return_sequences=True)
    self.attention_layer = tf.keras.layers.Attention()
    self.output_layer = tf.keras.layers.Dense(vocab_size, activation='softmax')

  def call(self, inputs):
    encoder_outputs, *encoder_state = self.encoder(inputs) # the data you fed into the model for processing.
    encoder_state = [
        tf.concat(encoder_state[::2], axis=-1),
        tf.concat(encoder_state[1::2], axis=-1)
    ]
    decoder_outputs = self.decoder(decoder_embeddings, initial_state= encoder_state)
    attention_outputs = self.attention_layer([decoder_outputs, encoder_outputs])
    Y_proba = self.output_layer(attention_outputs)

    return output_layer

In [None]:
model_comb = tf.keras.Model(inputs= [encoder_inputs, decoder_inputs],
                       outputs = [Y_proba])

model_comb.compile(loss='sparse_categorical_crossentropy', optimizer='nadam',
                                  metrics=['accuracy'])

model_comb.fit((X_train, X_train_dec), Y_train, epochs=10,
                    validation_data=((X_valid, X_valid_dec), Y_valid))

Epoch 1/10
[1m3125/3125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m91s[0m 28ms/step - accuracy: 0.9744 - loss: 0.0936 - val_accuracy: 0.9586 - val_loss: 0.1914
Epoch 2/10
[1m3125/3125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m88s[0m 28ms/step - accuracy: 0.9759 - loss: 0.0876 - val_accuracy: 0.9585 - val_loss: 0.1965
Epoch 3/10
[1m3125/3125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m141s[0m 28ms/step - accuracy: 0.9770 - loss: 0.0826 - val_accuracy: 0.9584 - val_loss: 0.1997
Epoch 4/10
[1m3125/3125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m86s[0m 28ms/step - accuracy: 0.9779 - loss: 0.0782 - val_accuracy: 0.9582 - val_loss: 0.2047
Epoch 5/10
[1m3125/3125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m85s[0m 27ms/step - accuracy: 0.9789 - loss: 0.0743 - val_accuracy: 0.9580 - val_loss: 0.2093
Epoch 6/10
[1m3125/3125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m142s[0m 27ms/step - accuracy: 0.9797 - loss: 0.0708 - val_accuracy: 0.9578 - val_loss: 0.2138
Ep

<keras.src.callbacks.history.History at 0x79b16459f5b0>

In [None]:
model_attention = tf.keras.Model(inputs= [encoder_inputs, decoder_inputs],
                       outputs = [Y_proba])

model_attention.compile(loss='sparse_categorical_crossentropy', optimizer='nadam',
                                  metrics=['accuracy'])

model_attention.fit((X_train, X_train_dec), Y_train, epochs=10,
                    validation_data=((X_valid, X_valid_dec), Y_valid))




Epoch 1/10
[1m3125/3125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m91s[0m 27ms/step - accuracy: 0.9004 - loss: 0.6324 - val_accuracy: 0.9415 - val_loss: 0.2718
Epoch 2/10
[1m3125/3125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m140s[0m 27ms/step - accuracy: 0.9458 - loss: 0.2454 - val_accuracy: 0.9534 - val_loss: 0.2016
Epoch 3/10
[1m3125/3125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m146s[0m 29ms/step - accuracy: 0.9552 - loss: 0.1904 - val_accuracy: 0.9567 - val_loss: 0.1835
Epoch 4/10
[1m3125/3125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m140s[0m 28ms/step - accuracy: 0.9597 - loss: 0.1665 - val_accuracy: 0.9583 - val_loss: 0.1761
Epoch 5/10
[1m3125/3125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m157s[0m 33ms/step - accuracy: 0.9630 - loss: 0.1496 - val_accuracy: 0.9588 - val_loss: 0.1745
Epoch 6/10
[1m3125/3125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m132s[0m 30ms/step - accuracy: 0.9658 - loss: 0.1359 - val_accuracy: 0.9586 - val_loss: 0.1769

<keras.src.callbacks.history.History at 0x79b1dc66e7d0>

In [None]:
model_attention.history.history

{'accuracy': [0.91878741979599,
  0.9491721987724304,
  0.9564043879508972,
  0.9605749845504761,
  0.9636824131011963,
  0.966299295425415,
  0.9684415459632874,
  0.970363974571228,
  0.972104549407959,
  0.9735676646232605],
 'loss': [0.4432101845741272,
  0.22556480765342712,
  0.18337608873844147,
  0.16158923506736755,
  0.14565828442573547,
  0.13279157876968384,
  0.12202385812997818,
  0.11265404522418976,
  0.10467331111431122,
  0.0978928804397583],
 'val_accuracy': [0.9414669275283813,
  0.9534192681312561,
  0.95668625831604,
  0.9582793116569519,
  0.9588291645050049,
  0.9586192965507507,
  0.9591705799102783,
  0.9589720368385315,
  0.9587769508361816,
  0.9587332606315613],
 'val_loss': [0.2718157470226288,
  0.201589435338974,
  0.1834685206413269,
  0.1761321723461151,
  0.1744513213634491,
  0.17688174545764923,
  0.1773817241191864,
  0.18084217607975006,
  0.18504217267036438,
  0.1875397115945816]}

In [None]:
model_attention.save('/content/drive/MyDrive/my_model.keras')

In [None]:
def translate(sentence_en):
    # Convert the input sentence to numerical representation
    sentence_en_vec = text_vec_layer_en(sentence_en).numpy()

    translation = ""
    for word_idx in range(max_length):
        X = np.array([sentence_en_vec])  # encoder input
        X_dec = text_vec_layer_es(["startofseq " + translation]).numpy() # decoder input
        # Reshape X_dec to match expected input shape
        X_dec = X_dec.reshape(1, -1)
        y_proba = model.predict((X, X_dec))[0, word_idx]  # last token's probas
        predicted_word_id = np.argmax(y_proba)
        predicted_word = text_vec_layer_es.get_vocabulary()[predicted_word_id]
        if predicted_word == "endofseq":
            break
        translation += " " + predicted_word
    return translation.strip()