<a href="https://colab.research.google.com/github/cagBRT/promptEngineering/blob/main/Basic_Embedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# A notebook to help with understanding basic embedding <br>


This notebook focuses on vectorizing text sequences and positional embedding

**Import the libraries**

In [None]:
import tensorflow as tf
from tensorflow import convert_to_tensor, string
from keras.layers import TextVectorization, Embedding, Layer
from tensorflow.data import Dataset
import tensorflow_datasets as tfds
import numpy as np
import matplotlib.pyplot as plt

- Decide on the output sequence length<br>
- The vocabulary size<br>
- The text you want to embed using the text vectorization layer from Keras

Output sequence length is the length of the text sequence that will go to the transformer. <br>
Vocabulary size is all the unique words in the training text sequences

In [None]:
output_sequence_length = 5
vocab_size = 15
#note the sentences do not have punctuation
sentences = [["I am a robot"], ["you are a robot"], ["you are not a robot in my mind"]]
sentence_data = Dataset.from_tensor_slices(sentences)

**The text vectorization layer**<br>
The text vectorization layer creates a dictionary of words and replaces each word with its corresponding index in the dictionary.

The output of the layer is a tensor of shape:<br>

(number of sentences, output sequence length)



In [None]:
# Create the TextVectorization layer
vectorize_layer = TextVectorization(
                  output_sequence_length=output_sequence_length,
                  max_tokens=vocab_size)
# Train the layer to create a dictionary
vectorize_layer.adapt(sentence_data)
# Convert all sentences to tensors
word_tensors = convert_to_tensor(sentences, dtype=tf.string)
word_tensors

The text is set to the output sequence length equal to 5.<br>
The text might be padded or truncated

In [None]:
# Use the word tensors to get vectorized phrases
vectorized_words = vectorize_layer(word_tensors)
print(sentences[0])
vectorized_words[0]
#the sequence is 4 words, add a pad to make it length 5

In [None]:
print(sentences[1])
vectorized_words[1]
#the sequence is 4 words, add a pad to make it length 5

Notice in the following vectorization, the sentence is truncated to fit the length limit

In [None]:
print(sentences[2])
vectorized_words[2]
#the sequence is 8 words, truncate to make it length 5

The shape is the number of sentences, output sequence length<br>

>robot=2<br>
a=3<br>
you=4<br>
...


In [None]:
print("Vocabulary: ", vectorize_layer.get_vocabulary())
print("Vectorized words: ", vectorized_words)

**Assignment 1:**<br>
Try changing the sentences, the number of vocabulary words, and  the output sequence length.




---


---



**Embeddings**<br>
The Keras Embedding layer converts integers to dense vectors.<br>
This layer maps these integers to random numbers, which are later tuned during the training phase. However, you also have the option to set the mapping to some predefined weight values.<br>

To initialize this layer, you need to specify the maximum value of an integer to map, along with the length of the output sequence.

In [None]:
output_length = 6 #number of dimensions
word_embedding_layer = Embedding(vocab_size, output_length)
embedded_words = word_embedding_layer(vectorized_words)
print(sentences[0],"\n",vectorized_words[0], "\n", embedded_words[0])

**Note in the above tensor**, the sentence is padded with one zero, to make the sentence have a length of 5. <br>
Also notice, there are six elements in the vectors, one for each dimension. <br>


Each time you run this code the embedded words will change. This is because the weights are randomly selected, later they will be tuned. <br>

In [None]:
print(sentences[1],"\n",vectorized_words[1], "\n", embedded_words[1])

In [None]:
print(sentences[2],"\n",vectorized_words[2], "\n", embedded_words[2])

**Position encoding**<br>
You also need the embeddings for the corresponding positions.



In [None]:
position_embedding_layer = Embedding(output_sequence_length, output_length)
position_indices = tf.range(output_sequence_length)
print(position_indices)

In [None]:
embedded_indices = position_embedding_layer(position_indices)

The output sequence length is five.

In a transformer model, the final output is the sum of both the word embeddings and the position embeddings.<br>

When you set up both embedding layers, you need to make sure that the output_length is the same for both.

In [None]:
for i in range(output_sequence_length):
  print("index",[i], ": ",embedded_indices[i])

The final output embedding is the sum of the embedded words and the embedded indices

In [None]:
final_output_embedding = embedded_words + embedded_indices

In [None]:
for i in range(len(sentences)):
  print(sentences[i],"\n" "Final output: ",  final_output_embedding[i])
  print("\n")
#there are 5 rows - one for each word input
#there are 6 columns, to match the output sequence size




---

**Using fixed weights**
---



---



The Output of Positional Encoding Layer in Transformers
In a transformer model, the final output is the sum of both the word embeddings and the position embeddings. Hence, when you set up both embedding layers, you need to make sure that the output_length is the same for both.

The weights are initialized randomly and tuned during the training phase.


 This example shows how you can subclass the Embedding layer to implement your own functionality. You can add more methods to it as you require.

In [None]:
#You can use the PositionEmbeddingLayer gunction in Keras instead of writing your own
class PositionEmbeddingLayer(Layer):
    def __init__(self, sequence_length, vocab_size, output_dim, **kwargs):
        super(PositionEmbeddingLayer, self).__init__(**kwargs)
        self.word_embedding_layer = Embedding(
            input_dim=vocab_size, output_dim=output_dim
        )
        self.position_embedding_layer = Embedding(
            input_dim=sequence_length, output_dim=output_dim
        )

    def call(self, inputs):
        position_indices = tf.range(tf.shape(inputs)[-1])
        embedded_words = self.word_embedding_layer(inputs)
        embedded_indices = self.position_embedding_layer(position_indices)
        return embedded_words + embedded_indices

In [None]:

my_embedding_layer = PositionEmbeddingLayer(output_sequence_length,
                                            vocab_size, output_length)
embedded_layer_output = my_embedding_layer(vectorized_words)
for i in range(len(sentences)):
  print(sentences[i],"Output from my_embedded_layer: ", embedded_layer_output[i], "\n")

Note the above class creates an embedding layer that has trainable weights. Hence, the weights are initialized randomly and tuned in to the training phase.

When specifying the Embedding layer, you need to provide the positional encoding matrix as weights along with trainable=False. Let’s create another positional embedding class that does exactly this.

In [None]:
class PositionEmbeddingFixedWeights(Layer):
    def __init__(self, sequence_length, vocab_size, output_dim, **kwargs):
        super(PositionEmbeddingFixedWeights, self).__init__(**kwargs)
        word_embedding_matrix = self.get_position_encoding(vocab_size, output_dim)
        position_embedding_matrix = self.get_position_encoding(sequence_length, output_dim)
        self.word_embedding_layer = Embedding(
            input_dim=vocab_size, output_dim=output_dim,
            weights=[word_embedding_matrix],
            trainable=False
        )
        self.position_embedding_layer = Embedding(
            input_dim=sequence_length, output_dim=output_dim,
            weights=[position_embedding_matrix],
            trainable=False
        )

    def get_position_encoding(self, seq_len, d, n=10000):
        P = np.zeros((seq_len, d))
        for k in range(seq_len):
            for i in np.arange(int(d/2)):
                denominator = np.power(n, 2*i/d)
                P[k, 2*i] = np.sin(k/denominator)
                P[k, 2*i+1] = np.cos(k/denominator)
        return P


    def call(self, inputs):
        position_indices = tf.range(tf.shape(inputs)[-1])
        embedded_words = self.word_embedding_layer(inputs)
        embedded_indices = self.position_embedding_layer(position_indices)
        return embedded_words + embedded_indices

Attention models, also called attention mechanisms, are deep learning techniques used to provide an additional focus on a specific component. In deep learning, attention relates to focus on something in particular and note its specific importance

In [None]:
attnisallyouneed_embedding = PositionEmbeddingFixedWeights(output_sequence_length,
                                            vocab_size, output_length)
attnisallyouneed_output = attnisallyouneed_embedding(vectorized_words)
#print("Output from my_embedded_layer: ", attnisallyouneed_output)
for i in range(len(sentences)):
  print(sentences[i],"Output from my_embedded_layer: ", embedded_layer_output[i], "\n")



---



---



---



---



# An example with fixed weights

The vocabulary is from two phrases: technical, wise<br>
The sequence length in is 20 <br>
The sequence length out is 50<br>

In [None]:
total_vocabulary = 200
sequence_length = 20
final_output_len = 50

Our corpus comes from these two phrases

In [None]:
technical_phrase = "to understand machine learning algorithms you need" +\
                   " to understand concepts such as gradient of a function "+\
                   "Hessians of a matrix and optimization etc"
wise_phrase = "patrick henry said give me liberty or give me death "+\
              "when he addressed the second virginia convention in march"

Vectorize the two phrases

In [None]:
phrase_vectorization_layer = TextVectorization(
                  output_sequence_length=sequence_length,
                  max_tokens=total_vocabulary)
# Learn the dictionary
phrase_vectorization_layer.adapt([technical_phrase, wise_phrase])

In [None]:
# Convert all sentences to tensors
phrase_tensors = convert_to_tensor([technical_phrase, wise_phrase],
                                   dtype=tf.string)

phrase_tensors
#tech phrase = 22 words
#wise phrase= 19 words

In [None]:
# Use the word tensors to get vectorized phrases
vectorized_phrases = phrase_vectorization_layer(phrase_tensors)
vectorized_phrases
#notice there is no BOS or EOS

In [None]:
random_weights_embedding_layer = PositionEmbeddingLayer(sequence_length,
                                                        total_vocabulary,
                                                        final_output_len)
fixed_weights_embedding_layer = PositionEmbeddingFixedWeights(sequence_length,
                                                        total_vocabulary,
                                                        final_output_len)
random_embedding = random_weights_embedding_layer(vectorized_phrases)
fixed_embedding = fixed_weights_embedding_layer(vectorized_phrases)

Note the sequence length is 20<br>


In [None]:
print(technical_phrase,"\nOutput from my_embedded_layer: ", vectorized_phrases[0], "\n")

**The vectorized phrase**<br>
3 - to <br>
2  - understand<br>
21 - machine <br>
23 - learning <br>
36 - algorithms <br>
8 - you<br>
18 - need<br>
3  - to <br>
2 - understand<br>
33 - concepts<br>
12 - such<br>
34 - as<br>
28 - gradient<br>
4  - of<br>
7 - a<br>
29 - function<br>
25 - Hessians<br>
4  - of <br>
7 - a <br>
19 - matrix

In [None]:
print(wise_phrase,"\nOutput from my_embedded_layer: ", vectorized_phrases[1], "\n")

Now we take our vectorized text sequence and embed it with the position information

In [None]:
fixed_embedding[0]

In [None]:
#fig = plt.figure(figsize=(15, 5))
#title = ["Tech Phrase", "Wise Phrase"]
#for i in range(2):
#    ax = plt.subplot(1, 2, 1+i)
#    matrix = tf.reshape(random_embedding[i, :, :], (sequence_length, final_output_len))
#    cax = ax.matshow(matrix)
#    plt.gcf().colorbar(cax)
#    plt.title(title[i], y=1.2)
#fig.suptitle("Random Embedding")
#plt.show()

In [None]:
#fig = plt.figure(figsize=(15, 5))
#title = ["Tech Phrase", "Wise Phrase"]
#for i in range(2):
#    ax = plt.subplot(1, 2, 1+i)
#    matrix = tf.reshape(fixed_embedding[i, :, :], (sequence_length, final_output_len))
#    cax = ax.matshow(matrix)
#    plt.gcf().colorbar(cax)
#    plt.title(title[i], y=1.2)
#fig.suptitle("Fixed Weight Embedding from Attention is All You Need")
#plt.show()