<a href="https://colab.research.google.com/github/cagBRT/promptEngineering/blob/main/Basic_Embedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Import the libraries**

In [None]:
import tensorflow as tf
from tensorflow import convert_to_tensor, string
from keras.layers import TextVectorization, Embedding, Layer
from tensorflow.data import Dataset
import tensorflow_datasets as tfds
import numpy as np
import matplotlib.pyplot as plt

- Decide on the output sequence length<br>
- The vocabulary size<br>
- The text you want to embed using the text vectorization layer from Keras

In [None]:
output_sequence_length = 5
vocab_size = 15
#note the sentences do not have punctuation
sentences = [["I am a robot"], ["you are a robot"], ["you are not a robot in my mind"]]
sentence_data = Dataset.from_tensor_slices(sentences)

**The text vectorization layer**<br>
The text vectorization layer creates a dictionary of words and replaces each word with its corresponding index in the dictionary.

The output of the layer is a tensor of shape:<br>

(number of sentences, output sequence length)



In [None]:
# Create the TextVectorization layer
vectorize_layer = TextVectorization(
                  output_sequence_length=output_sequence_length,
                  max_tokens=vocab_size)
# Train the layer to create a dictionary
vectorize_layer.adapt(sentence_data)
# Convert all sentences to tensors
word_tensors = convert_to_tensor(sentences, dtype=tf.string)
word_tensors

The text is set to the output sequence length equal to 5.<br>
Either the text is padded or truncated

In [None]:
# Use the word tensors to get vectorized phrases
vectorized_words = vectorize_layer(word_tensors)
print(sentences[0])
vectorized_words[0]

In [None]:
print(sentences[1])
vectorized_words[1]

In [None]:
print(sentences[2])
vectorized_words[2]

The shape is the number of sentences, output sequence length

In [None]:
print("Vocabulary: ", vectorize_layer.get_vocabulary())
print("Vectorized words: ", vectorized_words)

**Embeddings**<br>
The Keras Embedding layer converts integers to dense vectors.<br>
This layer maps these integers to random numbers, which are later tuned during the training phase. However, you also have the option to set the mapping to some predefined weight values.<br>

To initialize this layer, you need to specify the maximum value of an integer to map, along with the length of the output sequence.

In [None]:
output_length = 6 #number of dimensions
word_embedding_layer = Embedding(vocab_size, output_length)
embedded_words = word_embedding_layer(vectorized_words)
print(sentences[0],"\n",vectorized_words[0], "\n", embedded_words[0])

Each time you run this code the embedded words will change. This is because the weights are randomly selected to beginn. Later they will be tuned. 

In [None]:
print(sentences[1],"\n",vectorized_words[1], "\n", embedded_words[1])

In [None]:
print(sentences[2],"\n",vectorized_words[2], "\n", embedded_words[2])

**Position encoding**<br>
You also need the embeddings for the corresponding positions. The maximum positions correspond to the output sequence length of the TextVectorization layer.



In [None]:
position_embedding_layer = Embedding(output_sequence_length, output_length)
position_indices = tf.range(output_sequence_length)
print(position_indices)

In [None]:
embedded_indices = position_embedding_layer(position_indices)

The output sequence length is five. 

In a transformer model, the final output is the sum of both the word embeddings and the position embeddings.<br>

When you set up both embedding layers, you need to make sure that the output_length is the same for both.

In [None]:
for i in range(5):
  print("index",[i], ": ",embedded_indices[i])

In [None]:
final_output_embedding = embedded_words + embedded_indices
for i in range(len(sentences)):
  print(sentences[i],"\n" "Final output: ",  final_output_embedding[i])
  print("\n") 
#there are 5 rows - one for each word input
#there are 6 columns, to match the output sequence size