# Text Vectorization Example

This notebook demonstrates how to build a simple text vectorizer from scratch. It covers:
- Creating a vocabulary
- Encoding text to numerical tokens
- Decoding tokens back to text
- Handling unknown (out-of-vocabulary) words


## 1. Importing the Vectorizer

We start by importing our custom `TextVectorization` module and initializing the `Vectorizer` class. Then, we prepare a small dataset to work with.


First must be initalize the dataset

In [11]:

# Sample dataset of short sentences
dataset = [
    "I write, erase, rewrite",
    "Erase again, and then",
    "A poppy blooms."
]

# Choose the first sentence from the dataset
text = 'I write, rewrite, and still rewrite again.'
# Sentence with unknown words ("have" and "book" are not in training data)
test = "I have a book."

In [None]:
# Import the vectorizer module
import TextVectorization as tv

# Initialize the vectorizer
vectorizer = tv.Vectorizer()


## 2. Creating Vocabulary and Encoding/Decoding a Sentence

We build the vocabulary from the dataset, encode a sentence into tokens (numbers), and then decode it back into text.


In [None]:
# Build vocabulary from the dataset
vectorizer.make_vocabulary(dataset)

# Encode the sentence to tokens
encoded = vectorizer.encode(text)
print(f'Encoded: {encoded}')  # Example: [2, 3, 4, 5]

# Decode the tokens back to words
print(f'Decoded: {vectorizer.decode(encoded)}')  # Example: i write erase rewrite


Encoded: [2, 3, 4, 5]
Decoded: i write erase rewrite


## 3. Handling Unknown Words

Now we try encoding a sentence with words not in the original vocabulary. The unknown words are replaced by a special `[UNK]` token.


In [None]:
# Encode and decode
encoded = vectorizer.encode(test)
print(f'Encoded: {encoded}')  # Example: [2, 1, 9, 1]
print(f'Decoded: {vectorizer.decode(encoded)}')  # Example: i [UNK] a [UNK]


Encoded: [2, 1, 9, 1]
Decoded: i [UNK] a [UNK]


**TextVectorization** by Keras


In [15]:
from tensorflow.keras.layers import TextVectorization
# Create a vectorizer based on keras layers
tv = TextVectorization(output_mode="int")

# make a dictionary
tv.adapt(dataset)


In [16]:
voc = tv.get_vocabulary()
test_sentence = "I write, rewrite, and still rewrite again"
encoded = tv(test_sentence)
print(f'Encoded: {encoded}')

inverse_voc = dict(enumerate(voc))

decoded = " ".join(inverse_voc[int(i)] for i in encoded)

print(f'Decoded: {decoded}')

Encoded: [ 7  3  5  9  1  5 10]
Decoded: i write rewrite and [UNK] rewrite again
