# Text Vectorization Example

This notebook demonstrates how to build a simple text vectorizer from scratch. It covers:
- Creating a vocabulary
- Encoding text to numerical tokens
- Decoding tokens back to text
- Handling unknown (out-of-vocabulary) words


## 1. Importing the Vectorizer

We start by importing our custom `TextVectorization` module and initializing the `Vectorizer` class. Then, we prepare a small dataset to work with.


First must be initalize the dataset

In [4]:
# Sample dataset of short sentences
dataset = [
    "I write, erase, rewrite",
    "Erase again, and then",
    "A poppy blooms."
]

In [2]:
# Import the vectorizer module
import TextVectorization as tv

# Initialize the vectorizer
vectorizer = tv.Vectorizer()


## 2. Creating Vocabulary

Encoding/Decoding a Sentence

We build the vocabulary from the dataset, encode a sentence into tokens (numbers), and then decode it back into text.


In [3]:
# Build vocabulary from the dataset
vectorizer.make_vocabulary(dataset)
test_sentence = "I write, rewrite, and still rewrite again"
# Encode the sentence to tokens
encoded = vectorizer.encode(test_sentence)
print(f'Encoded: {encoded}')  # Example: [2, 3, 4, 5]

# Decode the tokens back to words
print(f'Decoded: {vectorizer.decode(encoded)}')  # Example: i write erase rewrite


Encoded: [2, 3, 5, 7, 1, 5, 6]
Decoded: i write rewrite and [UNK] rewrite again


## 3. Handling Unknown Words

Now we try encoding a sentence with words not in the original vocabulary. The unknown words are replaced by a special `[UNK]` token.


In [4]:
sample_text = "I am a book lover"
encoded = vectorizer.encode(sample_text)
print(f'Encoded: {encoded}') 
print(f'Decoded: {vectorizer.decode(encoded)}')


Encoded: [2, 1, 9, 1, 1]
Decoded: i [UNK] a [UNK] [UNK]


The result shows three `[KNK]` cases associated with three locations marked with `index = 1`.

## 4. TextVectorization by Keras

To build vocabulary in Kras, just call `tv.adapt(dataset)`.


In [2]:
%pip install tensorflow

Collecting tensorflow
  Using cached tensorflow-2.19.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.1 kB)
Collecting absl-py>=1.0.0 (from tensorflow)
  Using cached absl_py-2.2.2-py3-none-any.whl.metadata (2.6 kB)
Collecting astunparse>=1.6.0 (from tensorflow)
  Using cached astunparse-1.6.3-py2.py3-none-any.whl.metadata (4.4 kB)
Collecting flatbuffers>=24.3.25 (from tensorflow)
  Using cached flatbuffers-25.2.10-py2.py3-none-any.whl.metadata (875 bytes)
Collecting gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1 (from tensorflow)
  Using cached gast-0.6.0-py3-none-any.whl.metadata (1.3 kB)
Collecting google-pasta>=0.1.1 (from tensorflow)
  Using cached google_pasta-0.2.0-py3-none-any.whl.metadata (814 bytes)
Collecting libclang>=13.0.0 (from tensorflow)
  Using cached libclang-18.1.1-py2.py3-none-manylinux2010_x86_64.whl.metadata (5.2 kB)
Collecting opt-einsum>=2.3.2 (from tensorflow)
  Using cached opt_einsum-3.4.0-py3-none-any.whl.metadata (6.3 kB)
Collecting protobuf!=

In [1]:
from tensorflow.keras.layers import TextVectorization

# Create a text vectorizer based on keras layers
tv = TextVectorization(output_mode="int")

# make a dictionary
tv.adapt(dataset)

2025-05-14 15:27:09.755772: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-05-14 15:27:11.685760: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1747223832.299840    9923 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747223832.444498    9923 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1747223833.805279    9923 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

NameError: name 'dataset' is not defined

**Encoding/Decoding using `keras`**

In [6]:
voc = tv.get_vocabulary()
test_sentence = "I write, rewrite, and still rewrite again"
encoded = tv(test_sentence)
print(f'Encoded: {encoded}')

inverse_voc = dict(enumerate(voc))

decoded = " ".join(inverse_voc[int(i)] for i in encoded)

print(f'Decoded: {decoded}')

Encoded: [ 7  3  5  9  1  5 10]
Decoded: i write rewrite and [UNK] rewrite again


the vector generated by `keras`  : [ 7, 3, 5, 9, 1, 5, 10]

the vector generated from scratch: [2, 3, 5, 7, 1, 5, 6]

the results is deferent becuase indexing method is deferent.