In [2]:
import tensorflow as tf

# TextVectorization Layer Practice

A TextVectorization layer is a text preprocessing layer. This layer maps text features to integer sequences.

Here is how to call the TextVectorization layer:

In [4]:
tf.keras.layers.TextVectorization(
    max_tokens=None,
    standardize='lower_and_strip_punctuation',
    split='whitespace',
    ngrams=None,
    output_mode='int',
    output_sequence_length=None,
    pad_to_max_tokens=False,
    vocabulary=None,
    idf_weights=None,
    sparse=False,
    ragged=False,
    # **kwargs
)

2022-02-28 20:17:24.065618: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


<keras.layers.preprocessing.text_vectorization.TextVectorization at 0x7fdaaa2f2ee0>

This layer transforms a batch of strings(one observation = one string) into either a list of token indices (one observation = 1D tensor of integer token indeces) or a dense representation (one observation = 1D tensor float values representing data about the observation's tokens).

This layer is meant to handle natural language inputs.

The vocabulary for the TextVectorization layer must be either supplied on construction or learned via `adapt()`. After adaption, the dataset is analyzed. It will determine the frequency of indiviual string values and create a vocabulary for them.

You can make the vocabulary unlimited or limit the amount.

If there are more unique values in the input than the vocabulary, the most frequent vocabulary will proceed.

Here is the preprocessing steps:

1. Standardize each observation (lowercase & punctuation stripping)
2. Split each observation into substrings (words)
3. Recombine substrings into tokens (ngrams)
4. Index tokens (assign unique integer value to each token)
5. Transform each observation using this index, either into a vector of integers or a dense float vector.

#### Some notes on passing callables to customize splitting and normalization for this layer.
1. When using a custom callable for `standardize` the data received by the callable will be exactly as passed to this layer. The callable should return a tensor of the same shape as the input.
2. When using a custom callable for `split`, the data received by the callable will have the 1st dimension squeezed out - instead of [['string to split'], ['another string to split']], the callable will see ['string to split', 'another string to split']. The callable should return a Tensor with the first dimension containing the split tokens = in this example, we should see something like [['string', 'to', 'split'], ['another', 'string', 'to', 'split']]. This makes the callable site natively compatible with `tf.strings.split()`.