# Text Preprocessing with TensorFlow Data Pipeline and Keras Text Vectorization Layer Implementation

We will take several actions and decisions:

1. **Creating a TF Data Pipeline**:
   - We will set up a TensorFlow data pipeline to efficiently process and manage the text data. This pipeline will handle tasks such as loading the data, preprocessing it, and batching it for training.

2. **Configuring a Keras `TextVectorization` Layer**:
   - The `TextVectorization` layer in Keras is a powerful tool for text preprocessing and tokenization. We will configure this layer to perform tasks such as converting text to lowercase, removing punctuation, and tokenizing the text into individual words.

3. **Adopting the Keras `TextVectorization` Layer onto the Train Dataset**:
   - We will apply the configured `TextVectorization` layer specifically to our training dataset. This step ensures that the preprocessing steps defined in the layer are applied consistently to the training data.

4. **Applying the Keras `TextVectorization` Layer to Train, Validation, and Test Datasets**:
   - After adopting the `TextVectorization` layer on the training dataset, we will also apply it to the validation and test datasets. This ensures that the same preprocessing steps are applied consistently across all datasets, maintaining consistency and fairness during model evaluation.

5. **Finalizing the TF Data Pipeline by Configuring It**:
   - We will complete the configuration of the TensorFlow data pipeline by specifying additional settings such as shuffling the data, setting the batch size, and enabling prefetching. These settings optimize the performance and efficiency of the data pipeline during training.

By following these steps, we will have prepared the train, validation, and test datasets for input into any machine learning or deep learning models. These datasets will be preprocessed, tokenized, and organized in a format suitable for training and evaluating text classification models. In the subsequent parts of our project, we will utilize these datasets to train and evaluate various deep learning models for text classification tasks.

## Build the Train TensorFlow Datasets



In [None]:
same_elements = train_features.equals(val_features)

print('Same elements:', same_elements)

Same elements: False
time: 5.65 ms (started: 2024-05-02 07:14:45 +00:00)


In [None]:
train_text_ds_raw = tf.data.Dataset.from_tensor_slices(
            tf.cast(train_features.values, tf.string)
)

time: 397 ms (started: 2024-05-02 07:14:57 +00:00)


In [None]:
train_sentiment_ds_raw = tf.data.Dataset.from_tensor_slices(
            tf.cast(train_targets.values, tf.int64),
)

time: 7.95 ms (started: 2024-05-02 07:14:58 +00:00)


In [None]:
train_text_ds_raw.element_spec

TensorSpec(shape=(), dtype=tf.string, name=None)

time: 6.21 ms (started: 2024-05-02 07:15:00 +00:00)


In [None]:
data.describe()

Unnamed: 0,words,sentiment_id
count,150583.0,150583.0
mean,65.869813,1.067783
std,43.405512,0.823377
min,8.0,0.0
25%,33.0,0.0
50%,55.0,1.0
75%,87.0,2.0
max,289.0,2.0


time: 64.5 ms (started: 2024-05-02 07:15:01 +00:00)


In [None]:
data.shape

(150583, 4)

time: 3.55 ms (started: 2024-05-02 07:15:02 +00:00)


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 150583 entries, 129408 to 123256
Data columns (total 4 columns):
 #   Column        Non-Null Count   Dtype   
---  ------        --------------   -----   
 0   text          150583 non-null  object  
 1   sentiment     150583 non-null  category
 2   words         150583 non-null  int64   
 3   sentiment_id  150583 non-null  int8    
dtypes: category(1), int64(1), int8(1), object(1)
memory usage: 3.7+ MB
time: 64.8 ms (started: 2024-05-02 07:15:04 +00:00)


## Dictionary size and the review size

For preprocessing the text, we need to decide the **dictionary (vocab) size** and the **maximum review (text) size**.

We will just take the `min` and `max`.

In [None]:
vocab_size = vocab_size
max_len = max_review_size

time: 543 µs (started: 2024-05-02 07:15:08 +00:00)


## Text Vectorization layer



### Custom Standardization
As a first step of text preprocessing, we will standardize the text by using the below function.

In [None]:
@tf.keras.utils.register_keras_serializable()
def custom_standardization(input_string):
    """ Preprocess Russian text """
    # Convert text to lowercase
    no_uppercased = tf.strings.lower(input_string, encoding='utf-8')
    # Remove HTML tags
    no_html_tags = tf.strings.regex_replace(no_uppercased, "<[^>]+>", " ")
    # Remove email addresses
    no_emails = tf.strings.regex_replace(no_html_tags, r'\S+@\S+', ' ')
    # Remove digits
    no_digits = tf.strings.regex_replace(no_emails, r"\d", " ")
    # Remove punctuation
    no_punctuations = tf.strings.regex_replace(no_digits, f"([{string.punctuation}])", r" ")
    # Remove newlines
    no_newlines = tf.strings.regex_replace(no_punctuations, "\n", " ")
    # Remove extra spaces
    no_extra_space = tf.strings.regex_replace(no_newlines, " +", " ")
    no_Ё = tf.strings.regex_replace(no_extra_space, "ё", "е")
    no_Й = tf.strings.regex_replace(no_Ё, "й", "и")
    return no_Й

time: 1.93 ms (started: 2024-05-02 07:34:38 +00:00)


In [None]:
input_string = """
<html>
<head>
<title>Прймер -_=заголовка</title>
</head>
<body>
<p>Это пример текста с некоторыми знаками пунктуации: тест, тест, тест.</p>
<p>Также в нем содержатся цифры, напримёр 123456.</p>
<p>И некоторые образцы - _ для удаления: образец1 и образец2.</p>
<p>Контактный адрес электронной почты: example@example.com</p>
</body>
</html>
"""

print("input:  ", input_string)
output_string= custom_standardization(input_string)
print("output: ", output_string.numpy().decode("utf-8"))

input:   
<html>
<head>
<title>Прймер -_=заголовка</title>
</head>
<body>
<p>Это пример текста с некоторыми знаками пунктуации: тест, тест, тест.</p>
<p>Также в нем содержатся цифры, напримёр 123456.</p>
<p>И некоторые образцы - _ для удаления: образец1 и образец2.</p>
<p>Контактный адрес электронной почты: example@example.com</p>
</body>
</html>

output:   пример заголовка это пример текста с некоторыми знаками пунктуации тест тест тест также в нем содержатся цифры например и некоторые образцы для удаления образец и образец контактныи адрес электроннои почты 
time: 16.2 ms (started: 2024-05-02 07:35:19 +00:00)


### Build a TextVectorization layer

Let's build our `TextVectorization` layer:

In [None]:
vectorize_layer = TextVectorization(
    standardize=custom_standardization,
    max_tokens=vocab_size - 1,
    output_mode="int",
    output_sequence_length=max_len,
)

time: 16.5 ms (started: 2024-05-02 07:37:59 +00:00)


### Adopt the Text Vectorization layer

The TextVectorization preprocessing layer maintains an internal state that relies on a portion of the training dataset. This state essentially establishes a correspondence between textual tokens and integer indices. Consequently, we exclusively apply the TextVectorization preprocessing layer to the training data alone.

It's crucial to emphasize that to avoid any potential data leakage, we refrain from applying the TextVectorization preprocessing layer to the entire dataset, including both the training and testing subsets.

In [None]:
vectorize_layer.adapt(train_text_ds_raw)
vocab = vectorize_layer.get_vocabulary()

time: 3min 5s (started: 2024-05-02 07:38:45 +00:00)


### Saving and Loading a Customized TextVectorization Layer in TensorFlow Data Pipeline and Keras

In [None]:
vectorizer_model = tf.keras.models.Sequential()
vectorizer_model.add(tf.keras.Input(shape=(1,), dtype=tf.string))
vectorizer_model.add(vectorize_layer)
vectorizer_model.summary()

filepath = "vectorize_layer_model"
vectorizer_model.save(filepath, save_format="tf")

Model: "sequential_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization_4 (Text  (None, 60)                0         
 Vectorization)                                                  
                                                                 
Total params: 0 (0.00 Byte)
Trainable params: 0 (0.00 Byte)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________




time: 1 s (started: 2024-05-02 07:47:24 +00:00)


In [None]:
loaded_vectorizer_model = tf.keras.models.load_model(filepath)
loaded_vectorizer_layer = loaded_vectorizer_model.layers[0]



time: 343 ms (started: 2024-05-02 07:47:37 +00:00)


### Check the loaded Text Vectorization layer

In [None]:
loaded_vocab = loaded_vectorizer_layer.get_vocabulary()
print("original vocab has the ", len(vocab)," entries")
print("loaded_vectorizer_layer vocab has the ", len(loaded_vocab)," entries")
print("original vocab: ", vocab[:10])
print("loaded vocab  : ", loaded_vocab[:10])

original vocab has the  8260  entries
loaded_vectorizer_layer vocab has the  8260  entries
original vocab:  ['', '[UNK]', 'в', 'и', 'на', 'с', 'из', 'санкции', 'россии', 'по']
loaded vocab  :  ['', '[UNK]', 'в', 'и', 'на', 'с', 'из', 'санкции', 'россии', 'по']
time: 38.8 ms (started: 2024-05-02 07:47:54 +00:00)


## Preprocessing Text Data with TensorFlow: A Detailed Explanation

Let's begin by crafting a function to preprocess a given text using either the `vectorize_layer` or `loaded_vectorizer_layer`.

1. We utilize a lambda function, which is an anonymous function taking a single argument `text`.

2. The `tf.expand_dims(text, -1)` operation expands the dimensions of the input `text` tensor along the last axis, effectively adding a new axis at the end. This is crucial for compatibility with the `vectorize_layer`, which expects input tensors with a shape of `(batch_size, sequence_length)`.

3. Next, we apply the `vectorize_layer` to the expanded `text` tensor, thereby converting the text into tokenized integer sequences.

4. To ensure the final output tensor has the desired shape `(sequence_length,)`, we utilize `tf.squeeze(...)` to remove any singleton dimensions from the result of `vectorize_layer`. This step is necessary because `vectorize_layer` may add an extra dimension due to batching.

In [None]:
prepare_lm_inputs_labels = lambda text: tf.squeeze(vectorize_layer(tf.expand_dims(text, -1)))

time: 889 µs (started: 2024-05-02 07:48:20 +00:00)


### Process the Train Data

In [None]:
train_text_ds = train_text_ds_raw.map(prepare_lm_inputs_labels,
                                  num_parallel_calls=tf.data.experimental.AUTOTUNE)

time: 143 ms (started: 2024-05-02 07:48:25 +00:00)


**Let's check the result**

In [None]:
train_text_ds.element_spec

TensorSpec(shape=<unknown>, dtype=tf.int64, name=None)

time: 15.2 ms (started: 2024-05-02 07:48:36 +00:00)


In [None]:
train_ds = tf.data.Dataset.zip(
    (train_text_ds,train_sentiment_ds_raw)
)

time: 16.5 ms (started: 2024-05-02 07:48:38 +00:00)


In [None]:
train_size = train_ds.cardinality().numpy()
print("Train size: ", train_size)

Train size:  102997
time: 10.6 ms (started: 2024-05-02 07:48:40 +00:00)


### Process the Validation Data

Let's create the input (text) and output TF Datasets:

In [None]:
val_text_ds_raw = tf.data.Dataset.from_tensor_slices(
            tf.cast(val_features.values, tf.string)
)

time: 57.6 ms (started: 2024-05-02 07:48:48 +00:00)


In [None]:
val_sentiment_ds_raw = tf.data.Dataset.from_tensor_slices(
            tf.cast(val_targets.values, tf.int64),
)

time: 8.48 ms (started: 2024-05-02 07:48:51 +00:00)


Let's apply the same fuction ```prepare_lm_inputs_labels``` for the text in the validation data as follows:

In [None]:
val_text_ds = val_text_ds_raw.map(prepare_lm_inputs_labels,
                                  num_parallel_calls=tf.data.experimental.AUTOTUNE)

time: 70.5 ms (started: 2024-05-02 07:48:56 +00:00)


In [None]:
val_ds = tf.data.Dataset.zip(
    (val_text_ds, val_sentiment_ds_raw)
)

time: 26.9 ms (started: 2024-05-02 07:49:10 +00:00)


### Process the Test Data

In [None]:
test_text_ds_raw = tf.data.Dataset.from_tensor_slices(
            tf.cast(test_features.values, tf.string)
)

time: 210 ms (started: 2024-05-02 07:49:14 +00:00)


In [None]:
test_sentiment_ds_raw = tf.data.Dataset.from_tensor_slices(
            tf.cast(test_targets.values, tf.int64),
)

time: 9.96 ms (started: 2024-05-02 07:49:18 +00:00)


In [None]:
test_text_ds = test_text_ds_raw.map(prepare_lm_inputs_labels,
                                  num_parallel_calls=tf.data.experimental.AUTOTUNE)

time: 63.3 ms (started: 2024-05-02 07:49:22 +00:00)


In [None]:
test_ds = tf.data.Dataset.zip(
    (test_text_ds, test_sentiment_ds_raw)
)

time: 3.79 ms (started: 2024-05-02 07:49:26 +00:00)


In [None]:
test_size = test_ds.cardinality().numpy()
print("Test size: ", test_size)

Test size:  30117
time: 2.02 ms (started: 2024-05-02 07:49:27 +00:00)


## TensorFlow Data Pipeline

In [None]:
batch_size=64
AUTOTUNE=tf.data.experimental.AUTOTUNE

train_ds=train_ds.shuffle(buffer_size=train_size)
train_ds=train_ds.batch(batch_size=batch_size,drop_remainder=True)
train_ds=train_ds.cache()
train_ds = train_ds.prefetch(AUTOTUNE)

val_ds=val_ds.shuffle(buffer_size=train_size)
val_ds=val_ds.batch(batch_size=batch_size,drop_remainder=True)
val_ds=val_ds.cache()
val_ds = val_ds.prefetch(AUTOTUNE)

test_ds=test_ds.shuffle(buffer_size=train_size)
test_ds=test_ds.batch(batch_size=batch_size,drop_remainder=True)
test_ds=test_ds.cache()
test_ds = test_ds.prefetch(AUTOTUNE)

time: 56.9 ms (started: 2024-05-02 07:49:31 +00:00)


Notice that we have now batches of reviews and topics:

In [None]:
train_ds.element_spec

(TensorSpec(shape=<unknown>, dtype=tf.int64, name=None),
 TensorSpec(shape=(64,), dtype=tf.int64, name=None))

time: 7.86 ms (started: 2024-05-02 07:49:35 +00:00)
