# Use pretrained models from KerasNLP for the Semantic Similarity Task.

**Introduction Summary:**

Semantic similarity involves measuring how closely two sentences align in meaning. Previously, we explored using the SNLI (Stanford Natural Language Inference) corpus with the HuggingFace Transformers library for this task. This tutorial introduces KerasNLP, an extension of the Keras API, to perform the same task. KerasNLP aims to reduce boilerplate code and streamline the process of building and deploying models. The guide is organized into the following sections:

1. Setup, task definition, and establishing a baseline.
2. Establishing a baseline with BERT.
3. Saving and reloading the model.
4. Performing inference with the model.
5. Improving accuracy with RoBERTa.

**[BERT (Bidirectional Encoder Representations from Transformers)](https://arxiv.org/pdf/1810.04805)** is a state-of-the-art model for natural language processing tasks, including sentiment analysis. It pre-trains on a large corpus to understand language context by considering both left and right context in all layers, making it "bidirectional."

BERT's architecture consists of multiple layers of transformers, where each layer has attention mechanisms to focus on different parts of the text. For sentiment analysis, a classification layer is added on top of the BERT model, typically a simple dense layer that outputs probabilities for positive or negative sentiment.

![](https://miro.medium.com/v2/resize:fit:1400/1*Qww2aaIdqrWVeNmo3AS0ZQ.png)


## Setup

In [1]:
!pip install -q --upgrade keras-nlp
!pip install -q --upgrade tensorflow
!pip install -q --upgrade keras

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m572.2/572.2 kB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m28.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import numpy as np
import tensorflow as tf
import keras_nlp
import keras
import tensorflow_datasets as tfds

## Load the SNLI Dataset

- **Components of Each Sample:**
  - **Hypothesis**: The hypothesis caption created by the author.
  - **Premise**: The original caption provided to the author.
  - **Label**: Indicates the similarity between the hypothesis and premise.

- **Label Values:**
  - **Contradiction**: Represents completely dissimilar sentences.
  - **Entailment**: Indicates sentences with similar meanings.
  - **Neutral**: Refers to sentences where no clear similarity or dissimilarity can be established.

In [3]:
snli_train = tfds.load("snli", split="train[:20%]")
snli_val = tfds.load("snli", split="validation")
snli_test = tfds.load("snli", split="test")

Downloading and preparing dataset 90.17 MiB (download: 90.17 MiB, generated: 87.00 MiB, total: 177.17 MiB) to /root/tensorflow_datasets/snli/1.1.0...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating test examples...:   0%|          | 0/10000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/snli/incomplete.7P3Z6J_1.1.0/snli-test.tfrecord*...:   0%|          | 0/10…

Generating validation examples...:   0%|          | 0/10000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/snli/incomplete.7P3Z6J_1.1.0/snli-validation.tfrecord*...:   0%|          …

Generating train examples...:   0%|          | 0/550152 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/snli/incomplete.7P3Z6J_1.1.0/snli-train.tfrecord*...:   0%|          | 0/5…

Dataset snli downloaded and prepared to /root/tensorflow_datasets/snli/1.1.0. Subsequent calls will reuse this data.


In [4]:
sample = snli_train.batch(4).take(1).get_single_element()
sample

{'hypothesis': <tf.Tensor: shape=(4,), dtype=string, numpy=
 array([b'Washing clothes on a camping trip.', b'A woman walking alone.',
        b'The woman is thinking.',
        b'The woman is practicing to enter a roller derby.'], dtype=object)>,
 'label': <tf.Tensor: shape=(4,), dtype=int64, numpy=array([ 1,  2, -1,  1])>,
 'premise': <tf.Tensor: shape=(4,), dtype=string, numpy=
 array([b'A man washes or dies clothes in a primitive setting.',
        b'A woman is walking her baby with a stroller at the local park.',
        b'This is a pensive women in a grassy setting.',
        b'Woman in red shirt and white cap rollerblading on gray surface.'],
       dtype=object)>}

## Data Preprocessing

### **Handling Missing or Incorrectly Labeled Data:**

- **Issue Identified**: Some samples in the dataset have missing or incorrectly labeled data, marked by a value of `-1`.
- **Solution**: To maintain the accuracy and reliability of the model, these samples are filtered out from the dataset before training or evaluation.

In [5]:
def filter_labels(sample):
    return sample["label"] != -1

### Utility Function for Data Preparation

- **Purpose**: Splits each dataset example into an `(x, y)` tuple for use with `model.fit()` in Keras.
- **KerasNLP's BERT Classifier**: Automatically tokenizes and packs raw strings using a "[SEP]" token during training.
- **Key Step**: This label splitting is the only data preparation required before training.

In [6]:
def split_labels(sample):
    x = (sample["hypothesis"], sample["premise"])
    y = sample["label"]
    return x, y

train_ds = (
    snli_train.filter(filter_labels)
    .map(split_labels, num_parallel_calls=tf.data.AUTOTUNE)
    .batch(16)
)

val_ds = (
    snli_val.filter(filter_labels)
    .map(split_labels, num_parallel_calls=tf.data.AUTOTUNE)
    .batch(16)
)

test_ds = (
    snli_test.filter(filter_labels)
    .map(split_labels, num_parallel_calls=tf.data.AUTOTUNE)
    .batch(16)
)


## Establishing Baseline with BERT

- **Model Setup**: We use the `BertClassifier` from KerasNLP to establish a baseline for the semantic similarity task.
- **Classification Head**: The `BertClassifier` attaches a classification head to the BERT backbone, mapping its outputs to logits suitable for classification, reducing the need for custom code.
- **Built-in Tokenization**: KerasNLP models automatically handle tokenization, concatenating strings with a "[SEP]" separator when provided as input.
- **Pretrained Weights**: The model is used with pretrained weights, and the `from_preset()` method allows for custom preprocessing.
- **Task Configuration**: For the SNLI dataset, we set `num_classes` to 3 to match the classification labels (Contradiction, Entailment, Neutral).

In [7]:
# BERT Tiny model has only 4,386,307 trainable parameters
bert_classifier = keras_nlp.models.BertClassifier.from_preset(
    "bert_tiny_en_uncased",
    num_classes=3,
)

Downloading from https://www.kaggle.com/api/v1/models/keras/bert/keras/bert_tiny_en_uncased/2/download/metadata.json...


100%|██████████| 139/139 [00:00<00:00, 337kB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/bert/keras/bert_tiny_en_uncased/2/download/config.json...


100%|██████████| 507/507 [00:00<00:00, 444kB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/bert/keras/bert_tiny_en_uncased/2/download/model.weights.h5...


100%|██████████| 16.8M/16.8M [00:02<00:00, 6.67MB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/bert/keras/bert_tiny_en_uncased/2/download/tokenizer.json...


100%|██████████| 547/547 [00:00<00:00, 609kB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/bert/keras/bert_tiny_en_uncased/2/download/assets/tokenizer/vocabulary.txt...


100%|██████████| 226k/226k [00:00<00:00, 283kB/s]


In [8]:
# Train the model
bert_classifier.fit(
    train_ds,
    validation_data=val_ds,
    epochs=1,
)

   6867/Unknown [1m176s[0m 23ms/step - loss: 0.8553 - sparse_categorical_accuracy: 0.6059

  self.gen.throw(typ, value, traceback)


[1m6867/6867[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m189s[0m 25ms/step - loss: 0.8553 - sparse_categorical_accuracy: 0.6059 - val_loss: 0.5803 - val_sparse_categorical_accuracy: 0.7657


<keras.src.callbacks.history.History at 0x7b89d63b6440>

### Evaluate the performance on test data

In [9]:
bert_classifier.evaluate(test_ds)

[1m614/614[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 7ms/step - loss: 0.5748 - sparse_categorical_accuracy: 0.7697


[0.5837080478668213, 0.7662866711616516]

### Improve performance by higher learning rate

In [10]:
bert_classifier = keras_nlp.models.BertClassifier.from_preset(
    "bert_tiny_en_uncased",
    num_classes=3,
)

bert_classifier.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=keras.optimizers.Adam(1e-5),
    metrics=[keras.metrics.SparseCategoricalAccuracy()],
)

In [11]:
bert_classifier.fit(
    train_ds,
    validation_data=val_ds,
    epochs=1,
)

[1m6867/6867[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m184s[0m 24ms/step - loss: 0.9850 - sparse_categorical_accuracy: 0.5126 - val_loss: 0.6955 - val_sparse_categorical_accuracy: 0.7157


<keras.src.callbacks.history.History at 0x7b89d6396a70>

In [12]:
bert_classifier.evaluate(test_ds)

[1m614/614[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 8ms/step - loss: 0.6990 - sparse_categorical_accuracy: 0.7176


[0.702507495880127, 0.7131514549255371]

### Let's improve more

In [13]:
class TriangularScheduler(keras.optimizers.schedules.LearningRateSchedule):
    def __init__(self, rate, warmup, total):
        self.rate = rate
        self.warmup = warmup
        self.total = total

    def get_config(self):
        return {
            "rate": self.rate,
            "warmup": self.warmup,
            "total": self.total,
        }

    def __call__(self, step):
        step = keras.ops.cast(step, dtype=tf.float32)
        rate = keras.ops.cast(self.rate, dtype=tf.float32)
        warmup = keras.ops.cast(self.warmup, dtype=tf.float32)
        total = keras.ops.cast(self.total, dtype=tf.float32)

        warmup_rate = rate * (step / warmup)
        decay_rate = rate * (total - step) / (total - warmup)
        triangular_rate = keras.ops.minimum(warmup_rate, decay_rate)
        return keras.ops.maximum(triangular_rate, 0.0)

bert_classifier = keras_nlp.models.BertClassifier.from_preset(
    "bert_tiny_en_uncased",
    num_classes=3,
)

# Get total number of training batches
epochs = 1
total_steps = sum(1 for _ in train_ds.as_numpy_iterator()) * epochs
warmup_steps = int(total_steps * 0.2)

bert_classifier.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=keras.optimizers.AdamW(
        TriangularScheduler(1e-4, warmup_steps, total_steps)
    ),
    metrics = [keras.metrics.SparseCategoricalAccuracy()],
)

bert_classifier.fit(
    train_ds,
    validation_data=val_ds,
    epochs=epochs,
)

[1m6867/6867[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m188s[0m 25ms/step - loss: 0.8847 - sparse_categorical_accuracy: 0.5649 - val_loss: 0.5753 - val_sparse_categorical_accuracy: 0.7657


<keras.src.callbacks.history.History at 0x7b89c7033a30>

In [14]:
# let's evaluate our final model on the test set
bert_classifier.evaluate(test_ds)

[1m614/614[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 7ms/step - loss: 0.5763 - sparse_categorical_accuracy: 0.7711


[0.5830331444740295, 0.7655740976333618]

## Save and Reload the Model

In [15]:
bert_classifier.save("bert_classifier.keras")
restored_model = keras.models.load_model("bert_classifier.keras")
restored_model.evaluate(test_ds)

  instance.compile_from_config(compile_config)
  saveable.load_own_variables(weights_store.get(inner_path))


[1m614/614[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 8ms/step - loss: 0.5763 - sparse_categorical_accuracy: 0.7711


[0.5830331444740295, 0.7655740976333618]

## Inference with the Model

In [16]:
sample

{'hypothesis': <tf.Tensor: shape=(4,), dtype=string, numpy=
 array([b'Washing clothes on a camping trip.', b'A woman walking alone.',
        b'The woman is thinking.',
        b'The woman is practicing to enter a roller derby.'], dtype=object)>,
 'label': <tf.Tensor: shape=(4,), dtype=int64, numpy=array([ 1,  2, -1,  1])>,
 'premise': <tf.Tensor: shape=(4,), dtype=string, numpy=
 array([b'A man washes or dies clothes in a primitive setting.',
        b'A woman is walking her baby with a stroller at the local park.',
        b'This is a pensive women in a grassy setting.',
        b'Woman in red shirt and white cap rollerblading on gray surface.'],
       dtype=object)>}

In [17]:
# Convert to Hypothesis-Premise pair, for forward pass through model
sample = (
    sample["hypothesis"],
    sample["premise"],
)

sample

(<tf.Tensor: shape=(4,), dtype=string, numpy=
 array([b'Washing clothes on a camping trip.', b'A woman walking alone.',
        b'The woman is thinking.',
        b'The woman is practicing to enter a roller derby.'], dtype=object)>,
 <tf.Tensor: shape=(4,), dtype=string, numpy=
 array([b'A man washes or dies clothes in a primitive setting.',
        b'A woman is walking her baby with a stroller at the local park.',
        b'This is a pensive women in a grassy setting.',
        b'Woman in red shirt and white cap rollerblading on gray surface.'],
       dtype=object)>)

In [18]:
predictions = bert_classifier.predict(sample)

def softmax(x):
    return np.exp(x) / np.exp(x).sum(axis=0)

predictions = softmax(predictions[0])
predictions

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3s/step


array([0.08363279, 0.67912424, 0.23724292], dtype=float32)

## Improving accuracy with RoBERTa

In [19]:
# Initializing a RoBERTa from preset
roberta_classifier = keras_nlp.models.RobertaClassifier.from_preset(
    "roberta_base_en",
    num_classes=3,
)

roberta_classifier.fit(
    train_ds,
    validation_data=val_ds,
    epochs=1,   # Need to run more epochs to get better results
)

roberta_classifier.evaluate(test_ds)

Downloading from https://www.kaggle.com/api/v1/models/keras/roberta/keras/roberta_base_en/2/download/metadata.json...


100%|██████████| 141/141 [00:00<00:00, 178kB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/roberta/keras/roberta_base_en/2/download/config.json...


100%|██████████| 498/498 [00:00<00:00, 543kB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/roberta/keras/roberta_base_en/2/download/model.weights.h5...


100%|██████████| 474M/474M [00:32<00:00, 15.3MB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/roberta/keras/roberta_base_en/2/download/tokenizer.json...


100%|██████████| 463/463 [00:00<00:00, 491kB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/roberta/keras/roberta_base_en/2/download/assets/tokenizer/vocabulary.json...


100%|██████████| 0.99M/0.99M [00:01<00:00, 737kB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/roberta/keras/roberta_base_en/2/download/assets/tokenizer/merges.txt...


100%|██████████| 446k/446k [00:01<00:00, 418kB/s]


[1m6867/6867[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10546s[0m 2s/step - loss: 1.1020 - sparse_categorical_accuracy: 0.3347 - val_loss: 1.0990 - val_sparse_categorical_accuracy: 0.3382
[1m614/614[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m308s[0m 493ms/step - loss: 1.0981 - sparse_categorical_accuracy: 0.3470


[1.09872305393219, 0.34283387660980225]

In [20]:
## Inference with RoBERTa classifier
predictions = roberta_classifier.predict(sample)
print(tf.math.argmax(predictions, axis=1).numpy())

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 11s/step
[0 0 0 0]
