<a href="https://colab.research.google.com/github/ashaduzzaman-sarker/Text-Semantic-Similarity-Search/blob/main/Sentence_embeddings_using_Siamese_RoBERTa_networks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tune a RoBERTa model to generate sentence embeddings using KerasNLP.

## Introduction
![](https://www.labellerr.com/blog/content/images/2023/06/Roberta.png)

### Introduction: Semantic Textual Similarity and Sentence Embeddings

BERT and RoBERTa are powerful models for natural language processing tasks, including semantic textual similarity (STS), where the goal is to determine how similar two sentences are. Typically, these models can take two sentences as input and predict their similarity. However, when dealing with large collections of sentences, finding the most similar pairs can become computationally expensive. For a collection of `n` sentences, the number of pairwise comparisons is `n*(n-1)/2`, which can lead to significant processing time. For instance, comparing 10,000 sentences would take approximately 65 hours on a V100 GPU.

### Reducing Computational Overhead with Sentence Embeddings

To overcome the computational overhead, one approach is to generate sentence embeddings. This involves passing each sentence through the model individually, and then either averaging the model's output or using the [CLS] token as a representative embedding for the sentence. These embeddings can then be compared using vector similarity measures like cosine similarity, Manhattan distance, or Euclidean distance. This method drastically reduces the time required for finding similar sentence pairs—bringing it down from 65 hours to just a few seconds for a collection of 10,000 sentences.

### Fine-Tuning RoBERTa for Better Sentence Embeddings

While directly using RoBERTa for sentence embeddings might not yield the best results, fine-tuning RoBERTa using a Siamese network can significantly improve the quality of these embeddings. A Siamese network is a type of neural network that processes two input sentences through identical subnetworks (sharing the same weights) and then learns to measure the similarity between these embeddings. Fine-tuning RoBERTa in this way produces semantically meaningful embeddings that can be used for various downstream tasks, such as:

- **Large-Scale Semantic Similarity Comparison:** Efficiently finding the most similar sentences in large datasets.
- **Clustering:** Grouping similar sentences together based on their embeddings.
- **Information Retrieval via Semantic Search:** Retrieving the most relevant documents or sentences based on a query by comparing their embeddings.

### Example: Fine-Tuning RoBERTa with a Siamese Network

In this example, we'll demonstrate how to fine-tune a RoBERTa model using a Siamese network architecture. This process will enable the model to generate high-quality sentence embeddings that can be utilized for tasks such as semantic search and clustering. The method of fine-tuning that we’ll use is inspired by **Sentence-BERT**, a well-known technique for creating sentence embeddings that preserve semantic meaning.

# Imports

In [1]:
!pip install -q --upgrade keras_nlp
!pip install -q --upgrade keras
!pip install -q --upgrade tensorflow

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m572.2/572.2 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m16.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import os

os.environ["KERAS_BACKEND"] = "tensorflow"

import keras
import keras_nlp
import tensorflow as tf
import numpy as np
import pandas as pd
import tensorflow_datasets as tfds
import sklearn.cluster as cluster

keras.mixed_precision.set_global_policy("mixed_float16")

### Fine-Tuning RoBERTa Using Siamese Networks for Semantic Similarity

To fine-tune RoBERTa using a Siamese network, we will create a network architecture where two identical RoBERTa models share weights. The outputs of these models will be pooled to produce sentence embeddings, which will then be compared using cosine similarity. The goal is to train the model to generate embeddings that are close in vector space for semantically similar sentences and distant for dissimilar ones.

![](https://miro.medium.com/v2/resize:fit:1400/1*LwOBbwGXMZUy6OzkFAPTzw.png
)
### Steps to Fine-Tune the Model

#### 1. **Define the Siamese Network Architecture**
   - We'll use two identical subnetworks (RoBERTa models) to process pairs of input sentences.
   - We'll add a pooling layer on top of each RoBERTa model to extract sentence embeddings.
   - The cosine similarity between the embeddings will be used as the objective function during training.

#### 2. **Pooling Strategies**
   - **Mean Pooling:** Averages the token embeddings across all tokens in the sentence.
   - **Max Pooling:** Takes the maximum value for each dimension across all tokens.
   - **CLS Pooling:** Uses the embedding of the [CLS] token as the sentence embedding.
   - We'll use mean pooling in this example, as it generally produces the best results.

#### 3. **Regression Objective Function**
   - The cosine similarity between the two sentence embeddings will be calculated.
   - The network will be trained to predict the cosine similarity between sentence pairs.


In [3]:
max_length = 128  # Maximum length of input sentence
batch_size = 32
epochs = 2

# Labels of our dataset
labels = ["contradiction", "entailment", "neutral"]

## Load the Dataset

- **Dataset**: The STSB (Semantic Textual Similarity Benchmark) dataset is used for fine-tuning the model with a regression objective.

- **Label Range**: STSB labels range from 0 to 5, where 0 indicates the least similarity and 5 indicates the highest.

- **Cosine Similarity**: The output range of cosine similarity from the Siamese network is [-1, 1].

- **Normalization**: To align the dataset labels with cosine similarity, labels are divided by 2.5 and 1 is subtracted.

In [4]:
TRAIN_BATCH_SIZE = 6
VALIDATION_BATCH_SIZE = 6

TRAIN_NUM_BATCHES = 300
VALIDATION_NUM_BATCHES = 40

AUTOTUNE = tf.data.experimental.AUTOTUNE

In [5]:
def change_range(x):
    return (x / 2.5) - 1

def prepare_dataset(dataset, num_batches, batch_size):
    dataset = dataset.map(
        lambda z: (
            [z["sentence1"], z["sentence2"]],
            [tf.cast(change_range(z["label"]), tf.float32)],
        ),
        num_parallel_calls=AUTOTUNE,
    )
    dataset = dataset.batch(batch_size)
    dataset = dataset.take(num_batches)
    dataset = dataset.prefetch(AUTOTUNE)
    return dataset

stsb_ds = tfds.load(
    "glue/stsb",
)

stsb_train, stsb_valid = stsb_ds["train"], stsb_ds["validation"]

stsb_train = prepare_dataset(stsb_train, TRAIN_NUM_BATCHES, TRAIN_BATCH_SIZE)
stsb_valid = prepare_dataset(stsb_valid, VALIDATION_NUM_BATCHES, VALIDATION_BATCH_SIZE)

Downloading and preparing dataset 784.05 KiB (download: 784.05 KiB, generated: 1.58 MiB, total: 2.34 MiB) to /root/tensorflow_datasets/glue/stsb/2.0.0...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/5749 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/glue/stsb/incomplete.58COP8_2.0.0/glue-train.tfrecord*...:   0%|          …

Generating validation examples...:   0%|          | 0/1500 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/glue/stsb/incomplete.58COP8_2.0.0/glue-validation.tfrecord*...:   0%|     …

Generating test examples...:   0%|          | 0/1379 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/glue/stsb/incomplete.58COP8_2.0.0/glue-test.tfrecord*...:   0%|          |…

Dataset glue downloaded and prepared to /root/tensorflow_datasets/glue/stsb/2.0.0. Subsequent calls will reuse this data.


In [6]:
## Let's see examples from the dataset
for x, y in stsb_train.take(1):
    for i, example in enumerate(x):
        print(f"Sentence 1: {example[0].numpy().decode('utf-8')}")
        print(f"Sentence 2: {example[1].numpy().decode('utf-8')}")
        print(f"Similarity: {y[i].numpy()} \n")
    break

Sentence 1: A young girl is sitting on Santa's lap.
Sentence 2: A little girl is sitting on Santa's lap
Similarity: [0.9200001] 

Sentence 1: A women sitting at a table drinking with a basketball picture in the background.
Sentence 2: A woman in a sari drinks something while sitting at a table.
Similarity: [0.03999996] 

Sentence 1: Norway marks anniversary of massacre
Sentence 2: Norway Marks Anniversary of Breivik's Massacre
Similarity: [0.52] 

Sentence 1: US drone kills six militants in Pakistan: officials
Sentence 2: US missiles kill 15 in Pakistan: officials
Similarity: [-0.03999996] 

Sentence 1: On Tuesday, the central bank left interest rates steady, as expected, but also declared that overall risks were weighted toward weakness and warned of deflation risks.
Sentence 2: The central bank's policy board left rates steady for now, as widely expected, but surprised the market by declaring that overall risks were weighted toward weakness.
Similarity: [0.6] 

Sentence 1: At one of 

## Build the Encoder Model
To build the encoder model for producing sentence embeddings, the components include:

1. **Preprocessor Layer**: Tokenizes sentences and generates padding masks.

2. **Backbone Model**: Generates contextual representations for each token in the sentence.

3. **Mean Pooling Layer**: Uses `keras.layers.GlobalAveragePooling1D` to create embeddings by averaging the backbone outputs, excluding padded tokens.

4. **Normalization Layer**: Normalizes the embeddings for cosine similarity calculations.

In [7]:
preprocessor = keras_nlp.models.RobertaPreprocessor.from_preset("roberta_base_en")
backbone = keras_nlp.models.RobertaBackbone.from_preset("roberta_base_en")

inputs = keras.Input(shape=(1,), dtype="string", name="sentence")
x = preprocessor(inputs)
h = backbone(x)

embedding = keras.layers.GlobalAveragePooling1D(name="pooling_layer")(
    h, x["padding_mask"]
)
n_embedding = keras.layers.UnitNormalization(axis=1)(embedding)

roberta_normal_encoder = keras.Model(inputs=inputs, outputs=n_embedding)

roberta_normal_encoder.summary()

Downloading from https://www.kaggle.com/api/v1/models/keras/roberta/keras/roberta_base_en/2/download/metadata.json...


100%|██████████| 141/141 [00:00<00:00, 309kB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/roberta/keras/roberta_base_en/2/download/tokenizer.json...


100%|██████████| 463/463 [00:00<00:00, 232kB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/roberta/keras/roberta_base_en/2/download/assets/tokenizer/vocabulary.json...


100%|██████████| 0.99M/0.99M [00:00<00:00, 2.48MB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/roberta/keras/roberta_base_en/2/download/assets/tokenizer/merges.txt...


100%|██████████| 446k/446k [00:00<00:00, 1.54MB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/roberta/keras/roberta_base_en/2/download/config.json...


100%|██████████| 498/498 [00:00<00:00, 1.36MB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/roberta/keras/roberta_base_en/2/download/model.weights.h5...


100%|██████████| 474M/474M [00:12<00:00, 40.6MB/s]


## Build the Siamese network with a regression objective function:

1. **Single Encoder Usage**: We have one encoder model that processes both sentences, ensuring shared weights between the two paths.
2. **Dual Paths**: Each sentence is passed through the encoder separately to generate their embeddings.
3. **Normalization**: The embeddings from both paths are normalized.
4. **Cosine Similarity Calculation**: The normalized embeddings are multiplied to compute the cosine similarity between the two sentences, which serves as the output of the Siamese network.

In [13]:
class RegressionSiamese(keras.Model):
    def __init__(self, encoder, **kwargs):
        inputs = keras.Input(shape=(2,), dtype="string", name="sentences")
        sen1, sen2 = keras.ops.split(inputs, 2, axis=1)
        u = encoder(sen1)
        v = encoder(sen2)
        cosine_similarity_scores = keras.ops.matmul(u, keras.ops.transpose(v))

        super().__init__(
            inputs=inputs,
            outputs=cosine_similarity_scores,
            **kwargs,
        )

        self.encoder = encoder

    def get_encoder(self):
        return self.encoder

## Build the model


In [9]:
sentences = [
    "Knowledge is power.",
    "Andre Kapathy, is a great ai researcher",
    "Nvidia made it possible for ai revolution",
]

query = ["Robot will save the humanity"]

encoder = roberta_normal_encoder

sentence_embeddings = encoder(tf.constant(sentences))
query_embedding = encoder(tf.constant(query))

cosine_similarity_scores = tf.matmul(
    query_embedding,
    tf.transpose(sentence_embeddings),
)

# cosine similarity score between sentences and the query
for i, sim in enumerate(cosine_similarity_scores.numpy()[0]):
    print(f"{sentences[i]}:{sim}")

Knowledge is power.:0.9599609375
Andre Kapathy, is a great ai researcher:0.9609375
Nvidia made it possible for ai revolution:0.96728515625


In [10]:
## Train the Model
roberta_regression_siamese = RegressionSiamese(roberta_normal_encoder)

roberta_regression_siamese.compile(
    optimizer=keras.optimizers.Adam(learning_rate=1e-5),
    loss=keras.losses.MeanSquaredError(),
    jit_compile=False,
)

roberta_regression_siamese.fit(
    stsb_train,
    epochs=1,
    validation_data=stsb_valid,
)

[1m300/300[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m326s[0m 789ms/step - loss: 0.5186 - val_loss: 0.4272


<keras.src.callbacks.history.History at 0x7be9cc394670>

In [27]:
## Let's try the model after training
sentences = [
    "Knowledge is power.",
    "Andre Kapathy, is a great ai researcher",
    "Nvidia made it possible for ai revolution",
]

query = ["Robot will save the humanity"]

encoder = roberta_regression_siamese.get_encoder()

sentence_embeddings = encoder(tf.constant(sentences))
query_embedding = encoder(tf.constant(query))

cosine_similarity_scores = tf.matmul(
    query_embedding,
    tf.transpose(sentence_embeddings),
)

# cosine similarity score between sentences and the query
for i, sim in enumerate(cosine_similarity_scores.numpy()[0]):
    print(f"{sentences[i]}:{sim}")

## Fine-tuning with the Triplet Objective Function

1. **Triplet Objective Function**:
   - **Sentences**: Three sentences (anchor, positive, and negative) are passed to the Siamese network.
   - **Similarity**: Anchor and positive sentences are semantically similar, while anchor and negative sentences are dissimilar.
   - **Objective**: Minimize the distance between anchor and positive sentences, and maximize the distance between anchor and negative sentences.

2. **Dataset**:
   - **Dataset Used**: Wikipedia-sections-triplets dataset.
   - **Data Structure**: Contains triplets of sentences—anchor, positive, and negative.
   - **Source**: Anchor and positive sentences are from the same section; anchor and negative sentences are from different sections.
   - **Size**: 1.8 million training triplets and 220,000 test triplets. For this example, we'll use 1,200 triplets for training and 300 for testing.

In [17]:
!wget https://sbert.net/datasets/wikipedia-sections-triplets.zip -q
!unzip wikipedia-sections-triplets.zip -d wikipedia-sections-triplets

Archive:  wikipedia-sections-triplets.zip
  inflating: wikipedia-sections-triplets/validation.csv  
  inflating: wikipedia-sections-triplets/Readme.txt  
  inflating: wikipedia-sections-triplets/test.csv  
  inflating: wikipedia-sections-triplets/train.csv  


In [20]:
NUM_TRAIN_BATCHES = 200
NUM_TEST_BATCHES = 75
AUTOTUNE = tf.data.experimental.AUTOTUNE


def prepare_wiki_data(dataset, num_batches):
    dataset = dataset.map(
        lambda z: ((z["Sentence1"], z["Sentence2"], z["Sentence3"]), 0)
    )
    dataset = dataset.batch(6)
    dataset = dataset.take(num_batches)
    dataset = dataset.prefetch(AUTOTUNE)
    return dataset


wiki_train = tf.data.experimental.make_csv_dataset(
    "wikipedia-sections-triplets/train.csv",
    batch_size=1,
    num_epochs=1,
)
wiki_test = tf.data.experimental.make_csv_dataset(
    "wikipedia-sections-triplets/test.csv",
    batch_size=1,
    num_epochs=1,
)

wiki_train = prepare_wiki_data(wiki_train, NUM_TRAIN_BATCHES)
wiki_test = prepare_wiki_data(wiki_test, NUM_TEST_BATCHES)

## Build the encoder model

To build the encoder model using RoBERTa with mean pooling (without normalization), the components include:

1. **Preprocessor Layer**:
   - Tokenizes sentences and generates padding masks.

2. **Backbone Model**:
   - Uses the RoBERTa model to generate contextual representations for each token in the sentence.

3. **Mean Pooling Layer**:
   - Applies mean pooling to the output of the backbone model to produce sentence embeddings.

In [21]:
preprocessor = keras_nlp.models.RobertaPreprocessor.from_preset("roberta_base_en")
backbone = keras_nlp.models.RobertaBackbone.from_preset("roberta_base_en")
input = keras.Input(shape=(1,), dtype="string", name="sentence")

x = preprocessor(input)
h = backbone(x)
embedding = keras.layers.GlobalAveragePooling1D(name="pooling_layer")(
    h, x["padding_mask"]
)

roberta_encoder = keras.Model(inputs=input, outputs=embedding)


roberta_encoder.summary()

## Build the Siamese network with the triplet objective function

### Step-by-Step Implementation:

1. **Encoder Model**: Use the previously defined RoBERTa-based encoder model.
2. **Three Inputs**: Create inputs for the anchor, positive, and negative sentences.
3. **Pass Through Encoder**: Pass each of these inputs through the shared encoder to obtain embeddings.
4. **Distance Calculation**: Calculate the Euclidean distance between the anchor-positive and anchor-negative pairs.
5. **Model Definition**: Combine these steps into a single model.

In [22]:
class TripletSiamese(keras.Model):
    def __init__(self, encoder, **kwargs):
        anchor = keras.Input(shape=(1,), dtype="string")
        positive = keras.Input(shape=(1,), dtype="string")
        negative = keras.Input(shape=(1,), dtype="string")

        ea = encoder(anchor)
        ep = encoder(positive)
        en = encoder(negative)

        positive_dist = keras.ops.sum(keras.ops.square(ea - ep), axis=1)
        negative_dist = keras.ops.sum(keras.ops.square(ea - en), axis=1)

        positive_dist = keras.ops.sqrt(positive_dist)
        negative_dist = keras.ops.sqrt(negative_dist)

        output = keras.ops.stack([positive_dist, negative_dist], axis=0)

        super().__init__(
            inputs=[anchor, positive, negative],
            outputs=output,
            **kwargs,
        )

        self.encoder = encoder

    def get_encoder(self):
        return self.encoder

## Custom Triplet Loss Function
To implement the custom loss function for the triplet objective, we'll create a function that calculates the loss based on the distances between the anchor-positive and anchor-negative embeddings. The goal is to ensure that the positive distance is smaller than the negative distance by at least a specified margin.

The loss function will be defined as:

`loss = max(positive_dist − negative_dist+ margin, 0)`

This loss function encourages the model to make the positive distance smaller than the negative distance by a margin, ensuring that similar sentences are closer in the embedding space while dissimilar sentences are farther apart.

In [23]:
class TripletLoss(keras.losses.Loss):
    def __init__(self, margin=1, **kwargs):
        super().__init__(**kwargs)
        self.margin = margin

    def call(self, y_true, y_pred):
        positive_dist, negative_dist = tf.unstack(y_pred, axis=0)

        losses = keras.ops.relu(positive_dist - negative_dist + self.margin)

        return keras.ops.mean(losses, axis=0)

## Fit the model

In [24]:
roberta_triplet_siamese = TripletSiamese(roberta_encoder)

roberta_triplet_siamese.compile(
    optimizer=keras.optimizers.Adam(learning_rate=1e-5),
    loss=TripletLoss(),
    jit_compile=False,
)

roberta_triplet_siamese.fit(
    wiki_train,
    epochs=1,
    validation_data=wiki_test,
)

    200/Unknown [1m319s[0m 1s/step - loss: 0.7752

  self.gen.throw(typ, value, traceback)


[1m200/200[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m373s[0m 1s/step - loss: 0.7751 - val_loss: 0.6735


<keras.src.callbacks.history.History at 0x7be9b11010c0>

In [26]:
## Let's try this model in a clustering example
questions = [
    "What should I do to improve my English writting?",
    "How to be good at speaking English?",
    "How can I improve my English?",
    "How to earn money online?",
    "How do I earn money online?",
    "How to work and earn money through internet?",
]

encoder = roberta_triplet_siamese.get_encoder()

embeddings = encoder(tf.constant(questions))

kmeans = cluster.KMeans(n_clusters=2, random_state=0, n_init="auto").fit(embeddings)

for i, label in enumerate(kmeans.labels_):
    print(f"Sentence [{questions[i]}] belongs to cluster: {label}")

Sentence [What should I do to improve my English writting?] belongs to cluster: 1
Sentence [How to be good at speaking English?] belongs to cluster: 1
Sentence [How can I improve my English?] belongs to cluster: 1
Sentence [How to earn money online?] belongs to cluster: 0
Sentence [How do I earn money online?] belongs to cluster: 0
Sentence [How to work and earn money through internet?] belongs to cluster: 0
