# BERT for Sentence Similarity

Transformers work using word or token-level embeddings, not sentence-level embeddings.

With BERT to calculate accurate sentence similarity, the approach was to use cross-encoder structure. This meant that we would pass two sentences to BERT, add a classification head to the top of BERT — and use this to output a similarity score.

![](https://cdn.sanity.io/images/vr8gru94/production/9a89f1b7dddd4c78da8b9ba0311c2ffd1ff18ffe-1920x1080.png)

The BERT cross-encoder architecture consists of a BERT model which consumes sentences A and B. Both are processed in the same sequence, separated by a [SEP] token. All of this is followed by a feedforward NN classifier that outputs a similarity score.

The cross-encoder network is very accurate by not scalable.

The solution to this lack of an accurate model with reasonable latency was designed by Nils Reimers and Iryna Gurevych in 2019 with the introduction of sentence-BERT (SBERT) and the sentence-transformers library.



# Sentence Transformers

SBERT is fine-tuned on sentence pairs using a siamese architecture.
uses mean pooling on the final output layer to produce a sentence embedding.

![](https://cdn.sanity.io/images/vr8gru94/production/2425dc0efd3f73a0bf57b3bf85a091c78619ec2c-1920x1110.png)

## Siamese BERT Pre-Training

The softmax-loss approach used the ‘siamese’ architecture fine-tuned on the Stanford Natural Language Inference (SNLI) and Multi-Genre NLI (MNLI) corpora.

SNLI contains 570K sentence pairs, and MNLI contains 430K. The pairs in both corpora include a premise and a hypothesis. Each pair is assigned one of three labels:

- 0 — entailment, e.g. the premise suggests the hypothesis.
- 1 — neutral, the premise and hypothesis could both be true, but they are not necessarily related.
- 2 — contradiction, the premise and hypothesis contradict each other.

Given this data, we feed sentence A (let’s say the premise) into siamese BERT A and sentence B (hypothesis) into siamese BERT B.

The siamese BERT outputs our pooled sentence embeddings. The mean-pooling approach was best performing for both NLI and STSb datasets.

There are now two sentence embeddings. We will call embeddings A `u` and embeddings B `v`. The next step is to concatenate u and v. Again, several concatenation approaches were tested, but the highest performing was a `(u, v, |u-v|)` operation:




![](https://cdn.sanity.io/images/vr8gru94/production/c78a83baccb40c331a92ddb25d8a1e4c97e397ed-1920x840.png)

|u-v| is calculated to give us the element-wise difference between the two vectors. The `u`, `v` and `|u-v|` vectors all fed into feedforward neural network that has three outputs which aligh with NLI similarity  labels 0, 1, and 2.

Then, we need to calculate the softmax from our FFNN which is done within the cross-entropy loss function. The softmax and labels are used to optimize on this ‘softmax-loss’.

![](https://cdn.sanity.io/images/vr8gru94/production/a7bc429139dfb58998cee4fe84341ef5b66f2019-1920x990.png)

# Sentence Transformer Quickstart

In [1]:
!pip install -U sentence-transformers

Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sentencepiece (from sentence-transformers)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m19.9 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone
  Created wheel for sentence-transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=125923 sha256=f8130fa9f6da752f6b8b0276282a0f5d20b0c8f7daa69d27fa6ae99970dcb6d3
  Stored in directory: /root/.cache/pip/wheels/62/f2/10/1e606fd5f02395388f74e7462910fe851042f97238cbbd902f
Successfully built sentence-tr

In [2]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

#Our sentences we like to encode
sentences = ['This framework generates embeddings for each input sentence',
    'Sentences are passed as a list of string.',
    'The quick brown fox jumps over the lazy dog.']

#Sentences are encoded by calling model.encode()
sentence_embeddings = model.encode(sentences)

#Print the embeddings
for sentence, embedding in zip(sentences, sentence_embeddings):
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Sentence: This framework generates embeddings for each input sentence
Embedding: [-1.37173552e-02 -4.28515449e-02 -1.56286024e-02  1.40537303e-02
  3.95537727e-02  1.21796280e-01  2.94334106e-02 -3.17524187e-02
  3.54959629e-02 -7.93139935e-02  1.75878741e-02 -4.04369719e-02
  4.97259349e-02  2.54912246e-02 -7.18700588e-02  8.14968869e-02
  1.47069141e-03  4.79626991e-02 -4.50336412e-02 -9.92174670e-02
 -2.81769745e-02  6.45046085e-02  4.44670543e-02 -4.76217009e-02
 -3.52952331e-02  4.38671783e-02 -5.28566055e-02  4.33063833e-04
  1.01921506e-01  1.64072234e-02  3.26996595e-02 -3.45986746e-02
  1.21339476e-02  7.94870779e-02  4.58345609e-03  1.57777797e-02
 -9.68206208e-03  2.87625659e-02 -5.05805984e-02 -1.55793717e-02
 -2.87906546e-02 -9.62280575e-03  3.15556750e-02  2.27349028e-02
  8.71449187e-02 -3.85027491e-02 -8.84718448e-02 -8.75498448e-03
 -2.12343335e-02  2.08923239e-02 -9.02077407e-02 -5.25732562e-02
 -1.05638904e-02  2.88310610e-02 -1.61455162e-02  6.17837207e-03
 -1.23234

In [3]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')

sentences = [
    'A woman is cooking dinner.',
    'A child is eating a slice of pizza.',
    'The boy is holding a puppy.',
    'A girl is riding a bicycle.',
    'A man is playing the guitar.',
    'Three friends are hiking in the mountains.',
    'A woman is riding a black horse in an open field.',
    'A parrot is singing on a perch.',
    'Someone in a superhero costume is playing a keyboard.',
    'A group of friends is having a barbecue by the beach.'
]
#Encode all sentences
embeddings = model.encode(sentences)

#Compute cosine similarity between all pairs
cos_sim = util.cos_sim(embeddings, embeddings)


In [4]:
#Add all pairs to a list with their cosine similarity score
all_sentence_combinations = []
for i in range(len(cos_sim)-1):
    for j in range(i+1, len(cos_sim)):
        all_sentence_combinations.append([cos_sim[i][j], i, j])

#Sort list by the highest cosine similarity score
all_sentence_combinations = sorted(all_sentence_combinations, key=lambda x: x[0], reverse=True)

print("Top-5 most similar pairs:")
for score, i, j in all_sentence_combinations[0:5]:
    print("{} \t {} \t {:.4f}".format(sentences[i], sentences[j], cos_sim[i][j]))


Top-5 most similar pairs:
Three friends are hiking in the mountains. 	 A group of friends is having a barbecue by the beach. 	 0.3594
A man is playing the guitar. 	 Someone in a superhero costume is playing a keyboard. 	 0.2693
A man is playing the guitar. 	 A parrot is singing on a perch. 	 0.2659
A child is eating a slice of pizza. 	 A girl is riding a bicycle. 	 0.2262
A woman is cooking dinner. 	 A woman is riding a black horse in an open field. 	 0.1560


### Pre-Trained Models
https://www.sbert.net/docs/pretrained_models.html

# Computing Sentence Embeddings

In [5]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

#Our sentences we like to encode
sentences = ['This framework generates embeddings for each input sentence',
    'Sentences are passed as a list of string.',
    'The quick brown fox jumps over the lazy dog.']

#Sentences are encoded by calling model.encode()
embeddings = model.encode(sentences)

#Print the embeddings
for sentence, embedding in zip(sentences, embeddings):
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")


Sentence: This framework generates embeddings for each input sentence
Embedding: [-1.37173552e-02 -4.28515449e-02 -1.56286024e-02  1.40537303e-02
  3.95537727e-02  1.21796280e-01  2.94334106e-02 -3.17524187e-02
  3.54959629e-02 -7.93139935e-02  1.75878741e-02 -4.04369719e-02
  4.97259349e-02  2.54912246e-02 -7.18700588e-02  8.14968869e-02
  1.47069141e-03  4.79626991e-02 -4.50336412e-02 -9.92174670e-02
 -2.81769745e-02  6.45046085e-02  4.44670543e-02 -4.76217009e-02
 -3.52952331e-02  4.38671783e-02 -5.28566055e-02  4.33063833e-04
  1.01921506e-01  1.64072234e-02  3.26996595e-02 -3.45986746e-02
  1.21339476e-02  7.94870779e-02  4.58345609e-03  1.57777797e-02
 -9.68206208e-03  2.87625659e-02 -5.05805984e-02 -1.55793717e-02
 -2.87906546e-02 -9.62280575e-03  3.15556750e-02  2.27349028e-02
  8.71449187e-02 -3.85027491e-02 -8.84718448e-02 -8.75498448e-03
 -2.12343335e-02  2.08923239e-02 -9.02077407e-02 -5.25732562e-02
 -1.05638904e-02  2.88310610e-02 -1.61455162e-02  6.17837207e-03
 -1.23234

## Input Sequence Length

By default, the provided methods use a limit fo 128 word pieces, longer inputs will be truncated. You can get and set the maximal sequence length like this:



In [6]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

print("Max Sequence Length:", model.max_seq_length)

#Change the length to 200
model.max_seq_length = 200

print("Max Sequence Length:", model.max_seq_length)


Max Sequence Length: 256
Max Sequence Length: 200


## Storing & Loading Embeddings

from sentence_transformers import SentenceTransformer
import pickle

model = SentenceTransformer('all-MiniLM-L6-v2')
sentences = ['This framework generates embeddings for each input sentence',
    'Sentences are passed as a list of string.',
    'The quick brown fox jumps over the lazy dog.']


embeddings = model.encode(sentences)

#Store sentences & embeddings on disc
with open('embeddings.pkl', "wb") as fOut:
    pickle.dump({'sentences': sentences, 'embeddings': embeddings}, fOut, protocol=pickle.HIGHEST_PROTOCOL)

#Load sentences & embeddings from disc
with open('embeddings.pkl', "rb") as fIn:
    stored_data = pickle.load(fIn)
    stored_sentences = stored_data['sentences']
    stored_embeddings = stored_data['embeddings']


In [7]:
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('all-MiniLM-L6-v2')

# Single list of sentences
sentences = [
    'The dog runs in the park.',
    'A group of friends is having a picnic.',
    'I enjoy reading books in my free time.',
    'The sunsets are beautiful in this city.',
    'A little girl is playing with her toys.',
    'We are going to the beach this weekend.',
    'The coffee at the new cafe is delicious.',
    'Have you ever tried sushi?',
    'A man is jogging in the morning.',
    'The concert last night was incredible.'
]

#Compute embeddings
embeddings = model.encode(sentences, convert_to_tensor=True)

#Compute cosine-similarities for each sentence with each other sentence
cosine_scores = util.cos_sim(embeddings, embeddings)

#Find the pairs with the highest cosine similarity scores
pairs = []
for i in range(len(cosine_scores)-1):
    for j in range(i+1, len(cosine_scores)):
        pairs.append({'index': [i, j], 'score': cosine_scores[i][j]})

#Sort scores in decreasing order
pairs = sorted(pairs, key=lambda x: x['score'], reverse=True)

for pair in pairs[0:10]:
    i, j = pair['index']
    print("{} \t\t {} \t\t Score: {:.4f}".format(sentences[i], sentences[j], pair['score']))


The dog runs in the park. 		 A man is jogging in the morning. 		 Score: 0.2394
The sunsets are beautiful in this city. 		 We are going to the beach this weekend. 		 Score: 0.2189
The sunsets are beautiful in this city. 		 The concert last night was incredible. 		 Score: 0.1737
The sunsets are beautiful in this city. 		 The coffee at the new cafe is delicious. 		 Score: 0.1624
We are going to the beach this weekend. 		 Have you ever tried sushi? 		 Score: 0.1570
Have you ever tried sushi? 		 The concert last night was incredible. 		 Score: 0.1497
We are going to the beach this weekend. 		 The concert last night was incredible. 		 Score: 0.1474
The dog runs in the park. 		 The coffee at the new cafe is delicious. 		 Score: 0.1213
The coffee at the new cafe is delicious. 		 The concert last night was incredible. 		 Score: 0.1205
A group of friends is having a picnic. 		 We are going to the beach this weekend. 		 Score: 0.1200


## Semantic Search

Semantic search seeks to improve search accuracy by understanding the content of the search query. In contrast to traditional search engines which only find documents based on lexical matches, semantic search can also find synonyms.



In [8]:
"""
This is a simple application for sentence embeddings: semantic search

We have a corpus with various sentences. Then, for a given query sentence,
we want to find the most similar sentence in this corpus.

This script outputs for various queries the top 5 most similar sentences in the corpus.
"""
from sentence_transformers import SentenceTransformer, util
import torch

embedder = SentenceTransformer('all-MiniLM-L6-v2')

# Corpus with example sentences
corpus = ['A man is eating food.',
          'A man is eating a piece of bread.',
          'The girl is carrying a baby.',
          'A man is riding a horse.',
          'A woman is playing violin.',
          'Two men pushed carts through the woods.',
          'A man is riding a white horse on an enclosed ground.',
          'A monkey is playing drums.',
          'A cheetah is running behind its prey.'
          ]
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)

# Query sentences:
queries = ['A man is eating pasta.', 'Someone in a gorilla costume is playing a set of drums.', 'A cheetah chases prey on across a field.']


# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(5, len(corpus))
for query in queries:
    query_embedding = embedder.encode(query, convert_to_tensor=True)

    # We use cosine-similarity and torch.topk to find the highest 5 scores
    cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
    top_results = torch.topk(cos_scores, k=top_k)

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")

    for score, idx in zip(top_results[0], top_results[1]):
        print(corpus[idx], "(Score: {:.4f})".format(score))

    """
    # Alternatively, we can also use util.semantic_search to perform cosine similarty + topk
    hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=5)
    hits = hits[0]      #Get the hits for the first query
    for hit in hits:
        print(corpus[hit['corpus_id']], "(Score: {:.4f})".format(hit['score']))
    """






Query: A man is eating pasta.

Top 5 most similar sentences in corpus:
A man is eating food. (Score: 0.7035)
A man is eating a piece of bread. (Score: 0.5272)
A man is riding a horse. (Score: 0.1889)
A man is riding a white horse on an enclosed ground. (Score: 0.1047)
A cheetah is running behind its prey. (Score: 0.0980)




Query: Someone in a gorilla costume is playing a set of drums.

Top 5 most similar sentences in corpus:
A monkey is playing drums. (Score: 0.6433)
A woman is playing violin. (Score: 0.2564)
A man is riding a horse. (Score: 0.1389)
A man is riding a white horse on an enclosed ground. (Score: 0.1191)
A cheetah is running behind its prey. (Score: 0.1080)




Query: A cheetah chases prey on across a field.

Top 5 most similar sentences in corpus:
A cheetah is running behind its prey. (Score: 0.8253)
A man is eating food. (Score: 0.1399)
A monkey is playing drums. (Score: 0.1292)
A man is riding a white horse on an enclosed ground. (Score: 0.1097)
A man is riding a 