# Semantic Search with Word Embeddings using BERT Transformers

Explore the world of semantic similarity search using pretrained word embeddings with BERT Transformers in this code snippet. The example demonstrates how to compute and compare sentence embeddings, enabling semantic search for related content.


In [None]:
pip install sentence-transformers

### Code Snippet 1: Sentence Embeddings and Similarity Calculation
In summary, this code snippet demonstrates the process of using a pretrained BERT model to convert sentences into embeddings and then calculating the cosine similarity between pairs of sentences. The resulting similarity scores provide a quantitative measure of how closely related or similar the pairs of sentences are in a semantic context.

In [2]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')

# Two lists of sentences
sentences1 = ['The cat sits outside',
             'A man is playing guitar',
             'The new movie is awesome']

sentences2 = ['The dog plays in the garden',
              'A woman watches TV',
              'The new movie is so great']

#Compute embedding for both lists
embeddings1 = model.encode(sentences1, convert_to_tensor=True)
embeddings2 = model.encode(sentences2, convert_to_tensor=True)

#Compute cosine-similarities
cosine_scores = util.cos_sim(embeddings1, embeddings2)

#Output the pairs with their score
for i in range(len(sentences1)):
    print("{} \t\t {} \t\t Score: {:.4f}".format(sentences1[i], sentences2[i], cosine_scores[i][i]))

The cat sits outside 		 The dog plays in the garden 		 Score: 0.2838
A man is playing guitar 		 A woman watches TV 		 Score: -0.0327
The new movie is awesome 		 The new movie is so great 		 Score: 0.8939


### Code Snippet 2: Semantic Search for Similar Queries

In summary, Code Snippet 2 showcases a hypothetical scenario where previously computed sentence embeddings are used for semantic search. The code demonstrates how to find sentences similar to a set of query sentences based on cosine similarity. This type of semantic search can be valuable in applications where you want to retrieve content related to specific queries, such as finding relevant documents or articles based on user-input queries.

In [4]:
from sentence_transformers import SentenceTransformer, util
import torch

embedder = SentenceTransformer('all-MiniLM-L6-v2')

# Corpus with example sentences
corpus = ['is tiktok getting banned.',
          'is tiktok shutting down.',
          'is tiktok banned in us.',
          'is tiktok getting deleted.',
          'is tiktok a competitor for instagram.',
          'is instagram a useful app',
          'something unrelevant'
          ]
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)

# Query sentences:
queries = ['is tiktok']


# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(7, len(corpus))
for query in queries:
    query_embedding = embedder.encode(query, convert_to_tensor=True)

    # We use cosine-similarity and torch.topk to find the highest 5 scores
    cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
    top_results = torch.topk(cos_scores, k=top_k)

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")

    for score, idx in zip(top_results[0], top_results[1]):
        print(corpus[idx], "(Score: {:.4f})".format(score))





Query: is tiktok

Top 5 most similar sentences in corpus:
is tiktok banned in us. (Score: 0.7735)
is tiktok getting banned. (Score: 0.7716)
is tiktok getting deleted. (Score: 0.7651)
is tiktok a competitor for instagram. (Score: 0.7560)
is tiktok shutting down. (Score: 0.7308)
is instagram a useful app (Score: 0.1481)
something unrelevant (Score: 0.0925)
