In [1]:
!pip install -U sentence-transformers
!pip install datasets
!pip install hnswlib

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
import datasets
from sentence_transformers import SentenceTransformer, CrossEncoder, util
from datetime import datetime
import pandas as pd
import numpy as np

# Semantic Search

One potential extension involves utilizing **transformers** to generate **embeddings** for an entire collection of sentences, which can then be used for processing various queries.

Here's a straightforward algorithm: begin by encoding the query, then calculate a pairwise score (such as **cosine similarity**) between the encoded query and each embedding vector within the sentence corpus. Finally, extract the **k nearest neighbors** based on the highest scores obtained.

In [3]:
stsb = datasets.load_dataset('mteb/stsbenchmark-sts')
transformer = SentenceTransformer('stsb-distilroberta-base-v2')
cross_encoder = CrossEncoder('cross-encoder/stsb-distilroberta-base', num_labels=1)

print(f"Train samples: {len(stsb['train'])}")
stsb['train'][0].keys()



  0%|          | 0/3 [00:00<?, ?it/s]

Train samples: 5749


dict_keys(['split', 'genre', 'dataset', 'year', 'sid', 'score', 'sentence1', 'sentence2'])

In [4]:
sentences = np.unique(stsb['train']['sentence1'] + stsb['train']['sentence2'])
print(f"Total number of sentences: {len(sentences)}")

Total number of sentences: 10566


In [5]:
sentences_embeddings = transformer.encode(sentences, convert_to_tensor=True, show_progress_bar=True)
sentences_embeddings.shape

Batches:   0%|          | 0/331 [00:00<?, ?it/s]

torch.Size([10566, 768])

In [6]:
query = "Obama is eating an icecream"
embedded_query = transformer.encode(query, convert_to_tensor=True)

In [7]:
# utils.semantic_search returns a list of result for each query
t_start = datetime.now()
results = pd.DataFrame(util.semantic_search(
    embedded_query, 
    sentences_embeddings, 
    score_function=util.cos_sim, top_k=10
)[0])
t_stop = datetime.now()
print(f"Search time: {t_stop - t_start}")

Search time: 0:00:00.032681


In [8]:
print(f"Query: \"{query}\"")
print("---------------------------------------")
for idx, row in results.iterrows():
    print(f"{idx + 1}) {row['score']:.2f} - {sentences[int(row['corpus_id'])]}")

Query: "Obama is eating an icecream"
---------------------------------------
1) 0.50 - Obama signs up for Obamacare
2) 0.48 - A man is eating.
3) 0.47 - A man is eating a bowl of cereal.
4) 0.46 - A woman is eating something.
5) 0.45 - A man is eating a banana.
6) 0.44 - The man is eating cereal.
7) 0.44 - Obama jokes about himself at reporters' dinner
8) 0.44 - A girl is eating a cupcake.
9) 0.44 - A man is eating his food.
10) 0.44 - Obama to sign up for Obamacare


## Cross Encoders

Alternatively, we can use **CrossEncoders**, a model that takes two sentences as inputs and directly predicts the matching score. By utilizing the **CrossEncoders**, we can generate an **array of scores** by iteratively passing the query with each sentence in the corpus. Then, we can extract the **top k scores**.

Although this approach yields better results, it is slower. An enhancement to the previous algorithm would involve extracting the top k query results and subsequently **re-ranking** them using the cross encoder.

In [9]:
# Prepare model inputs
result_sentences = [sentences[int(row['corpus_id'])] for idx, row in results.iterrows()]
model_inputs = [[query, s] for s in result_sentences]

# Predict similarity score
scores = cross_encoder.predict(model_inputs)
# Print the result
print(f"Query: \"{query}\"")
print("---------------------------------------")
for i, idx in enumerate(np.argsort(-scores)):
    print(f"{i + 1}) {scores[idx]:.2f} - {result_sentences[idx]}")

Query: "Obama is eating an icecream"
---------------------------------------
1) 0.51 - A man is eating.
2) 0.44 - A man is eating his food.
3) 0.29 - The man is eating cereal.
4) 0.24 - A woman is eating something.
5) 0.24 - A man is eating a bowl of cereal.
6) 0.24 - A man is eating a banana.
7) 0.23 - A girl is eating a cupcake.
8) 0.11 - Obama jokes about himself at reporters' dinner
9) 0.08 - Obama signs up for Obamacare
10) 0.07 - Obama to sign up for Obamacare
