<h2>Sentence Transformers</h2>

**Resources:**

*   https://www.sbert.net/index.html
*   https://www.sbert.net/docs/pretrained_models.html



**Use cases:**



*   Sentence Embedding
*   Sentence Similarity
*   Semantic Search
*   Clustering












In [2]:
# !pip install -U sentence-transformers

**Generate Embeding**

In [3]:
from sentence_transformers import SentenceTransformer,util
model = SentenceTransformer('all-MiniLM-L6-v2')

In [4]:
sentences = ['This framework generates embeddings for each input sentence',
    'Sentences are passed as a list of string.']


embeddings = model.encode(sentences)


for sentence, embedding in zip(sentences, embeddings):
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")

Sentence: This framework generates embeddings for each input sentence
Embedding: [-1.37173282e-02 -4.28515524e-02 -1.56285744e-02  1.40537396e-02
  3.95537578e-02  1.21796273e-01  2.94333789e-02 -3.17523926e-02
  3.54959629e-02 -7.93139860e-02  1.75878331e-02 -4.04369682e-02
  4.97259609e-02  2.54912544e-02 -7.18700513e-02  8.14968869e-02
  1.47067406e-03  4.79627252e-02 -4.50336449e-02 -9.92174745e-02
 -2.81769522e-02  6.45046309e-02  4.44670394e-02 -4.76217046e-02
 -3.52952555e-02  4.38671894e-02 -5.28565869e-02  4.33016277e-04
  1.01921484e-01  1.64072458e-02  3.26996297e-02 -3.45987044e-02
  1.21339448e-02  7.94870928e-02  4.58343793e-03  1.57778151e-02
 -9.68207140e-03  2.87625752e-02 -5.05806245e-02 -1.55793792e-02
 -2.87906956e-02 -9.62282438e-03  3.15556824e-02  2.27348953e-02
  8.71449336e-02 -3.85027155e-02 -8.84718820e-02 -8.75501242e-03
 -2.12343279e-02  2.08923873e-02 -9.02077779e-02 -5.25732338e-02
 -1.05638886e-02  2.88310777e-02 -1.61454976e-02  6.17838465e-03
 -1.23234

**Cosine-Similarity**

In [5]:
emb1 = model.encode("I am eating Apple")
emb2 = model.encode("I like fruits")
cos_sim = util.cos_sim(emb1, emb2)
print("Cosine-Similarity:", cos_sim)

Cosine-Similarity: tensor([[0.5398]])


**Compute cosine similarity between all pairs**

In [6]:
# Compute cosine similarity between all pairs

sentences = ['A man is eating food.',
          'A man is eating a piece of bread.',
          'The girl is carrying a baby.',
          'A man is riding a horse.',
          'A woman is playing violin.',
          'Two men pushed carts through the woods.',
          'A man is riding a white horse on an enclosed ground.',
          'A monkey is playing drums.',
          'Someone in a gorilla costume is playing a set of drums.'
          ]

#Encode all sentences
embeddings = model.encode(sentences)

#Compute cosine similarity between all pairs
cos_sim = util.cos_sim(embeddings, embeddings)

cos_sim



tensor([[ 1.0000,  0.7553, -0.1050,  0.2474, -0.0704, -0.0333,  0.1707,  0.0476,
          0.0630],
        [ 0.7553,  1.0000, -0.0610,  0.1442, -0.0809, -0.0216,  0.1157,  0.0362,
          0.0216],
        [-0.1050, -0.0610,  1.0000, -0.1088,  0.0217, -0.0413, -0.0928,  0.0231,
          0.0247],
        [ 0.2474,  0.1442, -0.1088,  1.0000, -0.0348,  0.0362,  0.7369,  0.0821,
          0.1389],
        [-0.0704, -0.0809,  0.0217, -0.0348,  1.0000, -0.1654, -0.0592,  0.1961,
          0.2564],
        [-0.0333, -0.0216, -0.0413,  0.0362, -0.1654,  1.0000,  0.0769, -0.0380,
         -0.0895],
        [ 0.1707,  0.1157, -0.0928,  0.7369, -0.0592,  0.0769,  1.0000,  0.0495,
          0.1191],
        [ 0.0476,  0.0362,  0.0231,  0.0821,  0.1961, -0.0380,  0.0495,  1.0000,
          0.6433],
        [ 0.0630,  0.0216,  0.0247,  0.1389,  0.2564, -0.0895,  0.1191,  0.6433,
          1.0000]])

In [7]:
#Add all pairs to a list with their cosine similarity score
all_sentence_combinations = []
for i in range(len(cos_sim)-1):
    for j in range(i+1, len(cos_sim)):
        all_sentence_combinations.append((cos_sim[i][j], i, j))
all_sentence_combinations        

[(tensor(0.7553), 0, 1),
 (tensor(-0.1050), 0, 2),
 (tensor(0.2474), 0, 3),
 (tensor(-0.0704), 0, 4),
 (tensor(-0.0333), 0, 5),
 (tensor(0.1707), 0, 6),
 (tensor(0.0476), 0, 7),
 (tensor(0.0630), 0, 8),
 (tensor(-0.0610), 1, 2),
 (tensor(0.1442), 1, 3),
 (tensor(-0.0809), 1, 4),
 (tensor(-0.0216), 1, 5),
 (tensor(0.1157), 1, 6),
 (tensor(0.0362), 1, 7),
 (tensor(0.0216), 1, 8),
 (tensor(-0.1088), 2, 3),
 (tensor(0.0217), 2, 4),
 (tensor(-0.0413), 2, 5),
 (tensor(-0.0928), 2, 6),
 (tensor(0.0231), 2, 7),
 (tensor(0.0247), 2, 8),
 (tensor(-0.0348), 3, 4),
 (tensor(0.0362), 3, 5),
 (tensor(0.7369), 3, 6),
 (tensor(0.0821), 3, 7),
 (tensor(0.1389), 3, 8),
 (tensor(-0.1654), 4, 5),
 (tensor(-0.0592), 4, 6),
 (tensor(0.1961), 4, 7),
 (tensor(0.2564), 4, 8),
 (tensor(0.0769), 5, 6),
 (tensor(-0.0380), 5, 7),
 (tensor(-0.0895), 5, 8),
 (tensor(0.0495), 6, 7),
 (tensor(0.1191), 6, 8),
 (tensor(0.6433), 7, 8)]

In [8]:
#Sort list by the highest cosine similarity score
all_sentence_combinations = sorted(all_sentence_combinations, key=lambda x: x[0], reverse=True)

print("Top-5 most similar pairs:")
for score, i, j in all_sentence_combinations[0:5]:
    print("{} \t {} \t {:.4f}".format(sentences[i], sentences[j], cos_sim[i][j]))

Top-5 most similar pairs:
A man is eating food. 	 A man is eating a piece of bread. 	 0.7553
A man is riding a horse. 	 A man is riding a white horse on an enclosed ground. 	 0.7369
A monkey is playing drums. 	 Someone in a gorilla costume is playing a set of drums. 	 0.6433
A woman is playing violin. 	 Someone in a gorilla costume is playing a set of drums. 	 0.2564
A man is eating food. 	 A man is riding a horse. 	 0.2474


**Semantic search**

In [9]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('clips/mfaq')



In [10]:

question = "<Q>How many models can I host on HuggingFace?"
answer_1 = "<A>All plans come with unlimited private models and datasets."
answer_2 = "<A>AutoNLP is an automatic way to train and deploy state-of-the-art NLP models, seamlessly integrated with the Hugging Face ecosystem."
answer_3 = "<A>Based on how much training data and model variants are created, we send you a compute cost and payment link - as low as $10 per job."

query_embedding = model.encode(question)
corpus_embeddings = model.encode([answer_1, answer_2, answer_3])

print(util.semantic_search(query_embedding, corpus_embeddings))

[[{'corpus_id': 0, 'score': 0.5646324753761292}, {'corpus_id': 2, 'score': 0.5142340064048767}, {'corpus_id': 1, 'score': 0.4730038344860077}]]


In [11]:
from transformers import pipeline

In [12]:
qa_model = pipeline("question-answering")
question = "How many models can I host on HuggingFace?"
context = "All plans come with unlimited private models and datasets."
qa_model(question = question, context = context)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'score': 0.7017184495925903, 'start': 20, 'end': 29, 'answer': 'unlimited'}

In [13]:
print(util.semantic_search(query_embedding, corpus_embeddings))

[[{'corpus_id': 0, 'score': 0.5646324753761292}, {'corpus_id': 2, 'score': 0.5142340064048767}, {'corpus_id': 1, 'score': 0.4730038344860077}]]


**Clustering**

In [14]:
from sklearn.cluster import KMeans
import numpy as np

embedder = SentenceTransformer('all-MiniLM-L6-v2')

# Corpus with example sentences
corpus = ['A man is eating food.',
          'A man is eating a piece of bread.',
          'Horse is eating grass.',
          'A man is eating pasta.',
          'A Woman is eating Biryani.',
          'The girl is carrying a baby.',
          'The baby is carried by the woman',
          'A man is riding a horse.',
          'A man is riding a white horse on an enclosed ground.',
          'A monkey is playing drums.',
          'Someone in a gorilla costume is playing a set of drums.',
          'A cheetah is running behind its prey.',
          'A cheetah chases prey on across a field.',
          'The cheetah is chasing a man who is riding the horse.',
          'man and women with their baby are watching cheetah in zoo'
          ]
corpus_embeddings = embedder.encode(corpus)

# Normalize the embeddings to unit length
corpus_embeddings = corpus_embeddings /  np.linalg.norm(corpus_embeddings, axis=1, keepdims=True)

In [15]:
corpus_embeddings[0]

array([ 3.32415737e-02,  4.40607546e-03, -6.27697585e-03,  4.83787097e-02,
       -1.38702288e-01, -3.36174145e-02,  1.01128317e-01, -5.43849505e-02,
       -4.32477817e-02, -3.99411060e-02,  7.78632099e-03, -1.27489204e-02,
       -6.68302476e-02, -1.73866171e-02,  4.74506095e-02, -5.77242747e-02,
        1.01888381e-01, -9.11665440e-04,  8.22614506e-02, -5.03416061e-02,
        6.77303225e-02,  4.08765338e-02, -3.58018801e-02, -1.00682430e-01,
       -6.69354992e-03, -5.31686470e-02,  1.00335173e-01, -5.46136573e-02,
       -2.28481703e-02,  1.38387196e-02,  7.48658553e-02, -6.17879778e-02,
        6.39215931e-02,  1.62387192e-02, -5.32299727e-02, -3.86084020e-02,
        3.15276235e-02, -8.11530128e-02, -3.31432074e-02, -5.38519525e-04,
       -3.96066438e-03, -1.52734043e-02, -9.86409956e-04,  9.57987458e-02,
       -5.42920344e-02,  1.84572507e-02, -1.07143581e-01,  1.38884652e-02,
        3.94072421e-02, -2.69243475e-02, -9.15989056e-02, -1.14195151e-02,
        3.38137224e-02, -

In [16]:
# source: https://stackoverflow.com/questions/55619176/how-to-cluster-similar-sentences-using-bert

clustering_model = KMeans(n_clusters=4)
clustering_model.fit(corpus_embeddings)
cluster_assignment = clustering_model.labels_
print(cluster_assignment)

  super()._check_params_vs_input(X, default_n_init=10)


[2 2 2 2 2 0 0 2 2 3 3 1 1 1 1]


In [17]:
clustered_sentences = {}
for sentence_id, cluster_id in enumerate(cluster_assignment):
    if cluster_id not in clustered_sentences:
        clustered_sentences[cluster_id] = []

    clustered_sentences[cluster_id].append(corpus[sentence_id])
clustered_sentences

{2: ['A man is eating food.',
  'A man is eating a piece of bread.',
  'Horse is eating grass.',
  'A man is eating pasta.',
  'A Woman is eating Biryani.',
  'A man is riding a horse.',
  'A man is riding a white horse on an enclosed ground.'],
 0: ['The girl is carrying a baby.', 'The baby is carried by the woman'],
 3: ['A monkey is playing drums.',
  'Someone in a gorilla costume is playing a set of drums.'],
 1: ['A cheetah is running behind its prey.',
  'A cheetah chases prey on across a field.',
  'The cheetah is chasing a man who is riding the horse.',
  'man and women with their baby are watching cheetah in zoo']}

In [18]:
clustered_sentences = {}
for sentence_id, cluster_id in enumerate(cluster_assignment):
    if cluster_id not in clustered_sentences:
        clustered_sentences[cluster_id] = []

    clustered_sentences[cluster_id].append(corpus[sentence_id])
clustered_sentences

{2: ['A man is eating food.',
  'A man is eating a piece of bread.',
  'Horse is eating grass.',
  'A man is eating pasta.',
  'A Woman is eating Biryani.',
  'A man is riding a horse.',
  'A man is riding a white horse on an enclosed ground.'],
 0: ['The girl is carrying a baby.', 'The baby is carried by the woman'],
 3: ['A monkey is playing drums.',
  'Someone in a gorilla costume is playing a set of drums.'],
 1: ['A cheetah is running behind its prey.',
  'A cheetah chases prey on across a field.',
  'The cheetah is chasing a man who is riding the horse.',
  'man and women with their baby are watching cheetah in zoo']}