In this notebook, I am going to use Sentence transformers to perform the following NLP tasks:


*   Sentence Similarity
*   Semantic Search
*   Question Answering
*   Sentence Clustering



In [2]:
%%capture
!pip install -U sentence-transformers

In [3]:
import numpy as np
from sentence_transformers import SentenceTransformer, util
from transformers import pipeline
from sklearn.cluster import KMeans

In [4]:
checkpoint = "all-MiniLM-L6-v2"
model = SentenceTransformer(checkpoint)

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Let's first pass some sentences to the model and check the embeddings generated by the model

In [5]:
sentences = [
    'Australia is a wonderful tourist destination in December.',
    'I love when it rains in summer'
]

embeddings = model.encode(sentences)

for sentence, embedding in zip(sentences, embeddings):
    print("Sentence:", sentence)
    print("Embedding shape: ", embedding.shape,)
    print("Embedding generated:\n", embedding, "\n") 

Sentence: Australia is a wonderful tourist destination in December.
Embedding shape:  (384,)
Embedding generated:
 [ 8.04836154e-02  1.72129553e-02  2.09936518e-02  2.00939141e-02
  7.66017754e-03  3.73811200e-02  1.46356961e-02 -4.95522879e-02
 -4.56029698e-02  1.00384377e-01 -1.00349579e-02  7.08437059e-03
  4.33888063e-02  1.04353271e-01  4.61061858e-02 -3.71732637e-02
  4.61936404e-04 -8.76981467e-02  7.85304233e-02 -3.18054147e-02
  3.06846276e-02  1.30699296e-02 -4.72417064e-02 -1.33039141e-02
  1.23845302e-02  4.15663496e-02 -3.43382396e-02 -1.78184081e-02
 -1.25135835e-02  3.47135849e-02 -7.33558685e-02  8.99465755e-02
  4.75993322e-04 -4.14230255e-03  8.94850679e-03  5.84964640e-02
 -8.15250725e-03 -1.15181692e-01  2.41105023e-04 -4.13065106e-02
  6.67267218e-02  1.29336389e-02  1.07203551e-01 -6.36152104e-02
 -2.57131569e-02 -5.89676872e-02  4.88390960e-02  4.19631861e-02
  2.96290312e-02  9.87798423e-02  7.11039156e-02  1.01576380e-01
 -6.86481670e-02 -4.35450934e-02 -4.7503

### Sentence Similarity

In [6]:
# We will use cosine similarity

# Compute cosine similarity between all pairs

sentences = [
    'The pizza tastes great with extra cheeze today.',
    'The iPhone 12 have amazing new features, will be sold out quickly!',
    'The man is carrying a baby in his arms, he should not go to the woods alone.',
    'My new Honda City got scratched yesterday while passing through that street.',
    'I ordered two family pan pizza which extra cheeze for my pizza.',
    'The woods are extremenly scary due to wild animals.',
    'The face unlock feature on my iPhone is simply astonishing.',
    'I tried to warn you not to pass through that scary street with your car.',
    'Let us rock and roll tonight'
]

#Encode all sentences
embeddings = model.encode(sentences)

#Compute cosine similarity between all pairs
cos_sim = util.cos_sim(embeddings, embeddings)

#Add all pairs to a list with their cosine similarity score
sentence_pairs = []
for i in range(len(cos_sim)-1):
    for j in range(i+1, len(cos_sim)):
        sentence_pairs.append((cos_sim[i][j], i, j))

# Let's see the first 20 values in the list
sentence_pairs[:20]

[(tensor(0.1103), 0, 1),
 (tensor(-0.1328), 0, 2),
 (tensor(0.1347), 0, 3),
 (tensor(0.7315), 0, 4),
 (tensor(-0.0893), 0, 5),
 (tensor(0.0011), 0, 6),
 (tensor(0.0675), 0, 7),
 (tensor(0.1830), 0, 8),
 (tensor(-0.0758), 1, 2),
 (tensor(0.0458), 1, 3),
 (tensor(0.0823), 1, 4),
 (tensor(0.0303), 1, 5),
 (tensor(0.3701), 1, 6),
 (tensor(0.0195), 1, 7),
 (tensor(0.0172), 1, 8),
 (tensor(-0.0884), 2, 3),
 (tensor(-0.1041), 2, 4),
 (tensor(0.2956), 2, 5),
 (tensor(0.0347), 2, 6),
 (tensor(0.0982), 2, 7)]

In [7]:
# Sorting the list by descending order of cosine similarity score and checking the top-5 most similar pairs

sentence_pairs = sorted(sentence_pairs, key=lambda x: x[0], reverse=True)

print("Top-5 most similar pairs of sentences:\n")
for score, i, j in sentence_pairs[0:5]:
    print("First Sentence:", sentences[i])
    print("Second Sentence:", sentences[j])
    print("Similarity Score:", cos_sim[i][j].item(), "\n")

Top-5 most similar pairs of sentences:

First Sentence: The pizza tastes great with extra cheeze today.
Second Sentence: I ordered two family pan pizza which extra cheeze for my pizza.
Similarity Score: 0.7314900159835815 

First Sentence: The iPhone 12 have amazing new features, will be sold out quickly!
Second Sentence: The face unlock feature on my iPhone is simply astonishing.
Similarity Score: 0.3700793981552124 

First Sentence: The woods are extremenly scary due to wild animals.
Second Sentence: I tried to warn you not to pass through that scary street with your car.
Similarity Score: 0.35832250118255615 

First Sentence: My new Honda City got scratched yesterday while passing through that street.
Second Sentence: I tried to warn you not to pass through that scary street with your car.
Similarity Score: 0.3440636694431305 

First Sentence: The man is carrying a baby in his arms, he should not go to the woods alone.
Second Sentence: The woods are extremenly scary due to wild anim

### Semantic Search

In [8]:
checkpoint = "clips/mfaq"
model = SentenceTransformer(checkpoint)    

Downloading (…)e35aa/.gitattributes:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)d6a50e35aa/README.md:   0%|          | 0.00/3.74k [00:00<?, ?B/s]

Downloading (…)aa/added_tokens.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading (…)a50e35aa/config.json:   0%|          | 0.00/778 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/117 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)tencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/294 [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/464 [00:00<?, ?B/s]

Downloading (…)50e35aa/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]



In [9]:
# Search for the best answers among the corpus of answers for the given question

question = "Which are the best countries to visit in December for tourism?"
answers = [
    "Goa must be avoided in the months of July August due to heavy rains.",
    "Australia, Sri Lanka, West Indies are very popular tourist destinations in the winter months till January.",
    "I can complete this job for $500 in a week."
]

question_embedding = model.encode(question)
corpus_embeddings = model.encode(answers)

scores = util.semantic_search(question_embedding, corpus_embeddings)
print(scores, "\n")
print("Question:", question)
print("\nThe best answers in descending order of scores are:\n")

for d in scores[0]:
    print(f"\tAnswer: {answers[d['corpus_id']]}\n\tscore: {d['score']}\n")

[[{'corpus_id': 1, 'score': 0.694810152053833}, {'corpus_id': 0, 'score': 0.6513177752494812}, {'corpus_id': 2, 'score': 0.6036452651023865}]] 

Question: Which are the best countries to visit in December for tourism?

The best answers in descending order of scores are:

	Answer: Australia, Sri Lanka, West Indies are very popular tourist destinations in the winter months till January.
	score: 0.694810152053833

	Answer: Goa must be avoided in the months of July August due to heavy rains.
	score: 0.6513177752494812

	Answer: I can complete this job for $500 in a week.
	score: 0.6036452651023865



### Question Answering

In [10]:
qa_model = pipeline("question-answering")

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/261M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [11]:
# Let's provide the context for the question-answering

context = """Massive Beast in a small size! 
Just upgraded to this from a Samsung Galaxy S10e a couple days ago, 
and let me tell you something. For me, this was a huge jump because I've had very little experience
with iOS prior to buying this phone. And to start off, I love every aspect of this phone, 
especially Siri and the cameras! Speaking of which, the cameras on this thing are insane! 
Way less noisy and grainy in the background than my Samsung. Another feature I absolutely love is the Face ID, 
which detects your face almost instantly, unlike my S10e. And the last feature I really love on this thing 
is setting Do Not Disturb based on location. This was a feature that my Samsung did not have and I 
cannot tell you how helpful this is! One being that I don't have to readjust the time schedule of Do Not Disturb 
and the second being that it will not turn of until you're a certain distance away from that 
location (which can be adjusted in Settings). Overall, as a former Android user, 
I cannot tell you how much I love my new iPhone 13, I highly recomment this to everyone!
"""

# Now let's ask some questions

question = "Which phone was being user earlier?"
answer = qa_model(question = question, context = context)['answer']
print("Question:", question)
print("Answer:", answer)

Question: Which phone was being user earlier?
Answer: Samsung Galaxy S10e


In [12]:
question = "Which is the last feature mentioned by the user that she liked?"
answer = qa_model(question = question, context = context)['answer']
print("Question:", question)
print("Answer:", answer)

Question: Which is the last feature mentioned by the user that she liked?
Answer: Do Not Disturb


In [13]:
question = "Which version of iPhone is she taking about?"
answer = qa_model(question = question, context = context)['answer']
print("Question:", question)
print("Answer:", answer)

Question: Which version of iPhone is she taking about?
Answer: iPhone 13


### Sentence Clustering

In [16]:
checkpoint = "all-MiniLM-L6-v2"
embedder = SentenceTransformer(checkpoint)

In [47]:
# Corpus with example sentences

sentence_corpus = [
    'The pizza tastes great with extra cheeze today.',
    'Denmark, Sweden, Norway are called nordic countries',
    'The iPhone 12 have amazing new features, will be sold out quickly!',
    'The man is carrying a baby in his arms, he should not go to the woods alone.',
    'I ordered two family pan pizza which extra cheeze for my pizza.',
    'The woods are extremenly scary due to wild animals.',
    'The face unlock feature on my iPhone is simply astonishing.',
    'I tried to warn you not to pass through that scary street with your car.',
    'Maiana offers the best Italian cuisine in town.',
    'Australia is a country which is also a continent',
    'Florida is a state of USA where you can find crocodiles'
]

corpus_embeddings = embedder.encode(sentence_corpus)

# Normalize the embeddings to unit length
corpus_embeddings = corpus_embeddings/np.linalg.norm(corpus_embeddings, axis=1, keepdims=True)

In [51]:
# Let's checkout the first embedding

corpus_embeddings[0]

array([-7.37408251e-02,  3.93676758e-02,  3.42675894e-02,  5.99684827e-02,
       -5.52224405e-02, -4.06604335e-02,  7.64607936e-02,  7.43516758e-02,
        2.81906072e-02, -9.85224321e-02,  6.79197628e-03,  3.00994590e-02,
        2.20176224e-02, -8.34922865e-02,  8.52260590e-02, -1.24110907e-01,
        1.37485564e-01, -7.51159638e-02, -3.83998305e-02, -5.66198342e-02,
       -6.11542724e-02, -1.06979415e-01,  6.66869655e-02,  2.34215446e-02,
       -6.28352538e-02,  4.09877636e-02,  3.28017287e-02, -8.71219765e-03,
       -4.48306836e-02, -4.93518375e-02, -7.37910066e-03,  8.48614052e-02,
       -1.68028788e-03, -5.23484163e-02,  1.56562738e-02, -5.22746556e-02,
        2.79843882e-02, -6.77175820e-02,  4.12989594e-02,  4.27923836e-02,
        5.50004281e-02,  1.61397196e-02,  3.55118774e-02, -2.60924213e-02,
       -7.62159899e-02, -1.32712824e-02, -1.12092597e-02, -2.54522916e-02,
        5.21579199e-02,  2.27567833e-02,  1.53705720e-02,  2.50659906e-03,
       -2.62966119e-02, -

In [49]:
# Clustering now

cluster_model = KMeans(n_clusters=4, n_init='auto')
cluster_model.fit(corpus_embeddings)
clusters = cluster_model.labels_
list(clusters)

[1, 3, 2, 0, 1, 0, 2, 0, 1, 3, 3]

In [50]:
# Let's see the sentences in each cluster

clustered_sentences = {}
for sentence, cluster in enumerate(clusters):
    if cluster not in clustered_sentences:
        clustered_sentences[cluster] = []

    clustered_sentences[cluster].append(sentence_corpus[sentence])

for cluster, sentences in clustered_sentences.items():
    print("\nCluster:", cluster)
    for sentence in sentences:
        print(sentence)


Cluster: 1
The pizza tastes great with extra cheeze today.
I ordered two family pan pizza which extra cheeze for my pizza.
Maiana offers the best Italian cuisine in town.

Cluster: 3
Denmark, Sweden, Norway are called nordic countries
Australia is a country which is also a continent
Florida is a state of USA where you can find crocodiles

Cluster: 2
The iPhone 12 have amazing new features, will be sold out quickly!
The face unlock feature on my iPhone is simply astonishing.

Cluster: 0
The man is carrying a baby in his arms, he should not go to the woods alone.
The woods are extremenly scary due to wild animals.
I tried to warn you not to pass through that scary street with your car.
