In this notebook, I am going to use Sentence transformers to perform the following NLP tasks:


*   Sentence Similarity
*   Semantic Search
*   Question Answering
*   Sentence Clustering



In [1]:
%%capture
!pip install -U sentence-transformers

In [2]:
import numpy as np
from sentence_transformers import SentenceTransformer, util
from transformers import pipeline
from sklearn.cluster import KMeans

In [3]:
checkpoint = "all-MiniLM-L6-v2"
model = SentenceTransformer(checkpoint)

Let's first pass some sentences to the model and check the embeddings generated by the model

In [4]:
sentences = [
    'Australia is a wonderful tourist destination in December.',
    'I love when it rains in summer'
]

embeddings = model.encode(sentences)

# Let's check out the first embedding
print("Sentence:", sentences[0])
print("Embedding shape: ", embeddings[0].shape)
print("First 50 values of the Embedding generated:\n", embeddings[0][:50])

Sentence: Australia is a wonderful tourist destination in December.
Embedding shape:  (384,)
First 50 values of the Embedding generated:
 [ 0.08048362  0.01721296  0.02099365  0.02009391  0.00766018  0.03738112
  0.0146357  -0.04955229 -0.04560297  0.10038438 -0.01003496  0.00708437
  0.04338881  0.10435327  0.04610619 -0.03717326  0.00046194 -0.08769815
  0.07853042 -0.03180541  0.03068463  0.01306993 -0.04724171 -0.01330391
  0.01238453  0.04156635 -0.03433824 -0.01781841 -0.01251358  0.03471358
 -0.07335587  0.08994658  0.00047599 -0.0041423   0.00894851  0.05849646
 -0.00815251 -0.11518169  0.00024111 -0.04130651  0.06672672  0.01293364
  0.10720355 -0.06361521 -0.02571316 -0.05896769  0.0488391   0.04196319
  0.02962903  0.09877984]


### Sentence Similarity

In [5]:
# We will use cosine similarity

sentences = [
    'The pizza tastes great with extra cheeze today.',
    'The iPhone 12 have amazing new features, will be sold out quickly!',
    'The man is carrying a baby in his arms, he should not go to the woods alone.',
    'My new Honda City got scratched yesterday while passing through that street.',
    'I ordered two family pan pizza which extra cheeze for my pizza.',
    'The woods are extremenly scary due to wild animals.',
    'The face unlock feature on my iPhone is simply astonishing.',
    'I tried to warn you not to pass through that scary street with your car.',
    'Let us rock and roll tonight'
]

# Encoding the sentences
embeddings = model.encode(sentences)

# Calculating cosine similarity between all pairs of sentences
cos_sim = util.cos_sim(embeddings, embeddings)

# Adding all pairs to a list with their cosine similarity score
sentence_pairs = []
for i in range(len(cos_sim)-1):
    for j in range(i+1, len(cos_sim)):
        sentence_pairs.append((cos_sim[i][j], i, j))

# Let's see the first 20 values in the list
sentence_pairs[:20]

[(tensor(0.1103), 0, 1),
 (tensor(-0.1328), 0, 2),
 (tensor(0.1347), 0, 3),
 (tensor(0.7315), 0, 4),
 (tensor(-0.0893), 0, 5),
 (tensor(0.0011), 0, 6),
 (tensor(0.0675), 0, 7),
 (tensor(0.1830), 0, 8),
 (tensor(-0.0758), 1, 2),
 (tensor(0.0458), 1, 3),
 (tensor(0.0823), 1, 4),
 (tensor(0.0303), 1, 5),
 (tensor(0.3701), 1, 6),
 (tensor(0.0195), 1, 7),
 (tensor(0.0172), 1, 8),
 (tensor(-0.0884), 2, 3),
 (tensor(-0.1041), 2, 4),
 (tensor(0.2956), 2, 5),
 (tensor(0.0347), 2, 6),
 (tensor(0.0982), 2, 7)]

In [6]:
# Sorting the list by descending order of cosine similarity score and checking the top-5 most similar pairs

sentence_pairs = sorted(sentence_pairs, key=lambda x: x[0], reverse=True)

print("Top-5 most similar pairs of sentences:\n")
for score, i, j in sentence_pairs[0:5]:
    print("First Sentence:", sentences[i])
    print("Second Sentence:", sentences[j])
    print("Similarity Score:", cos_sim[i][j].item(), "\n")

Top-5 most similar pairs of sentences:

First Sentence: The pizza tastes great with extra cheeze today.
Second Sentence: I ordered two family pan pizza which extra cheeze for my pizza.
Similarity Score: 0.7314899563789368 

First Sentence: The iPhone 12 have amazing new features, will be sold out quickly!
Second Sentence: The face unlock feature on my iPhone is simply astonishing.
Similarity Score: 0.3700794577598572 

First Sentence: The woods are extremenly scary due to wild animals.
Second Sentence: I tried to warn you not to pass through that scary street with your car.
Similarity Score: 0.3583225607872009 

First Sentence: My new Honda City got scratched yesterday while passing through that street.
Second Sentence: I tried to warn you not to pass through that scary street with your car.
Similarity Score: 0.3440636992454529 

First Sentence: The man is carrying a baby in his arms, he should not go to the woods alone.
Second Sentence: The woods are extremenly scary due to wild anima

### Semantic Search

In [7]:
checkpoint = "clips/mfaq"
model = SentenceTransformer(checkpoint)    

Downloading (…)e35aa/.gitattributes:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)d6a50e35aa/README.md:   0%|          | 0.00/3.74k [00:00<?, ?B/s]

Downloading (…)aa/added_tokens.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading (…)a50e35aa/config.json:   0%|          | 0.00/778 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/117 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)tencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/294 [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/464 [00:00<?, ?B/s]

Downloading (…)50e35aa/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]



In [8]:
# Search for the best answers among the corpus of answers for the given question

question = "Which are the best countries to visit in December for tourism?"
answers = [
    "Goa must be avoided in the months of July August due to heavy rains.",
    "Australia, Sri Lanka, West Indies are very popular tourist destinations in the winter months till January.",
    "I can complete this job for $500 in a week."
]

question_embedding = model.encode(question)
corpus_embeddings = model.encode(answers)

scores = util.semantic_search(question_embedding, corpus_embeddings)
print(scores, "\n")
print("Question:", question)
print("\nThe best answers in descending order of scores are:\n")

for d in scores[0]:
    print(f"\tAnswer: {answers[d['corpus_id']]}\n\tscore: {d['score']}\n")

[[{'corpus_id': 1, 'score': 0.694810152053833}, {'corpus_id': 0, 'score': 0.6513177752494812}, {'corpus_id': 2, 'score': 0.6036452054977417}]] 

Question: Which are the best countries to visit in December for tourism?

The best answers in descending order of scores are:

	Answer: Australia, Sri Lanka, West Indies are very popular tourist destinations in the winter months till January.
	score: 0.694810152053833

	Answer: Goa must be avoided in the months of July August due to heavy rains.
	score: 0.6513177752494812

	Answer: I can complete this job for $500 in a week.
	score: 0.6036452054977417



### Question Answering

In [9]:
qa_model = pipeline("question-answering")

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/261M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [10]:
# Let's provide the context for the question-answering

context = """Massive Beast in a small size! 
Just upgraded to this from a Samsung Galaxy S10e a couple days ago, 
and let me tell you something. For me, this was a huge jump because I've had very little experience
with iOS prior to buying this phone. And to start off, I love every aspect of this phone, 
especially Siri and the cameras! Speaking of which, the cameras on this thing are insane! 
Way less noisy and grainy in the background than my Samsung. Another feature I absolutely love is the Face ID, 
which detects your face almost instantly, unlike my S10e. And the last feature I really love on this thing 
is setting Do Not Disturb based on location. This was a feature that my Samsung did not have and I 
cannot tell you how helpful this is! One being that I don't have to readjust the time schedule of Do Not Disturb 
and the second being that it will not turn of until you're a certain distance away from that 
location (which can be adjusted in Settings). Overall, as a former Android user, 
I cannot tell you how much I love my new iPhone 13, I highly recomment this to everyone!
"""

# Now let's ask some questions

question = "Which phone was being user earlier?"
answer = qa_model(question = question, context = context)['answer']
print("Question:", question)
print("Answer:", answer)

Question: Which phone was being user earlier?
Answer: Samsung Galaxy S10e


In [11]:
question = "Which is the last feature mentioned by the user that she liked?"
answer = qa_model(question = question, context = context)['answer']
print("Question:", question)
print("Answer:", answer)

Question: Which is the last feature mentioned by the user that she liked?
Answer: Do Not Disturb


In [12]:
question = "Which version of iPhone is she taking about?"
answer = qa_model(question = question, context = context)['answer']
print("Question:", question)
print("Answer:", answer)

Question: Which version of iPhone is she taking about?
Answer: iPhone 13


### Sentence Clustering

In [21]:
checkpoint = "all-MiniLM-L6-v2"
embedder = SentenceTransformer(checkpoint)

In [26]:
# Corpus with example sentences

sentence_corpus = [
    'The pizza tastes great with extra cheeze today.',
    'Denmark, Sweden, Norway are called nordic countries',
    'The iPhone 12 have amazing new features, will be sold out quickly!',
    'The man is carrying a baby in his arms, he should not go to the woods alone.',
    'I ordered two family pan pizza which extra cheeze for my pizza.',
    'The woods are extremenly scary due to wild animals.',
    'The face unlock feature on my iPhone is simply astonishing.',
    'I tried to warn you not to pass through that scary street with your car.',
    'Maiana offers the best Italian cuisine in town.',
    'Australia is a country which is also a continent',
    'Florida is a state of USA where you can find crocodiles'
]

corpus_embeddings = embedder.encode(sentence_corpus)

# Normalizing the embeddings to unit length
corpus_embeddings = corpus_embeddings/np.linalg.norm(corpus_embeddings, axis=1, keepdims=True)

In [27]:
# Let's checkout the first 20 values if the first embedding

print(corpus_embeddings[0][:20])
print("\nEmbedding shape:", corpus_embeddings[0].shape)

[-0.07374083  0.03936768  0.03426759  0.05996848 -0.05522244 -0.04066043
  0.07646079  0.07435168  0.02819061 -0.09852243  0.00679198  0.03009946
  0.02201762 -0.08349229  0.08522606 -0.12411091  0.13748556 -0.07511596
 -0.03839983 -0.05661983]

Embedding shape: (384,)


In [28]:
# Clustering now

cluster_model = KMeans(n_clusters=4, n_init='auto')
cluster_model.fit(corpus_embeddings)
clusters = cluster_model.labels_
list(clusters)

[0, 1, 2, 3, 0, 3, 2, 3, 0, 1, 1]

In [29]:
# Let's see the sentences in each cluster

clustered_sentences = {}
for sentence, cluster in enumerate(clusters):
    if cluster not in clustered_sentences:
        clustered_sentences[cluster] = []

    clustered_sentences[cluster].append(sentence_corpus[sentence])

for cluster, sentences in clustered_sentences.items():
    print("\nCluster:", cluster)
    for sentence in sentences:
        print(sentence)


Cluster: 0
The pizza tastes great with extra cheeze today.
I ordered two family pan pizza which extra cheeze for my pizza.
Maiana offers the best Italian cuisine in town.

Cluster: 1
Denmark, Sweden, Norway are called nordic countries
Australia is a country which is also a continent
Florida is a state of USA where you can find crocodiles

Cluster: 2
The iPhone 12 have amazing new features, will be sold out quickly!
The face unlock feature on my iPhone is simply astonishing.

Cluster: 3
The man is carrying a baby in his arms, he should not go to the woods alone.
The woods are extremenly scary due to wild animals.
I tried to warn you not to pass through that scary street with your car.
