<a href="https://colab.research.google.com/github/abogutalan/machineLearning-and-AI/blob/master/semantic_similarity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Compute Semantic Textual Similarity between two texts using Pytorch and SentenceTransformers

**The main objective** Semantic Similarity is to measure the distance between the semantic meanings of a pair of words, phrases, sentences, or documents.

```
Applications;
- information retrieval, 
- text summarization, 
- sentiment analysis etc.

Keywords;
- NLP
- Semantic Textual Similarity
- Pytorch
- cosine similarity
- SentenceTransformers: 
    - a simple library that provides an easy method to calculate dense vector representations (e.g. embeddings) for texts. 
    - requires Pytorch and Transformers to be downloaded.
    - recommends Python 3.6 or higher, PyTorch 1.6.0 or higher, and transformers v3.1.0 or higher.
- stsb-roberta-large: uses ROBERTA-large as the base model and mean-pooling, is the best model for the task of semantic similarity.
- semantic search: to find the top most relevant sentences in a corpus given a query sentence. 


```

**Install Transformers**


In [25]:
pip install transformers


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


**Install SentenceTransformers**


In [26]:
pip install sentence-transformers


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# **Import Library**


In [27]:
# import library

from sentence_transformers import SentenceTransformer, util
import numpy as np

# **Model Selection and Initialization**

In [28]:
# List of models optimized for semantic textual similarity can be found at:
# https://docs.google.com/spreadsheets/d/14QplCdTCDwEmTqrn1LH4yrbKvdogK4oQvYO1K1aPR5M/edit#gid=0
model = SentenceTransformer('stsb-roberta-large')

# **Calculate semantic similarity between two sentences**

In [29]:
sentence1 = "I like Python because I can build AI applications"
sentence2 = "I like Python because I can do data analytics"

# encode sentences to get their embeddings
# convert the final embeddings to tensor so that they can be 
# processed faster by the GPU. Not required for small data.
embedding1 = model.encode(sentence1, convert_to_tensor=True)
embedding2 = model.encode(sentence2, convert_to_tensor=True)

# compute similarity scores of two embeddings, using 
# the pytorch_cos_sim function provided by the util, 
# thanks to Sentence-Transformers.
cosine_scores = util.pytorch_cos_sim(embedding1, embedding2)

print("embedding 1:", embedding1)
print("embedding 2:", embedding2)
print("\nSentence 1:", sentence1)
print("Sentence 2:", sentence2)
print("\nSemantic Similarity score:", cosine_scores.item())

embedding 1: tensor([-0.4627,  0.7407, -0.2662,  ...,  1.6758, -2.6873, -0.2177])
embedding 2: tensor([-0.3860,  0.6502, -0.3014,  ...,  1.5001, -2.2585,  0.7606])

Sentence 1: I like Python because I can build AI applications
Sentence 2: I like Python because I can do data analytics

Semantic Similarity score: 0.8015284538269043


# **Calculate semantic similarity between two lists of sentences**

In [30]:
sentence1 = ["I like Python because I can build AI applications",
             "The cat sits on the ground"]
sentence2 = ["I like Python because I can do data analytics",
             "The cat walks on the sidewalk"]             

# encode list of sentences to get their embeddings
embedding1 = model.encode(sentence1, convert_to_tensor=True)            
embedding2 = model.encode(sentence2, convert_to_tensor=True)            

# compute similarity scores of two embeddings
cosine_scores = util.pytorch_cos_sim(embedding1, embedding2)

for i in range(len(sentence1)):
  for j in range(len(sentence2)):
    print("Sentence 1:", sentence1[i])
    print("Sentence 2:", sentence2[j])
    print("Similarity Score:", cosine_scores[i][j].item())
    print()


Sentence 1: I like Python because I can build AI applications
Sentence 2: I like Python because I can do data analytics
Similarity Score: 0.8015284538269043

Sentence 1: I like Python because I can build AI applications
Sentence 2: The cat walks on the sidewalk
Similarity Score: -0.031109800562262535

Sentence 1: The cat sits on the ground
Sentence 2: I like Python because I can do data analytics
Similarity Score: 0.11328643560409546

Sentence 1: The cat sits on the ground
Sentence 2: The cat walks on the sidewalk
Similarity Score: 0.4038149118423462



> Conclusion: The outputted similarity score gets higher as the sentence pairs are more similar.



# **Retrieve Top K most similar sentences from a corpus given a sentence**

A popular use case of semantic similarity is to find the top most relevant sentences in a corpus given a query sentence. This can also be called as **semantic search**.

In [31]:
corpus = ["I like Python because I can build AI applications",
          "I like Python because I can do data analytics",
          "The cat sits on the ground",
         "The cat walks on the sidewalk"]

# encode corpus to get corpus embeddings
corpus_embeddings = model.encode(corpus, convert_to_tensor=True)

sentence = "I like Javascript because I can build web applications"

# encode sentence to get sentence embeddings
sentence_embedding = model.encode(sentence, convert_to_tensor=True)

# top_k results to return
top_k=2

# compute similarity scores of the sentence with the corpus
cos_scores = util.pytorch_cos_sim(sentence_embedding, corpus_embeddings)[0]

#Sort the results in decreasing order and get the first top_k
top_results = np.argpartition(-cos_scores, range(top_k))[0:top_k]

print("Sentence:",sentence,"\n")
print("Top", top_k, "most similar sentence in corpus:")
for idx in top_results[0:top_k]:
  print(corpus[idx], "(Score: %.4f" % (cos_scores[idx]))


Sentence: I like Javascript because I can build web applications 

Top 2 most similar sentence in corpus:
I like Python because I can build AI applications (Score: 0.6696
I like Python because I can do data analytics (Score: 0.5455
