<a href="https://colab.research.google.com/github/elvinagam/Hands-MediumON-ML/blob/master/Sentence_Similarity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Semantic Text Similarity and Paraphrase [Mining](https://github.com/UKPLab/sentence-transformers)

In [3]:
# 1 - Import Libraries & Install Files
import torch
from sentence_transformers import SentenceTransformer
from sentence_transformers import util 

In [4]:
#We will find similarity for these two sentences
sentence1 = "This is a sentence"
sentence2 = "This is also a sentence"

In [None]:
#Download model
model = SentenceTransformer('all-MiniLM-L6-v2')

In [6]:
#Find sentence embeddings for sentences using SBERT model
embedding_sen1 = model.encode(sentence1)
embedding_sen2 = model.encode(sentence2)

#Find cosine distance between the sentences
cos_sim = util.cos_sim(embedding_sen1, embedding_sen2)
print("Cosine-Similarity:", cos_sim.numpy()[0][0])
print("Percentage Similarity: ", cos_sim.numpy()[0][0]*100, "%")

Cosine-Similarity: 0.87709093
Percentage Similarity:  87.70909309387207 %


In [7]:
# Single list of sentences
sentences = ['The cat sits outside',
             'A man is playing guitar',
             'I love pasta',
             'The new movie is awesome',
             'The cat plays in the garden',
             'A woman watches TV',
             'The new movie is so great',
             'Do you like pizza?']

In [8]:
#Compute embeddings
embeddings = model.encode(sentences, convert_to_tensor=True)

#Compute cosine-similarities for each sentence with each other sentence
cosine_scores = util.cos_sim(embeddings, embeddings)

#Find the pairs with the highest cosine similarity scores
pairs = []
for i in range(len(cosine_scores)-1):
    for j in range(i+1, len(cosine_scores)):
        pairs.append({'index': [i, j], 'score': cosine_scores[i][j]})

#Sort scores in decreasing order
pairs = sorted(pairs, key=lambda x: x['score'], reverse=True)

In [9]:
for pair in pairs[0:10]:
    i, j = pair['index']
    print("{} \t\t {} \t\t Score: {:.4f}".format(sentences[i], sentences[j], pair['score']))

The new movie is awesome 		 The new movie is so great 		 Score: 0.8939
The cat sits outside 		 The cat plays in the garden 		 Score: 0.6788
I love pasta 		 Do you like pizza? 		 Score: 0.5096
I love pasta 		 The new movie is so great 		 Score: 0.2560
I love pasta 		 The new movie is awesome 		 Score: 0.2440
A man is playing guitar 		 The cat plays in the garden 		 Score: 0.2105
The new movie is awesome 		 Do you like pizza? 		 Score: 0.1969
The new movie is so great 		 Do you like pizza? 		 Score: 0.1692
The cat sits outside 		 A woman watches TV 		 Score: 0.1310
The cat plays in the garden 		 Do you like pizza? 		 Score: 0.0900


The approach presented above used a brute-force approach to score and rank all pairs.However, as this has a quadratic runtime, it fails to scale to large (10,000 and more) collections of sentences.

For larger collections, util offers the paraphrase_mining function that can be used like this:

In [10]:
sentences = ['The cat sits outside',
             'A man is playing guitar',
             'I love pasta',
             'The new movie is awesome',
             'The cat plays in the garden',
             'A woman watches TV',
             'The new movie is so great',
             'Do you like pizza?']

paraphrases = util.paraphrase_mining(model, sentences)

for paraphrase in paraphrases[0:10]:
    score, i, j = paraphrase
    print("{} \t\t {} \t\t Score: {:.4f}".format(sentences[i], sentences[j], score))

The new movie is so great 		 The new movie is awesome 		 Score: 0.8939
The cat sits outside 		 The cat plays in the garden 		 Score: 0.6788
Do you like pizza? 		 I love pasta 		 Score: 0.5096
I love pasta 		 The new movie is so great 		 Score: 0.2560
I love pasta 		 The new movie is awesome 		 Score: 0.2440
A man is playing guitar 		 The cat plays in the garden 		 Score: 0.2105
The new movie is awesome 		 Do you like pizza? 		 Score: 0.1969
The new movie is so great 		 Do you like pizza? 		 Score: 0.1692
The cat sits outside 		 A woman watches TV 		 Score: 0.1310
The cat plays in the garden 		 Do you like pizza? 		 Score: 0.0900


Instead of computing all pairwise cosine scores and ranking all possible, combintations, the approach is a bit more complex (and hence efficient). We chunk our corpus into smaller pieces, which is defined by query_chunk_size and corpus_chunk_size. For example, if we set query_chunk_size=1000, we search paraphrases for 1,000 sentences at a time in the remaining corpus (all other sentences). However, the remaining corpus is also chunked, for example, if we set corpus_chunk_size=10000, we look for paraphrases in 10k sentences at a time.

If we pass a list of 20k sentences, we will chunk it to 20x1000 sentences, and each of the query is compared first against sentences 0-10k and then 10k-20k.

This is done to reduce the memory requirement. Increasing both values improves the speed, but increases also the memory requirement.

The next critical thing is finding the pairs with the highest similarities. Instead of getting and sorting all n^2 pairwise scores, we take for each query only the top_k scores. So with top_k=100, we find at most 100 paraphrases per sentence per chunk. You can play around with top_k to the ensure a certain behaviour.

So for example, with

paraphrases = util.paraphrase_mining(model, sentences, corpus_chunk_size=len(sentences), top_k=1)
You will get for each sentence only the one most other relevant sentence. Note, if B is the most similar sentence for A, A must not be the most similar sentence for B. So it can happen that the returned list contains entries like (A, B) and (B, C).

The final relevant parameter is max_pairs, which determines the maximum number of paraphrase pairs you like to get returned. If you set it to e.g. max_pairs=100, you will not get more than 100 paraphrase pairs returned. Usually, you get fewer pairs returned as the list is cleaned of duplicates, e.g., if it contains (A, B) and (B, A), then only one is returned.