https://enjoymachinelearning.com/blog/finding-semantic-similarity-between-sentences-in-python/

In [1]:
# comparing the two sentences using SBERT and Cosine Similarity

# here's the install command
#!pip install -U sentence-transformers
import pandas as pd
from sentence_transformers import SentenceTransformer, util


# load our Sentence Transformers model pre trained!!
model = SentenceTransformer('all-MiniLM-L6-v2')


# as always, we will get sentences from a
# public kaggle dataset
# https://www.kaggle.com/datasets/columbine/imdb-dataset-sentiment-analysis-in-csv-format?resource=download
df = pd.read_csv('../../dataset/similarity/train.csv')

# while this data has lots of good info, we just need the reviews
# let's grab 2000

# in real-life, you should not clean data like this
# since this wasn't a data cleaning tutorial I didn't want to bloat
# the code

# this is not production ready data!!
sentences = [sentence.lower()
             .replace('br','')
             .replace('<',"")
             .replace(">", "")
             .replace('\\',"")
             .replace('\/',"")
             for sentence in df.text.sample(n=2000)]


#see a sentence, and our length
print(sentences[5:6], f'\n\nLength Of Data {len(sentences)}')

["will they ever make movies without nudity and sex? this came on at 3:00 on sunday afternoon and i couldn't believe what they showed. thank god my son was outside or i would have been freaked out if he had seen the soft/medium porn! do people who make movies not care who they offend or corrupt? kids could have been watching after church and that is what they show???!!! the acting was good and i enjoyed the suspense but gee! there was violence and bad guys but that is to be expected in a western movie. randy travis was really good in his role. if the writers, directors and producers would just quit putting on so much uncalled for sex scenes. what has to happen to get them to quit going in that direction? where can i complain?"] 

Length Of Data 2000


In [2]:
# lets find the semantically closest sentence to a random sentence
# that we come up with, in our dataset

# i like action movies, mission impossible is one of my favorites
our_sentence = 'I really love action movies, huge tom cruise fan!'

# lets embed our sentence
my_embedding = model.encode(our_sentence)

# lets embed the corpus
embeddings = model.encode(sentences)

#Compute cosine similarity between my sentence, and each one in the corpus
cos_sim = util.cos_sim(my_embedding, embeddings)

# lets go through our array and find our best one!
# remember, we want the highest value here (highest cosine similiarity)
winners = []
for arr in cos_sim:
    for i, each_val in enumerate(arr):
        winners.append([sentences[i],each_val])

# lets get the top 2 sentences
final_winners = sorted(winners, key=lambda x: x[1], reverse=True)



for arr in final_winners[0:2]:
    print(f'\nScore : \n\n  {arr[1]}')
    print(f'\nSentence : \n\n {arr[0]}')


Score : 

  0.4957503080368042

Sentence : 

 gone in 60 sec. where do i began, it keeps you in the movie with some good action and some cool cars. people say its not a good movie i disagree sure it has some cheesy parts but what action movie doesn't. i gave it an 8 out of 10 cause of the action and the comic relief if you like the rock or face off than this movie is right up your alley cage dose a good job along with one of the most under rated actors in my mind del-roy lindo. i think sometimes people look to far into movies some times you need to sit back enjoy the movie and after words ask yourself did they achieve what they where showing. meaning if they where going for action was it action pact. if they where trying to make a movie to change how movies are made and trying to win every award out their well did they? i think they made the action movie they set out to make, give it a chance and you wont be sorry.

Score : 

  0.48675280809402466

Sentence : 

 this movie really rock