https://enjoymachinelearning.com/blog/finding-semantic-similarity-between-sentences-in-python/

In [2]:
# comparing the two sentences using SBERT and Cosine Similarity

# here's the install command
#!pip install -U sentence-transformers
import pandas as pd
from sentence_transformers import SentenceTransformer, util


# load our Sentence Transformers model pre trained!!
model = SentenceTransformer('all-MiniLM-L6-v2')


# as always, we will get sentences from a
# public kaggle dataset
# https://www.kaggle.com/datasets/columbine/imdb-dataset-sentiment-analysis-in-csv-format?resource=download
df = pd.read_csv('../../dataset/similarity/train.csv')

# while this data has lots of good info, we just need the reviews
# let's grab 2000

# in real-life, you should not clean data like this
# since this wasn't a data cleaning tutorial I didn't want to bloat
# the code

# this is not production ready data!!
sentences = [sentence.lower()
             .replace('br','')
             .replace('<',"")
             .replace(">", "")
             .replace('\\',"")
             .replace('\/',"")
             for sentence in df.text.sample(n=2000)]


#see a sentence, and our length
print(sentences[5:6], f'\n\nLength Of Data {len(sentences)}')

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

['i see quite a few positive reviews on this board, trying to revive this film from its lackluster status and starting a cult following. i see the usual ranting--"i guess this movie is just not for the easily offended," "this movie is not shakespeare," etc. guess what? neither was "road trip"! and i laughed my a** off during that movie! there\'s a way to make a crude, tasteless comedy and deliver laughs; and there\'s a way to...just make it crude and tasteless. "whipped" tries to be "swingers" without the wit or intelligence. it seems to have been written through the puerile eyes of a 14-year-old boy. for god\'s sake, the characters in this movie are supposed to be white-collar, upright citizens--and they talk like some of the idiots i knew in freshman year of high school! the dialogue is laced--more like drowned--with four-letter words. you would think that people of their status would have some degree of intelligence--and a more extensive vocabulary. just watch a whit stillman film a

In [3]:
# lets find the semantically closest sentence to a random sentence
# that we come up with, in our dataset

# i like action movies, mission impossible is one of my favorites
our_sentence = 'I really love action movies, huge tom cruise fan!'

# lets embed our sentence
my_embedding = model.encode(our_sentence)

# lets embed the corpus
embeddings = model.encode(sentences)

#Compute cosine similarity between my sentence, and each one in the corpus
cos_sim = util.cos_sim(my_embedding, embeddings)

# lets go through our array and find our best one!
# remember, we want the highest value here (highest cosine similiarity)
winners = []
for arr in cos_sim:
    for i, each_val in enumerate(arr):
        winners.append([sentences[i],each_val])

# lets get the top 2 sentences
final_winners = sorted(winners, key=lambda x: x[1], reverse=True)



for arr in final_winners[0:2]:
    print(f'\nScore : \n\n  {arr[1]}')
    print(f'\nSentence : \n\n {arr[0]}')


Score : 

  0.4778851270675659

Sentence : 

 this movie was pure genius. john waters is illiant. it is hilarious and i am not sick of it even after seeing it about 20 times since i bought it a few months ago. the acting is great, although ricki lake could have been better. and johnny depp is magnificent. he is such a beautiful man and a very talented actor. and seeing most of johnny's movies, this is probably my favorite. i give it 9.5/10. rent it today!

Score : 

  0.4539894759654999

Sentence : 

 i've seen this film literally over 100 times...it's absolutely jam-packed with entertainment!!! powers boothe gives a stellar performance. as a fan of actors such as william shatner (impulse, 1974) and ron liebmann (up the academy, 1981)i never thought an actor could capture the "intensity" like shatner and liebmann in those roles, until i saw boothe as jim jones! as far as i'm concerned, powers boothe is jim jones...this film captures his best performance!!!
