This is an overview of some of the good and bad cases of the similarity detection.

In [2]:
!pip install sentence_transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence_transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers<5.0.0,>=4.6.0 (from sentence_transformers)
  Downloading transformers-4.29.2-py3-none-any.whl (7.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m63.5 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece (from sentence_transformers)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m49.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub>=0.4.0 (from sentence_transformers)
  Downloading huggingface_hub-0.15.1-py3-

In [5]:
from sentence_transformers import SentenceTransformer, util

SENTENCE_MODEL = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2') # 90M

def check_message_similarity(message, messages_to_compare, threshold=0.7):
    # check if message is similar to messages_to_compare
    message_embedding = SENTENCE_MODEL.encode(message)
    messages_to_compare_embeddings = SENTENCE_MODEL.encode(messages_to_compare)
    cos_similarities = util.cos_sim(message_embedding, messages_to_compare_embeddings)
    max_similarity = cos_similarities.max().item()
    max_similarity_index = cos_similarities.argmax().item()
    message_with_max_similarity = messages_to_compare[max_similarity_index]
    print('message:', message, '\n message with max similarity:', message_with_max_similarity, '\n similarity score:', max_similarity)
    if max_similarity > threshold:
        return True, message_with_max_similarity
    return False, None

Good: Varying sentece structure

In [6]:
check_message_similarity('you\'re hated by everyone here', ['everyone here hates you'])

message: you're hated by everyone here 
 message with max similarity: everyone here hates you 
 similarity score: 0.8098553419113159


(True, 'everyone here hates you')

Good: Abbreviations

In [7]:
check_message_similarity('you are stupid', ['u r stupid'])

message: you are stupid 
 message with max similarity: u r stupid 
 similarity score: 0.7140724658966064


(True, 'u r stupid')

Bad: Special characters

In [14]:
check_message_similarity('you are stupid', ['you are stup1d'])

message: you are stupid 
 message with max similarity: you are stup1d 
 similarity score: 0.3606286346912384


(False, None)

Bad: Typoes

In [15]:
check_message_similarity('you are stupid', ['you are stoopid'])

message: you are stupid 
 message with max similarity: you are stoopid 
 similarity score: 0.35633277893066406


(False, None)

Bad: Similar sentence structures with different meanings

In [16]:
check_message_similarity('you are an extremely stupid person', ['you are an extremely smart person'])

message: you are an extremely stupid person 
 message with max similarity: you are an extremely smart person 
 similarity score: 0.7837221026420593


(True, 'you are an extremely smart person')

Bad: Simple changes to words cause noticeable decreases in similarity score

In [18]:
check_message_similarity('you\'re stupid', ['you are stupid'])

message: you're stupid 
 message with max similarity: you are stupid 
 similarity score: 0.8898153305053711


(True, 'you are stupid')