**This notebook shows how to check similarity between two or more sentences using Hugging Face's [sentence_transformers](https://huggingface.co/sentence-transformers) and cosine similarity. I have used the "all-mpnet-base-v2" model to embed sentences, there are other various models which you can [explore](https://www.sbert.net/docs/pretrained_models.html).**

Author: Ankit Kumar

In [1]:
pip install -U sentence-transformers

Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l- \ | done
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25l- \ | / done
[?25h  Created wheel for sentence-transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=125938 sha256=06fe28d6c5ca789b34de8a49b465cc6e776bfbabcb2c44f447fdbbbf420be9e9
  Stored in directory: /root/.cache/pip/wheels/62/f2/10/1e606fd5f02395388f74e7462910fe851042f97238cbbd902f
Successfully built sentence-transformers
Installing collected packages: sentence-transformers
Successfully installed sentence-transformers-2.2.2
[0mNote: you may need to restart the kernel to use updated packages.


In [2]:
# Import the required packages
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

In [3]:
# sentences to be checked for similarity
sentences = ["Watching Messi play in the worldcup was pure joy.", 
             
             "I loved the way Messi performed in the worldcup.",
            
             "Man! PSG makes him look like just another player."
            ]
#  Define the model
model = SentenceTransformer('all-mpnet-base-v2')

Downloading (…)a8e1d/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)0bca8e1d/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)e1d/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)a8e1d/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)8e1d/train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)bca8e1d/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']


Here I have used small sentences, the model is capable of handeling large texts. Find out more [here](https://huggingface.co/sentence-transformers/all-mpnet-base-v2).

In [4]:
# Embeddign the sentences to machine understandable vectors
embeddings = model.encode(sentences)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Let's compare how similar are the 2nd and 3rd sentences to the 1st sentence. 

In [5]:
# Finding the cosine similarity
cosine_similarity([embeddings[0]], embeddings[1:])

array([[0.8030752 , 0.38781318]], dtype=float32)

As expected the results say that "Watching Messi play in the worldcup was pure joy" is over 80% similar to "I loved the way Messi performed in the worldcup", while "Man! PSG makes him look like just another player" is just about 39% similar. 

Below is an example with large sentences.

In [6]:
large_sentences = ["Modular programming is a software design technique that emphasizes separating the functionality \
            of a program into independent, interchangeable modules, such that each contains everything necessary to execute \
            only one aspect of the desired functionality.", 
             
             "A key component of the software design method known as modular programming is the division of a program's \
             functionality into separate, interchangeable modules, each of which has all the components required to \
             carry out just one particular aspect of the desired capability.",
            
             "Tokenization is essentially splitting a phrase, sentence, paragraph, or\
            an entire text document into smaller units, such as individual words or terms. \
            Each of these smaller units are called tokens."
            ]
embeddings_2 = model.encode(large_sentences)
cosine_similarity([embeddings_2[0]], embeddings_2[1:])

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

array([[0.9375599 , 0.27826467]], dtype=float32)

Its worked like a charm! First two sentences mean the same (defination of modularization), just paraphrased. While 3rd sentence defines tokenization.