# Semantic Similarity

A key technology or *"idea"* required to power our app is *semantic similarity*. That is comparing items based on their human meaning.

We'll start by applying this concept to text, as it is one of the more intuitive modalities for this idea.

A traditional search/comparison between text will look at words or sub-words and compare the frequency of important words across the items being compared. For example:

🏦 "The **Bank** of England"

🌾 "A grassy **bank**"

🛩 "A plane **bank**s"

The word **"bank"** being shared by each phrase means a traditional keyword comparison may view these phrases as similar. However, from a human *semantic* perspective they are not similar at all. Each has a completely different meaning.

A *semantic similarity* technology would be able to recognize this and understand the difference between each phrase based on the surrounding words (the context).

## Embedding Models

To find "semantic similarity" we need models that have been trained to understand patterns in language. A popular example of this is BERT. BERT is a well known transformer model that has been trained on huge amounts of text data.

Thanks to the huge training dataset used by BERT and other language models, they are able grasp linguistic patterns surprisingly well.

From here we can further fine-tune (i.e. train) the language model on pairs of similar and dissimilar text. We call this *contrastive learning*.

A model fine-tuned with contrastive learning methods is able to transform sentences into *vector embeddings*.

<img src="https://github.com/jamescalam/applied-ml-minicourse/raw/main/images/encoder-vector-space.png" style="width:70%">

During the contrastive learning process, the model learns to place similar sentences in a similar vector space and dissimilar sentences in a dissimilar vector space.

<img src="https://github.com/jamescalam/applied-ml-minicourse/raw/main/images/vector-space.png" style="width:70%">

Using the logic, we can transform sentences into *meaningful* vectors and compare them with *similarity metrics*. These metrics are simply calculations that compare the *similarity or distance* between vectors.

<div style="display:flex">
  <div style="flex:50%;padding:5px">
    <img src="https://github.com/jamescalam/applied-ml-minicourse/raw/main/images/vec-search-distance.png" alt="Euclidean distance" style="width:100%">
  </div>
  <div style="flex:50%;padding:5px">
    <img src="https://github.com/jamescalam/applied-ml-minicourse/raw/main/images/vec-search-cosine.png" alt="Cosine similarity" style="width:100%">
  </div>
</div>

The `sentence-transformers` library is a popular way of creating vector embeddings from text, let's take a look at how to use it.

## Implementation

We start by installing the `sentence-transformers` library:

In [None]:
!pip install -qq sentence-transformers

Initialize an existing sentence transformer model called [`all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) like so:

In [1]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

We can encode text into vectors like so:

In [2]:
phrases = [
    "the bank of england",  # phrase A
    "a british financial institution",  # phrase B
    "a grassy bank"  # phrase C
]

embeddings = model.encode(phrases)

We will compare them using cosine similarity. Again, `sentence-transformers` provides utilities for this:

In [5]:
from sentence_transformers.util import cos_sim

ab = cos_sim(embeddings[0], embeddings[1]).item()
ac = cos_sim(embeddings[0], embeddings[2]).item()

print(f"""
    {round(ab, 2)}: "{phrases[0]}" vs "{phrases[1]}"
    {round(ac, 2)}: "{phrases[0]}" vs "{phrases[2]}"
""")


    0.73: "the bank of england" vs "a british financial institution"
    0.54: "the bank of england" vs "a grassy bank"



The similarity score between phrase *A* and *B* is correctly much greater than that between *A* and *C* despite sharing no words. This is because we are looking at *semantic similarity* and not the traditional method of keyword overlap.