# Sentence Embeddings + Other Comparisons
### Authored by Hilary Bayer

**Overview**
Like the sentence embeddings and cosine similarity approach we tested, this approach first requires embedding the test phrases. Embeddings are numeric vector representations of each phrase and are designed to capture the semantic meaning of the phrases as opposed to the superficial similarity of letters or other elements as we did in the token overlap approach. We use the SentenceTransformer Python module for this task.<br>

Next, we compare the embeddings to assess the similarity of the topics discussed in each of the phrases. In the second approach, we used cosine similarity to compare the embeddings. In this approach, for each pair of embeddings we calculate the Euclidean distance, Manhattan distance, and angular distance using `scipy`. We also find the dot product with `torch.dot`.<br>

**Comparison**<br>
The dot product $\mathbf{A} \cdot \mathbf{B}$ is simplest metric but it is important to note that in contrast to the distance measures it increases with similarity of the phrases rather than with dissimilarity. It reflects both the magnitude and direction of the vectors.

The Euclidean distance for embeddings $A = (a_1, a_2, \dots, a_n)$ and $B = (b_1, b_2, \dots, b_n)$ is $D_{\text{Euclidean}}(A, B) = \sqrt{\sum_{i=1}^{n} (a_i - b_i)^2}$. It is a straightforward and common distance metric that is affected by the magnitude of the embedding vectors. 

The Manhattan distance for embeddings $A = (a_1, a_2, \dots, a_n)$ and $B = (b_1, b_2, \dots, b_n)$ is 
$D_{\text{Manhattan}}(A, B) = \sum_{i=1}^{n} |a_i - b_i|$. It is less influenced by outliers than the Euclidean distance and is simple to compute. It is also sensitive to the magnitude of the embedding vectors. 

The angular distance for embeddings $A$ and $B$ is $D_{\text{angular}}(A, B) = \frac{ \arccos \left( \frac{ A \cdot B }{ \|A\| \, \|B\| } \right) }{\pi}$. This measure is calculated from cosine similarity, though it is a distance not a similarity. It is the most complicated to compute and least influenced by vector magnitude of the distance metrics. 

In [3]:
import pandas as pd
import torch
from sentence_transformers import SentenceTransformer, util
from scipy.spatial import distance
from math import acos, pi

In [None]:
# load evaluation phrase pairs
df = pd.read_csv('evaluation_cases.csv')

# load model
model = SentenceTransformer('all-MiniLM-L6-v2')

# create dataframe to store distances
distances = pd.DataFrame(columns=['euclidean', 'manhattan', 'dot_product', 'angular'])

# loop through the evaluation cases
for index, row in df.iterrows():
    text_a = row['sent1']
    text_b = row['sent2']

    # encode sentences
    emb1 = model.encode(text_a, convert_to_tensor=True)
    emb2 = model.encode(text_b, convert_to_tensor=True)

    # compute distances
    distances.at[index, 'euclidean'] = distance.euclidean(emb1.squeeze().cpu().numpy(), emb2.cpu().numpy())
    distances.at[index, 'manhattan'] = distance.cityblock(emb1.squeeze().cpu().numpy(), emb2.cpu().numpy()) 
    distances.at[index, 'dot_product'] = torch.dot(emb1, emb2).item()
    distances.at[index, 'angular'] = acos(1 - distance.cosine(emb1.cpu().numpy(), emb2.cpu().numpy()))/pi

In [6]:
# normalize distances 
distances['euclidean_normalized'] = distances['euclidean'] / distances['euclidean'].sum()
distances['manhattan_normalized'] = distances['manhattan'] / distances['manhattan'].sum()
distances['dot_product_normalized'] = distances['dot_product'] / distances['dot_product'].sum()
distances['angular_normalized'] = distances['angular'] / distances['angular'].sum()

In [7]:
distances

Unnamed: 0,euclidean,manhattan,dot_product,angular,euclidean_normalized,manhattan_normalized,dot_product_normalized,angular_normalized
0,0.935487,14.680565,0.562432,0.309866,0.293376,0.294925,0.254214,0.28556
1,0.0,0.0,1.0,0.0,0.0,0.0,0.45199,0.0
2,0.842431,13.078042,0.645155,0.276793,0.264193,0.262731,0.291604,0.255082
3,1.41078,22.018671,0.00485,0.498456,0.442431,0.442344,0.002192,0.459358


**Results**<br>
All three distance measures agree that in order of increasing distance the phrase pairs are:
- "he ran quickly to the store" and "he ran quickly to the store"
- "domestic unrest" and	"political instability in the country"
- "the cat sat on the mat" and "a feline rested atop a rug"
- "turn left at the traffic light" and "photosynthesis occurs in plant cells"

Ranking these phrase pairs in order of decreasing dot product produces the same results. 

These rankings differ from the reference rankings which rank the cat and feline pair of phrases as more similar than the domestic unrest and political instability pair of phrases. 


