<div style="background-color: #ffffff; color: #000000; padding: 10px;">
<img src="../media/img/kisz_logo.png" width="192" height="69"> 
<h1> Working with embeddings:
<h2> An introductory workshop with applications on Semantic Search
</div>

<div style="background-color: #f6a800; color: #ffffff; padding: 10px;">
<h2>Part 1.2 - Metrics
</div>

In this section we will briefly talk about the metrics that are common used for measuring the similiraty of vector embeddings, and the tools we will use for implementing them.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

import nb_config


from src.plotting import Plotter

<div style="background-color: #dd6108; color: #ffffff; padding: 10px;">
<h3>1. Overview
</div>

**Description of the 3 methods. Plus picture from the powerpoint**

To illustrate the different methods, we will use the following vectors:

In [None]:
# Define vectors A, B, and C in two dimensions
A = np.array([1, 7])
B = np.array([3, 1])
C = np.array([9, 5])

Plotter.points([A, B, C])

<div style="background-color: #dd6108; color: #ffffff; padding: 10px;">
<h3>2. Euclidean Distance
</div>

Euclidean distance is a measure of the straight-line distance between two points in Euclidean space. It is the distance that we use normally in our daily life.

Key points for this metric are:

- **Dimensionality Impact**: Sensitive to the dimensionality of the embeddings. Higher-dimensional spaces may result in larger distances.
- **Magnitude Information**: Retains information about the magnitude of the vectors.
- **Distance range**: The Euclidean distance is always a non-negative real number. The interpretation is such that the smaller the distance, the "closer" or more similar the vectors or points are in Euclidean space. Conversely, a larger distance indicates greater dissimilarity or "distance" between the vectors or points.

For calculating the euclidean distance we will use the <kbd>norm</kbd> function from the <kbd>linalg</kbd> module in Numpy. That function returns the norm of a vector, so for calculating the distance between two vectors we just need to substract them and find the norm of the new vector.

In [None]:
euclid_dist_AB = np.linalg.norm(B - A)
euclid_dist_AC = np.linalg.norm(C - A)
euclid_dist_BC = np.linalg.norm(C - B)

euclid_dist_AB, euclid_dist_AC, euclid_dist_BC

We have provided you with a visual help for comparing the vectors with this metric. 

In [None]:
Plotter.euclid_dist([A, B, C])

As you can see, with this metric the first two vectors are closer.

<div style="background-color: #dd6108; color: #ffffff; padding: 10px;">
<h3>3. Dot product similarity
</div>

Dot product similarity is used to determine how similar two vectors are by measuring the inner product (also called dot product) of both vectors. From a geometric perspective, it equals the product of the norm of the longest vector and the norm of the projection of the shorter vector onto the longest one. This metric is used extensively in recommender systems, where the length of the vector represents the popularity of the object.

Key points for this metric are:
- **Magnitude Impact**: Similar to Euclidean distance, the dot product similarity includes information about the magnitudes of the embeddings.
- **Semantic Information**: Captures the semantic information through the cosine of the angle.
- **Similarity Range**: The dot product similarity can have any real value, and the highest the value, the more similar the vectors are.

For calculating the dot product similarity we will use the <kbd>dot</kbd> function from Numpy. That function returns the dot product of two vectors.

In [None]:
dotprod_AB = np.dot(A, B)
dotprod_AC = np.dot(A, C)
dotprod_BC = np.dot(B, C)

dotprod_AB, dotprod_AC, dotprod_BC

We have provided you with a visual help for comparing the vectors with this metric. 

In [None]:
Plotter.dotprod_dist([A, B, C])

As you can see, with this metric the first and the last vectors are more similar.

<div style="background-color: #dd6108; color: #ffffff; padding: 10px;">
<h3>4. Cosine similarity
</div>

Cosine similarity is used to determine how similar two vectors are by measuring the cosine of the angle between them. The resulting similarity ranges from -1 (completely dissimilar) to 1 (completely similar). A cosine similarity of 0 indicates orthogonality, meaning the vectors are perpendicular.

Key points for this metric are:
- **Scale-Invariance**: Being insensitive to the scale of embeddings, cosine similarity is well-suited for comparing vectors with different magnitudes.
- **Directional Measure**: Focuses on the directional aspect, capturing the angle between vectors rather than their absolute magnitudes.
- **Range**: Cosine similarity outputs values between -1 (indicating opposite directions) and 1 (indicating identical directions), with 0 denoting orthogonality or dissimilarity.

For calculating the cosine similarity we can use the two functions that we have used for the other metrics or the <kbd>cosine_similarity</kbd> function from the module <kbd>metrics.pairwise</kbd> in Scikit-Learn (its use will be prefered here for its simplicity). This last function returns the cosine similarity of two matrices. We will have then to adjust our vectors in order to be passed as arguments of the function. The output is also an array.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# With the functions we used for 
cos_simil_AB = np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B))

# Our prefered choice, with cosine_similarity from scikit-learn
cos_simil_AB = cosine_similarity([A], [B])
cos_simil_AC = cosine_similarity([A], [C])
cos_simil_BC = cosine_similarity([B], [C])

cos_simil_AB.item(), cos_simil_AC.item(), cos_simil_BC.item()

We have provided you with a visual help for comparing the vectors with this metric. 

In [None]:
Plotter.cosine_dist([A, B, C])

As you can see, with this metric the last two vectors are more similar.

<div style="background-color: #dd6108; color: #ffffff; padding: 10px;">
<h3>5. Try it yourself...
</div>

You can try to get a feeling about how these metrics work

In [None]:
# Define your own vectors A, B, and C
A = np.array([1, 7])
B = np.array([3, 1])
C = np.array([9, 5])

Calculate the different metrics with

> ```python
> euclid_dist_AB = np.linalg.norm(B - A) # Calculates the euclidean distance
> dotprod_AB = np.dot(A, B) # Calculates the dot product similarity
> cos_simil_AB = cosine_similarity([A], [B]) # Calculates the cosine similarity
> ```

Visualize the metrics with

> ```python
> Plotter.points([A, B, C]) # Shows the vectors
> Plotter.euclid_dist([A, B, C]) # Shows the euclidean distance
> Plotter.dotprod_dist([A, B, C]) # Shows the dot product similarity
> Plotter.cosine_dist([A, B, C]) # Shows the cosine similarity
> ```


In [None]:
# your code here