# Local vs Global Semantics in Latent Space
### Authored by Adam Visokay

To make any similarity or distance metric globally interpretable, you need to anchor it to an empirical distribution that represents “the space of all language” (or at least, a large enough sample of it). Without such an anchor, the distances you compute in latent space are only meaningful in a local context, relative to the specific data points you are comparing.

Without calibration, your scores are contextual — they only mean something relative to the other items in your dataset. For instance:

- A cosine similarity of 0.75 could be “very high” in a corpus of random, unrelated sentences,

- but “average” in a corpus of paraphrase pairs.

So to make a similarity score global, you have to answer: “High relative to what?” That’s what a reference distribution provides.

## Methods for Calibration - Brief Lit Review

There are several methods and NLP papers that touch on the issues of embedding-space calibration, isotropy/anisotropy of embedding distributions, and to some extent the idea of grounding similarity scores in more global distributions.

#### On the Sentence Embeddings from Pre-trained Language Models (Li et al., 2020)
They show that embeddings from models like BERT without special processing tend to live in a narrow “cone” (i.e., anisotropic space), which hurts similarity tasks.
**Links:** [arXiv](https://arxiv.org/abs/2012.14538)

**Relevance:** This supports the claim that absolute distances/similarities are not straightforwardly calibrated because of the geometry of the embedding space.

#### Whitening Sentence Representations for Better Semantics and Faster Retrieval (Su et al., 2021)
**Summary:** They apply a “whitening” transformation to sentence embeddings (making mean zero, covariance ≈ identity) to improve retrieval and similarity tasks.  
**Links:** [CatalyzeX](https://www.catalyzex.com/paper/arxiv:2103.15316)  

**Relevance:** The whitening step effectively calibrates the embedding space — a key idea if you want global or interpretable similarity scores rather than dataset-relative ones.

#### Semantic Alignment with Calibrated Similarity for Multilingual Sentence Embedding (Ham & Kim, 2021)
**Summary:** Introduces a “calibrated similarity” for multilingual sentence embeddings. The goal is to align embeddings across languages in a globally comparable way.  
**Links:** [ACL Anthology](https://aclanthology.org/2021.emnlp-main.381/)  

**Relevance:** This provides direct precedent for introducing calibration of similarity metrics across large populations of embeddings — the idea of global rather than local similarity.

## Borrowing from Hilary's work (`03-embedding_other.ipynb`) as an example


In [None]:
import pandas as pd
import numpy as np
import random
import torch
from tqdm import tqdm
from sentence_transformers import SentenceTransformer, util
from scipy.spatial import distance
from math import acos, pi
from datasets import load_dataset

In [None]:
df = pd.read_csv('evaluation_cases.csv')

# load model (this might take a couple minutes the first time)
model = SentenceTransformer('all-MiniLM-L6-v2')

# create dataframe to store distances
distances = pd.DataFrame(columns=['euclidean', 'manhattan', 'dot_product', 'angular'])

In [None]:
# loop through the evaluation cases
for index, row in df.iterrows():
    text_a = row['sent1']
    text_b = row['sent2']

    # encode sentences
    emb1 = model.encode(text_a, convert_to_tensor=True)
    emb2 = model.encode(text_b, convert_to_tensor=True)

    # compute distances
    distances.at[index, 'euclidean'] = distance.euclidean(emb1.cpu().numpy(), emb2.cpu().numpy())
    distances.at[index, 'manhattan'] = distance.cityblock(emb1.cpu().numpy(), emb2.cpu().numpy()) 
    distances.at[index, 'dot_product'] = torch.dot(emb1, emb2).item()
    distances.at[index, 'angular'] = acos(1 - distance.cosine(emb1.cpu().numpy(), emb2.cpu().numpy()))/pi

In [4]:
distances

Unnamed: 0,euclidean,manhattan,dot_product,angular
0,0.935487,14.680563,0.562432,0.690134
1,0.0,0.0,1.0,1.0
2,0.842431,13.078041,0.645155,0.723207
3,1.41078,22.018671,0.00485,0.501544


# Option 1: Normalize distances using anchors

To calibrate the distance metrics, I created arbitrary anchor pairs representing different levels of semantic similarity:
- **Identical**: Same sentence (distance = 0)
- **Paraphrase**: Semantically equivalent but different wording
- **Related**: Topically related but different meaning
- **Unrelated**: Completely different topics

I'll use these anchors to normalize the distances into a 0-1 scale where:
- 0 = identical (like our "identical" anchor)
- 1 = maximally different (like our "unrelated" anchor)

In [None]:
# Define anchor sentence pairs representing different levels of semantic similarity
anchors = {
    'identical': ('the quick brown fox jumps over the lazy dog', 
                  'the quick brown fox jumps over the lazy dog'),       
    'paraphrase': ('the economy is growing rapidly', 
                   'economic growth is accelerating quickly'),
    'related': ('the stock market crashed today', 
                'investors are worried about the economy'),
    'unrelated': ('machine learning algorithms are improving', 
                  'the chef prepared a delicious pasta dish')
}

# Compute distances for anchor pairs
anchor_distances = {}
for label, (sent1, sent2) in anchors.items():
    emb1 = model.encode(sent1, convert_to_tensor=True)
    emb2 = model.encode(sent2, convert_to_tensor=True)
    
    anchor_distances[label] = {
        'euclidean': distance.euclidean(emb1.cpu().numpy(), emb2.cpu().numpy()),
        'manhattan': distance.cityblock(emb1.cpu().numpy(), emb2.cpu().numpy()),
        'dot_product': torch.dot(emb1, emb2).item(),
        'angular': acos(1 - distance.cosine(emb1.cpu().numpy(), emb2.cpu().numpy()))/pi
    }

# Display anchor distances
anchor_df = pd.DataFrame(anchor_distances).T
print("Anchor distances:")
anchor_df

Anchor distances:


Unnamed: 0,euclidean,manhattan,dot_product,angular
identical,0.0,0.0,1.0,1.0
paraphrase,0.558132,8.714251,0.844245,0.819951
related,1.010194,15.630564,0.489754,0.662913
unrelated,1.359313,20.842813,0.076134,0.524258


In [6]:
# Create normalized distances using min-max normalization
# We use 'identical' as min (0) and 'unrelated' as max (1)

distances_norm = pd.DataFrame(index=distances.index)

for metric in ['euclidean', 'manhattan', 'angular']:
    min_val = anchor_distances['identical'][metric]
    max_val = anchor_distances['unrelated'][metric]
    
    # Min-max normalization: (x - min) / (max - min)
    distances_norm[f'{metric}_norm'] = (distances[metric] - min_val) / (max_val - min_val)

# For dot_product, higher is more similar, so we need to invert
# Normalize using max (identical) and min (unrelated)
max_val = anchor_distances['identical']['dot_product']
min_val = anchor_distances['unrelated']['dot_product']
distances_norm['dot_product_norm'] = (max_val - distances['dot_product']) / (max_val - min_val)

print("Normalized distances (0 = identical, 1 = maximally different):")
distances_norm

Normalized distances (0 = identical, 1 = maximally different):


Unnamed: 0,euclidean_norm,manhattan_norm,angular_norm,dot_product_norm
0,0.688206,0.704347,0.651331,0.4736273
1,0.0,0.0,-0.0,-1.290332e-07
2,0.619748,0.62746,0.581812,0.384087
3,1.037863,1.056415,1.047745,1.077159


In [7]:
# Combine and reorder columns so raw and normalized metrics are side by side
combined = pd.concat([distances, distances_norm], axis=1)

# Reorder columns to group raw and normalized versions together
column_order = [
    'euclidean', 'euclidean_norm',
    'manhattan', 'manhattan_norm',
    'dot_product', 'dot_product_norm',
    'angular', 'angular_norm'
]

combined = combined[column_order]
combined

Unnamed: 0,euclidean,euclidean_norm,manhattan,manhattan_norm,dot_product,dot_product_norm,angular,angular_norm
0,0.935487,0.688206,14.680563,0.704347,0.562432,0.4736273,0.690134,0.651331
1,0.0,0.0,0.0,0.0,1.0,-1.290332e-07,1.0,-0.0
2,0.842431,0.619748,13.078041,0.62746,0.645155,0.384087,0.723207,0.581812
3,1.41078,1.037863,22.018671,1.056415,0.00485,1.077159,0.501544,1.047745


Including manually curated extremes helps ground the distances in a more interpretable way. But its kinda arbitrary and ad-hoc.

## Option 2: Background calibration with data-driven global baseline

If you want something statistically grounded instead of manually curated, use a background corpus. Here I am using random pairs of sentences from the AG news dataset as a background distribution of “typical” distances. This will let me compare my target pairs against a more global baseline.

<u>Note: In practice, you would want a much larger sample of background pairs (e.g., 1000s) to get a stable estimate of the distribution. Here I am just using a small sample for demonstration purposes. Also, because the cases I have include exact literal string matches, the background examples here are unlikely to include any identical pairs, so the min distance will be > 0. This will affect the normalization slightly and lead to some negative normalized distances for very similar pairs in my target set. In a real application, you would want to ensure your background corpus is large and diverse enough to include a full range of similarities, including identical pairs if possible.</u>

In [None]:
# Using random sentences from AG News - estimate background distribution of distances
news = load_dataset("ag_news", split="train[:1000]")
sentences = [s['text'] for s in news if len(s['text'].split()) > 5]
random_pairs = random.sample(list(zip(sentences[:-1], sentences[1:])), 50)

# Compute all distance metrics for random pairs
bg_distances = {
    'euclidean': [],
    'manhattan': [],
    'dot_product': [],
    'angular': []
}

print(f"Computing distances for {len(random_pairs)} random sentence pairs...")
for s1, s2 in tqdm(random_pairs):
    emb1 = model.encode(s1, convert_to_tensor=True)
    emb2 = model.encode(s2, convert_to_tensor=True)
    
    bg_distances['euclidean'].append(distance.euclidean(emb1.cpu().numpy(), emb2.cpu().numpy()))
    bg_distances['manhattan'].append(distance.cityblock(emb1.cpu().numpy(), emb2.cpu().numpy()))
    bg_distances['dot_product'].append(torch.dot(emb1, emb2).item())
    bg_distances['angular'].append(acos(1 - distance.cosine(emb1.cpu().numpy(), emb2.cpu().numpy()))/pi)

# Compute statistics for each metric
bg_stats = {}
for metric in bg_distances.keys():
    bg_stats[metric] = {
        'mean': np.mean(bg_distances[metric]),
        'std': np.std(bg_distances[metric]),
        'min': np.min(bg_distances[metric]),
        'max': np.max(bg_distances[metric])
    }

print("\nBackground distribution statistics:")
pd.DataFrame(bg_stats).T

Computing distances for 50 random sentence pairs...


100%|██████████| 50/50 [00:03<00:00, 13.91it/s]


Background distribution statistics:





Unnamed: 0,mean,std,min,max
euclidean,1.327357,0.106252,0.828798,1.468788
manhattan,20.643732,1.642879,12.95701,23.467381
dot_product,0.113417,0.127836,-0.07867,0.656547
angular,0.53678,0.042826,0.474933,0.727984


In [10]:
# Create distribution-based calibrations using min-max normalization
distances_distribution = pd.DataFrame(index=distances.index)

for metric in ['euclidean', 'manhattan', 'angular', 'dot_product']:
    bg_vals = np.array(bg_distances[metric])
    
    # Min-max normalization based on background distribution
    # For dot_product, invert because higher = more similar (opposite of distance)
    if metric == 'dot_product':
        min_val = bg_vals.min()
        max_val = bg_vals.max()
        # Invert: (max - value) / (max - min) so higher similarity = lower distance
        distances_distribution[f'{metric}_background'] = (max_val - distances[metric]) / (max_val - min_val)
    else:
        min_val = bg_vals.min()
        max_val = bg_vals.max()
        # Standard min-max: (value - min) / (max - min)
        distances_distribution[f'{metric}_background'] = (distances[metric] - min_val) / (max_val - min_val)
    
    # Percentile calibration - what % of background values indicate more distance
    if metric == 'dot_product':
        # For dot product, higher = more similar, so count values > current (more similar)
        distances_distribution[f'{metric}_percentile'] = [
            (np.sum(bg_vals > val) / len(bg_vals)) * 100 for val in distances[metric]
        ]
    else:
        # For distances, count values < current (less distant)
        distances_distribution[f'{metric}_percentile'] = [
            (np.sum(bg_vals < val) / len(bg_vals)) * 100 for val in distances[metric]
        ]

print("Distribution-based calibrations:")
print("- background: 0 = most similar in background, 1 = most dissimilar in background")
print("- percentile: % of background pairs that are more similar than this pair")
distances_distribution

Distribution-based calibrations:
- background: 0 = most similar in background, 1 = most dissimilar in background
- percentile: % of background pairs that are more similar than this pair


Unnamed: 0,euclidean_background,euclidean_percentile,manhattan_background,manhattan_percentile,angular_background,angular_percentile,dot_product_background,dot_product_percentile
0,0.541703,2.0,0.542567,2.0,0.504191,98.0,0.375968,2.0
1,-0.24897,0.0,-0.241088,0.0,1.216255,100.0,-0.041381,0.0
2,0.463052,2.0,0.457024,2.0,0.580192,98.0,0.297067,2.0
3,0.94342,88.0,0.934279,90.0,0.070815,12.0,0.907785,88.0


In [None]:
# Combine all three approaches: raw, anchor-normalized, and distribution-calibrated
combined_all = pd.concat([distances, distances_norm, distances_distribution], axis=1)

# Reorder columns to group each metric with its calibrations
column_order = []
for metric in ['euclidean', 'manhattan', 'dot_product', 'angular']:
    column_order.extend([
        metric,                   # raw
        f'{metric}_norm',         # anchor-normalized
        f'{metric}_background'    # background min-max normalized
    ])

combined_all = combined_all[column_order]
combined_all

Unnamed: 0,euclidean,euclidean_norm,euclidean_background,manhattan,manhattan_norm,manhattan_background,dot_product,dot_product_norm,dot_product_background,angular,angular_norm,angular_background
0,0.935487,0.688206,0.541703,14.680563,0.704347,0.542567,0.562432,0.4736273,0.375968,0.690134,0.651331,0.504191
1,0.0,0.0,-0.24897,0.0,0.0,-0.241088,1.0,-1.290332e-07,-0.041381,1.0,-0.0,1.216255
2,0.842431,0.619748,0.463052,13.078041,0.62746,0.457024,0.645155,0.384087,0.297067,0.723207,0.581812,0.580192
3,1.41078,1.037863,0.94342,22.018671,1.056415,0.934279,0.00485,1.077159,0.907785,0.501544,1.047745,0.070815


You can see that the background distribution calibration offers another way to ground the distances in a more global context, rather than relying on a few manually chosen anchors. This is more statistically principled, but also depends on having a representative background corpus. In this case, there are not more extreme values in the background distribution than in my target set, so the normalization is less effective at spreading out the scores. 