# Text Embedding Comparison

The Notebook creates and visualizes 200 text embeddings at 512 dimensions each, projected into 2D for visualization, across three popular open weight text embedding models from Google, Qwen, and IBM. Even with identical inputs and dimensionality, each model induces its own embedding space—with different clusters, separations, and neighborhood relationships—which is why production systems need explicit embedding‑model versioning and a full re‑embedding plus re‑indexing step whenever the underlying model changes.

## Install Dependencies

In [1]:
%pip install pip -Uq

Note: you may need to restart the kernel to use updated packages.


In [28]:
%pip install -r requirements.txt -Uq

Note: you may need to restart the kernel to use updated packages.


In [1]:
import os

# Disable tokenizers parallelism warnings in notebook contexts
os.environ["TOKENIZERS_PARALLELISM"] = "false"

## Authenticate with Hugging Face

**IMPORTANT: Complete these steps in order:**

1. **Request model access**: Visit https://huggingface.co/google/embeddinggemma-300m and click "Request access to this model" (requires a free Hugging Face account)
2. **Wait for approval**: Access is usually granted immediately or within a few minutes
3. **Get your token**: Go to https://huggingface.co/settings/tokens and create a new token (read permission is sufficient)
4. **Run the login cell below**: Execute the next cell and paste your token in the text box that appears
5. **Verify login**: Run the verification cell to confirm you're authenticated

In [2]:
from huggingface_hub import notebook_login

# Login to Hugging Face (this will show a widget for entering your token)
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [3]:
# Verify authentication status
from huggingface_hub import whoami

try:
    user_info = whoami()
    print(f"✓ Successfully logged in as: {user_info['name']}")
    print(f"✓ Authentication token is valid")
except Exception as e:
    print("✗ Not logged in or token is invalid")
    print(f"Error: {e}")

✓ Successfully logged in as: garystafford
✓ Authentication token is valid


## Load Quotes for Embedding

In [4]:
# Now, load the entire JSONL file and print the total number of quotes.
# Wrap the quotes in a list: `quotes = [list of quotes]`
import json


def load_all_quotes(file_path):
    quotes = []
    with open(file_path, "r", encoding="utf-8") as file:
        for line in file:
            quote = json.loads(line)
            quotes.append(quote["inputs"])
    return quotes


file_path = "quotes/quotes_200.jsonl"
quotes = load_all_quotes(file_path)
print(f"Total number of quotes: {len(quotes)}")
print(f"First quote in list: {quotes[0]}")

Total number of quotes: 200
First quote in list: I'm selfish, impatient and a little insecure. I make mistakes, I am out of control and at times hard to handle. But if you can't handle me at my worst, then you sure as hell don't deserve me at my best.


## Common Methods

In [5]:
from sentence_transformers import SentenceTransformer
from torch import Tensor


def compute_similarity(model: SentenceTransformer) -> Tensor:
    """Compute the similarity between queries and answers using the given model.

    Args:
        model (SentenceTransformer): The sentence transformer model to use for encoding and similarity computation.

    Returns:
        Tensor: A tensor containing the similarity scores between each query and answer.
    """
    # The queries and quotes to embed
    queries = [
        "What is the capital of China?",
        "Explain gravity",
    ]
    answers = [
        "The capital of China is Beijing.",
        "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
    ]

    # Encode the queries and quotes. Note that queries benefit from using a prompt
    # Here we use the prompt called "query" stored under `model.prompts`, but you can
    # also pass your own prompt via the `prompt` argument
    query_embeddings = model.encode_query(queries, prompt_name="query")
    quote_embeddings = model.encode_document(answers)

    # Compute the (cosine) similarity between the query and quote embeddings
    similarity = model.similarity(query_embeddings, quote_embeddings)

    return similarity

In [6]:
from numpy import ndarray


def generate_embeddings(model: SentenceTransformer, quotes: list[str]) -> ndarray:
    """Embed a list of quotes using the given model and measure the time taken.

    Args:
        model (SentenceTransformer): The sentence transformer model to use for encoding.
        quotes (list[str]): A list of quotes to embed.
    """
    import time

    start_time = time.time()
    quote_embeddings = model.encode(
        quotes,
        batch_size=32,
        show_progress_bar=True,
        truncate_dim=512,  # <- desired output dim
    )
    end_time = time.time()

    print(f"Time taken: {end_time - start_time} seconds")
    print(f"Time per embedding: {(end_time - start_time) / len(quotes)} seconds")

    return quote_embeddings

In [7]:
import numpy as np


def save_embeddings(quote_embeddings: ndarray, file_name: str) -> str:
    """Save the quote embeddings to a file.

    Args:
        quote_embeddings (ndarray): The quote embeddings to save.
        file_name (str): The name of the file to save the embeddings to.

    Returns:
        str: The path to the saved embeddings file.
    """
    embeddings_path = os.path.join("embeddings", file_name)
    np.save(embeddings_path, quote_embeddings)

    return embeddings_path

In [8]:
def load_embeddings(embeddings_path: str) -> ndarray:
    """Load embeddings from a file.

    Args:
        embeddings_path (str): Path to the file containing the saved embeddings.

    Returns:
        ndarray: The loaded embeddings as a NumPy array.
    """
    embeddings = np.load(embeddings_path)

    return embeddings

## Model 1: EmbeddingGemma

`google/embeddinggemma-300m`

https://huggingface.co/google/embeddinggemma-300m

In [11]:
# Load the model
model = SentenceTransformer("google/embeddinggemma-300m")
print(f"Model loaded: {model.model_card_data}")

modules.json:   0%|          | 0.00/573 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/997 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/18.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/58.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.49k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.21G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/312 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/134 [00:00<?, ?B/s]

2_Dense/model.safetensors:   0%|          | 0.00/9.44M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/134 [00:00<?, ?B/s]

3_Dense/model.safetensors:   0%|          | 0.00/9.44M [00:00<?, ?B/s]

Model loaded: tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- dense
base_model: google/embeddinggemma-300m
pipeline_tag: sentence-similarity
library_name: sentence-transformers


In [12]:
# Run test inference with queries and answers
similarities = compute_similarity(model)
print(f"Similarities: {similarities}")

Similarities: tensor([[0.6092, 0.0413],
        [0.0383, 0.5682]])


In [13]:
# Embed the quotes
embeddings = generate_embeddings(model, quotes)
print(f"Shape: {embeddings.shape}")  # (200, 512)

Batches:   0%|          | 0/7 [00:00<?, ?it/s]

Time taken: 1.947321891784668 seconds
Time per embedding: 0.00973660945892334 seconds
Shape: (200, 512)


In [14]:
# Save the quote embeddings to a file for later use
embeddings_path = save_embeddings(embeddings, "google-embedding-gemma-300m-512.npy")
print(f"Embeddings saved to: {embeddings_path}")

Embeddings saved to: embeddings/google-embedding-gemma-300m-512.npy


In [15]:
# Load the quote embeddings from the file
loaded_embeddings = load_embeddings(embeddings_path)
print(f"Shape: {loaded_embeddings.shape}")

Shape: (200, 512)


## Model 2: Qwen3 Embedding 0.6B

`Qwen/Qwen3-Embedding-0.6B`

https://huggingface.co/Qwen/Qwen3-Embedding-0.6B

In [16]:
# Load the model
model = SentenceTransformer("Qwen/Qwen3-Embedding-0.6B")
print(f"Model loaded: {model.model_card_data}")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/215 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.19G [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/313 [00:00<?, ?B/s]

Model loaded: tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- dense
base_model: Qwen/Qwen3-Embedding-0.6B
pipeline_tag: sentence-similarity
library_name: sentence-transformers


In [17]:
# Run test inference with queries and answers
similarities = compute_similarity(model)
print(f"Similarities: {similarities}")

Similarities: tensor([[0.7646, 0.1414],
        [0.1355, 0.6000]])


In [18]:
# Embed the quotes
embeddings = generate_embeddings(model, quotes)
print(f"Shape: {embeddings.shape}")  # (200, 512)

Batches:   0%|          | 0/7 [00:00<?, ?it/s]

Time taken: 3.042691230773926 seconds
Time per embedding: 0.015213456153869629 seconds
Shape: (200, 512)


In [19]:
# Save the quote embeddings to a file for later use
embeddings_path = save_embeddings(embeddings, "qwen-qwen3-embedding-0.6b-512.npy")
print(f"Embeddings saved to: {embeddings_path}")

Embeddings saved to: embeddings/qwen-qwen3-embedding-0.6b-512.npy


In [20]:
# Load the quote embeddings from the file
loaded_embeddings = load_embeddings(embeddings_path)
print(f"Shape: {loaded_embeddings.shape}")

Shape: (200, 512)


## Model 3: IBM Granite Embedding 125m English

`ibm-granite/granite-embedding-125m-english`

https://huggingface.co/ibm-granite/granite-embedding-125m-english

In [21]:
# Load the model
model = SentenceTransformer("ibm-granite/granite-embedding-125m-english")
print(f"Model loaded: {model.model_card_data}")

modules.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/54.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/720 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/249M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/957 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

Model loaded: tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- dense
base_model: ibm-granite/granite-embedding-125m-english
pipeline_tag: sentence-similarity
library_name: sentence-transformers


In [22]:
# Run test inference with queries and answers
similarities = compute_similarity(model)
print(f"Similarities: {similarities}")

Similarities: tensor([[0.9442, 0.6652],
        [0.7050, 0.9081]])


In [23]:
# Embed the quotes
embeddings = generate_embeddings(model, quotes)
print(f"Shape: {embeddings.shape}")  # (200, 512)

Batches:   0%|          | 0/7 [00:00<?, ?it/s]

Time taken: 1.8521819114685059 seconds
Time per embedding: 0.00926090955734253 seconds
Shape: (200, 512)


In [24]:
# Save the quote embeddings to a file for later use
embeddings_path = save_embeddings(
    embeddings, "ibm-granite-embedding-125m-english-512.npy"
)
print(f"Embeddings saved to: {embeddings_path}")

Embeddings saved to: embeddings/ibm-granite-embedding-125m-english-512.npy


In [25]:
# Load the quote embeddings from the file
loaded_embeddings = load_embeddings(embeddings_path)
print(f"Shape: {loaded_embeddings.shape}")

Shape: (200, 512)


## Model 4: TencentBAC Conan Embedding v1

`TencentBAC/Conan-embedding-v1`

https://huggingface.co/TencentBAC/Conan-embedding-v1

In [12]:
# Load the model
model = SentenceTransformer("TencentBAC/Conan-embedding-v1")
print(f"Model loaded: {model.model_card_data}")

modules.json:   0%|          | 0.00/341 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/199 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/851 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.30G [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/297 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

2_Dense/model.safetensors:   0%|          | 0.00/7.35M [00:00<?, ?B/s]

Model loaded: tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- dense
base_model: TencentBAC/Conan-embedding-v1
pipeline_tag: sentence-similarity
library_name: sentence-transformers


In [13]:
# Run test inference with queries and answers
similarities = compute_similarity(model)
print(f"Similarities: {similarities}")

Similarities: tensor([[0.8930, 0.6899],
        [0.6826, 0.8692]])


In [14]:
# Embed the quotes
embeddings = generate_embeddings(model, quotes)
print(f"Shape: {embeddings.shape}")  # (200, 512)

Batches:   0%|          | 0/7 [00:00<?, ?it/s]

Time taken: 2.1788878440856934 seconds
Time per embedding: 0.010894439220428466 seconds
Shape: (200, 512)


In [15]:
# Save the quote embeddings to a file for later use
embeddings_path = save_embeddings(
    embeddings, "tencentbac-conan-embedding-v1-512.npy"
)
print(f"Embeddings saved to: {embeddings_path}")

Embeddings saved to: embeddings/tencentbac-conan-embedding-v1-512.npy


In [16]:
# Load the quote embeddings from the file
loaded_embeddings = load_embeddings(embeddings_path)
print(f"Shape: {loaded_embeddings.shape}")

Shape: (200, 512)


## Visualization Methods

**`normalize_embeddings()`**
- L2 normalizes embeddings to unit length
- Standardizes to zero mean and unit variance
- Ensures all models are on comparable scales

**`visualize_multiple_embeddings_improved()`**
- Normalizes each model's embeddings separately before combining
- Reports PCA explained variance ratio
- Supports both PCA and t-SNE
- Includes hover text with quote content
- Better for direct comparison when normalization is appropriate

**`visualize_embeddings_separately()`**
- Applies PCA/t-SNE independently to each model
- Shows true structure of each embedding space
- No cross-contamination between models
- Side-by-side subplots for comparison
- Better for understanding individual model characteristics

### When to Use Which:
- **Separate visualization** (`visualize_embeddings_separately`): Best for understanding each model's embedding space structure independently
- **Combined normalized** (`visualize_multiple_embeddings_improved`): Best for direct comparison when you want to see relative positions across models

In [77]:
import pandas as pd
import plotly.express as px
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from plotly.subplots import make_subplots
import plotly.graph_objects as go

In [78]:
def normalize_embeddings(embeddings: ndarray) -> ndarray:
    """Normalize embeddings using L2 normalization followed by standardization.

    This ensures embeddings from different models are on comparable scales:
    1. L2 normalization: Scale each embedding vector to unit length
    2. Standardization: Zero mean and unit variance per dimension

    Args:
        embeddings (ndarray): The embeddings to normalize (shape: [n_samples, n_dimensions])

    Returns:
        ndarray: Normalized embeddings with the same shape
    """
    # Step 1: L2 normalize each embedding vector to unit length
    # This makes all vectors lie on a hypersphere
    norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
    # Avoid division by zero
    norms = np.where(norms == 0, 1, norms)
    l2_normalized = embeddings / norms

    # Step 2: Standardize to zero mean and unit variance per dimension
    # This ensures different models have comparable variance structures
    mean = l2_normalized.mean(axis=0)
    std = l2_normalized.std(axis=0)
    # Avoid division by zero for constant dimensions
    std = np.where(std == 0, 1, std)
    standardized = (l2_normalized - mean) / std

    return standardized

In [108]:
def visualize_multiple_embeddings_improved(
    embeddings_list: list[ndarray],
    model_names: list[str],
    quotes: list[str] = None,
    method: str = "pca",
):
    """Visualize multiple sets of embeddings in 2D space with proper normalization.

    This function addresses methodological issues by:
    1. Normalizing each model's embeddings separately before combining
    2. Reporting explained variance for PCA
    3. Supporting both PCA and t-SNE
    4. Adding hover text with quote content

    Args:
        embeddings_list (list[ndarray]): A list of embeddings to visualize.
        model_names (list[str]): A list of model names corresponding to the embeddings.
        quotes (list[str], optional): Original quote texts for hover display.
        method (str): Dimensionality reduction method - "pca" or "tsne"
    """
    print("Normalizing embeddings for each model separately...")
    normalized_embeddings = []
    for i, emb in enumerate(embeddings_list):
        norm_emb = normalize_embeddings(emb)
        normalized_embeddings.append(norm_emb)
        print(f"  {model_names[i]}: normalized {emb.shape[0]} embeddings")

    # Combine normalized embeddings
    combined_embeddings = np.vstack(normalized_embeddings)
    print(f"\nCombined shape: {combined_embeddings.shape}")

    # Apply dimensionality reduction
    if method.lower() == "pca":
        reducer = PCA(n_components=2)
        reduced_embeddings = reducer.fit_transform(combined_embeddings)

        # Report explained variance - critical for understanding information loss
        print(f"\nPCA Explained Variance:")
        print(f"  PC1: {reducer.explained_variance_ratio_[0]:.2%}")
        print(f"  PC2: {reducer.explained_variance_ratio_[1]:.2%}")
        print(f"  Total: {reducer.explained_variance_ratio_.sum():.2%}")

        axis_labels = {"x": "Principal Component 1", "y": "Principal Component 2"}
        title_method = "PCA"
    elif method.lower() == "tsne":

        print("\nApplying t-SNE (this may take a moment)...")
        reducer = TSNE(n_components=2, random_state=42, perplexity=30)
        reduced_embeddings = reducer.fit_transform(combined_embeddings)
        axis_labels = {"x": "t-SNE Dimension 1", "y": "t-SNE Dimension 2"}
        title_method = "t-SNE"
    else:
        raise ValueError(f"Unknown method: {method}. Use 'pca' or 'tsne'")

    # Create a DataFrame for Plotly
    df = pd.DataFrame(reduced_embeddings, columns=["dim1", "dim2"])
    df["Model"] = np.repeat(model_names, [emb.shape[0] for emb in embeddings_list])

    # Add quote text for hover if provided
    if quotes is not None:
        # Repeat quotes for each model
        all_quotes = quotes * len(embeddings_list)
        df["Quote"] = all_quotes
        hover_data = {"Quote": True, "Model": True, "dim1": ":.3f", "dim2": ":.3f"}
    else:
        hover_data = {"Model": True, "dim1": ":.3f", "dim2": ":.3f"}

    # Create scatter plot
    fig = px.scatter(
        df,
        x="dim1",
        y="dim2",
        color="Model",
        color_discrete_sequence=px.colors.qualitative.Set2,
        title=f"2D Visualization of Embeddings ({title_method}, Normalized)",
        labels={"dim1": axis_labels["x"], "dim2": axis_labels["y"]},
        hover_data=hover_data,
    )

    fig.update_traces(marker=dict(size=6, opacity=0.7))

    # Make title bold and centered, set Arial font, and increase resolution
    fig.update_layout(
        title={
            "text": f"<b>2D Visualization of Embeddings ({title_method}, Normalized)</b>",
            "x": 0.5,
            "xanchor": "center",
            "font": {"size": 18, "family": "Arial, sans-serif"},
        },
        font={"family": "Arial, sans-serif", "size": 12},
        width=1200,
        height=600,
    )

    # Show with high resolution
    fig.show(config={"toImageButtonOptions": {"format": "png", "scale": 3}})

In [109]:
def visualize_embeddings_separately(
    embeddings_list: list[ndarray],
    model_names: list[str],
    quotes: list[str] = None,
    method: str = "pca",
    shared_axes: bool = True,
):
    """Visualize embeddings with separate dimensionality reduction per model.

    This approach applies PCA/t-SNE independently to each model's embeddings,
    showing the true structure of each embedding space without cross-contamination.
    Models are displayed side-by-side in subplots for comparison.

    Args:
        embeddings_list (list[ndarray]): A list of embeddings to visualize.
        model_names (list[str]): A list of model names corresponding to the embeddings.
        quotes (list[str], optional): Original quote texts for hover display.
        method (str): Dimensionality reduction method - "pca" or "tsne"
        shared_axes (bool): If True, all subplots use the same x/y axis ranges for direct comparison
    """
    n_models = len(embeddings_list)
    fig = make_subplots(
        rows=1, cols=n_models, subplot_titles=model_names, horizontal_spacing=0.1
    )

    # First pass: compute all reductions and find global ranges if needed
    all_reduced = []
    all_explained_vars = []

    for embeddings in embeddings_list:
        # Apply dimensionality reduction independently
        if method.lower() == "pca":
            reducer = PCA(n_components=2)
            reduced = reducer.fit_transform(embeddings)
            explained_var = reducer.explained_variance_ratio_.sum()
            all_explained_vars.append(explained_var)
        elif method.lower() == "tsne":
            reducer = TSNE(n_components=2, random_state=42)
            reduced = reducer.fit_transform(embeddings)
            all_explained_vars.append(None)
        else:
            raise ValueError(f"Unknown method: {method}")

        all_reduced.append(reduced)

    # Compute shared axis ranges if requested
    if shared_axes:
        all_x = np.concatenate([r[:, 0] for r in all_reduced])
        all_y = np.concatenate([r[:, 1] for r in all_reduced])
        x_min, x_max = all_x.min(), all_x.max()
        y_min, y_max = all_y.min(), all_y.max()
        # Add small padding (5%)
        x_padding = (x_max - x_min) * 0.05
        y_padding = (y_max - y_min) * 0.05
        x_range = [x_min - x_padding, x_max + x_padding]
        y_range = [y_min - y_padding, y_max + y_padding]

    # Second pass: create plots
    for i, (reduced, model_name, explained_var) in enumerate(
        zip(all_reduced, model_names, all_explained_vars)
    ):
        # Prepare subtitle with explained variance
        if explained_var is not None:
            subtitle_suffix = f"<br>(Explained var: {explained_var:.1%})"
        else:
            subtitle_suffix = ""

        # Prepare hover text
        if quotes is not None:
            hover_text = [
                f"Quote: {q[:100]}..." if len(q) > 100 else f"Quote: {q}"
                for q in quotes
            ]
        else:
            hover_text = None

        # Add scatter trace
        fig.add_trace(
            go.Scatter(
                x=reduced[:, 0],
                y=reduced[:, 1],
                mode="markers",
                marker=dict(size=5, opacity=0.6, color=px.colors.qualitative.Set2[i]),
                text=hover_text,
                hovertemplate="%{text}<br>x: %{x:.3f}<br>y: %{y:.3f}<extra></extra>",
                showlegend=False,
            ),
            row=1,
            col=i + 1,
        )

        # Update subplot title with explained variance and font
        fig.layout.annotations[i].update(
            text=model_name + subtitle_suffix,
            font=dict(family="Arial, sans-serif", size=14),
        )

        # Set axis ranges
        if shared_axes:
            fig.update_xaxes(range=x_range, row=1, col=i + 1)
            fig.update_yaxes(range=y_range, row=1, col=i + 1)

    method_name = "PCA" if method.lower() == "pca" else "t-SNE"
    axes_note = " (Shared Axes)" if shared_axes else " (Independent Axes)"

    # Make title bold and centered, set Arial font, and increase resolution
    fig.update_layout(
        title={
            "text": f"<b>Separate {method_name} per Model{axes_note}</b>",
            "x": 0.5,
            "xanchor": "center",
            "font": {"size": 18, "family": "Arial, sans-serif"},
        },
        font={"family": "Arial, sans-serif", "size": 12},
        width=1600,
        height=400,
        showlegend=False,
    )

    # Show with high resolution
    fig.show(config={"toImageButtonOptions": {"format": "png", "scale": 3}})

In [110]:
# Load the embeddings from all three models
embeddings_google = load_embeddings("embeddings/google-embedding-gemma-300m-512.npy")
embeddings_qwen = load_embeddings("embeddings/qwen-qwen3-embedding-0.6b-512.npy")
embeddings_ibm = load_embeddings(
    "embeddings/ibm-granite-embedding-125m-english-512.npy"
)
embeddings_tencent = load_embeddings(
    "embeddings/tencentbac-conan-embedding-v1-512.npy"
)

print(f"Loaded embeddings:")
print(f"  Google: {embeddings_google.shape}")
print(f"  Qwen: {embeddings_qwen.shape}")
print(f"  IBM: {embeddings_ibm.shape}")
print(f"  Tencent: {embeddings_tencent.shape}")

Loaded embeddings:
  Google: (200, 512)
  Qwen: (200, 512)
  IBM: (200, 512)
  Tencent: (200, 512)


In [111]:
# Example 1: Combined visualization with normalization (PCA)
print("=" * 80)
print("METHOD 1: Combined PCA with Normalization")
print("=" * 80)

visualize_multiple_embeddings_improved(
    [embeddings_google, embeddings_qwen, embeddings_ibm, embeddings_tencent],
    ["Google EmbeddingGemma", "Qwen3 Embedding", "IBM Granite", "Tencent Conan"],
    quotes=quotes,
    method="pca",
)

METHOD 1: Combined PCA with Normalization
Normalizing embeddings for each model separately...
  Google EmbeddingGemma: normalized 200 embeddings
  Qwen3 Embedding: normalized 200 embeddings
  IBM Granite: normalized 200 embeddings
  Tencent Conan: normalized 200 embeddings

Combined shape: (800, 512)

PCA Explained Variance:
  PC1: 2.25%
  PC2: 1.89%
  Total: 4.13%


In [112]:
# Example 2: Separate PCA per model (shows true structure of each space)
print("=" * 80)
print("METHOD 2: Separate PCA per Model")
print("=" * 80)
print("Each model gets its own PCA transformation - no cross-contamination\n")

visualize_embeddings_separately(
    [embeddings_google, embeddings_qwen, embeddings_ibm, embeddings_tencent],
    ["Google EmbeddingGemma", "Qwen3 Embedding", "IBM Granite", "Tencent Conan"],
    quotes=quotes,
    method="pca",
    shared_axes=False,
)

METHOD 2: Separate PCA per Model
Each model gets its own PCA transformation - no cross-contamination



In [113]:
# Example 3: Combined t-SNE with normalization (better for local structure)
print("=" * 80)
print("METHOD 3: Combined t-SNE with Normalization")
print("=" * 80)
print("t-SNE preserves local structure better than PCA\n")

visualize_multiple_embeddings_improved(
    [embeddings_google, embeddings_qwen, embeddings_ibm, embeddings_tencent],
    ["Google EmbeddingGemma", "Qwen3 Embedding", "IBM Granite", "Tencent Conan"],
    quotes=quotes,
    method="tsne",
)

METHOD 3: Combined t-SNE with Normalization
t-SNE preserves local structure better than PCA

Normalizing embeddings for each model separately...
  Google EmbeddingGemma: normalized 200 embeddings
  Qwen3 Embedding: normalized 200 embeddings
  IBM Granite: normalized 200 embeddings
  Tencent Conan: normalized 200 embeddings

Combined shape: (800, 512)

Applying t-SNE (this may take a moment)...


## Save Visualizations as High-Resolution PNGs

Now let's save all three visualizations as PNG files for external use.

In [85]:
def save_visualization_as_png(
    embeddings_list: list[ndarray],
    model_names: list[str],
    quotes: list[str],
    method: str,
    filename: str,
    viz_type: str = "combined",
    shared_axes: bool = True,
):
    """Save a visualization as a high-resolution PNG file.

    Args:
        embeddings_list (list[ndarray]): A list of embeddings to visualize.
        model_names (list[str]): A list of model names corresponding to the embeddings.
        quotes (list[str]): Original quote texts.
        method (str): Dimensionality reduction method - "pca" or "tsne"
        filename (str): Output filename (without extension)
        viz_type (str): "combined" or "separate"
        shared_axes (bool): If True and viz_type="separate", use shared axes
    """

    if viz_type == "combined":
        # Normalize embeddings
        normalized_embeddings = []
        for emb in embeddings_list:
            normalized_embeddings.append(normalize_embeddings(emb))

        combined_embeddings = np.vstack(normalized_embeddings)

        # Apply dimensionality reduction
        if method.lower() == "pca":
            reducer = PCA(n_components=2)
            reduced_embeddings = reducer.fit_transform(combined_embeddings)
            axis_labels = {"x": "Principal Component 1", "y": "Principal Component 2"}
            title_method = "PCA"
        elif method.lower() == "tsne":
            reducer = TSNE(n_components=2, random_state=42, perplexity=30)
            reduced_embeddings = reducer.fit_transform(combined_embeddings)
            axis_labels = {"x": "t-SNE Dimension 1", "y": "t-SNE Dimension 2"}
            title_method = "t-SNE"

        df = pd.DataFrame(reduced_embeddings, columns=["dim1", "dim2"])
        df["Model"] = np.repeat(model_names, [emb.shape[0] for emb in embeddings_list])
        all_quotes = quotes * len(embeddings_list)
        df["Quote"] = all_quotes

        fig = px.scatter(
            df,
            x="dim1",
            y="dim2",
            color="Model",
            color_discrete_sequence=px.colors.qualitative.Vivid,
            labels={"dim1": axis_labels["x"], "dim2": axis_labels["y"]},
            hover_data={"Quote": True, "Model": True, "dim1": ":.3f", "dim2": ":.3f"},
        )

        fig.update_traces(marker=dict(size=6, opacity=0.7))
        fig.update_layout(
            title={
                "text": f"<b>2D Visualization of Embeddings ({title_method}, Normalized)</b>",
                "x": 0.5,
                "xanchor": "center",
                "font": {"size": 18, "family": "Arial, sans-serif"},
            },
            font={"family": "Arial, sans-serif", "size": 12},
            width=1200,
            height=600,
        )

    else:  # separate
        n_models = len(embeddings_list)
        fig = make_subplots(
            rows=1, cols=n_models, subplot_titles=model_names, horizontal_spacing=0.1
        )

        # Compute all reductions
        all_reduced = []
        all_explained_vars = []

        for embeddings in embeddings_list:
            if method.lower() == "pca":
                reducer = PCA(n_components=2)
                reduced = reducer.fit_transform(embeddings)
                explained_var = reducer.explained_variance_ratio_.sum()
                all_explained_vars.append(explained_var)
            elif method.lower() == "tsne":
                reducer = TSNE(n_components=2, random_state=42)
                reduced = reducer.fit_transform(embeddings)
                all_explained_vars.append(None)

            all_reduced.append(reduced)

        # Compute shared axis ranges if requested
        if shared_axes:
            all_x = np.concatenate([r[:, 0] for r in all_reduced])
            all_y = np.concatenate([r[:, 1] for r in all_reduced])
            x_min, x_max = all_x.min(), all_x.max()
            y_min, y_max = all_y.min(), all_y.max()
            x_padding = (x_max - x_min) * 0.05
            y_padding = (y_max - y_min) * 0.05
            x_range = [x_min - x_padding, x_max + x_padding]
            y_range = [y_min - y_padding, y_max + y_padding]

        # Create plots
        for i, (reduced, model_name, explained_var) in enumerate(
            zip(all_reduced, model_names, all_explained_vars)
        ):
            if explained_var is not None:
                subtitle_suffix = f"<br>(Explained var: {explained_var:.1%})"
            else:
                subtitle_suffix = ""

            hover_text = [
                f"Quote: {q[:100]}..." if len(q) > 100 else f"Quote: {q}"
                for q in quotes
            ]

            fig.add_trace(
                go.Scatter(
                    x=reduced[:, 0],
                    y=reduced[:, 1],
                    mode="markers",
                    marker=dict(size=5, opacity=0.6),
                    text=hover_text,
                    hovertemplate="%{text}<br>x: %{x:.3f}<br>y: %{y:.3f}<extra></extra>",
                    showlegend=False,
                ),
                row=1,
                col=i + 1,
            )

            fig.layout.annotations[i].update(
                text=model_name + subtitle_suffix,
                font=dict(family="Arial, sans-serif", size=14),
            )

            if shared_axes:
                fig.update_xaxes(range=x_range, row=1, col=i + 1)
                fig.update_yaxes(range=y_range, row=1, col=i + 1)

        method_name = "PCA" if method.lower() == "pca" else "t-SNE"
        axes_note = " (Shared Axes)" if shared_axes else " (Independent Axes)"

        fig.update_layout(
            title={
                "text": f"<b>Separate {method_name} per Model{axes_note}</b>",
                "x": 0.5,
                "xanchor": "center",
                "font": {"size": 18, "family": "Arial, sans-serif"},
            },
            font={"family": "Arial, sans-serif", "size": 12},
            width=1200,
            height=425,
            showlegend=False,
        )

    # Save as PNG with high resolution (scale=3 means 3x the size)
    output_path = f"visualizations/{filename}.png"
    os.makedirs("visualizations", exist_ok=True)
    fig.write_image(output_path, scale=3)
    print(f"Saved: {output_path}")

    return fig

In [86]:
# Save all three visualizations as high-resolution PNGs
print("Saving visualizations to the 'visualizations' folder...\n")

# 1. Combined PCA with normalization
print("1. Combined PCA (Normalized)...")
save_visualization_as_png(
    [embeddings_google, embeddings_qwen, embeddings_ibm, embeddings_tencent],
    ["Google EmbeddingGemma", "Qwen3 Embedding", "IBM Granite", "Tencent Conan"],
    quotes,
    method="pca",
    filename="combined_pca_normalized",
    viz_type="combined",
)

# 2. Separate PCA per model (with shared axes)
print("\n2. Separate PCA per Model (Shared Axes)...")
save_visualization_as_png(
    [embeddings_google, embeddings_qwen, embeddings_ibm, embeddings_tencent],
    ["Google EmbeddingGemma", "Qwen3 Embedding", "IBM Granite", "Tencent Conan"],
    quotes,
    method="pca",
    filename="separate_pca_shared_axes",
    viz_type="separate",
    shared_axes=False,
)

# 3. Combined t-SNE with normalization
print("\n3. Combined t-SNE (Normalized)...")
save_visualization_as_png(
    [embeddings_google, embeddings_qwen, embeddings_ibm, embeddings_tencent],
    ["Google EmbeddingGemma", "Qwen3 Embedding", "IBM Granite", "Tencent Conan"],
    quotes,
    method="tsne",
    filename="combined_tsne_normalized",
    viz_type="combined",
)

print("\n" + "=" * 80)
print("All visualizations saved successfully!")
print("=" * 80)
print("\nOutput files:")
print("  - visualizations/combined_pca_normalized.png")
print("  - visualizations/separate_pca_shared_axes.png")
print("  - visualizations/combined_tsne_normalized.png")
print("\nAll images are saved at 3× resolution for high quality.")

Saving visualizations to the 'visualizations' folder...

1. Combined PCA (Normalized)...
Saved: visualizations/combined_pca_normalized.png

2. Separate PCA per Model (Shared Axes)...
Saved: visualizations/separate_pca_shared_axes.png

3. Combined t-SNE (Normalized)...
Saved: visualizations/combined_tsne_normalized.png

All visualizations saved successfully!

Output files:
  - visualizations/combined_pca_normalized.png
  - visualizations/separate_pca_shared_axes.png
  - visualizations/combined_tsne_normalized.png

All images are saved at 3× resolution for high quality.
