# Why we cannot rely on text deduplication alone

While working on a custom **topic modeling repository**, I designed a complete pipeline â€” from text preprocessing to clustering, including a custom meta-clustering method and final **automatic annotation using LLMs**. This pipeline is meant to be robust, modular, and efficient â€” a unique piece of work I hope to release soon, inshallah. Iâ€™ll share the full technical details at the right time.

In the preprocessing stage, I made sure to remove all **duplicated texts**. However, when running `cuml.UMAP` (from RAPIDS.ai, which speeds up dimensionality reduction with GPU support), I encountered the following error:

> `RuntimeError: At least one row does not have any neighbor with non-zero distance.`

This means that some rows (i.e., embeddings) are **identical**, resulting in **zero-distance neighbors**, which breaks UMAP's internal graph construction.

But how is that possible if no duplicate texts are present?

After investigating the embeddingsâ€¦ the **surprise** was clear: some **different texts were mapped to exactly the same vector**.

This raised a critical question:  
**Are popular embedding models robust enough to handle fine-grained textual differences â€” especially when used in downstream tasks like UMAP, clustering, or sentiment analysis?**

Thatâ€™s what this analysis aims to explore, by testing:
- `all-MiniLM-L6-v2`
- `jina-embeddings-v2-base-en`
- `text-embedding-3-large` (OpenAI)

Weâ€™ll dive into how these models react to surface-level variations and contextual shifts, and how these behaviors can directly affect tasks such as **semantic clustering** and **sentiment analysis**.


# Embedding Robustness: Why Basic Text Deduplication Isn't Enough

When working with NLP, the **first critical step** is transforming raw text into numerical representations â€” **word embeddings**. Unlike traditional methods like Bag-of-Words or TF-IDF, modern embedding models capture **semantic meaning** in dense vector form.

However, not all embedding models behave equally in terms of **robustness to minor variations**. To evaluate this, we ran a small test comparing three widely used models:

- `all-MiniLM-L6-v2` (Sentence Transformers)
- `jina-embeddings-v2-base-en` (Jina AI)
- `text-embedding-3-large` (OpenAI's latest commercial model)

We tested each model on a set of **15 English sentences** featuring **light surface variations**, such as:

- Punctuation differences (e.g., `"."` vs `"!"`)
- Case changes (e.g., `"hello"` vs `"HELLO"`)
- Repetitions of nearly identical or identical texts

## âž¤ Text Sample Used:

```text
1. "The weather is nice today."
2. "The weather is nice today!"
3. "The weather is nice today."
4. "The Weather is nice today."
5. "It is hot in Paris this summer."
6. "It is hot in Paris this summer"
7. "It is HOT in Paris this summer."
8. "Hello world."
9. "Hello world!"
10. "HELLO world."
11. "Machine learning is fascinating."
12. "Machine learning is fascinating!"
13. "Artificial intelligence is fascinating."
14. "Football is a popular sport."
15. "Switching to English is sometimes useful."


In [None]:
import os
import numpy as np
from dotenv import load_dotenv
from openai import OpenAI
from sentence_transformers import SentenceTransformer
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from dotenv import load_dotenv
import torch

# ------------------ Setup ------------------
load_dotenv()
api_key = os.getenv("OPENAI_API_KEY")
client = OpenAI(api_key=api_key)
device = "cuda" if torch.cuda.is_available() else "cpu"

# ------------------ Helper functions ------------------

def compute_similarity_and_hover(model: SentenceTransformer, texts: list):
    """
    Encode, normalise et calcule matrice de similaritÃ© + hover text riche.
    """
    embeddings = model.encode(texts, normalize_embeddings=True)
    sim_matrix = np.clip(np.matmul(embeddings, embeddings.T), -1, 1)
    hover = [
        [
            f"<b>Index X:</b> {j}<br><b>Text X:</b> {texts[j]}<br>"
            f"<b>Index Y:</b> {i}<br><b>Text Y:</b> {texts[i]}<br>"
            f"<b>Cosine Similarity:</b> {sim_matrix[i, j]:.3f}"
            for j in range(len(texts))
        ]
        for i in range(len(texts))
    ]
    return sim_matrix, hover

def get_openai_embeddings(texts: list):
    """
    RÃ©cupÃ¨re et normalise les embeddings OpenAI.
    """
    response = client.embeddings.create(model="text-embedding-3-large", input=texts)
    vecs = np.vstack([np.array(d.embedding) for d in response.data])
    return vecs  



In [11]:
# ------------------------------------------------------------------
# Textes Ã  comparer
# ------------------------------------------------------------------
texts = [
    "The weather is nice today.",
    "The weather is nice today!",
    "The weather is nice today.",
    "The Weather is nice today.",
    "It is hot in Paris this summer.",
    "It is hot in Paris this summer",
    "It is HOT in Paris this summer.",
    "Hello world.",
    "Hello world!",
    "HELLO world.",
    "Machine learning is fascinating.",
    "Machine learning is fascinating!",
    "Artificial intelligence is fascinating.",
    "Football is a popular sport.",
    "Switching to English is sometimes useful."
]

model_mini = SentenceTransformer("all-MiniLM-L6-v2", device=device)
model_jina = SentenceTransformer("jinaai/jina-embeddings-v2-base-en", trust_remote_code=True, device=device)


sim_mini, hover_mini = compute_similarity_and_hover(model_mini, texts)
sim_jina, hover_jina = compute_similarity_and_hover(model_jina, texts)


In [None]:
fig_comparison = make_subplots(
    rows=1,
    cols=2,
    subplot_titles=("all-MiniLM-L6-v2", "jina-embeddings-v2-base-en"),
    horizontal_spacing=0.08
)

fig_comparison.add_trace(
    go.Heatmap(
        z=sim_mini,
        x=list(range(len(texts))),
        y=list(range(len(texts))),
        text=hover_mini,
        hoverinfo="text",
        colorscale="Viridis",
        zmin=0,
        zmax=1,
        colorbar=dict(title="Cosine<br>Similarity"),
    ),
    row=1,
    col=1,
)

fig_comparison.add_trace(
    go.Heatmap(
        z=sim_jina,
        x=list(range(len(texts))),
        y=list(range(len(texts))),
        text=hover_jina,
        hoverinfo="text",
        colorscale="Viridis",
        zmin=0,
        zmax=1,
        showscale=False,
    ),
    row=1,
    col=2,
)

fig_comparison.update_layout(
    title="Cosine Similarity Matrix â€“ all-MiniLM-L6-v2 vs jina-embeddings-v2-base-en",
    width=1300,
    height=650,
)
fig_comparison.update_yaxes(autorange="reversed", row=1, col=1)
fig_comparison.update_yaxes(autorange="reversed", row=1, col=2)
fig_comparison.show()

# ------------------------------------------------------------------
# OpenAI embeddings et heatmap
# ------------------------------------------------------------------
emb_openai = get_openai_embeddings(texts)
sim_openai = np.matmul(emb_openai, emb_openai.T)

hover_openai = [
    [
        f"<b>Index X:</b> {j}<br><b>Text X:</b> {texts[j]}<br>"
        f"<b>Index Y:</b> {i}<br><b>Text Y:</b> {texts[i]}<br>"
        f"<b>Cosine Similarity:</b> {sim_openai[i, j]:.3f}"
        for j in range(len(texts))
    ]
    for i in range(len(texts))
]

fig_openai = go.Figure(
    data=go.Heatmap(
        z=sim_openai,
        x=list(range(len(texts))),
        y=list(range(len(texts))),
        text=hover_openai,
        hoverinfo="text",
        colorscale="Viridis",
        zmin=0,
        zmax=1,
        colorbar=dict(title="Cosine<br>Similarity"),
    )
)
fig_openai.update_layout(
    title="Cosine Similarity â€“ OpenAI text-embedding-3-large",
    width=600,
    height=600,
)
fig_openai.update_yaxes(autorange="reversed")
fig_openai.show()

# Comparative Analysis of Embedding Models: Jina vs MiniLM vs OpenAI

We conducted a small-scale evaluation to compare how three different embedding models handle **minor text variations**:

- `jina-embeddings-v2-base-en`
- `all-MiniLM-L6-v2` (from Sentence Transformers)
- `text-embedding-3-large` (OpenAI)

## 1. Jina Embeddings: Not Sensitive and Inconsistent

- A deeper issue arises when comparing **unrelated texts**, we have the impression that the similarity is between 0.6 and 1.
  - for example "It is HOT in Paris this summer."` vs `"Hello word"` â†’ similarity â‰ˆ **0.68** !
- In contrast, **case changes** (e.g., `"Hello"` vs `"HELLO"`) had **no effect** â€” these variations produced **identical embeddings**.

## 2. all-MiniLM-L6-v2: Better Semantic Consistency

- This model produced **more stable embeddings** across punctuation and casing.
- Example:
  - `"Hello world."` vs `"Hello world!"` â†’ similarity remained **high and logical**
- It also yielded **low similarities** between clearly unrelated texts:
  - `"It is HOT in Paris this summer."` vs `"Artificial intelligence is fascinating."` â†’ similarity â‰ˆ **0.0**
- A slightly unexpected similarity (~0.15) was found between `"Hello world."` and `"Artificial intelligence is fascinating."`, which might be explained by **shared associations with programming or computer science** â€” a **plausible semantic overlap**.

## 3. OpenAI `text-embedding-3-large`: Most Coherent, but Slight Variance Remains

- OpenAIâ€™s model showed **robust handling** of both punctuation and case changes.
- It returned **high but logically decreasing similarities** for variants of the same sentence:
  - `"Machine learning is fascinating."` vs `"Machine learning is fascinating!"` â†’ **0.95**
  - `"Hello world."` vs `"Hello world!"` â†’ **0.882**
- Even though the textual difference is just `"!"` in both cases, the **similarity values differ**, suggesting some **context-sensitive weighting** in the embedding process.
- Overall, this model provided the **most balanced behavior** across all cases tested.


##  Key Takeaways

- **Jina-v2-base-en** embeddings turned out to be **too fragile** for tasks requiring robustness in deduplication or semantic similarity.
  
- **MiniLM** represents a **strong and efficient compromise**: the model is lightweight, yet delivers semantically consistent results. Given its small size, the quality of its embeddings was **unexpectedly impressive**.

- **OpenAIâ€™s `text-embedding-3-large`** remains the most **robust and production-ready** solution among the three. It handles variations intelligently, although it may still introduce some **inconsistent similarity scores** across seemingly equivalent transformations â€” a nuance worth monitoring.




# Can This Impact Sentiment Analysis?

After the previous experiments, a natural question arose:  
**Could embedding sensitivity (or lack thereof) affect sentiment analysis outcomes?**

To explore this, we designed a simple yet illustrative test using two semantically distinct variants of the same sentence:

- `"Give me coffee please â˜•ðŸ˜Š"` â€” polite, friendly, and calm tone.
- `"GIVE ME COFFEE PLEASE ðŸ˜ ðŸ’¢"` â€” aggressive and angry tone, using uppercase and angry emojis.

Although the textual content is nearly identical, the **intent and sentiment** are completely opposite.


In [15]:
texts = [
    "Give me coffee please â˜•ðŸ˜Š",
    "GIVE ME COFFEE PLEASE ðŸ˜ ðŸ’¢"
]


# Cosine similarity function (fonctionne parce que les vecteurs sont normalisÃ©s)
def cosine_similarity(a, b):
    return float(np.dot(a, b))

# Collect all
results = []

# 1. MiniLM
mini_emb = model_mini.encode(texts, normalize_embeddings=True)
sim_mini = cosine_similarity(mini_emb[0], mini_emb[1])
results.append(("all-MiniLM-L6-v2", sim_mini))

# 2. Jina v2
jina_emb = model_jina.encode(texts, normalize_embeddings=True)
sim_jina = cosine_similarity(jina_emb[0], jina_emb[1])
results.append(("jina-embeddings-v2", sim_jina))

# 3. OpenAI
openai_emb = get_openai_embeddings(texts)
sim_openai = cosine_similarity(openai_emb[0], openai_emb[1])
results.append(("text-embedding-3-large (OpenAI)", sim_openai))

# Affichage
for model, sim in results:
    print(f"{model:35s} â†’ Cosine Similarity: {sim:.6f}")


all-MiniLM-L6-v2                    â†’ Cosine Similarity: 1.000000
jina-embeddings-v2                  â†’ Cosine Similarity: 1.000000
text-embedding-3-large (OpenAI)     â†’ Cosine Similarity: 0.742812


## Observations on Sentiment Encoding

We compared the embeddings of two emotionally distinct messages:

- `"Give me coffee please â˜•ðŸ˜Š"`
- `"GIVE ME COFFEE PLEASE ðŸ˜ ðŸ’¢"`

The cosine similarities between the two embeddings are:

| Model                              | Cosine Similarity |
|-----------------------------------|-------------------|
| all-MiniLM-L6-v2                  | **1.000000**      |
| jina-embeddings-v2-base-en        | **1.000000**      |
| text-embedding-3-large (OpenAI)   | **0.742812**      |

This means that for **MiniLM** and this version of **Jina-v2-base-en**, the two texts are **interpreted as semantically identical**, despite conveying opposite emotional tones.

**Implication:**  
The upstream embedding model may completely **neutralize emotional contrast**, leading to misleading predictions.

This highlights a **critical limitation**:  
> Embedding models not designed with affective signals in mind (e.g., case, punctuation, emojis) can **erase sentiment cues** altogether.

**Takeaway:**  
For sentiment analysis (or any task involving emotional nuance for instance), choosing a robust embedding model is **non-negotiable**.

*And indeed, as the saying goes: "A picture is worth a thousand words."*  

*Note:* We also tested with additional punctuation/emojis. While the results slightly varied, **Jina-v2** and **MiniLM** still failed to capture the emotion


# Final Thoughts on Embedding Model Selection

There is **no universal embedding model** that works best for all tasks.

Just because a model performs well on a benchmark or in one specific use case doesn't mean it will generalize across all contexts. Our experiments clearly show that even widely used models can fail under seemingly simple variations (punctuation, casing, emojis...).

From a practical and scientific standpoint, **the choice of embedding model should be treated as a hyperparameter** â€” one that must be tuned and validated based on the specific downstream task.

>  Blindly trusting benchmark scores or community popularity can lead to underperforming pipelines.

 Instead:
- Always **test embedding robustness** for your use case (sentiment analysis, classification, retrieval, etc.).
- Evaluate whether the model captures the **semantic and stylistic nuances** required by your task.
- Stay aware that performance may vary **across text types** â€” what works for tweets might fail on scientific abstracts, and vice versa.

Finally, note that this analysis is based on a small set of crafted sentences. It **does not claim universal conclusions**, and **certain models may be specialized** (e.g., trained on formal language, code, or web content). More comprehensive testing is required for production-grade deployment.
