In [5]:
import sys
from pathlib import Path

# Ajouter automatiquement le dossier racine du projet au sys.path
root_dir = Path().resolve().parent  # remonte à la racine
if str(root_dir) not in sys.path:
    sys.path.insert(0, str(root_dir))


# Imports standards
from sentence_transformers import SentenceTransformer
from utils.helper_functions import clean_text
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import warnings
import umap.umap_ as umap

warnings.filterwarnings("ignore")

file_path = root_dir/Path("data/importation-635-focus-AI.csv")
df = pd.read_csv(file_path, sep=";")
df = df[["sentences"]]
texts = (df.sentences).to_list()

# Context

This dataset was extracted from another dataset collected from the Twitter/X platform as part of a study aimed at analyzing trends at the intersection of **AI and climate**. The goal is to gain deeper insights into the specific themes and narratives emerging from posts that relate to both domains.

The data was retrieved using the **official X API**, ensuring compliance with platform constraints and metadata integrity.

In summary, this is a **real-world, multilingual, and noisy dataset**, making it a valuable benchmark to demonstrate the impact of deplicated texts on the performance of a NLP model.

## Objective

The goal here is **not to develop a new NLP method or model**, but rather to **assess the impact of duplicate data** on the final results of our pipeline.

To do so, we will proceed as follows:
1. Apply **sentence embeddings** using the `"all-mpnet-base-v2"` model.
2. Perform **dimensionality reduction** using **UMAP**, and **PCA**.
3. **Visualize, comment, and compare** the results with and without duplicates.

This experiment will help us understand whether the presence of duplicate entries significantly alters the structure or quality of the low-dimensional representation.


# Vectorisation 


In [None]:
# Vectorisation 
model_embedding = "all-mpnet-base-v2"
model = SentenceTransformer(model_embedding, device="cuda")  

def quantize(v, scale=1e5):
    a = np.asarray(v, dtype=np.float32)
    return tuple(np.rint(a * scale).astype(np.int32))


# Encode the texts
embeddings = model.encode(texts, device="cuda", show_progress_bar=True, batch_size=64)
df["embeddings"] = embeddings.tolist()
df_deduplicated = df.drop_duplicates(subset="embeddings", keep="first")
embeddings = np.vstack(df["embeddings"].to_numpy())
embeddings_deduplicated = np.vstack(df_deduplicated["embeddings"].to_numpy())

Batches: 100%|██████████| 58/58 [00:11<00:00,  4.88it/s]


dire que meme si parfois on a des des textes dupliqués, on peut avoir des embeddings différents à 10^-8 pres, cette tres faible différence, pousse les methode de detection des valeurs duppliqué de dire que les deux vecteurs sont  différentes

si on travaille avec cuml/umap on va avoir toujours l'erreur RuntimeError: RAFT failure at file=/__w/cuml/cuml/cpp/src/umap/fuzzy_simpl_set/naive.cuh line=284: At least one row does not have any neighbor with non-zero distance. (hey gpt explique pourquoi)

donc pour depasser ce probleme, sur cuml/UMAP on peut supprimer les vecteurs qui ont une similarité cosinuse qui egale exactement 1 (car pour le calcul de similarité cos, un difference de la 8 ou 10e virgule n'aura pas d'impact).
sur la version cpu, ou le faite de ne pas etre ultra précis, on peut juste supprimer les textes duppliqué, ou bien pour améliorer la precision, on peut supprimer les vecteurs avec une  similarité proche de 1

# Réduction de dimensions
## UMAP

In [9]:
# 1) Fit and transform both datasets
reducer = umap.UMAP(n_components=2, random_state=123).fit(embeddings)
reducer_dedup = umap.UMAP(n_components=2, random_state=123).fit(embeddings_deduplicated)

df['x'], df['y'] = reducer.transform(embeddings).T
df_deduplicated['x'], df_deduplicated['y'] = reducer_dedup.transform(embeddings_deduplicated).T

# 2) Create a 1×2 subplot figure
fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=("Original Data (with duplicated data)", "Deduplicated Data")
)

# 3) Add original UMAP scatter
fig.add_trace(
    go.Scattergl(
        x=df['x'], y=df['y'],
        mode='markers',
        marker=dict(opacity=0.6, size=5),
        name="Original"
    ),
    row=1, col=1
)

# 4) Add deduplicated UMAP scatter
fig.add_trace(
    go.Scattergl(
        x=df_deduplicated['x'], y=df_deduplicated['y'],
        mode='markers',
        marker=dict(opacity=0.6, size=5, color='firebrick'),
        name="Dedup"
    ),
    row=1, col=2
)

# 5) Axis titles & layout
for i in [1, 2]:
    fig.update_xaxes(title_text="UMAP Dim 1", row=1, col=i)
    fig.update_yaxes(title_text="UMAP Dim 2", row=1, col=i)

fig.update_layout(
    width=1000, height=500,
    title_text="UMAP Projection: Original vs. Deduplicated",
    showlegend=False
)

fig.show()

## ACP

In [7]:
from sklearn.decomposition import PCA
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# 1. Fit PCA on original and deduplicated embeddings
pca = PCA(n_components=2, random_state=123)

reduced_pca_orig = pca.fit_transform(embeddings)
reduced_pca_dedup = pca.fit_transform(embeddings_deduplicated)

print(f"PCA shape (original): {reduced_pca_orig.shape}")
print(f"PCA shape (deduplicated): {reduced_pca_dedup.shape}")

# 2. Add to DataFrames
df['pca_x'] = reduced_pca_orig[:, 0]
df['pca_y'] = reduced_pca_orig[:, 1]

df_deduplicated['pca_x'] = reduced_pca_dedup[:, 0]
df_deduplicated['pca_y'] = reduced_pca_dedup[:, 1]

# 3. Create a 1×2 subplot figure
fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=("Original Data (with duplicated data)", "Deduplicated Data")
)

# 4. Add original PCA scatter
fig.add_trace(
    go.Scattergl(
        x=df['pca_x'], y=df['pca_y'],
        mode='markers',
        marker=dict(opacity=0.6, size=5),
        name="Original"
    ),
    row=1, col=1
)

# 5. Add deduplicated PCA scatter
fig.add_trace(
    go.Scattergl(
        x=df_deduplicated['pca_x'], y=df_deduplicated['pca_y'],
        mode='markers',
        marker=dict(opacity=0.6, size=5, color='firebrick'),
        name="Deduplicated"
    ),
    row=1, col=2
)

# 6. Axis titles & layout
for i in [1, 2]:
    fig.update_xaxes(title_text="PC1", row=1, col=i)
    fig.update_yaxes(title_text="PC2", row=1, col=i)

fig.update_layout(
    width=1000, height=500,
    title_text="PCA Projection: Original vs. Deduplicated",
    showlegend=False,
    template="simple_white"
)

fig.show()


PCA shape (original): (1853, 2)
PCA shape (deduplicated): (1290, 2)


# Impact of Duplicate Texts on Dimensionality Reduction

## 1. Why Use Dimensionality Reduction to Compare Duplication Effects?

- **No labeled ground truth:** We lack supervised labels; visual embeddings (UMAP/PCA) act as proxies to measure how duplicates distort the representation.

- **Local density sensitivity:** Duplicates create artificial “point clouds” that skew nearest‑neighbor graphs (k‑NN) in UMAP, altering each point’s 2D projection.

- **Quantifying deformation:** By comparing “original” vs. “deduplicated” projections, we detect whether duplicates cause significant structural shifts.


## 2. Observations on UMAP

- **Shape changes**  
  - Duplicated data: clusters appear overly compact or stretched.  
  - Deduplicated data: clusters relax to reveal true thematic groupings.

- **Noise sensitivity:** UMAP’s reliance on k‑NN graphs makes it particularly vulnerable to density artifacts introduced by duplicates.

- **Stability improvement:** Deduplication reduces topological noise, yielding more consistent and semantically meaningful embeddings.


## 3. Why This Matters for NLP Pipelines

- **Cascade effect:** Biased projections feed into clustering (e.g. HDBSCAN), producing spurious topics or hiding genuine ones.

- **Embedding quality:** A distorted low‑dimensional representation degrades downstream unsupervised tasks such as topic modeling or anomaly detection.


## 4. Why PCA Is Less Sensitive

- **Global, linear projection:** PCA maximizes overall variance without constructing a local neighbor graph; duplicates have minimal impact on principal axes.

- **Insensitivity to local extremes:** Identical points only marginally affect the covariance matrix, preserving the global shape.

***Recommendations***

- **Visual reduction** is essential for diagnosing duplication effects in unsupervised pipelines.  
- **Best practices**: always **deduplicate** before neighbor‑based steps (UMAP, clustering).  
    