In [1]:
import sys
from pathlib import Path

# Ajouter automatiquement le dossier racine du projet au sys.path
root_dir = Path().resolve().parent  # remonte à la racine
if str(root_dir) not in sys.path:
    sys.path.insert(0, str(root_dir))


# Imports standards
from sentence_transformers import SentenceTransformer
from utils.helper_functions import clean_text
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import matplotlib.pyplot as plt
from openai import OpenAI
import pandas as pd
import numpy as np
import warnings

import os

from dotenv import load_dotenv

load_dotenv()         # lit .env local
api_key = os.getenv("api_key")


warnings.filterwarnings("ignore")

file_path = root_dir/Path("data/importation-635-focus-AI.csv")
df = pd.read_csv(file_path, sep=";")
df["sentences"] = df.sentences.apply(lambda x: clean_text(x))

  from .autonotebook import tqdm as notebook_tqdm


NameError: name 'clean_text' is not defined

# Context

This dataset was extracted from another dataset collected from the Twitter/X platform as part of a study aimed at analyzing trends at the intersection of **AI and climate**. The goal is to gain deeper insights into the specific themes and narratives emerging from posts that relate to both domains.

The data was retrieved using the **official X API**, ensuring compliance with platform constraints and metadata integrity.

In summary, this is a **real-world, multilingual, and noisy dataset**, making it a valuable benchmark to demonstrate the impact of deplicated texts on the performance of a NLP model.


On fait un premier nettoyage de la base de données, en supprimant les emojis et les espacement inutiles, la base de donnés contien 1853 textes, avec 639 textes répétées (34% de la base de donnés), avec aucune  valeurs manquantes.
Le but ici c'est de visualiser les clusters resultant en utilisant 6 differente méthodes d'embeddings
## Embedding Comparison Pipeline

In this notebook, we will explore six different embedding techniques:

- **BoW (Bag of Words)**
- **TF-IDF (Term Frequency–Inverse Document Frequency)**
- **Word2Vec**
- **FastText**
- **all-mpnet-base-v2**
- **OpenAI Large 3**

## Step-by-step Process:

1. **Text Embedding**  
   Convert the textual data into vector representations using each method.

2. **Dimensionality Reduction**  
   Apply **UMAP** to reduce the embeddings to 2D for visualization and clustering.

3. **Clustering**  
   Use **HDBSCAN** to identify clusters in the reduced space.

4. **Evaluation**  
   Compare clustering results using metrics such as:
   - Silhouette Score  
   - Davies-Bouldin Index  
   - Persistance 

This pipeline will help us understand how different embedding strategies affect clustering quality and structure.


In [3]:
display(df.shape)
display(df.sentences.apply(lambda x: clean_text(x)).duplicated().sum())
display(df.sentences.apply(lambda x: clean_text(x)).isna().sum())
df = df.drop_duplicates(subset="sentences", keep="first") 
texts = (df.sentences).to_list()

(1853, 31)

639

0

## BoW

In [4]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import normalize
import time 
start_bow = time.time()
bow_vec = CountVectorizer(
    strip_accents='unicode',
    lowercase=True,
    ngram_range=(1,1),
    min_df=5,            # <- ajuste (ex: 1 si petit corpus; 2-5 si gros)
    max_df=0.9,          # <- filtre termes trop fréquents
    dtype=np.float32,
)
X_bow = bow_vec.fit_transform(texts)      # sparse CSR (n_docs x n_terms)
bow_vocab = bow_vec.get_feature_names_out()
X = normalize(X_bow, norm='l2')
embeddings_bow = X.toarray()
end_bow = time.time()
print(f"BoW - Temps de calcul des embeddings : {end_bow - start_bow:.2f} secondes")


BoW - Temps de calcul des embeddings : 0.04 secondes


## Tf-Idf

In [5]:
start_idf = time.time()
tfidf_vec = TfidfVectorizer(
    strip_accents='unicode',
    lowercase=True,
    ngram_range=(1,1),
    min_df=5,
    max_df=0.9,
    norm='l2',           # norme standard pour cosines
    use_idf=True,
    smooth_idf=True,
    sublinear_tf=False,  # True si tu veux log(1+tf)
    dtype=np.float32,
)
X_tfidf = tfidf_vec.fit_transform(texts)
tfidf_vocab = tfidf_vec.get_feature_names_out()
embeddings_tfidf = X_tfidf.toarray()
end_idf = time.time()
print(f"TF-IDF - Temps de calcul des embeddings : {end_idf - start_idf:.2f} secondes")

TF-IDF - Temps de calcul des embeddings : 0.03 secondes


## Word2vec

In [6]:
# pip install gensim
from gensim.utils import simple_preprocess
from gensim.models import Word2Vec
start_wv = time.time()
# 1) Tokenisation simple
tokens = [simple_preprocess(t, deacc=True, min_len=2) for t in texts]

# 2) Entraînement W2V (skip-gram)
w2v = Word2Vec(
    sentences=tokens, vector_size=300, window=5,
    min_count=2, sg=1, negative=10, sample=1e-3,
    epochs=10, workers=20
)
wv = w2v.wv  # KeyedVectors

# 3) Embedding de document = moyenne des mots
def doc_embed(tok):
    V = [wv[w] for w in tok if w in wv]
    return np.mean(V, axis=0) if V else np.zeros(wv.vector_size, dtype=np.float32)

embeddings_w2v = np.vstack([doc_embed(t) for t in tokens]).astype(np.float32)
embeddings_w2v = normalize(embeddings_w2v, norm='l2')  # (n_docs, 300)
end_wv = time.time()
print(f"Word2Vec - Temps de calcul des embeddings : {end_wv - start_wv:.2f} secondes")


Word2Vec - Temps de calcul des embeddings : 1.45 secondes


## FastText

In [7]:
# pip install gensim scikit-learn
from gensim.models import FastText

texts_tok = [simple_preprocess(t) for t in texts]

ft_local = FastText(
    sentences=texts_tok, vector_size=300, window=5,
    min_count=2, sg=1, negative=10, sample=1e-3,
    epochs=10, workers=4, min_n=3
)

def doc_embed(tok):
    # OOV géré via sous-mots
    vecs = [ft_local.wv.get_vector(w) for w in tok]  # get_vector fonctionne même OOV
    return np.mean(vecs, axis=0) if vecs else np.zeros(ft_local.wv.vector_size, np.float32)

embeddings_fasttext = np.vstack([doc_embed(t) for t in texts_tok]).astype(np.float32)
embeddings_fasttext = normalize(embeddings_fasttext, norm='l2')
end_ft = time.time()
print(f"FastText - Temps de calcul des embeddings : {end_ft - end_wv:.2f} secondes")


FastText - Temps de calcul des embeddings : 4.37 secondes


## all-mpnet-base-v2

In [8]:
start_mnpnet = time.time()
model_embedding = "all-mpnet-base-v2"
model = SentenceTransformer(model_embedding, device="cuda")  
# Encode the texts
embeddings_all_mpnet = model.encode(texts, device="cuda", show_progress_bar=True, batch_size=256)
end_mnpnet = time.time()
print(f"MPNet - Temps de calcul des embeddings : {end_mnpnet - start_mnpnet:.2f} secondes")


Batches: 100%|██████████| 5/5 [00:03<00:00,  1.39it/s]

MPNet - Temps de calcul des embeddings : 6.79 secondes





## OpenAI: Large 3 

In [None]:
start_large3 = time.time()
client = OpenAI(api_key = api_key)
res = client.embeddings.create(model="text-embedding-3-large", input=texts)
embedding_large3 = np.array([np.array(d.embedding, dtype=np.float32) for d in res.data])
end_large3 = time.time()
print(f"Large 3 - Temps de calcul des embeddings : {end_large3 - start_large3:.2f} secondes")

Large 3 - Temps de calcul des embeddings : 12.05 secondes


# UMAP (default parameters)

In [11]:
import umap

def umap2d(X, random_state=42):
    reducer = umap.UMAP(n_components=2, random_state=random_state, n_jobs=22)  # defaults
    # UMAP accepte le sparse CSR/CSC. Sinon, cast en float32.
    return reducer.fit_transform(np.asarray(X, dtype=np.float32))

# 1) BoW
Z_bow = umap2d(embeddings_bow)

# 2) TF-IDF
Z_tfidf = umap2d(embeddings_tfidf)

# 3) Word2Vec
Z_w2v = umap2d(embeddings_w2v)

# 4) FastText
Z_fasttext = umap2d(embeddings_fasttext)

# 5) all-mpnet-base-v2
Z_all_mpnet = umap2d(embeddings_all_mpnet)

# 6) text-embedding-3-large
Z_large3 = umap2d(embedding_large3)

## HDBSCAN (default parameters)

In [44]:
import hdbscan

def run_hdbscan(Z, min_cluster_size=50, min_samples=5):
    cl = hdbscan.HDBSCAN(
        min_cluster_size=min_cluster_size,
        min_samples=min_samples,
        cluster_selection_method='eom',
        prediction_data=True,
        core_dist_n_jobs=22,
    )
    labels = cl.fit_predict(Z)          # -1 = bruit (noise)
    probs  = cl.probabilities_          # appartenance (0..1)
    return cl, labels, probs, cl.cluster_persistence_.mean()

cl_bow,      labels_bow,      probs_bow,        cluster_persistence_bow= run_hdbscan(Z_bow)
cl_tfidf,    labels_tfidf,    probs_tfidf,      cluster_persistence_tfidf    = run_hdbscan(Z_tfidf)
cl_w2v,      labels_w2v,      probs_w2v,        cluster_persistence_w2v      = run_hdbscan(Z_w2v)
cl_fasttext, labels_fasttext, probs_fasttext,   cluster_persistence_fasttext     = run_hdbscan(Z_fasttext)
cl_mpnet,    labels_mpnet,    probs_mpnet,      cluster_persistence_mpnet    = run_hdbscan(Z_all_mpnet)
cl_large3,   labels_large3,   probs_large3,     cluster_persistence_large3       = run_hdbscan(Z_large3)

# (option) petit résumé
def summarize(name, labels, persistance):
    n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
    noise = (labels == -1).mean()
    print(f"{name}: clusters={n_clusters}, noise={noise:.1%}, persistance={persistance:.1%}")

summarize("BoW", labels_bow, cluster_persistence_bow)
summarize("TF-IDF", labels_tfidf, cluster_persistence_tfidf)
summarize("Word2Vec", labels_w2v, cluster_persistence_w2v)
summarize("FastText", labels_fasttext, cluster_persistence_fasttext)
summarize("all-mpnet-base-v2", labels_mpnet, cluster_persistence_mpnet)
summarize("text-embedding-3-large", labels_large3, cluster_persistence_large3)

BoW: clusters=2, noise=1.7%, persistance=17.5%
TF-IDF: clusters=2, noise=9.1%, persistance=28.7%
Word2Vec: clusters=3, noise=13.3%, persistance=6.1%
FastText: clusters=2, noise=0.2%, persistance=28.0%
all-mpnet-base-v2: clusters=3, noise=0.0%, persistance=56.7%
text-embedding-3-large: clusters=3, noise=6.9%, persistance=54.6%


## Visualization

In [46]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px
import numpy as np

# (nom, coords 2D, labels HDBSCAN d'origine)
plots = [
    ("BoW", Z_bow, labels_bow),
    ("Word2Vec", Z_w2v, labels_w2v),
    ("MPNet (all-mpnet-base-v2)", Z_all_mpnet, labels_mpnet),
    ("TF-IDF", Z_tfidf, labels_tfidf),
    ("FastText", Z_fasttext, labels_fasttext),
    ("text-embedding-3-large", Z_large3, labels_large3),
]

fig = make_subplots(
    rows=2, cols=3,
    subplot_titles=[name for name, _, _ in plots],
    horizontal_spacing=0.06, vertical_spacing=0.10
)

PALETTE = px.colors.qualitative.D3 + px.colors.qualitative.Set1 + px.colors.qualitative.Dark24
NOISE_COLOR = "#6E6E6E"  # gris lisible

# --- utilitaires ---
def remap_labels_to_1K(labels):
    labels = np.asarray(labels)
    uniq = sorted([l for l in np.unique(labels) if l != -1])
    mapping = {old: i+1 for i, old in enumerate(uniq)}
    return np.array([mapping.get(l, -1) for l in labels], dtype=int)

def colors_for(new_labels):
    labs = np.asarray(new_labels).tolist()
    uniq = sorted(set(labs) - {-1})
    lut = {lab: PALETTE[(lab-1) % len(PALETTE)] for lab in uniq}
    cols = np.array([NOISE_COLOR if l == -1 else lut[l] for l in labs], dtype=object)
    return cols, lut

def add_legend_box(fig, r, c, new_labels, lut, pos="top-right",
                   width=0.34, font_size=10, alpha=0.90):
    total = len(new_labels)
    items = []
    for k in sorted(lut.keys()):      # cluster 1, 2, ...
        cnt = int(np.sum(new_labels == k))
        pct = 100.0 * cnt / total if total else 0.0
        items.append((f"cluster {k} ({pct:.1f}%)", lut[k]))
    if -1 in set(new_labels.tolist()):
        cnt = int(np.sum(new_labels == -1))
        pct = 100.0 * cnt / total if total else 0.0
        items.append((f"bruit ({pct:.1f}%)", NOISE_COLOR))
    if not items:
        return

    pad = 0.016
    line_h = 0.062          # compacte
    height = pad*2 + line_h*len(items)

    # position compacte dans le coin (en coordonnées de domaine)
    if pos == "top-right":
        x1, y1 = 0.98, 0.98
        x0 = x1 - width
        y0 = max(0.02, y1 - height)
    elif pos == "bottom-left":
        x0, y0 = 0.02, 0.02
        x1, y1 = x0 + width, y0 + height
    else:  # top-left
        x0, y1 = 0.02, 0.98
        x1, y0 = x0 + width, max(0.02, y1 - height)

    fig.add_shape(
        type="rect", xref="x domain", yref="y domain",
        x0=x0, y0=y0, x1=x1, y1=y1,
        line=dict(color="rgba(0,0,0,0.25)", width=1),
        fillcolor=f"rgba(255,255,255,{alpha})",
        layer="above",
        row=r, col=c
    )

    sw = 0.019  # carré couleur
    # ancrage en haut-gauche
    base_y = y1 - pad
    for i, (txt, color) in enumerate(items):
        y = base_y - (i + 0.5) * line_h
        # carré couleur
        fig.add_shape(
            type="rect", xref="x domain", yref="y domain",
            x0=x0 + pad, x1=x0 + pad + sw, y0=y - 0.018, y1=y + 0.018,
            line=dict(color=color, width=1), fillcolor=color, layer="above",
            row=r, col=c
        )
        # texte
        fig.add_annotation(
            xref="x domain", yref="y domain",
            x=x0 + pad + sw + 0.012, y=y,
            text=txt, showarrow=False,
            xanchor="left", yanchor="middle", align="left",
            font=dict(size=font_size, color="#0F0F0F"),
            row=r, col=c
        )

# --- tracés ---
for k, (name, Z, labels_raw) in enumerate(plots, start=1):
    r = 1 if k <= 3 else 2
    c = k if k <= 3 else k - 3

    lab = remap_labels_to_1K(labels_raw)
    cols, lut = colors_for(lab)

    fig.add_trace(
        go.Scattergl(
            x=Z[:, 0], y=Z[:, 1],
            mode="markers",
            marker=dict(size=5, color=cols, opacity=0.8),
            hoverinfo="skip",
            showlegend=False
        ),
        row=r, col=c
    )
    # légende compacte dans le coin, faible emprise
    add_legend_box(fig, r, c, lab, lut, pos="top-right",
                   width=0.28, font_size=10, alpha=0.7)

# style global très clair
fig.update_layout(
    height=900, width=1400,
    title="HDBSCAN clusters (UMAP 2D) — BoW | W2V | MPNet / TF-IDF | FastText | OAI-large3",
    template="plotly_white",
    margin=dict(l=0, r=0, t=56, b=0),
)
fig.update_xaxes(showticklabels=False, showgrid=True, gridcolor="rgba(0,0,0,0.04)", zeroline=False)
fig.update_yaxes(showticklabels=False, showgrid=True, gridcolor="rgba(0,0,0,0.04)", zeroline=False)

fig.show(config={
    "displayModeBar": True,
    "toImageButtonOptions": {"format": "png", "scale": 3}  # png/jpg/svg/pdf
})


## Comparaison 

In [49]:
from sklearn.metrics import silhouette_score
from hdbscan.validity import validity_index as dbcv  # le vrai DBCV

def metrics_from_hdbscan(Z, labels, probs):
    Z = np.asarray(Z, dtype=np.float64, order='C')   # ← fix
    labels = np.asarray(labels, dtype=int)

    mask = labels != -1
    n_clusters = len(set(labels[mask]))
    noise_ratio = float((~mask).mean())
    mean_prob = float(probs[mask].mean()) if mask.sum() else np.nan
    sil = float(silhouette_score(Z[mask], labels[mask])) if n_clusters >= 2 else np.nan
    try:
        v = float(dbcv(Z, labels)) if n_clusters >= 1 else np.nan
    except Exception:
        v = np.nan
    return n_clusters, noise_ratio, mean_prob, sil, v

rows = []
rows.append(("BoW",               *metrics_from_hdbscan(Z_bow,      labels_bow,      probs_bow), cluster_persistence_bow, end_bow-   start_bow))
rows.append(("TF-IDF",            *metrics_from_hdbscan(Z_tfidf,    labels_tfidf,    probs_tfidf), cluster_persistence_tfidf, end_idf - start_idf))
rows.append(("Word2Vec",          *metrics_from_hdbscan(Z_w2v,      labels_w2v,      probs_w2v), cluster_persistence_w2v, end_wv -   start_wv))
rows.append(("FastText",          *metrics_from_hdbscan(Z_fasttext, labels_fasttext, probs_fasttext), cluster_persistence_fasttext, end_ft -   end_wv))
rows.append(("MPNet (all-mpnet)", *metrics_from_hdbscan(Z_all_mpnet,labels_mpnet,    probs_mpnet), cluster_persistence_mpnet, end_mnpnet - end_ft))
rows.append(("text-emb-3-large",  *metrics_from_hdbscan(Z_large3,   labels_large3,   probs_large3), cluster_persistence_large3, end_large3 - start_large3))

__ = pd.DataFrame(rows, columns=["Embedding","n_clusters","noise_ratio","mean_prob","silhouette","DBCV", "cluster_persistence",  "time_seconds"])
__[["Embedding", "n_clusters","silhouette","DBCV", "cluster_persistence",  "time_seconds"]] 

Unnamed: 0,Embedding,n_clusters,silhouette,DBCV,cluster_persistence,time_seconds
0,BoW,2,0.155953,-0.244443,0.174862,0.040604
1,TF-IDF,2,0.314243,-0.36731,0.287143,0.031778
2,Word2Vec,3,0.208206,-0.4085,0.061306,1.450235
3,FastText,2,0.265163,-0.148796,0.280291,4.371775
4,MPNet (all-mpnet),3,0.749627,0.706091,0.567071,6.791794
5,text-emb-3-large,3,0.68485,0.490799,0.546145,12.046897


### Joint Analysis of Metrics and Visualization

#### Key Indicators
- **Silhouette**: Measures the balance between intra-cluster cohesion and inter-cluster separation (values close to 1 are ideal).
- **DBCV**: Assesses clustering quality based on density (positive values indicate reliable structure).
- **HDBSCAN Persistence**: Reflects hierarchical stability of clusters (higher values = more robust groups).


#### Number of Clusters and Noise
- The reported number of **clusters** refers to HDBSCAN groups excluding noise.
- Points shown in **gray** in the figure are considered **noise** (isolated or ambiguous observations).

#### Table Insights: Method Comparison
Transformer-based embeddings stand out clearly:
- **MPNet (all-mpnet)**: Best overall performance (Silhouette 0.75, DBCV 0.71, Persistence 0.57).
- **OpenAI text-embedding-3-large**: Strong results, slightly behind MPNet (Silhouette 0.68, DBCV 0.49, Persistence 0.55).
- **BoW**, **TF–IDF**, **Word2Vec**, **FastText**: Negative DBCV and low-to-moderate persistence, indicating less coherent density-based structures despite intermediate Silhouette scores (e.g., TF–IDF = 0.31).

#### Visualization Insights
The UMAP–HDBSCAN figure confirms these findings:
- **BoW / TF–IDF**: Two dominant clusters with blurry boundaries and significant noise (especially TF–IDF).
- **Word2Vec**: Three discernible groups, but with diffuse transitions and notable gray areas.
- **FastText**: Two dense clusters, but limited granularity (risk of under-segmentation).
- **MPNet**: Three compact, well-separated clusters with minimal noise; fine semantic separation.
- **text-embedding-3-large**: Three coherent groups with slight vertical dispersion and some noise.

Transformer embeddings provide **finer semantic detail** and **clearer cluster separation** than traditional approaches.

#### Analyst Validation
Client-side analysts confirmed the **readability** and **usability** of partitions produced by **MPNet**, which was selected as the reference embedding for the remainder of the project.

#### Why MPNet over OpenAI Large 3?
Model size and popularity don’t guarantee optimal structuring for a **specific corpus**. Performance depends on:
- Text type and length (e.g., short posts)
- Lexical register
- Topic distribution
- Distance metrics and UMAP reduction

MPNet, optimized for compact representations, aligns well with short, varied messages—yielding **better density boundaries** with HDBSCAN in this context.

#### Unsupervised Evaluation: Use with Caution
Internal metrics offer **guidance**, not definitive answers:
- **Example 1 (Elbow method)**: The “elbow” value doesn’t always reflect the most meaningful number of clusters. In this study, granularity was defined through **iterations with analysts** to align with business needs.
- **Example 2 (Business constraints)**: In retail segmentation, a small number of actionable segments (e.g., 4–5) is often required. A clustering algorithm might suggest **12 statistically optimal groups**, but these are **not operationally viable** due to catalog, journey, or CRM constraints.

Metrics are **decision aids**, not decisions themselves.

#### Execution Time and Interpretability
- **BoW / TF–IDF** offer **very low latency** (≤ 0.05 s), ideal for large-scale datasets (e.g., **10 million tweets**) or **real-time batch processing**.
- They also provide **immediate interpretability** (dimensions ↔ terms), useful for diagnostics, pre-filtering, or audits.

#### Conclusion
Based on metrics, visual structure, and business validation, the report will proceed with **transformer-based embeddings**  for theme construction and annotation.
