# Notebook Objective

The primary objective of this notebook is to perform semantic clustering of social-media images.
For this reason, we will exclusively import and use CLIP embeddings, which are specifically designed to capture semantic patterns (image's meaning).

# Test Datasets - recap

We will test the search engine on two datasets:

1. **Recognition of the Palestinian State (08/13 – 08/29)**  
   This dataset was collected following France’s recognition of the State of Palestine.  
   It includes both **texts and their associated images**, totaling **5,055 images**.

2. **September 10 Demonstrations (08/20 – ~10/10)**  
   This dataset was collected during the period surrounding the **September 10 protest in France**.  
   It also contains **texts and their related images**, totaling **41,942 images**.

Only the **image URLs** are stored — the images themselves will be **downloaded dynamically** during processing.

In [1]:
from helpers import *

import umap.umap_ as umap
import hdbscan
from pathlib import Path 



import pandas as pd 
import numpy as np 
import warnings 
import logging  
import os


from dash import Dash, dcc, html, Input, Output
import plotly.express as px
from io import BytesIO
import base64


warnings.filterwarnings("ignore")
logging.getLogger("transformer").setLevel(logging.ERROR)
logging.getLogger("PIL").setLevel(logging.WARNING)



from dotenv import load_dotenv
load_dotenv()        
path = os.getenv("path_folder")


import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Device:", device)


  from .autonotebook import tqdm as notebook_tqdm


Device: cuda


## Import embeddings

- **`path_folder_images_embeddings`** → the **directory path** where the embeddings are saved as parquet file (to be adapted as needed).
- **`subject`** → defines the dataset used, either `"10_septembre"` or `"reconnaissance_france_palestine"` (to be adapted).  

In [2]:
path_folder_images_embeddings = os.path.join(path, "img_embeddings")
subject_10_septembre = '10_septembre' 
subject_reconnaissance_france_palestine = 'reconnaissance_france_palestine'

path_10_septembre = f"{path_folder_images_embeddings}/{subject_10_septembre}_CLIP.parquet"
path_reconnaissance_france_palestine = f"{path_folder_images_embeddings}/{subject_reconnaissance_france_palestine}_CLIP.parquet"

# Upload data
df_10_septembre = pd.read_parquet(path_10_septembre)
df_reconnaissance_france_palestine = pd.read_parquet(path_reconnaissance_france_palestine)

## 10 septembre  
We begin with the **“10 septembre”** dataset before moving on to the second one.  
This approach helps avoid overlapping or duplicated code blocks that perform the same operations with only minor name changes.


### Dimensionality reduction with UMAP

In [3]:
%%time
X_10_septembre = np.vstack(df_10_septembre["embedding"].values)
print("Shape:", X_10_septembre.shape)  # (nb_images, 1280)
X_10_septembre_reduced = umap.UMAP(n_components=2, random_state=42, n_neighbors=20, metric='cosine', min_dist=0, spread=1, n_jobs=-1).fit_transform(X_10_septembre)
df_10_septembre["x"] = X_10_septembre_reduced[:,0]
df_10_septembre["y"] = X_10_septembre_reduced[:,1]

Shape: (39839, 1024)
CPU times: total: 3min 14s
Wall time: 52.9 s


### HDBSCAN

In [16]:
test_performance(X_10_septembre_reduced, df_10_septembre)

i=2
Clusters: 1945, Bruit: 31.76%, Persistance: 0.011, Silhouette: 0.578
i=7
Clusters: 1145, Bruit: 28.88%, Persistance: 0.022, Silhouette: 0.596
i=12
Clusters: 729, Bruit: 29.05%, Persistance: 0.025, Silhouette: 0.593
i=17
Clusters: 538, Bruit: 30.18%, Persistance: 0.024, Silhouette: 0.598
i=22
Clusters: 419, Bruit: 28.72%, Persistance: 0.025, Silhouette: 0.540
i=27
Clusters: 334, Bruit: 29.44%, Persistance: 0.022, Silhouette: 0.529
i=32
Clusters: 293, Bruit: 31.53%, Persistance: 0.021, Silhouette: 0.560
i=37
Clusters: 243, Bruit: 31.95%, Persistance: 0.021, Silhouette: 0.538
i=42
Clusters: 214, Bruit: 33.27%, Persistance: 0.020, Silhouette: 0.527
i=47
Clusters: 198, Bruit: 34.34%, Persistance: 0.020, Silhouette: 0.520
i=52
Clusters: 173, Bruit: 33.46%, Persistance: 0.022, Silhouette: 0.511
i=57
Clusters: 140, Bruit: 29.66%, Persistance: 0.026, Silhouette: 0.457
i=62
Clusters: 130, Bruit: 29.57%, Persistance: 0.018, Silhouette: 0.454
i=67
Clusters: 125, Bruit: 30.20%, Persistance: 0.0

The best compremise was when min cluster size is 167 (noise 36.91%,a good silhouette value of 0.523 and the higher persistance 0.064)

In [4]:
clusterer = hdbscan.HDBSCAN(min_cluster_size=167, min_samples=5, metric="euclidean")
labels = clusterer.fit_predict(X_10_septembre_reduced)
df_10_septembre["cluster"] = (labels).astype(str)
print(f"Cluster: {len(df_10_septembre.cluster.unique())}, Bruit: {len(df_10_septembre[df_10_septembre.cluster == '-1'])/len(df_10_septembre)*100}%, Persistance: {(clusterer.cluster_persistence_).mean()}")

Cluster: 53, Bruit: 36.906046838525064%, Persistance: 0.0635862653560839


### Visualisation

In [None]:
path_folder_images_name = os.path.join(path, "img_data")
image_dir_10_septembre = os.path.join(path_folder_images_name, subject_10_septembre)
IMAGE_DIR = Path(image_dir_10_septembre)

# ====== Préparation du dataframe ======
# (df_10_septembre doit déjà contenir x, y, cluster, filename)
def img_to_base64(path):
    try:
        img = Image.open(path).convert("RGB")
        img.thumbnail((400, 400))
        buffer = BytesIO()
        img.save(buffer, format="PNG")
        return "data:image/png;base64," + base64.b64encode(buffer.getvalue()).decode()
    except:
        return None

df_10_septembre["img_b64"] = [img_to_base64(IMAGE_DIR / fn) for fn in df_10_septembre["filename"]]

# ====== Création du scatter ======
fig = px.scatter(
    df_10_septembre,
    x="x",
    y="y",
    color="cluster",
    hover_data=["filename"],
    custom_data=["filename"],
    title="Clustering visuel des images (CLIP + HDBSCAN)",
)
fig.update_traces(marker=dict(size=6, opacity=0.8))

# ====== Création de l'application Dash ======
app = Dash(__name__)
app.layout = html.Div([
    html.H3("Exploration interactive des clusters d’images (CLIP)"),
    html.Div([
        dcc.Graph(
            id="scatter",
            figure=fig,
            style={"width": "65vw", "height": "85vh", "display": "inline-block"},
        ),
        html.Div(
            id="image-display",
            style={"width": "30vw", "display": "inline-block", "verticalAlign": "top", "padding": "20px"},
        ),
    ]),
])

# ====== Callback : clic sur un point → affiche l’image ======
@app.callback(
    Output("image-display", "children"),
    Input("scatter", "clickData"),
)
def show_image(clickData):
    if clickData is None:
        return html.P("Clique sur un point pour afficher l'image correspondante.")
    
    filename = clickData["points"][0]["customdata"][0]
    path = IMAGE_DIR / filename
    if not path.exists():
        return html.P(f"Image introuvable : {filename}")
    
    img_b64 = img_to_base64(path)
    return html.Div([
        html.H4(filename),
        html.Img(src=img_b64, style={"maxWidth": "100%", "border": "2px solid #444"}),
    ])


# ====== Lancer le serveur ======
if __name__ == "__main__":
    app.run(debug=True, port=8053)

print("Pour ouvrir la visualisation sur un navigateur: http://127.0.0.1:8053")

Pour ouvrir la visualisation sur un navigateur: http://127.0.0.1:8053


## Recognition france-palestine

In [6]:
%%time
X_rfp_dino = np.vstack(df_reconnaissance_france_palestine["embedding"].values)
print("Shape:", X_rfp_dino.shape)  # (nb_images, 1280)
X_rfp_reduced = umap.UMAP(n_components=2, random_state=42, n_neighbors=20, metric='cosine', min_dist=0, spread=1, n_jobs=-1).fit_transform(X_rfp_dino)
df_reconnaissance_france_palestine["x"] = X_rfp_reduced[:,0]
df_reconnaissance_france_palestine["y"] = X_rfp_reduced[:,1]

Shape: (5025, 1024)
CPU times: total: 7 s
Wall time: 7.54 s


### HDBSCAN

In [23]:
test_performance(X_rfp_reduced, df_reconnaissance_france_palestine)

i=2
Clusters: 293, Bruit: 28.02%, Persistance: 0.017, Silhouette: 0.698
i=7
Clusters: 199, Bruit: 23.88%, Persistance: 0.024, Silhouette: 0.653
i=12
Clusters: 121, Bruit: 23.20%, Persistance: 0.027, Silhouette: 0.617
i=17
Clusters: 87, Bruit: 24.60%, Persistance: 0.009, Silhouette: 0.604
i=22
Clusters: 67, Bruit: 29.03%, Persistance: 0.012, Silhouette: 0.588
i=27
Clusters: 49, Bruit: 25.85%, Persistance: 0.018, Silhouette: 0.539
i=32
Clusters: 45, Bruit: 25.01%, Persistance: 0.056, Silhouette: 0.537
i=37
Clusters: 39, Bruit: 23.66%, Persistance: 0.056, Silhouette: 0.498
i=42
Clusters: 30, Bruit: 27.60%, Persistance: 0.082, Silhouette: 0.480
i=47
Clusters: 28, Bruit: 29.39%, Persistance: 0.083, Silhouette: 0.497
i=52
Clusters: 26, Bruit: 31.28%, Persistance: 0.082, Silhouette: 0.511
i=57
Clusters: 23, Bruit: 34.51%, Persistance: 0.077, Silhouette: 0.533
i=62
Clusters: 20, Bruit: 38.09%, Persistance: 0.113, Silhouette: 0.536
i=67
Clusters: 20, Bruit: 38.09%, Persistance: 0.091, Silhouett

The best compremise was when min cluster size is 37 (noise 23.66%,a good silhouette value of 0.498 and a persistance of 0.056)

In [7]:
clusterer = hdbscan.HDBSCAN(min_cluster_size=37, min_samples=5, metric="euclidean")
labels = clusterer.fit_predict(X_rfp_reduced)
df_reconnaissance_france_palestine["cluster"] = (labels).astype(str)
print(f"Cluster: {len(df_reconnaissance_france_palestine.cluster.unique())}, Bruit: {len(df_reconnaissance_france_palestine[df_reconnaissance_france_palestine.cluster == '-1'])/len(df_reconnaissance_france_palestine)*100}%, Persistance: {(clusterer.cluster_persistence_).mean()}")

Cluster: 40, Bruit: 23.66169154228856%, Persistance: 0.05601262146102711


### Visualisation

In [None]:
path_folder_images_name = os.path.join(path, "img_data")
image_dir_reconnaissance_france_palestine = os.path.join(path_folder_images_name, subject_reconnaissance_france_palestine)
IMAGE_DIR = Path(image_dir_reconnaissance_france_palestine)

# ====== Préparation du dataframe ======
# (df_reconnaissance_france_palestine doit déjà contenir x, y, cluster, filename)
def img_to_base64(path):
    try:
        img = Image.open(path).convert("RGB")
        img.thumbnail((400, 400))
        buffer = BytesIO()
        img.save(buffer, format="PNG")
        return "data:image/png;base64," + base64.b64encode(buffer.getvalue()).decode()
    except:
        return None

df_reconnaissance_france_palestine["img_b64"] = [img_to_base64(IMAGE_DIR / fn) for fn in df_reconnaissance_france_palestine["filename"]]

# ====== Création du scatter ======
fig = px.scatter(
    df_reconnaissance_france_palestine,
    x="x",
    y="y",
    color="cluster",
    hover_data=["filename"],
    custom_data=["filename"],
    title="Clustering visuel des images (CLIP + HDBSCAN)",
)
fig.update_traces(marker=dict(size=6, opacity=0.8))

# ====== Création de l'application Dash ======
app = Dash(__name__)
app.layout = html.Div([
    html.H3("Exploration interactive des clusters d’images (CLIP)"),
    html.Div([
        dcc.Graph(
            id="scatter",
            figure=fig,
            style={"width": "65vw", "height": "85vh", "display": "inline-block"},
        ),
        html.Div(
            id="image-display",
            style={"width": "30vw", "display": "inline-block", "verticalAlign": "top", "padding": "20px"},
        ),
    ]),
])

# ====== Callback : clic sur un point → affiche l’image ======
@app.callback(
    Output("image-display", "children"),
    Input("scatter", "clickData"),
)
def show_image(clickData):
    if clickData is None:
        return html.P("Clique sur un point pour afficher l'image correspondante.")
    
    filename = clickData["points"][0]["customdata"][0]
    path = IMAGE_DIR / filename
    if not path.exists():
        return html.P(f"Image introuvable : {filename}")
    
    img_b64 = img_to_base64(path)
    return html.Div([
        html.H4(filename),
        html.Img(src=img_b64, style={"maxWidth": "100%", "border": "2px solid #444"}),
    ])

# ====== Lancer le serveur ======
if __name__ == "__main__":
    app.run(debug=True, port=8054)

print("Pour ouvrir la visualisation sur un navigateur: http://127.0.0.1:8054")

Pour ouvrir la visualisation sur un navigateur: http://127.0.0.1:8054


# Conclusion

This semantic clustering experiment highlights how **CLIP embeddings** organize images based on their **conceptual meaning** rather than purely visual features.  

In the example above:
- Images and texts referring to similar **themes, entities, or narratives** (e.g., *Zionism*, *political leaders*, *war scenes*) are grouped together in the embedding space.  
- Each cluster reflects a **shared semantic context**, even when the visual appearance differs significantly.  

This demonstrates the strength of **multimodal models** like CLIP, which align visual and textual information into a unified representation.  
By mapping both images and language into the same latent space, CLIP enables **semantic grouping, retrieval, and narrative analysis** — allowing us to explore how visual content connects to the broader discourse it represents.
