# Notebook Objective

The primary objective of this notebook is to perform visual clustering of social-media images.
For this reason, we will exclusively import and use DINO embeddings, which are specifically designed to capture visual patterns, shapes, and textures rather than semantic meaning.

# Test Datasets - recap

We will test the search engine on two datasets:

1. **Recognition of the Palestinian State (08/13 – 08/29)**  
   This dataset was collected following France’s recognition of the State of Palestine.  
   It includes both **texts and their associated images**, totaling **5,055 images**.

2. **September 10 Demonstrations (08/20 – ~10/10)**  
   This dataset was collected during the period surrounding the **September 10 protest in France**.  
   It also contains **texts and their related images**, totaling **41,942 images**.

Only the **image URLs** are stored — the images themselves will be **downloaded dynamically** during processing.

In [None]:
from helpers import *

import umap.umap_ as umap
import hdbscan
from PIL import Image  # Librairie pour ouvrir, convertir et manipuler des images locales
from pathlib import Path 



from tqdm import tqdm  
import pandas as pd 
import numpy as np 
import warnings 
import logging  
import csv
import os


from dash import Dash, dcc, html, Input, Output
import plotly.express as px
from io import BytesIO
import base64


warnings.filterwarnings("ignore")
logging.getLogger("transformer").setLevel(logging.ERROR)
logging.getLogger("PIL").setLevel(logging.WARNING)



from dotenv import load_dotenv
load_dotenv()        
path = os.getenv("path_folder")


import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Device:", device)


  from .autonotebook import tqdm as notebook_tqdm


Device: cuda


## Import embeddings

- **`path_folder_images_embeddings`** → the **directory path** where the embeddings are saved as parquet file (to be adapted as needed).
- **`subject`** → defines the dataset used, either `"10_septembre"` or `"reconnaissance_france_palestine"` (to be adapted).  

In [2]:
path_folder_images_embeddings = os.path.join(path, "img_embeddings")
subject_10_septembre = '10_septembre' 
subject_reconnaissance_france_palestine = 'reconnaissance_france_palestine'

path_10_septembre = f"{path_folder_images_embeddings}/{subject_10_septembre}_DINO.parquet"
path_reconnaissance_france_palestine = f"{path_folder_images_embeddings}/{subject_reconnaissance_france_palestine}_DINO.parquet"

# Upload data
df_10_septembre = pd.read_parquet(path_10_septembre)
df_reconnaissance_france_palestine = pd.read_parquet(path_reconnaissance_france_palestine)

## 10 septembre  
We begin with the **“10 septembre”** dataset before moving on to the second one.  
This approach helps avoid overlapping or duplicated code blocks that perform the same operations with only minor name changes.


### Dimensionality reduction with UMAP

In [3]:
%%time
X_10_septembre_dino = np.vstack(df_10_septembre["embedding"].values)
print("Shape:", X_10_septembre_dino.shape)  # (nb_images, 1280)
X_10_septembre_reduced = umap.UMAP(n_components=2, random_state=42, n_neighbors=20, metric='cosine', min_dist=0, spread=1, n_jobs=-1).fit_transform(X_10_septembre_dino)
df_10_septembre["x"] = X_10_septembre_reduced[:,0]
df_10_septembre["y"] = X_10_septembre_reduced[:,1]

Shape: (39839, 1280)
CPU times: total: 3min 4s
Wall time: 52.9 s


### HDBSCAN

In [4]:
test_performance(X_10_septembre_reduced, df_10_septembre)

i=2
Clusters: 1949, Bruit: 28.67%, Persistance: 0.011, Silhouette: 0.615
i=7
Clusters: 1289, Bruit: 24.66%, Persistance: 0.021, Silhouette: 0.631
i=12
Clusters: 797, Bruit: 22.67%, Persistance: 0.023, Silhouette: 0.629
i=17
Clusters: 565, Bruit: 21.71%, Persistance: 0.023, Silhouette: 0.602
i=22
Clusters: 448, Bruit: 22.31%, Persistance: 0.023, Silhouette: 0.600
i=27
Clusters: 366, Bruit: 22.48%, Persistance: 0.020, Silhouette: 0.593
i=32
Clusters: 333, Bruit: 23.52%, Persistance: 0.019, Silhouette: 0.605
i=37
Clusters: 288, Bruit: 24.19%, Persistance: 0.018, Silhouette: 0.594
i=42
Clusters: 258, Bruit: 24.65%, Persistance: 0.015, Silhouette: 0.588
i=47
Clusters: 231, Bruit: 25.30%, Persistance: 0.015, Silhouette: 0.584
i=52
Clusters: 214, Bruit: 25.52%, Persistance: 0.015, Silhouette: 0.581
i=57
Clusters: 198, Bruit: 27.26%, Persistance: 0.014, Silhouette: 0.587
i=62
Clusters: 183, Bruit: 27.99%, Persistance: 0.011, Silhouette: 0.596
i=67
Clusters: 170, Bruit: 29.34%, Persistance: 0.0

The best compremise was when min cluster size is 177 (noise 27%,a good silhouette value of 0.444 and the higher persistance 0.11)

In [7]:
clusterer = hdbscan.HDBSCAN(min_cluster_size=177, min_samples=5, metric="euclidean")
labels = clusterer.fit_predict(X_10_septembre_reduced)
df_10_septembre["cluster"] = (labels).astype(str)
print(f"Cluster: {len(df_10_septembre.cluster.unique())}, Bruit: {len(df_10_septembre[df_10_septembre.cluster == '-1'])/len(df_10_septembre)*100}%, Persistance: {(clusterer.cluster_persistence_).mean()}")

Cluster: 45, Bruit: 26.986119129496224%, Persistance: 0.10968416775996738


### Visualisation

In [None]:
path_folder_images_name = os.path.join(path, "img_data")
image_dir_10_septembre = os.path.join(path_folder_images_name, subject_10_septembre)
IMAGE_DIR = Path(image_dir_10_septembre)

# ====== Préparation du dataframe ======
# (df_10_septembre doit déjà contenir x, y, cluster, filename)
def img_to_base64(path):
    try:
        img = Image.open(path).convert("RGB")
        img.thumbnail((400, 400))
        buffer = BytesIO()
        img.save(buffer, format="PNG")
        return "data:image/png;base64," + base64.b64encode(buffer.getvalue()).decode()
    except:
        return None

df_10_septembre["img_b64"] = [img_to_base64(IMAGE_DIR / fn) for fn in df_10_septembre["filename"]]

# ====== Création du scatter ======
fig = px.scatter(
    df_10_septembre,
    x="x",
    y="y",
    color="cluster",
    hover_data=["filename"],
    custom_data=["filename"],
    title="Clustering visuel des images (DINOv2 + HDBSCAN)",
)
fig.update_traces(marker=dict(size=6, opacity=0.8))

# ====== Création de l'application Dash ======
app = Dash(__name__)
app.layout = html.Div([
    html.H3("Exploration interactive des clusters d’images (DINOv2)"),
    html.Div([
        dcc.Graph(
            id="scatter",
            figure=fig,
            style={"width": "65vw", "height": "85vh", "display": "inline-block"},
        ),
        html.Div(
            id="image-display",
            style={"width": "30vw", "display": "inline-block", "verticalAlign": "top", "padding": "20px"},
        ),
    ]),
])

# ====== Callback : clic sur un point → affiche l’image ======
@app.callback(
    Output("image-display", "children"),
    Input("scatter", "clickData"),
)
def show_image(clickData):
    if clickData is None:
        return html.P("Clique sur un point pour afficher l'image correspondante.")
    
    filename = clickData["points"][0]["customdata"][0]
    path = IMAGE_DIR / filename
    if not path.exists():
        return html.P(f"Image introuvable : {filename}")
    
    img_b64 = img_to_base64(path)
    return html.Div([
        html.H4(filename),
        html.Img(src=img_b64, style={"maxWidth": "100%", "border": "2px solid #444"}),
    ])


# ====== Lancer le serveur ======
if __name__ == "__main__":
    app.run(debug=True, port=8050)

print("Pour ouvrir la visualisation sur un navigateur: http://127.0.0.1:8050")

## Recognition france-palestine

In [None]:
%%time
X_rfp_dino = np.vstack(df_reconnaissance_france_palestine["embedding"].values)
print("Shape:", X_rfp_dino.shape)  # (nb_images, 1280)
X_rfp_reduced = umap.UMAP(n_components=2, random_state=42, n_neighbors=20, metric='cosine', min_dist=0, spread=1, n_jobs=-1).fit_transform(X_rfp_dino)
df_reconnaissance_france_palestine["x"] = X_rfp_reduced[:,0]
df_reconnaissance_france_palestine["y"] = X_rfp_reduced[:,1]

### HDBSCAN

In [None]:
test_performance(X_rfp_reduced, df_reconnaissance_france_palestine)

The best compremise was when min cluster size is 44 (noise xx%,a good silhouette value of xx and the higher persistance xx)

In [None]:
clusterer = hdbscan.HDBSCAN(min_cluster_size=177, min_samples=5, metric="euclidean")
labels = clusterer.fit_predict(X_rfp_reduced)
df_reconnaissance_france_palestine["cluster"] = (labels).astype(str)
print(f"Cluster: {len(df_reconnaissance_france_palestine.cluster.unique())}, Bruit: {len(df_reconnaissance_france_palestine[df_reconnaissance_france_palestine.cluster == '-1'])/len(df_reconnaissance_france_palestine)*100}%, Persistance: {(clusterer.cluster_persistence_).mean()}")

### Visualisation

In [None]:
path_folder_images_name = os.path.join(path, "img_data")
image_dir_reconnaissance_france_palestine = os.path.join(path_folder_images_name, subject_reconnaissance_france_palestine)
IMAGE_DIR = Path(image_dir_reconnaissance_france_palestine)

# ====== Préparation du dataframe ======
# (df_reconnaissance_france_palestine doit déjà contenir x, y, cluster, filename)
def img_to_base64(path):
    try:
        img = Image.open(path).convert("RGB")
        img.thumbnail((400, 400))
        buffer = BytesIO()
        img.save(buffer, format="PNG")
        return "data:image/png;base64," + base64.b64encode(buffer.getvalue()).decode()
    except:
        return None

df_reconnaissance_france_palestine["img_b64"] = [img_to_base64(IMAGE_DIR / fn) for fn in df_reconnaissance_france_palestine["filename"]]

# ====== Création du scatter ======
fig = px.scatter(
    df_reconnaissance_france_palestine,
    x="x",
    y="y",
    color="cluster",
    hover_data=["filename"],
    custom_data=["filename"],
    title="Clustering visuel des images (DINOv2 + HDBSCAN)",
)
fig.update_traces(marker=dict(size=6, opacity=0.8))

# ====== Création de l'application Dash ======
app = Dash(__name__)
app.layout = html.Div([
    html.H3("Exploration interactive des clusters d’images (DINOv2)"),
    html.Div([
        dcc.Graph(
            id="scatter",
            figure=fig,
            style={"width": "65vw", "height": "85vh", "display": "inline-block"},
        ),
        html.Div(
            id="image-display",
            style={"width": "30vw", "display": "inline-block", "verticalAlign": "top", "padding": "20px"},
        ),
    ]),
])

# ====== Callback : clic sur un point → affiche l’image ======
@app.callback(
    Output("image-display", "children"),
    Input("scatter", "clickData"),
)
def show_image(clickData):
    if clickData is None:
        return html.P("Clique sur un point pour afficher l'image correspondante.")
    
    filename = clickData["points"][0]["customdata"][0]
    path = IMAGE_DIR / filename
    if not path.exists():
        return html.P(f"Image introuvable : {filename}")
    
    img_b64 = img_to_base64(path)
    return html.Div([
        html.H4(filename),
        html.Img(src=img_b64, style={"maxWidth": "100%", "border": "2px solid #444"}),
    ])


# ====== Lancer le serveur ======
if __name__ == "__main__":
    app.run(debug=True, port=8050)

print("Pour ouvrir la visualisation sur un navigateur: http://127.0.0.1:8050")

# Conclusion

This visual clusterings experiment demonstrates how **DINO embeddings** can effectively group images according to **visual similarity** rather than textual or semantic content.  

In the example above:
- Images sharing similar **visual patterns**—such as **maps of Palestine** or **portraits of political figures**—naturally cluster together in the 2D projection.  
- Each color represents a distinct **visual cluster**, learned purely from pixel-level and structural features.  

This confirms that **self-supervised visual representations** (like DINO) are powerful tools for exploring large collections of images, detecting **recurring motifs**, and revealing **visual narratives** that may not be captured by text-based models such as CLIP.  
