# Roadmap of this nootebook
         Having urls of images -> Donwnload images -> Apply embedding model -> Stock embeddings     

## Notebook Objective

This notebook presents a complete pipeline for **collecting, encoding, and storing visual embeddings** from social-media images related to two key topics:  
- **“10 septembre”** (September 10th protests)  
- **“Reconnaissance France–Palestine”** (recognition of Palestine by France)

The objective is to build **reusable, high-quality image embeddings** using state-of-the-art deep-learning models in order to use them in other analysis in this repo.

# Test Datasets

We will test the search engine on two datasets:

1. **Recognition of the Palestinian State (08/13 – 08/29)**  
   This dataset was collected following France’s recognition of the State of Palestine.  
   It includes both **texts and their associated images**, totaling **5,055 images**.

2. **September 10 Demonstrations (08/20 – ~10/10)**  
   This dataset was collected during the period surrounding the **September 10 protest in France**.  
   It also contains **texts and their related images**, totaling **41,942 images**.

Only the **image URLs** are stored — the images themselves will be **downloaded dynamically** during processing.

In [None]:
from helpers import *
import pandas as pd 


import warnings 
import logging 
warnings.filterwarnings("ignore")
logging.getLogger("transformer").setLevel(logging.ERROR)
logging.getLogger("PIL").setLevel(logging.WARNING)


from dotenv import load_dotenv
import os
load_dotenv()        
path = os.getenv("path_folder")


from huggingface_hub import login # authenticates with Hugging Face to access gated models like DINOv3.
token = os.getenv("hugging_face_token")
login(token=token) # authenticates Python environment with Hugging Face account, which is required to download gated models like DINOv3. It can be find on HF


import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Device:", device)

## Parametres 
To know where to stock and from where to download images

In [None]:
path_folder_images_name = os.path.join(path, "img_data")
path_folder_images_urls = os.path.join(path, "img_urls") # Where are URLs of images
path_folder_images_embeddings = os.path.join(path, "img_embeddings")

subject_10_septembre = '10_septembre' 
subject_reconnaissance_france_palestine = 'reconnaissance_france_palestine'

image_dir_10_septembre = os.path.join(path_folder_images_name, subject_10_septembre)
image_dir_reconnaissance_france_palestine = os.path.join(path_folder_images_name, subject_reconnaissance_france_palestine)

Device: cuda


- **`path_folder_images_name`** → the **directory path** where the downloaded images will be saved (to be adapted as needed).  
- **`path_folder_images_urls`** → the **directory path** where the files containing the **image URLs** are stored (to be adapted as needed).  
- **`path_folder_images_embeddings`** → the **directory path** where the embeddings will be saved as parquet file (to be adapted as needed).
- **`subject`** → defines the dataset used, either `"10_septembre"` or `"reconnaissance_france_palestine"` (to be adapted).  
- **`image_dir`** → the **subfolder name** inside `path_folder_images_name` where the images will actually be downloaded.

## Uploading URLs

In [None]:
# 10 Septembre Dataset
df_urls_10_septembre = pd.read_csv(fr"{path_folder_images_urls}\{subject_10_septembre}.csv")
urls_10_septembre = df_urls_10_septembre.url.to_list()

# Reconnaissance France Palestine Dataset
df_urls_reconnaissance_france_palestine = pd.read_csv(fr"{path_folder_images_urls}\{subject_reconnaissance_france_palestine}.csv")
urls_reconnaissance_france_palestine = df_urls_reconnaissance_france_palestine.url.to_list()

## Downloading images (to be executed one time to download images)

Once we have URLs we can download images

In [None]:
# 10 Septembre Dataset
results_10_s, failed_10_s = parallel_download(urls_10_septembre, path_folder_images_name, subject_10_septembre, return_failed_csv=True, csv_name="failed_urls_10_septembre.csv",timeout=2) # Parallel image downloading with error and retry handling

# Reconnaissance France Palestine Dataset
results_r_f_p, failed_r_f_p = parallel_download(urls_reconnaissance_france_palestine, path_folder_images_name, subject_reconnaissance_france_palestine, return_failed_csv=True, csv_name="failed_urls_reconnaissance_france_palestine.csv", timeout=2) # Parallel image downloading with error and retry handling

Since some images on social media may **no longer exist** or may cause **request errors**, setting **`return_failed_csv = True`** enables the creation of a **CSV file** that lists all **URLs that could not be downloaded** during the process.

The code above should be executed **only once**.  
After running it, all images will be **downloaded locally**, and can then be **reused directly** in subsequent steps without re-downloading.

## Applying a sota Models 

We apply the model **`laion/CLIP-ViT-H-14-laion2B-s32B-b79K`** to generate embeddings for semantic comparison and retrieval.

We apply the model **`facebook/dinov3-vith16plus-pretrain-lvd1689m`** to generate embeddings for visual comparison.


`CLIP (semantic model): encodes the meaning of an image — useful for text–image retrieval or topic clustering.`

`DINO (visual model): encodes purely visual similarity (color, texture, layout, etc.) — ideal for clustering similar pictures, memes, or logos.`
### Watch out 
`facebook/dinov3-vith16plus-pretrain-lvd1689m` is gated on Hugging Face, meaning you need to be logged in to access it. Alternatively, you can use an open version like `facebook/dinov2-large`

In [4]:
# Load model directly
processor_sem, model_sem = upload_clip("laion/CLIP-ViT-H-14-laion2B-s32B-b79K")
processor_vis, model_vis = upload_dino("facebook/dinov3-vith16plus-pretrain-lvd1689m") #Modèle par défaut: "facebook/dinov2-large"

 Modèle laion/CLIP-ViT-H-14-laion2B-s32B-b79K chargé sur cuda
Modèle: facebook/dinov3-vith16plus-pretrain-lvd1689m chargé sur cuda


## Encoding “reconnaissance france palestine” Images

### CLIP

In [None]:
dict_clip_reconnaissance_france_palestine = encode_with_clip(image_dir=image_dir_reconnaissance_france_palestine, processor=processor_sem, model=model_sem, batch_size=16)
df_sem_reconnaissance_france_palestine = dict_clip_reconnaissance_france_palestine['image_embeddings']

# Stock embeddings as parquet file
df_sem_reconnaissance_france_palestine.to_parquet(fr"{path_folder_images_embeddings}/{subject_reconnaissance_france_palestine}_CLIP.parquet", index=False)

Preparing image paths...


Encoding image batches: 100%|██████████| 315/315 [02:11<00:00,  2.40it/s]


### DINO

In [13]:
df_vis_reconnaissance_france_palestine = encode_with_dino(image_dir=image_dir_reconnaissance_france_palestine, processor=processor_vis, model=model_vis, batch_size=16)

# Stock embeddings as parquet file
df_vis_reconnaissance_france_palestine.to_parquet(fr"{path_folder_images_embeddings}/{subject_reconnaissance_france_palestine}_DINO.parquet", index=False)

Encoding batches: 100%|██████████| 315/315 [02:15<00:00,  2.32it/s]


## Encoding “10 septembre” Images

### CLIP

In [14]:
dict_clip_10_septembre = encode_with_clip(image_dir=image_dir_10_septembre, processor=processor_sem, model=model_sem, batch_size=32)
df_sem_10_septembre = dict_clip_10_septembre['image_embeddings']

# Stock embeddings as parquet file
df_sem_10_septembre.to_parquet(fr"{path_folder_images_embeddings}/{subject_10_septembre}.parquet", index=False)

Preparing image paths...


Encoding image batches: 100%|██████████| 1245/1245 [17:47<00:00,  1.17it/s]


### DINO

In [15]:
df_vis_10_septembre = encode_with_dino(image_dir=image_dir_10_septembre, processor=processor_vis, model=model_vis, batch_size=16)

# Stock embeddings as parquet file
df_vis_10_septembre.to_parquet(fr"{path_folder_images_embeddings}/{subject_10_septembre}_DINO.parquet", index=False)

Encoding batches: 100%|██████████| 2490/2490 [17:23<00:00,  2.39it/s]
