<a href="https://colab.research.google.com/github/daisysong76/AI--Machine--learning/blob/main/Uncovering_Hidden_Bias_in_Image_Datasets_Using_CLIP_based_Semantic_Clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

"Uncovering Hidden Bias in Image Datasets Using CLIP-based Semantic Clustering"

🧠 Project Goal:
Use CLIP embeddings to cluster images based on semantic content, then detect imbalances, stereotype reinforcement, or redundancy in large-scale image datasets — e.g., image-caption pairs for training or evaluating vision-language models.

💡 Use Case Examples:
Stock photo datasets (e.g., Unsplash, OpenImages): Do all “doctors” appear as men?

Social media image posts: Are certain activities (e.g., cooking, driving) associated with only one gender or ethnicity?

LLM-generated images: Do generative models output biased visual stereotypes?

🧰 Tools & Libraries:
🤖 CLIP (OpenAI or HuggingFace version)

📊 scikit-learn (for KMeans or DBSCAN clustering)

📈 UMAP/t-SNE (for visualization)

🖼️ Matplotlib / Plotly (interactive visuals)

🧠 (Optional) DINOv2 or BLIP for comparison

✅ Workflow (with Code Skeletons):
1. Load & Preprocess Images


In [None]:
import os
from PIL import Image
import torch
import clip

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

image_folder = "data/images"
image_paths = [os.path.join(image_folder, f) for f in os.listdir(image_folder)]

images = [preprocess(Image.open(path)).unsqueeze(0).to(device) for path in image_paths]
image_tensor = torch.cat(images, dim=0)

2. Generate CLIP Embeddings

In [None]:
with torch.no_grad():
    image_features = model.encode_image(image_tensor).cpu().numpy()

3. Cluster Images by Semantic Similarity

In [None]:
from sklearn.cluster import KMeans

n_clusters = 10
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
labels = kmeans.fit_predict(image_features)

4. Visualize Clusters

In [None]:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

tsne = TSNE(n_components=2, perplexity=30)
projected = tsne.fit_transform(image_features)

plt.figure(figsize=(10, 8))
plt.scatter(projected[:, 0], projected[:, 1], c=labels, cmap='tab10')
plt.title("CLIP-based Image Clusters")
plt.show()

5. Investigate Bias or Imbalance
Manually inspect samples from each cluster:

In [None]:
import shutil

for cluster_id in range(n_clusters):
    cluster_folder = f"clusters/cluster_{cluster_id}"
    os.makedirs(cluster_folder, exist_ok=True)
    for idx, label in enumerate(labels):
        if label == cluster_id:
            shutil.copy(image_paths[idx], os.path.join(cluster_folder, os.path.basename(image_paths[idx])))


Then open the folders. Are some clusters dominated by certain demographics, contexts, or aesthetics?

📈 Bonus Ideas for Expansion:
Pair with image captions and check language patterns (e.g., are certain clusters described with gendered or emotional language?)

Run Diversity Scores: compute intra-cluster visual diversity

Use this pipeline to clean, rebalance, or augment datasets for fine-tuning VLMs


“I used CLIP-based clustering to surface latent semantic patterns in multimodal training data. For example, I found that certain clusters overrepresented Western-centric business settings when prompted with 'CEO'. Based on this, I proposed rebalancing samples and flagged stereotype-enforcing patterns before model fine-tuning.”