CMMD (Cross Modal Mean Discrepancy) Diese Metrik basiert auf reicheren CLIP-Einbettungen und der maximalen Mittelwertsdiskrepanz-Distanz mit dem Gaußschen RBF-Kernel. Im Gegensatz zur FID ist CMMD ein unverzerrter Schätzer, der keine Annahmen über die Wahrscheinlichkeitsverteilung der Einbettungen macht und stichprobeneffizient ist. Umfangreiche Experimente und Analysen haben gezeigt, dass die FID-basierte Bewertung von Text-zu-Bild-Modellen möglicherweise unzuverlässig ist, während CMMD eine robustere und zuverlässigere Bewertung der Bildqualität biete
https://www.mind-verse.de/news/qualitaetsmessung-bildgenerierung-herausforderungen-neue-wege

https://github.com/openai/CLIP


CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. It can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task, similarly to the zero-shot capabilities of GPT-2 and 3. We found CLIP matches the performance of the original ResNet50 on ImageNet “zero-shot” without using any of the original 1.28M labeled examples, overcoming several major challenges in computer vision.

In [3]:
# Import modules
import numpy as np
import torch
import clip
from PIL import Image
from scipy.spatial.distance import cdist
import glob

# Load CLIP model and tokenizer
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)


In [4]:
# Define a function to compute CLIP embeddings for an image
def get_clip_embeddings(image):
    # Convert image to tensor
    image = preprocess(image).unsqueeze(0).to(device)
    # Apply model and normalize output
    with torch.no_grad():
        image_features = model.encode_image(image)
        image_features /= image_features.norm(dim=-1, keepdim=True)
    # Return embeddings as numpy array
    return image_features.cpu().numpy()

In [5]:
# Define a function to compute Modal Mean Discrepancy with Gaussian RBF kernel
def mmd_rbf(source, target, sigma=1.0):
    # Compute pairwise distances
    xx = cdist(source, source)
    yy = cdist(target, target)
    xy = cdist(source, target)
    # Compute kernel values
    k_xx = np.exp(-xx / sigma)
    k_yy = np.exp(-yy / sigma)
    k_xy = np.exp(-xy / sigma)
    # Compute MMD
    mmd = k_xx.mean() + k_yy.mean() - 2 * k_xy.mean()
    return mmd


4 AI Pizza mit allen realen pizzen

In [32]:
# Load real and generated images
real_images = [Image.open(filename) for filename in glob.glob("input/pizza_not_pizza/pizza/*.jpg")] 
gen_images = [Image.open(filename) for filename in glob.glob("generated/*.png")] 

# Compute CLIP embeddings for both image sets
real_embeddings = np.vstack([get_clip_embeddings(image) for image in real_images])
gen_embeddings = np.vstack([get_clip_embeddings(image) for image in gen_images])

# Compute CMMD between the two embedding distributions
cmmd = mmd_rbf(real_embeddings, gen_embeddings)
print(f"CMMD: {cmmd:.4f}")


CMMD: 0.2772


4 AI Pizza mit allen realen nicht pizzen

In [34]:
# Load real and generated images
real_images = [Image.open(filename) for filename in glob.glob("input/pizza_not_pizza/not_pizza/*.jpg")] 
gen_images = [Image.open(filename) for filename in glob.glob("generated/*.png")] 

# Compute CLIP embeddings for both image sets
real_embeddings = np.vstack([get_clip_embeddings(image) for image in real_images])
gen_embeddings = np.vstack([get_clip_embeddings(image) for image in gen_images])

# Compute CMMD between the two embedding distributions
cmmd = mmd_rbf(real_embeddings, gen_embeddings)
print(f"CMMD: {cmmd:.4f}")

CMMD: 0.3740


4 AI Pizza mit 4 zufälligen realen pizzen

In [35]:
# Load real and generated images
real_images = [Image.open(filename) for filename in glob.glob("real_pizza/*.jpg")] 
gen_images = [Image.open(filename) for filename in glob.glob("generated/*.png")] 

# Compute CLIP embeddings for both image sets
real_embeddings = np.vstack([get_clip_embeddings(image) for image in real_images])
gen_embeddings = np.vstack([get_clip_embeddings(image) for image in gen_images])

# Compute CMMD between the two embedding distributions
cmmd = mmd_rbf(real_embeddings, gen_embeddings)
print(f"CMMD: {cmmd:.4f}")

CMMD: 0.4123


4 AI Pizza mit 4 zufälligen realen nicht pizzen

In [36]:
# Load real and generated images
real_images = [Image.open(filename) for filename in glob.glob("real_nopizza/*.jpg")] 
gen_images = [Image.open(filename) for filename in glob.glob("generated/*.png")] 

# Compute CLIP embeddings for both image sets
real_embeddings = np.vstack([get_clip_embeddings(image) for image in real_images])
gen_embeddings = np.vstack([get_clip_embeddings(image) for image in gen_images])

# Compute CMMD between the two embedding distributions
cmmd = mmd_rbf(real_embeddings, gen_embeddings)
print(f"CMMD: {cmmd:.4f}")

CMMD: 0.5093


4 AI NonPizza mit 4 zufälligen realen Pizzen

In [38]:
# Load real and generated images
real_images = [Image.open(filename) for filename in glob.glob("real_pizza/*.jpg")] 
gen_images = [Image.open(filename) for filename in glob.glob("generated_nonPizza/*.jpg")] 

# Compute CLIP embeddings for both image sets
real_embeddings = np.vstack([get_clip_embeddings(image) for image in real_images])
gen_embeddings = np.vstack([get_clip_embeddings(image) for image in gen_images])

# Compute CMMD between the two embedding distributions
cmmd = mmd_rbf(real_embeddings, gen_embeddings)
print(f"CMMD: {cmmd:.4f}")

CMMD: 0.6730


4 AI Pizzen mit den selben 4 AI Pizzen

In [6]:
# Load real and generated images
real_images = [Image.open(filename) for filename in glob.glob("generated/*.png")] 
gen_images = [Image.open(filename) for filename in glob.glob("generated/*.png")] 

# Compute CLIP embeddings for both image sets
real_embeddings = np.vstack([get_clip_embeddings(image) for image in real_images])
gen_embeddings = np.vstack([get_clip_embeddings(image) for image in gen_images])

# Compute CMMD between the two embedding distributions
cmmd = mmd_rbf(real_embeddings, gen_embeddings)
print(f"CMMD: {cmmd:.4f}")

CMMD: 0.0000


4 AI Pizzen mit 4 AI Erdbeertorten

In [7]:
# Load real and generated images
real_images = [Image.open(filename) for filename in glob.glob("generated/*.png")] 
gen_images = [Image.open(filename) for filename in glob.glob("generated_Torte/*.jpg")] 

# Compute CLIP embeddings for both image sets
real_embeddings = np.vstack([get_clip_embeddings(image) for image in real_images])
gen_embeddings = np.vstack([get_clip_embeddings(image) for image in gen_images])

# Compute CMMD between the two embedding distributions
cmmd = mmd_rbf(real_embeddings, gen_embeddings)
print(f"CMMD: {cmmd:.4f}")

CMMD: 0.6621


Alle realen Pizzen mit allen realen NonPizzen

In [8]:
# Load real and generated images
real_images = [Image.open(filename) for filename in glob.glob("input/pizza_not_pizza/pizza/*.jpg")] 
gen_images = [Image.open(filename) for filename in glob.glob("input/pizza_not_pizza/not_pizza/*.jpg")] 

# Compute CLIP embeddings for both image sets
real_embeddings = np.vstack([get_clip_embeddings(image) for image in real_images])
gen_embeddings = np.vstack([get_clip_embeddings(image) for image in gen_images])

# Compute CMMD between the two embedding distributions
cmmd = mmd_rbf(real_embeddings, gen_embeddings)
print(f"CMMD: {cmmd:.4f}")


CMMD: 0.0986
