# 6.3 CLIP and Stable Diffusion


## Exploring CLIP: Bridging Vision and Language

### Introduction to CLIP

CLIP (Contrastive Language-Image Pre-training), developed by OpenAI, represents a significant advancement in multimodal learning, bridging the gap between computer vision and natural language processing.

#### Key Features of CLIP:

1. **Joint Understanding**: CLIP learns a joint representation space for images and text, enabling direct comparison between the two modalities.

2. **Zero-Shot Capabilities**: It can perform various tasks without specific training, including image classification and image-text matching.

3. **Versatility**: CLIP can be applied to a wide range of vision tasks with minimal adaptation.

4. **Large-Scale Pre-training**: Trained on 400 million (image, text) pairs from the internet, giving it broad knowledge.

### CLIP Architecture and Training

- **Dual Encoder**: CLIP uses separate encoders for images (a Vision Transformer or ResNet) and text (a Transformer).
- **Contrastive Learning**: During training, CLIP learns to maximize the cosine similarity between correct image-text pairs while minimizing it for incorrect pairs.
- **Temperature-Scaled Cross Entropy Loss**: Used to train the model efficiently on large batches.

### Code Breakdown

Let's explore how the provided code leverages CLIP's capabilities:

1. **Model Setup** (`setup_clip_model`):
   - Initializes the CLIP model and processor.
   - Uses the "openai/clip-vit-base-patch32" variant, which employs a Vision Transformer for image encoding.

2. **Image Loading** (`get_image_from_url`):
   - Fetches images from URLs, allowing for dynamic testing with various images.

3. **Similarity Computation** (`clip_image_text_similarity`):
   - Core function that demonstrates CLIP's ability to compare images and text.
   - Processes both the image and a list of text prompts.
   - Computes the similarity scores (logits) between the image and each text prompt.
   - Applies softmax to convert logits to probabilities, showing the relative likelihood of each text matching the image.

4. **Visualization** (`visualize_clip_results`):
   - Creates a side-by-side display of the input image and a bar chart of text-image similarity scores.
   - Helps in interpreting CLIP's understanding of the image content.

### Applications and Implications

This code setup allows for exploration of several key CLIP capabilities:

1. **Zero-Shot Image Classification**: By providing class names as text prompts, CLIP can classify images without specific training.

2. **Image-Text Retrieval**: Can be used to find the most relevant text for an image or vice versa.

3. **Concept Understanding**: By testing various descriptive phrases, we can probe CLIP's understanding of image content and concepts.

4. **Bias Detection**: Analyzing CLIP's responses can reveal potential biases in its training data or learned representations.

### Limitations and Considerations

- CLIP's performance can vary based on the complexity of the image and the phrasing of text prompts.
- The model may exhibit biases present in its internet-derived training data.
- While powerful, CLIP's zero-shot capabilities may not always match task-specific fine-tuned models in specialized domains.

### Conclusion

This CLIP exploration code provides a powerful tool for understanding and leveraging CLIP's capabilities in bridging visual and textual understanding. By experimenting with different images and text prompts, users can gain insights into both the strengths and limitations of this groundbreaking model.


In [None]:
!pip install -qqq transformers accelerate diffusers

In [None]:
import torch
from diffusers import StableDiffusionPipeline, DDIMScheduler
import matplotlib.pyplot as plt
from tqdm import tqdm
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import requests
import numpy as np

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")


In [None]:
# CLIP Exploration Section

def setup_clip_model():
    model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
    processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
    return model.to(device), processor

def get_image_from_url(url):
    return Image.open(requests.get(url, stream=True).raw)

def clip_image_text_similarity(model, processor, image, texts):
    inputs = processor(text=texts, images=image, return_tensors="pt", padding=True).to(device)
    outputs = model(**inputs)
    logits_per_image = outputs.logits_per_image
    probs = logits_per_image.softmax(dim=1)
    return probs.detach().cpu().numpy()

def visualize_clip_results(image, texts, probs):
    plt.figure(figsize=(16, 7))
    plt.subplot(1, 2, 1)
    plt.imshow(image)
    plt.axis('off')
    plt.title("Input Image")

    plt.subplot(1, 2, 2)
    y_pos = np.arange(len(texts))
    plt.barh(y_pos, probs[0])
    plt.yticks(y_pos, texts)
    plt.xlabel('Probability')
    plt.title('CLIP Predictions')
    plt.tight_layout()
    plt.show()



In [None]:
def im2txt_sim():
    model, processor = setup_clip_model()

    # Example 1: Image-Text Similarity
    image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"
    image = get_image_from_url(image_url)
    texts = ["a photo of a cat", "a photo of a dog", "a photo of a giraffe", "a photo of a zebra"]

    probs = clip_image_text_similarity(model, processor, image, texts)
    visualize_clip_results(image, texts, probs)

In [None]:
im2txt_sim()

In [None]:
def zero_shot_classify():
    # Example 2: Zero-shot Image Classification
    model, processor = setup_clip_model()
    image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/1/1d/Taj_Mahal_%28Edited%29.jpeg/1200px-Taj_Mahal_%28Edited%29.jpeg"
    image = get_image_from_url(image_url)
    texts = ["a photo of the Eiffel Tower", "a photo of the Statue of Liberty", "a photo of the Great Wall of China", "a photo of the Taj Mahal"]

    probs = clip_image_text_similarity(model, processor, image, texts)
    visualize_clip_results(image, texts, probs)

In [None]:
zero_shot_classify()

In [None]:
def txt2img_retrieval():
    # Example 3: Text-to-Image Retrieval
    model, processor = setup_clip_model()
    image_urls = [
        "https://upload.wikimedia.org/wikipedia/commons/thumb/1/1d/Taj_Mahal_%28Edited%29.jpeg/1200px-Taj_Mahal_%28Edited%29.jpeg",
        "http://images.cocodataset.org/val2017/000000039769.jpg",
    ]
    images = [get_image_from_url(url) for url in image_urls]
    text = "a photo of a bridge at night"

    probs_list = []
    for img in images:
        probs = clip_image_text_similarity(model, processor, img, [text])
        probs_list.append(probs[0][0])

    plt.figure(figsize=(20, 5))
    for i, (img, prob) in enumerate(zip(images, probs_list)):
        plt.subplot(1, 4, i+1)
        plt.imshow(img)
        plt.axis('off')
        plt.title(f"Probability: {prob:.4f}")
    plt.suptitle(f"Text Query: {text}", fontsize=16)
    plt.tight_layout()
    plt.show()

In [None]:
txt2img_retrieval()



---



## Hugging Face Diffusers


### Introduction to Hugging Face Diffusers

Hugging Face Diffusers is a powerful library that provides a unified interface for working with various diffusion models, including Stable Diffusion. It simplifies the process of using state-of-the-art generative AI models, making them accessible to developers and researchers.

### Key Features of Diffusers

1. **Pre-trained Models**:
   - Offers a wide range of pre-trained diffusion models, including multiple versions of Stable Diffusion.
   - Enables easy experimentation with different model architectures and training datasets.

2. **Pipeline Abstraction**:
   - Encapsulates the entire generation process in a user-friendly pipeline.
   - Handles text encoding, noise generation, and the iterative denoising process internally.

3. **Customization Options**:
   - Allows fine-tuning of various parameters like guidance scale and number of inference steps.
   - Supports advanced techniques such as prompt weighting and negative prompts.

4. **Hardware Optimization**:
   - Provides built-in support for GPU acceleration and mixed-precision inference.
   - Enables efficient use of computational resources, crucial for real-time applications.

5. **Interoperability**:
   - Seamlessly integrates with other PyTorch-based libraries and workflows.
   - Facilitates easy incorporation of diffusion models into larger AI systems or applications.

### Working with Stable Diffusion

Stable Diffusion, accessed through the Diffusers library, offers several advantages:

1. **Text-to-Image Generation**:
   - Converts textual descriptions into high-quality images.
   - Supports a wide range of concepts, styles, and compositions.

2. **Controllable Generation**:
   - Guidance scale parameter allows balancing between creativity and prompt adherence.
   - Number of inference steps can be adjusted to trade off between quality and speed.

3. **Extensibility**:
   - Supports advanced techniques like img2img, inpainting, and outpainting.
   - Can be fine-tuned on custom datasets for specialized applications.

### Practical Applications

The combination of Hugging Face Diffusers and Stable Diffusion enables a wide range of applications:

- **Content Creation**: Assisting artists and designers in generating visual concepts.
- **Data Augmentation**: Creating diverse datasets for machine learning tasks.
- **Prototyping**: Rapidly visualizing ideas in fields like product design or architecture.
- **Educational Tools**: Illustrating complex concepts or historical scenes.
- **Entertainment**: Powering creative tools in gaming or interactive media.


In [None]:
def setup_diffusion_model():
    model_id = "runwayml/stable-diffusion-v1-5"
    pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
    return pipe.to(device)

def generate_image_diffusers(pipe, prompt, num_images=1, guidance_scale=7.5, num_inference_steps=50):
    images = pipe(
        [prompt] * num_images,
        guidance_scale=guidance_scale,
        num_inference_steps=num_inference_steps
    ).images
    return images


In [None]:
pipe = setup_diffusion_model()

# Generate a single image
prompt = "A beautiful sunset over mountains, digital art"
generated_images = generate_image_diffusers(pipe, prompt)
plt.figure(figsize=(10, 10))
plt.imshow(generated_images[0])
plt.axis('off')
plt.title(prompt)
plt.show()

Generating Multiple Images

In [None]:
def generate_multiple_images(pipe, prompt, num_images=4):
    images = generate_image_diffusers(pipe, prompt, num_images)

    fig, axes = plt.subplots(2, 2, figsize=(20, 20))
    for i, img in enumerate(images):
        ax = axes[i // 2, i % 2]
        ax.imshow(img)
        ax.axis('off')

    fig.suptitle(prompt, fontsize=16)
    plt.tight_layout()
    plt.show()

In [None]:
# Generate multiple images
prompt = "A futuristic city skyline at night, cyberpunk style"
generate_multiple_images(pipe, prompt)

### Exploring Guidance Scales in Stable Diffusion

The `compare_guidance_scales` function allows us to visualize the effects of different classifier-free guidance scales in the Stable Diffusion model. This parameter plays a crucial role in the balance between unconditional and conditional generation.

### Technical Background:

1. **Classifier-Free Guidance**:
   Stable Diffusion uses a technique called classifier-free guidance to steer the diffusion process towards the desired output. The guidance scale (ω) determines the strength of this steering.

2. **Mathematical Formulation**:
   Given an unconditional score $u_Θ(x_t, t)$ and a conditional score $c_\theta(x_t, t)$, the guided score is computed as:
   
   $$s_Θ(x_t, t) = u_Θ(x_t, t) + ω * (c_Θ(x_t, t) - u_Θ(x_t, t))$$

   Where:
   - x_t is the noisy image at timestep t
   - ω is the guidance scale

3. **Impact on Generation**:
   - When ω = 1, the model follows its natural, unguided diffusion process.
   - As ω increases, the model more strongly favors the conditional distribution, adhering more closely to the prompt.

### Function Implementation Details:

1. **Image Generation**:
   For each guidance scale, the function calls `generate_image_diffusers`, which internally uses the Stable Diffusion pipeline with the specified guidance_scale parameter.

2. **Sampling Process**:
   Higher guidance scales effectively increase the signal-to-noise ratio in the reverse diffusion process, potentially leading to sharper but possibly less diverse outputs.

3. **Visualization**:
   The function creates a matplotlib figure with subplots, each showing an image generated with a different guidance scale. This allows for direct visual comparison of the guidance scale's effects.

### Key Technical Considerations:

- **Non-linearity**: The effect of the guidance scale is not linear. Doubling the scale doesn't necessarily double the "adherence" to the prompt.
- **Model Architecture Interaction**: The impact of guidance scales can vary depending on the specific architecture and training of the Stable Diffusion model.
- **Computational Implications**: Higher guidance scales don't significantly increase computation time, as the core diffusion process remains the same.


By systematically varying the guidance scale, we can gain insights into the model's behavior and find optimal settings for different applications, balancing prompt adherence with image quality and diversity.

In [None]:
def compare_guidance_scales(pipe, prompt, scales=[5.0, 7.5, 10.0, 15.0]):
    images = []
    for scale in scales:
        img = generate_image_diffusers(pipe, prompt, guidance_scale=scale)[0]
        images.append(img)

    fig, axes = plt.subplots(1, len(scales), figsize=(20, 5))
    for i, (img, scale) in enumerate(zip(images, scales)):
        axes[i].imshow(img)
        axes[i].set_title(f"Guidance Scale: {scale}")
        axes[i].axis('off')

    fig.suptitle(prompt, fontsize=16)
    plt.tight_layout()
    plt.show()

In [None]:
# Compare guidance scales
prompt = "A magical forest with glowing mushrooms and fairies"
compare_guidance_scales(pipe, prompt)

### Visualizing the Stable Diffusion Generation Process

One of the most fascinating aspects of Stable Diffusion is how it gradually refines random noise into a coherent image based on a text prompt. The `visualize_generation_steps` function allows us to peek into this process, showing how the image evolves over increasing numbers of inference steps.

### Key Points:

- We start with a minimum of 10 inference steps to ensure meaningful initial results.
- The number of steps increases gradually, up to the maximum specified (default is 50).
- The resulting visualization helps us understand how additional inference steps contribute to image quality and fidelity to the prompt.


By examining this visualization, we can gain insights into:
- How quickly the overall composition emerges
- At what point fine details start to appear
- How the image quality improves with additional computation

This function is a valuable tool for understanding the inner workings of Stable Diffusion and can help in fine-tuning the balance between generation speed and image quality.

In [None]:
def visualize_generation_steps(prompt, num_inference_steps=50, num_images=8):
    model_id = "runwayml/stable-diffusion-v1-5"
    scheduler = DDIMScheduler.from_pretrained(model_id, subfolder="scheduler")
    pipe = StableDiffusionPipeline.from_pretrained(model_id, scheduler=scheduler, torch_dtype=torch.float16)
    pipe = pipe.to(device)

    step_images = []
    min_steps = 10  # Minimum number of inference steps
    step_sizes = [max(min_steps, i) for i in range(min_steps, num_inference_steps + 1, (num_inference_steps - min_steps) // (num_images - 1))]

    for steps in tqdm(step_sizes, desc="Generating steps"):
        image = pipe(prompt, num_inference_steps=steps).images[0]
        step_images.append(image)

    # Determine the grid size based on the actual number of generated images
    num_images = len(step_images)
    cols = min(4, num_images)
    rows = (num_images + cols - 1) // cols

    fig, axes = plt.subplots(rows, cols, figsize=(5*cols, 5*rows))
    if rows == 1 and cols == 1:
        axes = np.array([axes])
    axes = axes.flatten()  # Flatten the axes array for easier indexing

    for i, (img, steps) in enumerate(zip(step_images, step_sizes)):
        ax = axes[i]
        ax.imshow(img)
        ax.set_title(f"Steps: {steps}")
        ax.axis('off')

    # Turn off any unused subplots
    for j in range(num_images, len(axes)):
        axes[j].axis('off')

    fig.suptitle(prompt, fontsize=16)
    plt.tight_layout()
    plt.show()

In [None]:
# Visualize generation steps
prompt = "A very funky cyberpunk themed Japanese garden with a koi pond and cherry blossoms"
visualize_generation_steps(prompt)