<a href="https://colab.research.google.com/github/bachaudhry/paper_implementations/blob/main/generative_modeling/diffusion/vision/implementing_diffedit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Implementing DiffEdit**

Key Steps:

1. Provide paper summary with relevant links.
2. List the key steps involved in the DiffEdit implementation.
3. Setup SD pipelines based on NB 9b.
4. Edit segments to generate masks and outputs.
5. Provide fleshed out prose to both explain, while comprehending at the same time, what actually happened.
6. Provide a series of examples with properly animated interpolation of Diffusion steps.
7. References to other implementations.
8. Publish as a blog post.

## Paper Summary

The key idea behind this paper is:

> Semantic image editing is an extension of image generation, with the additional constraint that the generated image should be as similar as possible to a given input image. Current editing methods based on diffusion models usually require to provide a mask, making the task much easier by treating it as a conditional inpainting task. In contrast, our main contribution is able to automatically generate a mask highlighting regions of the input image that need to be edited, by contrasting predictions of a diffusion model conditioned on different text prompts. Moreover, we rely on latent inference to preserve content in those regions of interest and show excellent synergies with mask-based diffusion. DIFFEDIT achieves state-of-the-art editing performance on ImageNet.

> In our DIFFEDIT approach, a mask generation module determines which part of the image should be edited, and an encoder infers the latents, to provide inputs to a text-conditional diffusion model which produces the image edit.

> ## The three steps of DIFFEDIT.

> Step 1: we add noise to the input image, and denoise it: once conditioned on the query text, and once conditioned on a reference text (or unconditionally). We derive a mask based on the difference in the denoising results.

> Step 2: we encode the input image with DDIM, to estimate the latents corresponding to the input image.

> Step 3: we perform DDIM decoding conditioned on the text query, using the inferred mask to replace the background with pixel values coming from the encoding process at the corresponding timestep.


In [1]:
# Install  requirements for colab - dropping version requirements
! pip install -q --upgrade transformers diffusers ftfy accelerate

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m27.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.4/54.4 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m309.4/309.4 kB[0m [31m29.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.3/21.3 MB[0m [31m65.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [4]:
# Import libraries
from base64 import b64encode
from google.colab import userdata
from huggingface_hub import notebook_login
userdata.get('HFtoken')
# Primary
import numpy as np
import torch
from torch import autocast
from diffusers import AutoencoderKL, LMSDiscreteScheduler, UNet2DConditionModel
from torchvision import transforms
from transformers import CLIPTextModel, CLIPTokenizer, logging
# Secondary
from IPython.display import HTML
from matplotlib import pyplot as plt
from pathlib import Path
from PIL import Image
from tqdm.auto import tqdm
import os

torch.manual_seed(42)
#if not (Path.home()/'.huggingface'/'token').exists(): notebook_login()
logging.set_verbosity_error()

torch_device = "cuda" if torch.cuda.is_available() else "cpu"

## Loading Model Pipelines

In [5]:
# Autoencoder model for decoding latents into the image space
vae = AutoencoderKL.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="vae")

# Tokenizer and the text encoder to tokenize and encode the text
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14")

# The UNet Model for generating latents.
unet = UNet2DConditionModel.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="unet")

# The noise scheduler
scheduler = LMSDiscreteScheduler(beta_start=0.00085, beta_end=0.012,
                                 beta_schedule="scaled_linear", num_train_timesteps=1000)

# Load the components onto device
vae = vae.to(torch_device)
text_encoder = text_encoder.to(torch_device)
unet = unet.to(torch_device)


vae/config.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

diffusion_pytorch_model.safetensors:   0%|          | 0.00/335M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/905 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/961k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/525k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/389 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.22M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/4.52k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.71G [00:00<?, ?B/s]

unet/config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

diffusion_pytorch_model.safetensors:   0%|          | 0.00/3.44G [00:00<?, ?B/s]