# Context-Aware Object Insertion Using Stable Diffusion and Vision-Language Models

**This Colab notebook is a collaborative effort by:**

* Deekshita Sriyaa K
* Anagha S Bharadwaj
* Aditi S Kulkarni
* Akshat Singh Jaswal


#Installing dependencies
This code sets up the environment for our project by installing essential libraries:
* **transformers:** This library is  used for working with pre-trained models .
* **diffusers:** This library is used for working with diffusion models, a type of generative model used for image generation. In our case, it's used for Stable Diffusion.
* **bitsandbytes:** This library is for optimizing models by reducing their precision . This can improve performance on some hardware.
* **accelerate:** This library is for accelerating training or inference of models, especially on multiple GPUs or TPUs.
* **gradio:** This library is  used for creating web interfaces for  machine learning models.


In [1]:
!pip install transformers diffusers
!pip install bitsandbytes
!pip install accelerate
!pip install gradio

Collecting diffusers
  Downloading diffusers-0.29.2-py3-none-any.whl (2.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: diffusers
Successfully installed diffusers-0.29.2
Collecting bitsandbytes
  Downloading bitsandbytes-0.43.1-py3-none-manylinux_2_24_x86_64.whl (119.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch->bitsandbytes)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch->bitsandbytes)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch->bitsandbytes)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidi

# Importing all essential libraries
This code block imports libraries for various functionalities in our project:

* **transformers:** This library likely provides functions for working with pre-trained language models.
    * `AutoProcessor`: This class simplifies preprocessing text data for compatibility with specific models.
    * `LlavaForConditionalGeneration`: This class is a pre-trained model for image-to-text generation tasks.
    * `BitsAndBytesConfig`: This class is used for configuring model quantization with bitsandbytes.
* **diffusers:** This library provides tools for diffusion models.
    * `StableDiffusionInpaintPipeline`: This class is used for image inpainting with Stable Diffusion, allowing insertion of objects into existing images.
* **torch:** This library provides tools for deep learning operations, used for manipulating tensors.
* **PIL (Python Imaging Library):** This library provides tools for image processing.
    * `Image`, `ImageDraw`, `ImageFont`: These classes are likely used for image loading, drawing, and adding text elements.
* **matplotlib.pyplot:** This library is used for plotting and visualizing images.
* **gradio:** This library is used for creating web interfaces for the project.
* **io:** This library provides tools for working with input/output streams, used for handling image data.

In [2]:
from transformers import AutoProcessor, LlavaForConditionalGeneration, BitsAndBytesConfig
from diffusers import StableDiffusionInpaintPipeline
import torch
from PIL import Image, ImageDraw, ImageFont
import matplotlib.pyplot as plt
import gradio as gr
import io

# Prompt Generation using the Llava model with quantization

This code defines a configuration object (`quantization_config`) for model quantization using the `BitsAndBytesConfig` class.
This part of the code defines `"llava-hf/llava-1.5-7b-hf"` and loads the pre-trained Llava model (`LlavaForConditionalGeneration`) from the transformers library. It also applies the defined `quantization_config` for model size reduction.

In [3]:
# Step 1: Generate prompt using Llava model
#!pip install accelerate
# Load Llava model and processor with quantization
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,  # Enable double quantization for further memory optimization
    load_in_8bit_fp32_cpu_offload=True  # Offload parts of the model to CPU
)
model_id1 = "llava-hf/llava-1.5-7b-hf"
processor = AutoProcessor.from_pretrained(model_id1)
model = LlavaForConditionalGeneration.from_pretrained(
    model_id1,
    quantization_config=quantization_config,
    device_map="auto"  # Let Transformers automatically decide the device mapping
)


Unused kwargs: ['load_in_8bit_fp32_cpu_offload']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


preprocessor_config.json:   0%|          | 0.00/505 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.36k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/41.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/552 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


config.json:   0%|          | 0.00/950 [00:00<?, ?B/s]



model.safetensors.index.json:   0%|          | 0.00/70.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.18G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/141 [00:00<?, ?B/s]

**This part of the code logins into hugging face hub to be able to access the models**

In [9]:
import os
from PIL import Image
import cv2
import numpy as np
import torch
from diffusers import StableDiffusionPipeline, StableDiffusionInpaintPipeline
from huggingface_hub import notebook_login
import matplotlib.pyplot as plti

device = 'cuda'

# Log in to Hugging Face
#notebook_login()

#Loading Stable Diffusion pipelines for image inpainting.
 This code block loads a separate Stable Diffusion Inpainting pipeline from the Hugging Face Hub using the `StableDiffusionInpaintPipeline.from_pretrained` function. This pipeline is specifically designed for inpainting tasks, which involve filling in missing parts of an image while maintaining coherence with the existing content.

In [5]:
# Load the Stable Diffusion pipeline
pipe = StableDiffusionPipeline.from_pretrained(
    'CompVis/stable-diffusion-v1-4', revision='fp16',
    torch_dtype=torch.float16, use_auth_token=True
).to(device)

# Load the inpainting pipeline
inpaint_pipe = StableDiffusionInpaintPipeline.from_pretrained(
    "runwayml/stable-diffusion-inpainting", torch_dtype=torch.float16
).to(device)

model_index.json:   0%|          | 0.00/543 [00:00<?, ?B/s]

safety_checker/model.safetensors not found


Fetching 16 files:   0%|          | 0/16 [00:00<?, ?it/s]

pytorch_model.bin:   0%|          | 0.00/246M [00:00<?, ?B/s]

safety_checker/config.json:   0%|          | 0.00/4.63k [00:00<?, ?B/s]

(…)ature_extractor/preprocessor_config.json:   0%|          | 0.00/342 [00:00<?, ?B/s]

text_encoder/config.json:   0%|          | 0.00/572 [00:00<?, ?B/s]

tokenizer/merges.txt:   0%|          | 0.00/525k [00:00<?, ?B/s]

(…)kpoints/scheduler_config-checkpoint.json:   0%|          | 0.00/209 [00:00<?, ?B/s]

scheduler/scheduler_config.json:   0%|          | 0.00/307 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/608M [00:00<?, ?B/s]

diffusion_pytorch_model.bin:   0%|          | 0.00/1.72G [00:00<?, ?B/s]

tokenizer/vocab.json:   0%|          | 0.00/1.06M [00:00<?, ?B/s]

tokenizer/special_tokens_map.json:   0%|          | 0.00/472 [00:00<?, ?B/s]

tokenizer/tokenizer_config.json:   0%|          | 0.00/788 [00:00<?, ?B/s]

unet/config.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

vae/config.json:   0%|          | 0.00/550 [00:00<?, ?B/s]

diffusion_pytorch_model.bin:   0%|          | 0.00/167M [00:00<?, ?B/s]

Keyword arguments {'use_auth_token': True} are not expected by StableDiffusionPipeline and will be ignored.


Loading pipeline components...:   0%|          | 0/7 [00:00<?, ?it/s]

An error occurred while trying to fetch /root/.cache/huggingface/hub/models--CompVis--stable-diffusion-v1-4/snapshots/2880f2ca379f41b0226444936bb7a6766a227587/unet: Error no file named diffusion_pytorch_model.safetensors found in directory /root/.cache/huggingface/hub/models--CompVis--stable-diffusion-v1-4/snapshots/2880f2ca379f41b0226444936bb7a6766a227587/unet.
Defaulting to unsafe serialization. Pass `allow_pickle=False` to raise an error instead.
An error occurred while trying to fetch /root/.cache/huggingface/hub/models--CompVis--stable-diffusion-v1-4/snapshots/2880f2ca379f41b0226444936bb7a6766a227587/vae: Error no file named diffusion_pytorch_model.safetensors found in directory /root/.cache/huggingface/hub/models--CompVis--stable-diffusion-v1-4/snapshots/2880f2ca379f41b0226444936bb7a6766a227587/vae.
Defaulting to unsafe serialization. Pass `allow_pickle=False` to raise an error instead.


model_index.json:   0%|          | 0.00/548 [00:00<?, ?B/s]

safety_checker/model.safetensors not found


Fetching 16 files:   0%|          | 0/16 [00:00<?, ?it/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/492M [00:00<?, ?B/s]

safety_checker/config.json:   0%|          | 0.00/4.78k [00:00<?, ?B/s]

scheduler/scheduler_config.json:   0%|          | 0.00/313 [00:00<?, ?B/s]

text_encoder/config.json:   0%|          | 0.00/617 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/748 [00:00<?, ?B/s]

tokenizer/merges.txt:   0%|          | 0.00/525k [00:00<?, ?B/s]

(…)ature_extractor/preprocessor_config.json:   0%|          | 0.00/342 [00:00<?, ?B/s]

tokenizer/tokenizer_config.json:   0%|          | 0.00/806 [00:00<?, ?B/s]

tokenizer/special_tokens_map.json:   0%|          | 0.00/472 [00:00<?, ?B/s]

tokenizer/vocab.json:   0%|          | 0.00/1.06M [00:00<?, ?B/s]

diffusion_pytorch_model.bin:   0%|          | 0.00/3.44G [00:00<?, ?B/s]

vae/config.json:   0%|          | 0.00/552 [00:00<?, ?B/s]

diffusion_pytorch_model.bin:   0%|          | 0.00/335M [00:00<?, ?B/s]

Loading pipeline components...:   0%|          | 0/7 [00:00<?, ?it/s]

An error occurred while trying to fetch /root/.cache/huggingface/hub/models--runwayml--stable-diffusion-inpainting/snapshots/51388a731f57604945fddd703ecb5c50e8e7b49d/unet: Error no file named diffusion_pytorch_model.safetensors found in directory /root/.cache/huggingface/hub/models--runwayml--stable-diffusion-inpainting/snapshots/51388a731f57604945fddd703ecb5c50e8e7b49d/unet.
Defaulting to unsafe serialization. Pass `allow_pickle=False` to raise an error instead.
An error occurred while trying to fetch /root/.cache/huggingface/hub/models--runwayml--stable-diffusion-inpainting/snapshots/51388a731f57604945fddd703ecb5c50e8e7b49d/vae: Error no file named diffusion_pytorch_model.safetensors found in directory /root/.cache/huggingface/hub/models--runwayml--stable-diffusion-inpainting/snapshots/51388a731f57604945fddd703ecb5c50e8e7b49d/vae.
Defaulting to unsafe serialization. Pass `allow_pickle=False` to raise an error instead.


**In essence, the create_mask function takes an image and a rectangular area as input and generates a binary mask image. The white rectangle on the mask defines the region where the inpainting pipeline will focus on filling in missing information based on the surrounding context and the text prompt.**

In [6]:
# Function to create a mask for the inpainting
def create_mask(image, mask_area):
    mask = Image.new("L", image.size, 0)
    draw = ImageDraw.Draw(mask)
    draw.rectangle(mask_area, fill=255)
    return mask

# Inpainting object into scene
This code defines a function `combine_image_and_mask` for combining an image and its mask into a single tensor suitable for inpainting.
- The function takes an original image and its corresponding inpainting mask as inputs (both as PIL Image objects).
- It converts them to a suitable format for the Stable Diffusion inpainting pipeline
- This combined tensor includes both the image information and the mask defining the region where inpainting should occur.
- By providing this combined tensor to the inpainting pipeline, the function simplifies the process of feeding data and allows the model to focus on filling in the designated area while considering the surrounding image context.


In [7]:
def generate_combined_image(scene_img, object_img):
    try:
        # Generate prompt using the object image
        prompts = [
            "USER: <image>\nWrite me a prompt describing the object in the focus of this image and asking it to be put in the image with similar perspective and size.\nASSISTANT:"
        ]
        inputs = processor(prompts, images=[object_img], padding=True, return_tensors="pt").to("cuda")
        output = model.generate(**inputs, max_new_tokens=45)
        generated_text = processor.batch_decode(output, skip_special_tokens=True)
        prompt = generated_text[0].split("ASSISTANT:")[-1].strip()

        #print(f"Generated Prompt: {prompt}")

        # Define the mask area (you might want to customize this)
        mask_area = (150, 150, 400, 400)

        # Create the mask
        mask = create_mask(scene_img, mask_area)

        # Perform inpainting
        generated_image = inpaint_pipe(prompt=prompt, image=scene_img, mask_image=mask).images[0]
        # Display the output image

        return generated_image

    except Exception as e:
        print(f"Error: {e}")
        return None

**Setting up the UI**

In [8]:
iface = gr.Interface(
    fn=generate_combined_image,
    inputs=[
        gr.Image(type="pil", label="Scene Image"),
        gr.Image(type="pil", label="Object Image")
    ],
    outputs=gr.Image(type="pil", label="Generated Image"),
    title="Inpainting with Stable Diffusion",
    description="Upload a scene image and an object image to generate a new image with the object placed in the scene."
)

# Launch the Gradio app
iface.launch()

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://7ec43a6b2ca2ab933b.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


