# Virtual Fashion Styling With Stable Diffusion In-painting 🧨 Using SageMaker

This blog post details the implementation of the text based image semantic segemantion using CLIPSeg pre-trained model, Stable diffusion in-patining model fine tuning and inference deployment using AmazonSageMaker.


## Semantic Segmentation with CLIPSeg
The CLIPSeg introduced a novel image semantic segmentation method allowing users to easily identify fashion items in pictures using simple text commands. It utilizes a text prompt or an image encoder to encode textual and visual information into a multimodal embedding space, enabling highly accurate segmentation of target objects based on the prompt. The model has been trained on a vast amount of data with techniques such as zero-shot transfer, natural language supervision, and multimodal self-supervised contrastive learning. This means that users can utilize a pre-trained model that is publicly available by Timo Lüddecke et al[1] without the need for customization.

![image](./imgs/clipseg.png)

## Stable Diffusion

Stable Diffusion is a technique that allows fashion designers to generate highly realistic imagery in large quantities purely based on text descriptions without the need for lengthy and expensive customization. This is beneficial for designers who want to create vogue styles quickly, and manufacturers who want to produce personalized products at a lower cost. Compared to traditional GAN-based methods, Stable Diffusion is a generative AI that is capable of producing more stable and photo-realistic images that match the distribution of the original image. The model can be conditioned on a wide range of purposes, such as text for text-to-image generation, bounding boxes for layout-to-image generation, masked images for in-painting, and lower-resolution images for super-resolution. Diffusion models have a wide range of business applications, and their practical uses continue to evolve. These models will greatly benefit various industries such as fashion, retail and e-commerce, entertainment, social media, marketing, and more. 

![image](./imgs/sdv2.png)

## Setup

Install needed packages and toolkits

In [None]:
!pip install -q sagemaker transformers --upgrade

## Upload a fashion image

We began by having the user upload a fashion image, followed by downloading and extracting the pre-trained model from CLIPSeq. The image is then normalized and resized to comply with the size limit. Stable Diffusion V2 supports image resolution up to 768x768 while V1 supports up to 512x512.

In [None]:
import PIL
import matplotlib
from torchvision import transforms

# Use Stable Diffusion V2 size 
img_width = 768
img_height = 768

# Image tranformation that accepts tensor images with (C, H, W) shape or a a batch of tensor with (B, C, H, W) shape.
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    transforms.Resize((img_height, img_width)),
])

#Open an image file
orig_image_filename = "./imgs/fashion_model.png"
image_filename = "./imgs/fashion_model_resize.png"

orig_image = PIL.Image.open(orig_image_filename).convert("RGB").resize((img_height, img_width))

# Save the resized image for inference
orig_image.save(image_filename)

# Covert PHI.Image.image to torch.Tensor type.
img = transform(orig_image).unsqueeze(0)

# Display the resized image from the original one
orig_image

## Semantic segmentation using CLIPSeg

CLIPSeg is a model that uses a text and image encoder to encode textual and visual information into a multimodal embedding space to perform semantic segmentation based on a text prompt. The architecture of CLIPSeg consists of two main components: a text encoder and an image encoder. The text encoder takes in the text prompt and converts it into a text embedding, while the image encoder takes in the image and converts it into an image embedding. Both embeddings are then concatenated and passed through a fully connected layer to produce the final segmentation mask.

In terms of data flow, the model is trained on a dataset of images and corresponding text prompts, where the text prompts describe the target object to be segmented. During the training process, the text encoder and image encoder are optimized to learn the mapping between the text prompts and the image to produce the final segmentation mask. Once the model is trained, it can take in a new text prompt and image and produce a segmentation mask for the object described in the prompt.


In [None]:
from models.clipseg import CLIPDensePredT
#from matplotlib import pyplot as plt

# Clean up any old files
import os
for myfile in ["weights/rd16-uni.pth", "weights/rd64-uni.pth", "weights/rd64-uni-refined.pth"]:
    if os.path.isfile(myfile):
        os.remove(myfile)

# Download pre-trained CLIPSeg model
! wget https://owncloud.gwdg.de/index.php/s/ioHbRzFx6th32hn/download -O weights.zip
! unzip -d weights -j weights.zip

In [None]:
import torch

# load model  available models = ['RN50', 'RN101', 'RN50x4', 'RN50x16', 'RN50x64', 'ViT-B/32', 'ViT-B/16', 'ViT-L/14', 'ViT-L/14@336px']
model = CLIPDensePredT(version='ViT-B/16', reduce_dim=64)
model.eval();

# non-strict, because we only stored decoder weights (not CLIP weights)
model.load_state_dict(torch.load('weights/rd64-uni.pth', map_location=torch.device('cuda')), strict=False);

Use a text prompt to guide image segmentaion 

In [None]:
from matplotlib import pyplot as plt

# Text prompt for segementaion
negative_prompt = 'Get the skirt only.'

# predict
mask_image_filename = './imgs/a_dress.png'

with torch.no_grad():
    preds = model(img.repeat(4,1,1,1), negative_prompt)[0]
#plt.imsave(mask_image_filename,torch.mul(torch.sigmoid(preds[0][0]), 5))
plt.imsave(mask_image_filename,torch.special.ndtr(preds[0][0]))

In [None]:
# Display the mask image
mask_image = PIL.Image.open(mask_image_filename).resize((img_height, img_width))
mask_image

## Fine tune a pre-trained Stable Diffusion Inpainting model

Fine-tuning a pre-trained in-painting text encoder with the UNet for resolution 512x512 images requires ~22G of VRAM or higher for 768x768 resolution.  Ideally fine-tune samples should be resized to match the desirable output image resolution to avoid performance degradation. The text encoder produces more accurate details such as model faces. One option is to run on a single AWS EC2 g5.2xlarge instance, now available in 8 regions[9] or leverage HuggingFace Accelerate to run the fine-tune code across a distributed configuration. For additional memory savings, you can choose a sliced version of attention that performs the computation in steps instead of all at once by simply modifying DreamBooth’s training script train_dreambooth_inpaint.py to add pipeline enable_attention_slicing() function. 

Accelerate is a library that enables one fine tuning code to be executed across any distributed configuration. Hugging Face and Amazon introduced Hugging Face Deep Learning Containers (DLCs) to scale fine tuning tasks across multiple GPUs and nodes. You can configure the launch configuration for Amazon SageMaker with a single CLI command.


In [None]:
# From your aws account, install the sagemaker sdk for Accelerate
!pip install "accelerate[sagemaker]" --upgrade

# Configure the launch configuration for Amazon SageMaker 
!accelerate config

# List and verify Accelerate configuration
!accelerate env

# Make necessary modification of the training script as the following to save 
# output on S3, if needed
#  - torch.save('/opt/ml/model`)
#  + accelerator.save('/opt/ml/model')

To launch a fine-tune job, verify Accelerate's configuration using CLI[9] and provide the necessary training arguments, then use the following shell script.

In [None]:
%%writefile ./scripts/fine_tune_dreambooth.sh
# Instance images — Custom images that represents the specific 
#          concept for dreambooth training. You should collect 
#          high #quality images based on your use cases.
# Class images — Regularization images for prior-preservation 
#          loss to prevent overfitting. You should generate these 
#          images directly from the base pre-trained model. 
#          You can choose to generate them on your own or generate 
#         them on the fly when running the training script.
# 
# You can access train_dreambooth_inpaint.py from huggingface/diffuser 

export MODEL_NAME="stabilityai/stable-diffusion-2-inpainting"
export INSTANCE_DIR="/data/fashion/gowns/highres/"
export CLASS_DIR="/opt/data/fashion/generated_gowns/imgs"
export OUTPUT_DIR="/opt/model/diffuser/outputs/inpainting/"

accelerate launch train_dreambooth_inpaint.py \
  --pretrained_model_name_or_path=$MODEL_NAME  \
  --train_text_encoder \
  --instance_data_dir=$INSTANCE_DIR \
  --class_data_dir=$CLASS_DIR \
  --output_dir=$OUTPUT_DIR \
  --with_prior_preservation --prior_loss_weight=1.0 \
  --instance_prompt="A supermodel poses in long summer travel skirt, photorealistic" \
  --class_prompt="A supermodel poses in skirt, photorealistic" \
  --resolution=512 \
  --train_batch_size=1 \
  --use_8bit_adam \
  --gradient_checkpointing \
  --learning_rate=2e-6 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --num_class_images=200 \
  --max_train_steps=800


The fine-tuned in-painting model allows for the generation of more specific images to the fashion class described by the text prompt. Because it has been fine-tuned with a set of high-resolution images and text prompts, the model can generate images that are more tailored to the class, such as formal evening gowns. It is important to note that the more specific the class and the more data used for fine-tuning, the more accurate and realistic the output images will be.

In [None]:
!tree -d ./jumpstart-examples-finetuned-stable-diffusion-v2-1-inpainting

    jumpstart-examples-finetuned-stable-diffusion-v2-1-inpainting
    ├── 512-inpainting-ema.ckpt
    ├── feature_extractor
    ├── code
    │ └──inference.py
    │ ├──requirements.txt
    ├── scheduler
    ├── text_encoder 
    ├── tokenizer
    ├── unet
    └── vae


## Deploy fine-tuned in-painting model using SageMaker for inference

Using Amazon SageMaker, the fine-tuned Stable Diffusion models can be deployed for real-time inference. To upload the model to AWS S3 for deployment, a model.tar.gz archive tarball must be created. Ensure the archive directly includes all files, not a folder that contains them. The DreamBooth fine-tuning archive folder should appear as follows after eliminating the intermittent checkpoints:

The initial step in creating our inference handler involves the creation of the "inference.py" file. This file serves as the central hub for loading the model and handling all incoming inference requests. Once the model is loaded, the "model_fn" function is executed. When the need arises to perform inference, the "predict_fn" function is called. Additionally, the "decode_base64" function is utilized to convert a JSON string, contained within the payload, into a PIL image data type.


In [None]:
%%writefile scripts/inference.py
import base64
import torch
from PIL import Image
from io import BytesIO
from diffusers import EulerDiscreteScheduler, StableDiffusionInpaintPipeline

def decode_base64(base64_string):
    decoded_string = BytesIO(base64.b64decode(base64_string))
    img = Image.open(decoded_string)
    return img

def model_fn(model_dir):
    # Load stable diffusion and move it to the GPU
    scheduler = EulerDiscreteScheduler.from_pretrained(model_dir, subfolder="scheduler")
    pipe = StableDiffusionInpaintPipeline.from_pretrained(model_dir, 
                                                   scheduler=scheduler,
                                                   revision="fp16",
                                                   torch_dtype=torch.float16)
    pipe = pipe.to("cuda")
    pipe.enable_xformers_memory_efficient_attention()
    #pipe.enable_attention_slicing()
    return pipe


def predict_fn(data, pipe):
    # get prompt & parameters
    prompt = data.pop("inputs", data) 
    # Require json string input. Inference to convert imge to string.
    input_img = data.pop("input_img", data)
    mask_img = data.pop("mask_img", data)
    # set valid HP for stable diffusion
    num_inference_steps = data.pop("num_inference_steps", 25)
    guidance_scale = data.pop("guidance_scale", 6.5)
    num_images_per_prompt = data.pop("num_images_per_prompt", 2)
    image_length = data.pop("image_length", 512)
    # run generation with parameters
    generated_images = pipe(
        prompt,
        image = decode_base64(input_img),
        mask_image = decode_base64(mask_img),
        num_inference_steps=num_inference_steps,
        guidance_scale=guidance_scale,
        num_images_per_prompt=num_images_per_prompt,
        height=image_length,
        width=image_length,
    #)["images"] # for Stabel Diffusion v1.x
    ).images
    
    # create response
    encoded_images = []
    for image in generated_images:
        buffered = BytesIO()
        image.save(buffered, format="JPEG")
        encoded_images.append(base64.b64encode(buffered.getvalue()).decode())
        
    return {"generated_images": encoded_images}


In order to upload the model to an Amazon S3 bucket, it is necessary to first create a model.tar.gz archive. It's crucial to note that the archive should consist of the files directly and not a folder that holds them. For instance, the file should appear as follows:

import tarfile
import os

# helper to create the model.tar.gz
def compress(tar_dir=None,output_file="model.tar.gz"):
    parent_dir=os.getcwd()
    os.chdir(tar_dir)
    with tarfile.open(os.path.join(parent_dir, output_file), "w:gz") as tar:
        for item in os.listdir('.'):
          print(item)
          tar.add(item, arcname=item)    
    os.chdir(parent_dir)
            
compress(str(model_tar))

# After we created the model.tar.gz archive we can upload it to Amazon S3. We will 
# use the sagemaker SDK to upload the model to our sagemaker session bucket.
from sagemaker.s3 import S3Uploader

# upload model.tar.gz to s3
s3_model_uri=S3Uploader.upload(local_path="model.tar.gz", \
        desired_s3_uri=f"s3://{sess.default_bucket()}/jumpstart-examples-finetuned-stable-diffusion-v2-1-inpainting")

In order to upload the model to an Amazon S3 bucket, it is necessary to first create a model.tar.gz archive. It's crucial to note that the archive should consist of the files directly and not a folder that holds them. For instance, the file should appear as follows:

In [None]:
import tarfile
import os

# helper to create the model.tar.gz
def compress(tar_dir=None,output_file="model.tar.gz"):
    parent_dir=os.getcwd()
    os.chdir(tar_dir)
    with tarfile.open(os.path.join(parent_dir, output_file), "w:gz") as tar:
        for item in os.listdir('.'):
          print(item)
          tar.add(item, arcname=item)    
    os.chdir(parent_dir)
            
compress(str(model_tar))

# After we created the model.tar.gz archive we can upload it to Amazon S3. We will 
# use the sagemaker SDK to upload the model to our sagemaker session bucket.
from sagemaker.s3 import S3Uploader

# upload model.tar.gz to s3
s3_model_uri=S3Uploader.upload(local_path="model.tar.gz", \
        desired_s3_uri=f"s3://{sess.default_bucket()}/jumpstart-examples-finetuned-stable-diffusion-v2-1-inpainting")

Once the model archive is uploaded, we can deploy it on Amazon SageMaker using HuggingfaceModel for real-time inference. You can host the endpoint using a g4dn.xlarge instance, which is equipped with a single NVIDIA Tesla T4 GPU with 16GB of VRAM. Autoscaling can be activated to handle varying traffic demands. For information on incorporating autoscaling in your endpoint, see the article "Going Production: Auto-scaling Hugging Face Transformers with Amazon SageMaker". 

In [None]:
from sagemaker.huggingface.model import HuggingFaceModel
import sagemaker

# Get role Arn which has proper sagemaker execution permissions
role = sagemaker.get_execution_role()

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   model_data=s3_model_uri,      # path to your model and script
   role=role,                    # iam role with permissions to create an Endpoint
   transformers_version="4.17",  # transformers version used
   pytorch_version="1.10",       # pytorch version used
   py_version='py38',            # python version used
)

# deploy the endpoint endpoint
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.g4dn.xlarge"
    )

The huggingface_model.deploy() method returns a HuggingFacePredictor object that can be used to request inference. The endpoint requires a JSON with an "inputs" key, which represents the input prompt for the model to generate an image. You can also control the generation with parameters such as "num_inference_steps", "guidance_scale", and "num_images_per_prompt". The predictor.predict() function returns a JSON with a "generated_images" key, which holds the four generated images as base64 encoded strings. We added two helper functions, decode_base64_to_image and display_images, to decode the response and display the images respectively. The former decodes the base64 encoded string and returns a PIL.Image object, while the latter displays a list of PIL.Image objects.

In [None]:
# Encoder to convert an image to json string
def encode_base64(file_name):
    with open(file_name, "rb") as image:
        image_string = base64.b64encode(bytearray(image.read())).decode()
    return image_string

# Decode to to convert a json str to an image 
def decode_base64_image(base64_string):
    decoded_string = BytesIO(base64.b64decode(base64_string))
    img = PIL.Image.open(decoded_string)
    return img

# display PIL images as grid
def display_images(images=None,columns=3, width=100, height=100):
    plt.figure(figsize=(width, height))
    for i, image in enumerate(images):
        plt.subplot(int(len(images) / columns + 1), columns, i + 1)
        plt.axis('off')
        plt.imshow(image)
       
# Display images in a row/col grid
def image_grid(imgs, rows, cols):
    assert len(imgs) == rows*cols
    w, h = imgs[0].size
    grid = PIL.Image.new('RGB', size=(cols*w, rows*h))
    grid_w, grid_h = grid.size
    
    for i, img in enumerate(imgs):
        grid.paste(img, box=(i%cols*w, i//cols*h))
    return grid

Let's move forward with the in-painting task. It has been estimated that it will take roughly 15 seconds to produce three images, given the input image and the mask created using CLIPSeg with the text prompt discussed previously.

In [None]:
import base64

num_images_per_prompt = 4
prompt = "A female super-model poses in a casual long vacation skirt, with full body length, bright multiple striped colors,  photorealistic, high quality, highly detailed, elegant, sharp focus"

# Convert both original and mask images to string variables
encoded_input_image = encode_base64(image_filename)
encoded_mask_image = encode_base64(mask_image_filename)


# Set in-painint parameters
guidance_scale = 6.7
num_inference_steps = 45

# run prediction
response = predictor.predict(data={
  "inputs": prompt,
  "input_img": encoded_input_image,
  "mask_img": encoded_mask_image,
  "num_images_per_prompt" : num_images_per_prompt,
  "image_length": img_width
  }
)

# decode images
decoded_images = [decode_base64_image(image) for image in response["generated_images"]]

# visualize generation
display_images(decoded_images, columns=num_images_per_prompt, width=100, height=100)

(Optional) The in-painted images can be displayed along with the original image for visual comparison. Additionally, the in-painting process can be constrained using various parameters such as guidance_scale, which controls the strength of the guidance image during the in-painting process. This allows the user to adjust the output image and achieve the desired results.

In [None]:
# insert initial image in the list so we can compare side by side
image = PIL.Image.open(image_filename).convert("RGB")
decoded_images.insert(0, image)
                       
# Display inpainting images in grid
image_grid(decoded_images, 1, num_images_per_prompt + 1)

# Display inpainting images in grid
image_grid(decoded_images, 1, num_images_per_prompt + 1)

## Clean up

Afterwards, delete the inference endpoint that is no longer needed to avoid incurring costs.

In [None]:
predictor.delete_endpoint()