# Building Personalized Avatar  Using Generative AI

Generative AI has become a popular tool for enhancing and accelerating the creative process across various industries, including entertainment, advertising, and art. It enables more personalized experiences for audiences and improves the overall quality of the final products. 

In this notebook, we will demonstrate how you can use generative AI models like Stable Diffusion (SD) to build a personalized avatar generator using Amazon EC2.

<div class="alert alert-warning">
<b>Warning</b>: You should run this notebook on a SageMaker Notebook Instance. An GPU instance such as `ml.g5.2xlarge` is recommended. This notebook is tested on `conda_python_p310` kernel. 
</div>
---

The entire example takes about 1 hour to complete. Here is the cost breakdown:

- `ml.g5.2xlarge` instance is $1.52 per hour

## Set up the environment
Installs the dependencies required to package the model and test the fine-tuned model.

In [None]:
!pip install -Uq diffusers==0.21.4
!pip install -Uq accelerate==0.22.0
!pip install -Uq peft==0.4.0
!pip install -Uq conda-pack==0.7.1
!pip install -Uq gradio==3.41.2
!pip install -Uq autocrop==1.3.0
!pip install -Uq datasets
!pip install -Uq bitsandbytes

Check the diffusers version

In [None]:
import diffusers

# check difusers version, make sure it's 0.21.4
diffusers.__version__

## Prepare the images

The sample images are provided in the `data` folder. You can also replace the images with your own. You should also include photos with different facial expressions like smiling, frowning, and a neutral expression. Having a mix of expressions will allow the model to better reproduce your unique facial features. The input images dictate the quality of avatar you can generate. 

The accepted formats are `.jpg` or `.png`. 

<img src="statics/input_examples.jpg">

To help the model focus on the facial features, we implement a preprocessing step using computer vision techniques to face detect and center crop the faces from images. This alleviate the burden for user to curiate the perfect images for the model. 

The preprocessing code is in `utils.py` where we first use a face detection model to isolate the largest face in each image. Then we crop and pad the image to the required size of 512 x 512 pixels for our model. 


<img src="statics/prepare_images.jpg" alt="image" width="300" height="auto">

In [None]:
from pathlib import Path
from itertools import chain
import utils
import shutil

imag_dir=Path("data")
dest_dir = Path("cropped")
dest_dir.mkdir(parents=True, exist_ok=True)

for n,img_path in enumerate(chain(imag_dir.glob("*.[jJ][pP]*[Gg]"),imag_dir.glob("*.[Pp][Nn][Gg]"))):
    try:
        cropped = utils.detect_face_and_resize(img_path.as_posix())
        cropped.save(dest_dir / f"image_{n}.png")
    except ValueError:
        print(f"Could not detect face in {img_path}. Skipping.")
        continue

print("Here are the preprocessed images ==========")
[x.as_posix() for x in dest_dir.iterdir() if x.is_file()]

## Training A Stable Diffusion Model

[DreamBooth](https://arxiv.org/abs/2208.12242) is a method to personalize text2image models like stable diffusion given just a few images of a subject. 

[Low-Rank Adaption (LoRA)](https://arxiv.org/abs/2106.09685) is a parameter effecient fine tuning technique that adapt pretrained models by adding pairs of rank-decomposition matrices to existing weights and only training those newly added weights. This has a couple of advantages:

- Previous pretrained weights are kept frozen so that the model is not prone to catastrophic forgetting
- Rank-decomposition matrices have significantly fewer parameters than the original model, which means that trained LoRA weights are easily portable.
- LoRA attention layers allow to control to which extent the model is adapted towards new training images via a scale parameter.

The `train_dreambooth_lora.py` script shows how to implement the training procedure with dreambooth and LoRA for stable diffusion. More implementations available ind [diffusers examples](https://github.com/huggingface/diffusers/tree/main/examples/dreambooth)

In [None]:
import shutil

output_dir = Path("output")
output_dir.mkdir(exist_ok=True)
lora_dir = output_dir / "lora"
lora_dir.mkdir(exist_ok=True)

Here are example parameters you can configure for fine tuning. More parameters and parameter definition available in `train_dreambooth_lora.py`.

| Parameter | Definition |
|-|-|  
| base_model | Path to pretrained model or model identifier from huggingface.co/models. |
| max_train_steps | Total number of training steps to perform.  If provided, overrides num_train_epochs. | 
| instance_prompt | The prompt with identifier specifying the instance |
| validation_prompt | A prompt that is used during validation to verify that the model is learning.|
| learning_rate | A prompt that is used during validation to verify that the model is learning. |
| class_prompt | The prompt to specify images in the same class as provided instance images.|
| class_data_dir |A folder containing the training data of class images. |

In [None]:
base_model = "stabilityai/stable-diffusion-2-1"
n_steps = 1000
instance_prompt = "photo of <<TOK>>"
validation_prompt = "photo of <<TOK>> sleeping on the coach"
learning_rate = 1e-4
class_prompt = "a photo of person"
class_data_dir = Path("/tmp/priors")

Remove `.ipynb_checkpoints` files

In [None]:
!rm -rf `find -type d -name .ipynb_checkpoints`

Here is the command to kick off the training. When training large AI models like Stable Diffusion, GPU memory becomes a key constraint. Here are some ways to help reduce memory usage during training:

- Use gradient checkpointing: this method reduces memory by only storing a subset of activations during the forward pass, and recomputing them as needed during backpropagation. This trades off compute for memory. 

- Use quantization. 8-bit optimizers like those from BitsandBytes convert 32-bit floating point weights and activations to 8-bit during training. This reduces the memory usage but may hurt precision. To help wiht that BitsandBytes' optimizers automatically keep small, sensitive parameters at 32-bit. 

- Use xFormers a libnary for memory-effecient attention caluculation. Attention is often the memory bottleneck in large language models. It reduce the normal self-attention computation from O(n^2) to O(n), thus improve speed and reduce memory utilization.

In [None]:
%time
import subprocess
import shlex

command = f"""
accelerate launch train_dreambooth_lora.py \
  --pretrained_model_name_or_path={base_model}  \
  --train_text_encoder \
  --instance_data_dir={dest_dir} \
  --class_data_dir={class_data_dir} \
  --output_dir={output_dir / "lora"} \
  --with_prior_preservation --prior_loss_weight=1.0 \
  --instance_prompt="{instance_prompt}" \
  --class_prompt="{class_prompt}" \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --learning_rate={learning_rate} \
  --lr_scheduler="constant" \
  --lr_warmup_steps=100 \
  --max_train_steps={n_steps} \
  --num_class_images=200 \
  --train_text_encoder \
  --mixed_precision fp16 \
  --use_8bit_adam \
  --gradient_checkpointing \
  --validation_prompt="{validation_prompt}" \
  --validation_epochs=50 \
  --seed=0 \
"""

print(command)

with open(output_dir / "lora/train.sh", "w") as f:
    command_s = " ".join(command.split())
    f.write(command_s)

res = subprocess.run(shlex.split(command))

print(res)

## Prior-preservation in dreambooth

Prior preservation is a technique used to avoid overfitting and language-drift. For prior preservation, you use other images of the same class as part of the training process. The nice thing is that you can generate those images using the Stable Diffusion model itself! The training script will save the generated images to a local path you specify.

Let's checkout some of the class images generated by stable diffusion.

In [None]:
# Check pior preservation images generated

from IPython.display import Image, display
import random

img_paths = [x for x in class_data_dir.iterdir() if x.is_file()]

random_img = random.choice(img_paths)

display(Image(filename=random_img))

## Test Fine-tuned Model locally

---
Load the base Stable Diffusion model.

In [None]:
import diffusers
import torch 
from peft import PeftModel
import os

device="cuda"

pipe = diffusers.StableDiffusionPipeline.from_pretrained(base_model,
                                                         cache_dir='hf_cache',
                                                         torch_dtype=torch.float16,
                                                         revision="fp16")


pipe.to(device)

Generate an image using the base SD model, then attach the LoRA adapter and generate a image use the fine tuned model

In [None]:
images = []
prompt = """
photo of <<TOK>> front portrait, Pixar character, smiling, zoomed out, smooth skin, fun expression, tantalizing eyes, young and handsome, 4k
"""

negative_prompt = """
ugly, tiling, poorly drawn hands, poorly drawn feet, poorly drawn face, out of frame, extra limbs, disfigured, deformed, body out of frame, blurry, bad anatomy, blurred, 
watermark, grainy, signature, cut off, draft, amateur, multiple, gross, weird, uneven, furnishing, decorating, decoration, furniture, text, poor, low, basic, worst, juvenile, 
unprofessional, failure, crayon, oil, label, thousand hands
"""

seed = random.randint(1, 1000000000)
generator = [torch.Generator(device="cuda").manual_seed(seed)]

print(seed)

# Generate an image with base model
             
image = pipe(prompt, 
             num_inference_steps=50, 
             guidance_scale=7, 
             negative_prompt=negative_prompt,
             generator=generator).images[0]

images.append(image)


# Attach LoRA weights

pipe.load_lora_weights(output_dir / "lora", weight_name="pytorch_lora_weights.safetensors")

generator = [torch.Generator(device="cuda").manual_seed(seed)]

# Generate an image using fine tuned model with the same seed.

image = pipe(prompt, 
             num_inference_steps=50, 
             guidance_scale=7, 
             negative_prompt=negative_prompt,
             generator=generator).images[0]

images.append(image)

Render the images side by side

In [None]:
import matplotlib.pyplot as plt

# Plot images side by side
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 8))
fig.suptitle(f"Prompt\n{prompt}")
ax1.imshow(images[0])
ax1.set_title("Base Model")
ax2.imshow(images[1])
ax2.set_title("Fine-tuned Model")
plt.show()

**Prompt Engineering** Recommend to start with just "<< TOK >>" or "photo of << TOK >>", this is the identifier used to fine tune the model. SD should identify your facial features with this identifier, and provide an image the resembles you. If not, you may need to provide additional image (better quality image). Or adjust the fine tuning parameters.

In [None]:
# prompt = "<<TOK>>"
prompt = """photo of <<TOK>> epic portrait, young and handsome, with glasses, zoomed out, blurred background cityscape, bokeh, perfect symmetry, by artgem, artstation ,concept art,cinematic lighting, highly detailed, 
octane, concept art, sharp focus, rockstar games,
post processing, picture of the day, ambient lighting, epic composition"""
# prompt = """
# photo of <<TOK>> front portrait, Pixar character, shocked, zoomed out, smooth skin, fun expression, tantalizing eyes, young and handsome, 4k
# """

negative_prompt = """
ugly, tiling, poorly drawn hands, poorly drawn feet, poorly drawn face, out of frame, extra limbs, disfigured, deformed, body out of frame, blurry, bad anatomy, blurred, 
watermark, grainy, signature, cut off, draft, amateur, multiple, gross, weird, uneven, furnishing, decorating, decoration, furniture, text, poor, low, basic, worst, juvenile, 
unprofessional, failure, crayon, oil, label, thousand hands
"""
image = pipe(prompt, num_inference_steps=50, guidance_scale=7, negative_prompt=negative_prompt).images[0]
image

## Fuse the LoRA Adapter and Base Model

In [None]:
fused_model_dir = output_dir / "fused_model"

fused_model_dir.mkdir(parents=True, exist_ok=True)

shutil.rmtree(fused_model_dir)

pipe.fuse_lora()
pipe.save_pretrained(fused_model_dir)

Delete the original pipeline and free up GPU memory

In [None]:
del pipe
torch.cuda.empty_cache()

Load the fused model back and make sure the performance is the same.

We are doing the side by side comparison with base stable diffusion model again.

In [None]:
device="cuda"

pipe = diffusers.StableDiffusionPipeline.from_pretrained(base_model,
                                                         cache_dir='hf_cache',
                                                         torch_dtype=torch.float16,
                                                         revision="fp16")

pipe.to(device)

In [None]:
pipe2 = diffusers.StableDiffusionPipeline.from_pretrained(fused_model_dir,
                                                         cache_dir='hf_cache',
                                                         torch_dtype=torch.float16,
                                                         revision="fp16")

pipe2.to(device)

In [None]:
images = []

prompt = """
photo of <<TOK>> pencil sketch, handsome, face front, centered
"""

negative_prompt = """
ugly, tiling, poorly drawn hands, poorly drawn feet, poorly drawn face, out of frame, extra limbs, disfigured, deformed, body out of frame, blurry, bad anatomy, blurred, 
watermark, grainy, signature, cut off, draft, amateur, multiple, gross, weird, uneven, furnishing, decorating, decoration, furniture, text, poor, low, basic, worst, juvenile, 
unprofessional, failure, crayon, oil, label, thousand hands
"""

seed = random.randint(1, 1000000000)
generator = [torch.Generator(device="cuda").manual_seed(seed)]

print(seed)
             
image = pipe(prompt, 
             num_inference_steps=50, 
             guidance_scale=7, 
             negative_prompt=negative_prompt,
             generator=generator).images[0]

images.append(image)

image = pipe2(prompt, 
             num_inference_steps=50, 
             guidance_scale=7, 
             negative_prompt=negative_prompt,
             generator=generator).images[0]

images.append(image)

In [None]:
import matplotlib.pyplot as plt

# Plot images side by side
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 8))
fig.suptitle(f"Prompt\n{prompt}")
ax1.imshow(images[0])
ax1.set_title("Base Model")
ax2.imshow(images[1])
ax2.set_title("Fine-tuned Model")
plt.show()

## Clean up

In [None]:
del pipe, pipe2

In [None]:
torch.cuda.empty_cache()