# Text to Image generation on SageMaker

In this notebook, you will learn how you can fine-tune an existing Stable Diffusion model on SageMaker and deploy it for inference.

## 0. Setup

In [None]:
import sagemaker
from sagemaker.pytorch import PyTorch
from sagemaker.huggingface import HuggingFace
from sagemaker import get_execution_role

This notebook is purely educational for showing how to fine-tune latent-stable-diffusion on Amazon SageMaker. Neither the images produced or code represent Amazon or its views in any way shape or form. To properly leverage this codebase, read the corresponding licenses from [CompVis](https://huggingface.co/spaces/CompVis/stable-diffusion-license) (the model) and [Conceptual Captions](https://huggingface.co/datasets/conceptual_captions) (from Google, but you will use HF)

This demo requires a g4dn.12xlarge or more powerful instance.

Model weights were provided by CompVis/stable-diffusion-v1-4. You can find the licensing, README and more [here](https://huggingface.co/CompVis/stable-diffusion-v1-4). To download the weights, you will need to have a huggingface account, accept the terms on the aforementioned link, then generate your user authenticated token. These steps are beyond the scope of this Notebook. Please note that the finetune.py script has been slightly modified from a PR request [here](https://github.com/huggingface/diffusers/pull/356)

You will install some libraries so that you can use stable-diffusion locally.

In [None]:
!pip install diffusers -q

In [None]:
!pip install transformers==4.21.0 -q
!pip install ftfy spacy -q

## 1. Download Model and Data
Now you will download the model first.

In [None]:
import os
from diffusers import DiffusionPipeline
from transformers import CLIPFeatureExtractor, CLIPTextModel, CLIPTokenizer
token_value = 'INSERT TOKEN HERE'
force = False
if os.path.exists('sd-base-model') and (not force):
    d = './sd-base-model/'
else:
    d = "CompVis/stable-diffusion-v1-4"
    
model = DiffusionPipeline.from_pretrained(d, cache_dir=os.path.join(os.getcwd(),'base-model'),use_auth_token=token_value)

if d == "CompVis/stable-diffusion-v1-4":
    d = './sd-base-model'
    !rm -rf sd-base-model
    model.save_pretrained('./sd-base-model/')
    !rm -rf ./base-model

And the dataset.

In [None]:
!pip install datasets -q

In [None]:
from datasets import load_dataset

data_name = 'conceptual_captions'
!rm -rf conceptual_captions
dataset = load_dataset("conceptual_captions")
!mkdir {data_name}
!cp -r ./sd-base-model {data_name}

In [None]:
#feel free to visualize the dataset below

In [None]:
df = dataset['train'].to_pandas()
df.head()

This following cell will allow you to download the images (not provided in the previous download), and extract a subset for training.

In [None]:
import pandas as pd
from PIL import Image

df = dataset['train'].to_pandas()
def download_file(url,index):
    import urllib.request
    urllib.request.urlretrieve(url,f'./{data_name}/{index}.jpg')
j = 0
indexes = []
images = set()
for i in range(len(df)):
    try:
        df.loc[i,'sm_key'] = f'/opt/ml/input/data/training/{j}.jpg'
        df.loc[i,'local_key'] = f'{j}.jpg'
        download_file(df.loc[i,'image_url'],j)
        img = Image.open(f'./conceptual_captions/{j}.jpg')
        j += 1
        indexes.append(i)
    except Exception as e:
        print(f"file didn't download will continue {i}")
        print(e)
            
    if (j % 100 == 0) and (j>0):#You can change this to train on a larger dataset
        break
df = df.iloc[indexes,:]
df.to_parquet(f'./{data_name}/dataset.parquet')

In [None]:
#Again feel free to run the following two cells to visualize a sample from the dataset.

In [None]:
d = load_dataset('parquet',data_dir=f'./{data_name}',data_files='dataset.parquet')

In [None]:
idx = 0
img,text = d['train'][idx]['local_key'],d['train'][idx]['caption']
print(img)
print(text)
Image.open(f'./{data_name}/{img}')

Additionally, the data you will be using comes from mscoco. However, you can also download from [here](https://huggingface.co/datasets/ChristophSchuhmann/MS_COCO_2017_URL_TEXT) which uses the dataset from [here](https://academictorrents.com/details/74dec1dd21ae4994dfd9069f9cb0443eb960c962). Then use this [link](https://github.com/rom1504/img2dataset) to quickly fill in the datasets files. For the purpose of this notebook you can download a few samples using the cell below.

# 2. Training
You will use distributed training, to do so you need to leverage any existing GPU's. The first cell will evaluate to see how many gpus are on the current system.

In [None]:
import subprocess
processes_per_host = subprocess.Popen("nvidia-smi -q | awk '/Attached GPUs/ {print $4}'",
                              shell=True,
                              stdout=subprocess.PIPE)
processes_per_host = int(processes_per_host.stdout.read().decode().strip())

The following cell will enable you to build an estimator for training locally, and fit on the local dataset you previously built.

In [None]:
import os
from sagemaker.huggingface import HuggingFace
from sagemaker.local import LocalSession
from sagemaker import get_execution_role

est = HuggingFace(
    entry_point='finetune.py',
    source_dir='src',
    image_uri='763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.12.1-gpu-py38-cu113-ubuntu20.04-sagemaker',
    #pytorch_version="1.10.2",
    #transformers_version="4.17.0",
    sagemaker_session=LocalSession(),
    role=get_execution_role(),
    instance_type='local_gpu',
    output_path='file://{}'.format(os.path.join(os.getcwd(),'model')),
    py_version='py38',
    base_job_name='test',
    instance_count=1,
    hyperparameters={
        'pretrained_model_name_or_path':'/opt/ml/input/data/training/sd-base-model',
        'dataset_name':'/opt/ml/input/data/training/dataset.parquet',
        'caption_column':'caption',
        'image_column':'sm_key',
        'resolution':256,
        'mixed_precision':'fp16',
        'train_batch_size':2,
        'learning_rate': '1e-10',
        'max_train_steps':100,
        'num_train_epochs':1,
        'output_dir':'/opt/ml/model/sd-output-final',   
    },
    distribution={"mpi":{"enabled":True,"processes_per_host":processes_per_host}}
)

In [None]:
#Please note training can take upwards of 25 minutes (13 minutes for saving the model). 

In [None]:
est.fit(f'file://./{data_name}/')

The "Aborting on container exit" line may hang for up to 15 minutes due to the size of the model being compressed, saved, and uploaded.

In [None]:
print(est.model_data) #In case you have to restart kernel.

## 3. Inference
Prior to doing inference you will need to extand an existing Deep Learning Container. Feel free to look at Dockerfile-Inf under the src directory for more details on this file. Otherwise, this following cell will build a local container for use in this notebook.

In [None]:
!DOCKER_BUILDKIT=1 docker build ./src -f ./src/Dockerfile-Inf -t local:latest -q

Define your Model for deployment (This can be skipped due to the previous train job).

In [None]:
from sagemaker.huggingface import HuggingFaceModel

#This cell could be done separately, and you could deploy the following one directly.
## However, if you happen to restart later, this cell will run but you will need to input
##  est.model_data into model_data =
est=HuggingFaceModel(role=get_execution_role(),
                     py_version='py38',
                      model_data=est.model_data,
                      image_uri='local:latest',
                      sagemaker_session=LocalSession(),
                      model_server_workers=4
)

Deploy your model for inference!

In [None]:
pred = est.deploy(instance_type='local_gpu',
                  initial_instance_count=1)

Provide prompts for training. The first text argument is based on this current dataset.

In [None]:
prompts = [text,'A photo of an astronaut riding a horse on mars', 
           'A dragonfruit wearing karate belt in the snow.', 
           'Teddy bear swimming at the Olympics 400m Butter-fly event.',
           'A cute sloth holding a small glowing treasure chest.']

In [None]:
#Get the outputs

In [None]:
outputs = [pred.predict({'inputs':prompt}) for prompt in prompts]

In [None]:
outputs = [output['images'][0] for output in outputs]

In [None]:
def process_result(out):
    from PIL import Image
    from io import BytesIO
    import base64
    return Image.open(BytesIO(base64.b64decode(out)))

In [None]:
images = [[process_result(output),prompt] for output,prompt in zip(outputs,prompts)]

In [None]:
#Visualize the results from the inference

In [None]:
import matplotlib.pyplot as plt

for i in range(len(images)):
    plt.figure()
    plt.title(images[i][1])
    plt.imshow(images[i][0])

In [None]:
# clean up your endpoint
pred.delete_endpoint()

## 4. (Bonus) compare against the original model.

Using the previous model that you defined as a base model, you can also evaluate it locally.

In [None]:
import torch
if torch.cuda.is_available():
    model = model.to('cuda:0')

In [None]:
outs = [model(prompt) for prompt in prompts]

In [None]:
#visualize output

In [None]:
for i in range(len(outs)):
    plt.figure()
    plt.title(prompts[i])
    plt.imshow(outs[i].images[0])

In [None]:
#You will need to reset the kernel to remove the model from the GPU Memory if you wish to train more locally.