# Deploy Stable Diffusion XL on AWS Inferentia2   

In this notebook, we deploy a [Stable Diffusion XL](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) model using an Inferentia2 instance and optimum-neuron on Amazon SageMaker. [Optimum Neuron](https://huggingface.co/docs/optimum-neuron/en/index) is the interface betweeen the Transfomers library and AWS Purpose Built Accelerators - AWS Trainium and Inferentia.

## Install required libraries

In [None]:
!pip install --upgrade --quiet "optimum-neuron" "sagemaker"

Note: you may need to restart the kernel to use updated packages.

## Download and save the compiled model to a local directory

In [None]:
from huggingface_hub import snapshot_download
 
# compiled model id
compiled_model_id = "aws-neuron/stable-diffusion-xl-base-1-0-1024x1024"
 
# save compiled model to local directory
save_directory = "/tmp/sdxl_neuron"

# Downloads our compiled model from the HuggingFace Hub
# using the revision as neuron version reference
# and makes sure we exlcude the symlink files and "hidden" files, like .DS_Store, .gitignore, etc.
snapshot_download(compiled_model_id, revision="2.15.0", local_dir=save_directory, local_dir_use_symlinks=False, allow_patterns=["[!.]*.*"])

## Create `code` directory and `inference.py` file

In [None]:
# create code directory in our model directory
!mkdir {save_directory}/code

In [None]:
%%writefile {save_directory}/code/inference.py

import os
# Assign two neuron cores per worker
os.environ["NEURON_RT_NUM_CORES"] = "2"
import torch
import torch_neuronx
import base64
from io import BytesIO
from optimum.neuron import NeuronStableDiffusionXLPipeline
 
 
def model_fn(model_dir):
    # Load local converted model into pipeline
    pipeline = NeuronStableDiffusionXLPipeline.from_pretrained(model_dir, device_ids=[0, 1])
    return pipeline
 
 
def predict_fn(data, pipeline):
    # Extract prompt from data
    prompt = data.pop("inputs", data)
 
    parameters = data.pop("parameters", None)
 
    if parameters is not None:
        generated_images = pipeline(prompt, **parameters)["images"]
    else:
        generated_images = pipeline(prompt)["images"]
 
    # Convert image into base64 string
    encoded_images = []
    for image in generated_images:
        buffered = BytesIO()
        image.save(buffered, format="JPEG")
        encoded_images.append(base64.b64encode(buffered.getvalue()).decode())
 
    # Always return the first image
    return {"generated_images": encoded_images}

## Configure SageMaker resources

In [None]:
import sagemaker
import boto3

sess = sagemaker.Session()
# Create an Amazon Sagemaker session bucket for uploading data, models and logs
# Amazon Sagemaker will automatically create this bucket if it does not exist
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # If a bucket name is not provided, set to default bucket
    sagemaker_session_bucket = sess.default_bucket()
 
try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']
 
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)
 
print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")
assert sess.boto_region_name in ["us-east-2", "us-east-1"] , "region must be us-east-2 or us-west-2, due to instance availability"

## Create a tar file (`model.tar.gz`) with model artifacts and scripts

**Note to reviewer**: We are working on storing the `model.tar.gz` in a shared location to avoid this step

In [None]:
%%time
# Create a model.tar.gz archive with all the model artifacts and the inference.py script.
%cd {save_directory}
!tar zcvf model.tar.gz *
%cd ..

Creating the `model.tar.gz` takes around 10 minutes. If you are in a AWS event, your instructor will cover other aspects of the workshop as you wait for this.

## Upload the tar file to a S3 bucket

In [None]:
from sagemaker.s3 import S3Uploader
 
# Create s3 uri
s3_model_path = f"s3://{sess.default_bucket()}/neuronx/sdxl"
 
# Upload model.tar.gz
s3_model_uri = S3Uploader.upload(local_path=f"{save_directory}/model.tar.gz", desired_s3_uri=s3_model_path)
print(f"model artifcats uploaded to {s3_model_uri}")

## Deploy the model

In [None]:
from sagemaker.huggingface.model import HuggingFaceModel
 
# Create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   model_data=s3_model_uri,        # path to your model.tar.gz on s3
   role=role,                      # iam role with permissions to create an Endpoint
   transformers_version="4.34.1",  # transformers version used
   pytorch_version="1.13.1",       # pytorch version used
   py_version='py310',             # python version used
   model_server_workers=1,         # number of workers for the model server
)

In [None]:
# Deploy the endpoint
predictor = huggingface_model.deploy(
    endpoint_name="Stable-Diffusion-XL",
    initial_instance_count=1,      # number of instances
    instance_type="ml.inf2.xlarge", # AWS Inferentia Instance
    volume_size = 128
)


**Note**: Ignore the "Your model is not compiled. Please compile your model before using Inferentia." warning, we have already compiled our model.

### The above step takes about 10-15 minutes.

While you wait for the model to be deployed, you can read the below resources - 
- [AWS Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/index.html)
- [AWS Inferentia2](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/inf2-arch.html)
- [Amazon SageMaker Real Time Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html)
- [Amazon SageMaker with HuggingFace Optimum Neuron](https://huggingface.co/docs/optimum-neuron/en/guides/sagemaker)

## Invoke the model with a sample prompt

In [None]:
from PIL import Image
from io import BytesIO
from IPython.display import display
import base64
 
# Helper decoder
def decode_base64_image(image_string):
  base64_image = base64.b64decode(image_string)
  buffer = BytesIO(base64_image)
  return Image.open(buffer)
 
# Display PIL images as grid
def display_image(image=None,width=500,height=500):
    img = image.resize((width, height))
    display(img)

In [None]:
prompt = "A dog trying catch a flying pizza at a street corner, comic book, well lit, night time"
 
# Run prediction
response = predictor.predict(data={
  "inputs": prompt,
  "parameters": {
    "num_inference_steps" : 20,
    "negative_prompt" : "disfigured, ugly, deformed"
    }
  }
)
 
# Decode and display image
display_image(decode_base64_image(response["generated_images"][0]))

<div class="alert alert-block alert-warning"> 

<b>DO NOT DELETE THE ENDPOINT</b>

The endpoints will be used to invoke the models when building our application.
</div>

## Clean up the environment

In [None]:
predictor.delete_model()
predictor.delete_endpoint()