# Deploy Stable Diffusion using Triton Inference Server

In this notebook we will host LoRA finetuned Stable Diffusion models on Triton Inference Server provided by NVIDIA

<div class="alert alert-warning">
<b>Warning</b>: You should run this notebook on a SageMaker Notebook Instance. An GPU instance such as `ml.g5.2xlarge` is recommended. This notebook is tested on `conda_python_p310` kernel. 
</div>

### Installs and imports

In [None]:
!pip install nvidia-pyindex
!pip install tritonclient[http]
!pip install -U sagemaker pywidgets numpy PIL
!pip install -Uq conda-pack==0.7.1

In [None]:
import boto3

import tritonclient.http as httpclient
from tritonclient.utils import *
import time
from PIL import Image
import numpy as np
from io import BytesIO
import base64

# variables
s3_client = boto3.client("s3")
ts = time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

## What is Triton Inference Server

**Triton Inference Server** is an open source inference serving toolkit from NVIDIA that supports high-performance inferencing for deep learning models. It provides a framework-agnostic platform to deploy trained AI models from any framework, including TensorFlow, PyTorch, and ONNX. Triton allows multiple models to be served from the same server, optimizing hardware utilization.

**The Triton backend for Python.** The goal of Python backend is to let you serve models written in Python by Triton Inference Server without having to write any C++ code. Read [here](https://github.com/triton-inference-server/python_backend) for more information

In this example, two fine tuned stable diffusion models are already prepared for you. Take a look at the `model_repository` folder structure.

```
model_repository
└── james                                       # model folder
    ├── 1                                       # model version
    │   └── model.py                            # inference handler functions must save in this python file
    │   └── pytorch_lora_weights.safetensors    # LoRA adapter weights
    ├── config.pbtxt                            # model configuration
    └── sd_env.tar.gz                           # custom execution environment (Created in next section)
└── diwakar
    ├── 1
    │   └── model.py
    │   └── pytorch_lora_weights.safetensors
    ├── config.pbtxt
    └── sd_env.tar.gz 
```

Packaging a conda environment

When using the Triton Python backend, you can include your own environment and dependencies. The recommended way to do this is to use [conda pack](https://conda.github.io/conda-pack/) to generate a conda environment archive in `tar.gz` format, include it in your model repository, and point to it in the `config.pbtxt` file of python models that should use it, adding the snippet: 

```
parameters: {
  key: "EXECUTION_ENV_PATH",
  value: {string_value: "$$TRITON_MODEL_DIRECTORY/your_env.tar.gz"}
}

```
Let's create this file and save it to the pipeline model repo, which is our business logic "model".

In [None]:
%%writefile environment.yml
name: sd_env
dependencies:
  - python=3.10
  - pip
  - pip:
      - numpy
      - --extra-index-url https://download.pytorch.org/whl/cu118 torch
      - accelerate==0.22.0
      - transformers==4.26
      - diffusers==0.21.4
      - xformers
      - bitsandbytes
      - conda-pack==0.7.1

In [None]:
!conda env create -f environment.yml

We will use the same conda environment for both models. In reality they can be different.

In [None]:
!conda pack -n sd_env -o model_repository/james/sd_env.tar.gz

In [None]:
!cp model_repository/james/sd_env.tar.gz model_repository/diwakar/

Let's checkout the `model.py` inference script.

In [None]:
!pygmentize model_repository/james/1/model.py

## Test of Triton model repository
you can test the model repository and validate it is working. Let's run the Triton docker container locally and invoke the script to check this. 

In [None]:
!rm -rf `find -type d -name .ipynb_checkpoints`

In [None]:
repo_name = "model_repository"

We are running the Triton container in detached model with the `-d` flag so that it runs in the background. 

In [None]:
!docker run --gpus=all -d --shm-size=4G --rm -p8000:8000 -p8001:8001 -p8002:8002 -v$(pwd)/$repo_name:/model_repository nvcr.io/nvidia/tritonserver:23.10-py3 tritonserver --model-repository=/model_repository --exit-on-error=false
time.sleep(90)

In [None]:
CONTAINER_ID=!docker container ls -q
FIRST_CONTAINER_ID = CONTAINER_ID[0]

In [None]:
!echo $FIRST_CONTAINER_ID

In [None]:
!docker logs $FIRST_CONTAINER_ID

<div class="alert alert-warning">
<b>Warning</b>: Rerun the cell above to check the container logs until you verify that Triton has loaded all models successfully, otherwise inference request will fail.
</div>

#### Now we will invoke the script locally

We will use Triton's HTTP client and its utility functions to send a request to `localhost:8000`, where the server is listening. We are sending text as binary data for input and receiving an array that we decode with numpy as output. Check out the code in `model_repository/pipeline/1/model.py` to understand how the input data is decoded and the output data returned, and check out more Triton Python backend [docs](https://github.com/triton-inference-server/python_backend) and [examples](https://github.com/triton-inference-server/python_backend/tree/main/examples) to understand how to handle other data types.

In [None]:
client = httpclient.InferenceServerClient(url="localhost:8000")

In [None]:
import random
import json

prompt = """photo of <<TOK>> epic portrait, handsome, zoomed out, blurred background cityscape, bokeh, perfect symmetry, by artgem, artstation ,concept art,cinematic lighting, highly detailed, 
octane, concept art, sharp focus, rockstar games,
post processing, picture of the day, ambient lighting, epic composition"""

negative_prompt = """
beard, goatee, ugly, tiling, poorly drawn hands, poorly drawn feet, poorly drawn face, out of frame, extra limbs, disfigured, deformed, body out of frame, blurry, bad anatomy, blurred, 
watermark, grainy, signature, cut off, draft, amateur, multiple, gross, weird, uneven, furnishing, decorating, decoration, furniture, text, poor, low, basic, worst, juvenile, 
unprofessional, failure, crayon, oil, label, thousand hands
"""

seed = 233571759 #random.randint(1, 1000000000)
gen_args = json.dumps(dict(num_inference_steps=50, guidance_scale=7, seed=seed))

input_dict = dict(prompt = prompt,
              negative_prompt = negative_prompt,
              gen_args = gen_args)
inputs = []
for name, data in input_dict.items():
    
    obj = np.array([data], dtype="object").reshape((-1, 1))

    i = httpclient.InferInput(name, obj.shape, np_to_triton_dtype(obj.dtype))
    i.set_data_from_numpy(obj)
    inputs.append(i)

output_img = httpclient.InferRequestedOutput("generated_image")

Change your target model. Available models: [james, diwakar]

In [None]:
target_model = "diwakar"

In [None]:
start = time.time()
query_response = client.infer(model_name=target_model, inputs=inputs, outputs=[output_img])

print(f"took {time.time()-start} seconds")

image = query_response.as_numpy("generated_image")

test = np.squeeze(image).tolist()
Image.open(BytesIO(base64.b64decode(test)))

Check your memory utilization using `nvidia-smi`, a command line utility that helps with managing NVIDIA Graphics Processing Unit (GPU) devices.

You can fit up-to 4 stable diffusion 2.1 models on a single A10G GPU.

In [None]:
!nvidia-smi

## Clean up

In [None]:
!docker kill $FIRST_CONTAINER_ID