# Deploy Stable Diffusion using Triton

In this notebook we will host LoRA finetuned Stable Diffusion models on Triton Inference Server provided by NVIDIA

<div class="alert alert-warning">
<b>Warning</b>: This notebook is tested on `torch-neuronx` kernel an Inf2 instance (`inf2.8xlarge or larger`)
</div>

### Installs and imports

In [None]:
!pip install nvidia-pyindex
!pip install tritonclient[http]
!pip install -U sagemaker pywidgets numpy PIL
!pip install -Uq conda-pack==0.7.1

In [None]:
import boto3

import tritonclient.http as httpclient
from tritonclient.utils import *
import time
from PIL import Image
import numpy as np
from io import BytesIO
import base64

# variables
s3_client = boto3.client("s3")
ts = time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

### Setup Triton on Inf2

On the Inf2 instances, run the follow command to install the runtime tools on the instance.

```bash
 $chmod 777 setup-pre-container.sh
 $sudo setup-pre-container.sh -inf2
```
This installs the following runtime tools on the instance (Your DLAMI may already have these pre-installed):

```
aws-neuronx-dkms=2.* \
aws-neuronx-tools=2.* \
aws-neuronx-collectives=2.* -y \
aws-neuronx-runtime-lib=2.* -y
```

Then build the custom container to install the requirements and, most importantly, run the `setup.sh` scrip to properly install neuron compiler and neuron framework packages.

In [None]:
!cat docker/Dockerfile

In [None]:
new_image_name = f"tritonserver-pt-inf2"
base_image = "nvcr.io/nvidia/tritonserver:23.03-py3"

In [None]:
%%capture build_output
!cd docker && docker build  -t {new_image_name} . --build-arg BASE_IMAGE={base_image}

In [None]:
print(build_output)

list the docker images

In [None]:
!docker images

## What is Triton Inference Server

**Triton Inference Server** is an open source inference serving toolkit from NVIDIA that supports high-performance inferencing for deep learning models. It provides a framework-agnostic platform to deploy trained AI models from any framework, including TensorFlow, PyTorch, and ONNX. Triton allows multiple models to be served from the same server, optimizing hardware utilization.

**The Triton backend for Python.** The goal of Python backend is to let you serve models written in Python by Triton Inference Server without having to write any C++ code. Read [here](https://github.com/triton-inference-server/python_backend) for more information

In this example, a fine tuned stable diffusion models are already prepared for you. Take a look at the `model_repository` folder structure.

```
model_repository
└── james                                       # model folder
    ├── 1                                       # model version
    │   └── model.py                            # inference handler  must save in this python file
    │   └── sd2_compile_dir_512                 # compiled sd model (generated from other notebook)
    └── config.pbtxt                            # model configuration
```

In [None]:
!rm -rf `find -type d -name .ipynb_checkpoints`

In [None]:
repo_name = "model_repository"

In [None]:
!docker run --device /dev/neuron0 -d --shm-size=4G --rm -p8000:8000 -p8001:8001 -p8002:8002 -v$(pwd)/$repo_name:/model_repository tritonserver-pt-inf2:latest tritonserver --model-repository=/model_repository --exit-on-error=false
time.sleep(90)

In [None]:
CONTAINER_ID=!docker container ls -q
FIRST_CONTAINER_ID = CONTAINER_ID[0]

In [None]:
!echo $FIRST_CONTAINER_ID

In [None]:
!docker logs $FIRST_CONTAINER_ID

#### Now we will invoke the script locally

We will use Triton's HTTP client and its utility functions to send a request to `localhost:8000`, where the server is listening. We are sending text as binary data for input and receiving an array that we decode with numpy as output. Check out the code in `model_repository/pipeline/1/model.py` to understand how the input data is decoded and the output data returned, and check out more Triton Python backend [docs](https://github.com/triton-inference-server/python_backend) and [examples](https://github.com/triton-inference-server/python_backend/tree/main/examples) to understand how to handle other data types.

In [None]:
client = httpclient.InferenceServerClient(url="localhost:8000")

In [None]:
import random
import json

prompt = """
photo of <<TOK>> pencil sketch, young and handsome, face front, centered
"""

negative_prompt = """
ugly, tiling, poorly drawn hands, poorly drawn feet, poorly drawn face, out of frame, extra limbs, disfigured, deformed, body out of frame, blurry, bad anatomy, blurred, 
watermark, grainy, signature, cut off, draft, amateur, multiple, gross, weird, uneven, furnishing, decorating, decoration, furniture, text, poor, low, basic, worst, juvenile, 
unprofessional, failure, crayon, oil, label, thousand hands
"""

seed = random.randint(1, 1000000000)
gen_args = json.dumps(dict(num_inference_steps=50, guidance_scale=7, seed=seed))

input_dict = dict(prompt = prompt,
              negative_prompt = negative_prompt,
              gen_args = gen_args)
inputs = []
for name, data in input_dict.items():
    
    obj = np.array([data], dtype="object").reshape((-1, 1))

    i = httpclient.InferInput(name, obj.shape, np_to_triton_dtype(obj.dtype))
    i.set_data_from_numpy(obj)
    inputs.append(i)

output_img = httpclient.InferRequestedOutput("generated_image")

In [None]:
target_model = "james"

In [None]:
start = time.time()
query_response = client.infer(model_name=target_model, inputs=inputs, outputs=[output_img])

print(f"took {time.time()-start} seconds")

image = query_response.as_numpy("generated_image")

test = np.squeeze(image).tolist()
Image.open(BytesIO(base64.b64decode(test)))

To check if neuron is being used: run `neuron-top` command on the instance.

(Note: Device Memory  should be non-zero)

### Clean Up

In [None]:
!docker kill $FIRST_CONTAINER_ID