# Hugging Face LLM Chatbot application using Gradio  

---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/inference|generativeai|huggingfacetgi|open-assistant|open-assistant-chatbot.ipynb)

---

[AWS Deep Learning Containers (DLC)](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-dlc.html) quickly deploy deep learning environments with optimized, prepackaged container images. This sample notebook is a quick start to deploy open source LLMs to Amazon SageMaker for inference using the [Hugging Face LLM Inference Container](https://huggingface.co/docs/sagemaker/index) which is powered by Text Generation Inference (TGI). [TGI](https://github.com/huggingface/text-generation-inference) is a toolkit for deploying and serving Large Language Models (LLMs). 

This sample notebook uses open source chat LLM trained using the [Open assistant](https://open-assistant.io/bye) initiative with [Gradio](https://www.gradio.app/) as demo Chat app to validate inference. 


# Enviroment 


- `Amazon SageMaker Studio JupyterLab` `(JupyterLab 3.0, 5GB storage and ml.t3.medium instance)` with `public internet access`
- `Amazon Sagemaker` python SDK version `2.163.0 +`
- `Gradio` python version `4.16.0 +`

# Step-1: Uninstall packages

This is an optional step included to minimize chances of package dependency conflicts while running the [Gradio](https://www.gradio.app/) chat apps later in this notebook

In [None]:
# cell 01 : Optional step to reduce dependency conflicts with packages

!pip uninstall sagemaker -y
!pip uninstall gradio -y

# Step-2: Install packages and environment  

Install `Amazon SageMaker Python SDK` and `Gradio` 

In [None]:
# cell 02 : Install packages required for this notebook

!pip install sagemaker --quiet
!pip install gradio  --upgrade

In [None]:
# cell 03 : Get SageMaker session details using AWS SDK for Python (boto3)

import sagemaker
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
import time

sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
role = sagemaker.get_execution_role()

print(f"Amazon SageMaker role arn: {role}")
print(f"Amazon SageMaker session region: {region}")

# Step-3: Retrieve LLM image URI

Using Amazon SageMaker SDK `get_huggingface_llm_image_uri()` helper function to retrieve appropriate image URI from `Amazon ECR` for the Hugging Face Large Language Model (LLM). This method allows to retrieve the URI for the desired Hugging Face LLM DLC based on the specified `backend`, `session`, `region`, and `version`. Refer [Amazon SageMaker SDK](https://github.com/aws/sagemaker-python-sdk) for details. 

- `backend` : Valid values include "huggingface" and "lmi". The "lmi" stands for SageMaker LMI inference backend, and "huggingface" refers to using Hugging Face TGI inference backend
- `session` : The SageMaker Session to use. (Default: None)
- `region`  : The AWS region to use for image URI. (default: None)
- `version` : The framework version for which to retrieve an image URI
  

In [None]:
# cell 04 : Get the image URI for the Hugging Face Large Language Model (LLM) inference

# from sagemaker.huggingface import get_huggingface_llm_image_uri

llm_image = get_huggingface_llm_image_uri(backend="huggingface", region=region, version="0.8.2")

# print Amazon ECR image uri

print(f"llm image uri: {llm_image}")

# Step-4: Configure and create SageMaker Endpoint



In this section, 

- Configure the instance type for the SageMaker Deployment that will be used in the `deploy()` function in `Step-5`
- Configure the `env` variables for the `HuggingFaceModel` class. Environment variables include
    - `HF_MODEL_ID` which corresponds to the model from the HuggingFace Hub that will be deployed
    - `SM_NUM_GPUS` to the number of available GPUs on the selected instance type
    - `HF_MODEL_QUANTIZE`  environment variable to reduce the memory footprint of the model
- Configure the Hugging Face `model` object `image_uri`, Amazon SageMaker execution `role` and `env` variables specified 


In [None]:
# cell 05: Configure the model with the ideal sagemaker inference instance & create SageMaker endpoint

import json
from sagemaker.huggingface import HuggingFaceModel

# sagemaker config
instance_type = "ml.g4dn.12xlarge"
number_of_gpu = 4
health_check_timeout = 300

# Define Model and Endpoint configuration parameter

model_name = "pythia-12b-sft-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

hub = {
    "HF_MODEL_ID": "OpenAssistant/pythia-12b-sft-v8-7k-steps",  # model_id from hf.co/models
    "SM_NUM_GPUS": json.dumps(number_of_gpu),  # Number of GPU used per replica
    "MAX_INPUT_LENGTH": json.dumps(1024),  # Max length of input text
    "MAX_TOTAL_TOKENS": json.dumps(2048),  # Max length of the generation (including input text)
    # 'HF_MODEL_QUANTIZE': "bitsandbytes", # comment in to quantize
}

# create HuggingFaceModel with the image uri

llm_model = HuggingFaceModel(name=model_name, role=role, image_uri=llm_image, env=hub)

# Step-5: Deploy model for real-time inference

After creating the HuggingFaceModel, deploy the model by invoking the `deploy()` function. Here the model will be deployed to `ml.g4dn.12xlarge` instance type. TGI will automatically distribute and shard the model across all GPUs.
This may take up to `15 mins -20 mins` to complete.


In [None]:
# cell 06: Deploy model to an endpoint

llm = llm_model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    endpoint_name=model_name,
    # volume_size=400, # If using an instance with local SSD storage, volume_size must be None, e.g. p4 but not p3
    container_startup_health_check_timeout=health_check_timeout,  # 15 mins to be able to load the model
)

print(f"Amazon SageMaker endpoint name : {model_name}")

# Step-6: Evaluation 

Once the endpoint is `InService` status, execute the following cell for validation. 

We will use the predict method from the predictor to evaluate the inference endpoint. We can evaluate with different parameters to impact the generation. Parameters can be defined as in the parameters attribute of the payload. Please refer [open api specification of the TGI](https://huggingface.github.io/text-generation-inference/) in the swagger documentation for parameters supported by TGI.

The `OpenAssistant/pythia-12b-sft-v8-7k-steps` is a conversational chat model meaning we can chat with it using the following prompt:

```
<|prompter|>[Instruction]<|endoftext|>
<|assistant|>
```




In [None]:
# cell 07: Validate the model using a simple prompt

chat = llm.predict(
    {
        "inputs": """<|prompter|>what are federal reserve's responsibilities ?<|endoftext|><|assistant|>"""
    }
)

print(chat[0]["generated_text"])

In [None]:
# cell 08:  Define payload and  run inference with different parameters to impact the generation. Parameters can be defined as in the parameters attribute of the payload


prompt = """<|prompter|>give me 3 business ideas that will help when federal reserve decreases interest rate<|endoftext|><|assistant|>"""

# hyperparameters for llm
payload = {
    "inputs": prompt,
    "parameters": {
        "do_sample": True,
        "top_p": 0.7,
        "temperature": 0.7,
        "top_k": 50,
        "max_new_tokens": 256,
        "repetition_penalty": 1.03,
        "stop": ["<|endoftext|>"],
    },
}

# send request to endpoint
response = llm.predict(payload)

# print(response[0]["generated_text"][:-len("<human>:")])

print(response[0]["generated_text"])

# Step-7: Create a chatbot dummy app using Gradio

[Gradio](https://www.gradio.app/) helps to build and share demo applications quickly. Here we will use its rich functions to develop [chat applications](https://www.gradio.app/guides/creating-a-chatbot-fast) to demonstrate the chatbot. In this section, a dummy chatbot will be created to validate that the Gradio dependencies are imported and working fine. 

If the result shows a successful dummy chatbot app with `local URL` and `public URL`, proceed to `Step-8`. 

In [None]:
# cel 09 : Creating a dummy chat application to validate that the Gradio dependencies are imported and working fine !

import gradio as gr
import random


def random_response(message, history):
    return random.choice(["Yes", "No"])


gr.ChatInterface(random_response).launch()

# Step-8: Chatbot application 

Once the dummy chatbot application is working. Let's create a customized Chat App that will invoke the Hugging face chat LLM using the SageMaker inference end point that we deployed and validated using prompts in previous steps. 

In this chatbot application, Gradio’s [low-level Blocks APIs](https://www.gradio.app/guides/creating-a-custom-chatbot-with-blocks) are used.

Successful Result will show and output with a responsive chatbot with `local URL` and `Public URL`




In [None]:
# cell 10: Gradio Chatbot that invoke the chat LLM

import gradio as gr

# hyperparameters for llm
parameters = {
    "do_sample": True,
    "top_p": 0.7,
    "temperature": 0.7,
    "top_k": 50,
    "max_new_tokens": 256,
    "repetition_penalty": 1.03,
    "stop": ["<|endoftext|>"],
}

with gr.Blocks() as demo:
    gr.Markdown("## Gradio ChatBot for HuggingFace LLM")
    with gr.Column():
        chatbot = gr.Chatbot()
        with gr.Row():
            with gr.Column():
                message = gr.Textbox(
                    label="Chat Message Box", placeholder="Chat Message Box", show_label=False
                )
            with gr.Column():
                with gr.Row():
                    submit = gr.Button("Submit")
                    clear = gr.Button("Clear")

    def respond(message, chat_history):
        # convert chat history to prompt
        converted_chat_history = ""
        if len(chat_history) > 0:
            for c in chat_history:
                converted_chat_history += (
                    f"<|prompter|>{c[0]}<|endoftext|><|assistant|>{c[1]}<|endoftext|>"
                )
        prompt = f"{converted_chat_history}<|prompter|>{message}<|endoftext|><|assistant|>"

        # send request to endpoint
        llm_response = llm.predict({"inputs": prompt, "parameters": parameters})

        # remove prompt from response
        parsed_response = llm_response[0]["generated_text"][len(prompt) :]
        chat_history.append((message, parsed_response))
        return "", chat_history

    submit.click(respond, [message, chatbot], [message, chatbot], queue=False)
    clear.click(lambda: None, None, chatbot, queue=False)

demo.launch(share=True)

# Step-9: Cleaning Up

As a best practice and to avoid incurring costs, delete Amazon Sagemaker endpoints 

In [None]:
llm.delete_model()
llm.delete_endpoint()

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.


![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/inference|generativeai|huggingfacetgi|open-assistant|open-assistant-chatbot.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/inference|generativeai|huggingfacetgi|open-assistant|open-assistant-chatbot.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/inference|generativeai|huggingfacetgi|open-assistant|open-assistant-chatbot.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/inference|generativeai|huggingfacetgi|open-assistant|open-assistant-chatbot.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/inference|generativeai|huggingfacetgi|open-assistant|open-assistant-chatbot.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/inference|generativeai|huggingfacetgi|open-assistant|open-assistant-chatbot.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/inference|generativeai|huggingfacetgi|open-assistant|open-assistant-chatbot.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/inference|generativeai|huggingfacetgi|open-assistant|open-assistant-chatbot.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/inference|generativeai|huggingfacetgi|open-assistant|open-assistant-chatbot.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/inference|generativeai|huggingfacetgi|open-assistant|open-assistant-chatbot.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/inference|generativeai|huggingfacetgi|open-assistant|open-assistant-chatbot.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/inference|generativeai|huggingfacetgi|open-assistant|open-assistant-chatbot.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/inference|generativeai|huggingfacetgi|open-assistant|open-assistant-chatbot.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/inference|generativeai|huggingfacetgi|open-assistant|open-assistant-chatbot.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/inference|generativeai|huggingfacetgi|open-assistant|open-assistant-chatbot.ipynb)
