# Real-Time Inference Streaming with HuggingFace TGI and the OpenChat Model in SageMaker

---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/inference|generativeai|llm-workshop|deploy-openchat|OpenChat-streaming_tgi.ipynb)

---

This tutorial will guide you through using the [Hugging Face Large Language Model](https://huggingface.co/blog/sagemaker-huggingface-llm) Inference Container on Amazon SageMaker to deploy OpenChat, which is a conversational AI assistant that uses large language models to engage in open-ended dialogue and assist with a variety of tasks. You'll run inference streaming with this container, which is powered by [Text Generation Inference](https://github.com/huggingface/text-generation-inference) (TGI) - an open-source, purpose-built solution for deploying and serving LLMs.

TGI enables high-performance text generation by leveraging Tensor Parallelism and dynamic batching. It supports the most popular open-source LLMs, including StarCoder, BLOOM, GPT-NeoX, Llama, and T5.

Please make sure the following permission granted before running the notebook:

- S3 bucket push access
- SageMaker access

# Introduction

- OpenChat is an innovative library of open-source language models, fine-tuned with [C-RLFT](https://arxiv.org/pdf/2309.11235.pdf) - a strategy inspired by offline reinforcement learning.

- The models learn from mixed-quality data without preference labels, delivering exceptional performance on par with `ChatGPT`, even with a `7B` model which can be run on a consumer GPU (e.g. RTX 3090).

- Despite their simple approach, they are committed to developing a high-performance, commercially viable, open-source large language model, and continue to make significant strides toward this vision.

For more information:

- [OpenChat Git repo](https://github.com/imoneoi/openchat)
- [Huggingface - OpenChat](https://huggingface.co/openchat)
- [Research Paper](https://arxiv.org/pdf/2309.11235.pdf)

## Step 1: Installing and importing dependencies

We begin by importing the necessary libraries and configuring several global variables using the boto3 library.

In [2]:
%pip install sagemaker pip boto3 botocore --upgrade  --quiet

[0mNote: you may need to restart the kernel to use updated packages.


In [13]:
from sagemaker.huggingface import get_huggingface_llm_image_uri
from sagemaker.huggingface import HuggingFaceModel
from sagemaker.session import Session
from sagemaker.base_deserializers import StreamDeserializer
import sagemaker
import json
import boto3
import logging
import io

sagemaker_session = Session()
role = sagemaker_session.get_caller_identity_arn()

Deploying Hugging Face models in Amazon SageMaker is slightly different from deploying regular Hugging Face models. To do this, you need to first retrieve the container URI and provide it to your HuggingFaceModel model class, along with an `image_uri` that points to the image you want to use.

To retrieve the URI for the new Hugging Face Large Language Model (LLM) Deep Learning Containers (DLC) in Amazon SageMaker, you can use the `get_huggingface_llm_image_uri()` method provided by the SageMaker SDK. This method allows you to retrieve the URI for the desired Hugging Face LLM DLC based on the specified backend, session, AWS region, and version.

By using this method, you can easily obtain the necessary container URI to deploy your Hugging Face LLM model in Amazon SageMaker, without having to manually look up and manage the URI information. You can find the available versions [here](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#huggingface-text-generation-inference-containers).

In [4]:
# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri("huggingface")

# print ecr image uri
print(f"llm image uri: {llm_image}")

llm image uri: 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.1.1-tgi1.4.2-gpu-py310-cu121-ubuntu22.04


In [6]:
hf_model_id = "openchat/openchat-3.5-0106"  # model id from huggingface.co/models
number_of_gpus = 4  # number of gpus to use for inference and tensor parallelism
health_check_timeout = (
    300  # Increase the timeout for the health check to 5 minutes for downloading the model
)
instance_type = "ml.g5.12xlarge"  # the instance type ml.g5.12xlarge has 4 GPUs

## Step 2: Deploy OpenChat using the TGI image
To deploy the `HuggingFaceModel` on Amazon SageMaker, we'll use the `deploy` method. We'll be deploying the model on the `ml.g5.12xlarge` instance type, as specified earlier. Details of the below environment variables are described at [here](https://huggingface.co/docs/text-generation-inference/en/basic_tutorials/launcher)

In [7]:
endpoint_name = sagemaker.utils.name_from_base("tgi-model-openchat")
llm_model = HuggingFaceModel(
    role=role,
    image_uri=llm_image,
    env={
        "HF_MODEL_ID": hf_model_id,
        "SM_NUM_GPUS": str(number_of_gpus),
        "MAX_INPUT_LENGTH": "1024",
        "MAX_TOTAL_TOKENS": "2048",
        "HF_MODEL_REVISION": "dfcf6be1e44eb54db7af0d05d2760fb1d4969845",
    },
)

In [8]:
llm = llm_model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    container_startup_health_check_timeout=health_check_timeout,
    endpoint_name=endpoint_name,
)

----------!

## Step 3: Initiate streaming inference requests to the deployed SageMaker model endpoint

In [9]:
boto3.set_stream_logger("", logging.INFO)
llm.deserializer = StreamDeserializer()
smr = boto3.client("sagemaker-runtime")
stop_token = "<|endoftext|>"
prompt = "How to study GenAI"

In [10]:
body = {
    "inputs": prompt,
    "parameters": {"max_new_tokens": 2041, "return_full_text": False},
    "stream": True,
}

In [11]:
class LineIterator:
    """
    A helper class for parsing the byte stream input from a TGI (Text Generation Interface) container.

    The output of the model will be in the following format:
    ```
    b'data:{"token": {"text": " a"}}\n\n'
    b'data:{"token": {"text": " challenging"}}\n\n'
    b'data:{"token": {"text": " problem"
    b'}}'
    ...
    ```

    While usually each PayloadPart event from the event stream will contain a complete JSON object,
    this is not guaranteed, and some of the JSON objects may be split across multiple PayloadPart events.
    For example:
    ```
    {'PayloadPart': {'Bytes': b'{"outputs": '}}
    {'PayloadPart': {'Bytes': b'[" problem"]}\n'}}
    ```

    This class accounts for this by concatenating the bytes written via the 'write' function,
    and then exposing a method ('scan_lines') that will return lines (ending with a '\n' character)
    within the buffer. It maintains the position of the last read position to ensure that previous
    bytes are not exposed again. It will also save any pending lines that do not end with a '\n'
    to make sure truncations are concatenated.
    """

    def __init__(self, stream):
        self.byte_iterator = iter(stream)
        self.buffer = io.BytesIO()
        self.read_pos = 0

    def __iter__(self):
        return self

    def __next__(self):
        while True:
            self.buffer.seek(self.read_pos)
            line = self.buffer.readline()
            if line and line[-1] == ord("\n"):
                self.read_pos += len(line)
                return line[:-1]
            try:
                chunk = next(self.byte_iterator)
            except StopIteration:
                if self.read_pos < self.buffer.getbuffer().nbytes:
                    continue
                raise
            if "PayloadPart" not in chunk:
                print("Unknown event type:", chunk)
                continue
            self.buffer.seek(0, io.SEEK_END)
            self.buffer.write(chunk["PayloadPart"]["Bytes"])

In [14]:
resp = smr.invoke_endpoint_with_response_stream(
    EndpointName=llm.endpoint_name, Body=json.dumps(body), ContentType="application/json"
)
event_stream = resp["Body"]
start_json = b"{"
for line in LineIterator(event_stream):
    if line != b"" and start_json in line:
        data = json.loads(line[line.find(start_json) :].decode("utf-8"))
        if data["token"]["text"] != stop_token:
            print(data["token"]["text"], end="")



# How to study GenAI

GenAI is a complex and rapidly evolving field, and studying it requires a multidisciplinary approach. Here are some steps you can take to study GenAI effectively:

1. Understand the basics: Start by learning the fundamentals of artificial intelligence, machine learning, and natural language processing. This will provide you with a solid foundation for understanding GenAI.

2. Learn programming languages: Familiarize yourself with programming languages such as Python, which is commonly used in AI and machine learning. This will enable you to work with AI tools and frameworks.

3. Study AI models and algorithms: Learn about different AI models and algorithms, such as neural networks, deep learning, and reinforcement learning. This will help you understand how AI systems work and how they can be applied to various tasks.

4. Explore GenAI applications: Study the different applications of GenAI, such as natural language understanding, computer vision, and robotics. 

## Clean up

In [None]:
# llm.delete_model()
# llm.delete_endpoint()

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/inference|generativeai|llm-workshop|deploy-openchat|OpenChat-streaming_tgi.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/inference|generativeai|llm-workshop|deploy-openchat|OpenChat-streaming_tgi.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/inference|generativeai|llm-workshop|deploy-openchat|OpenChat-streaming_tgi.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/inference|generativeai|llm-workshop|deploy-openchat|OpenChat-streaming_tgi.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/inference|generativeai|llm-workshop|deploy-openchat|OpenChat-streaming_tgi.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/inference|generativeai|llm-workshop|deploy-openchat|OpenChat-streaming_tgi.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/inference|generativeai|llm-workshop|deploy-openchat|OpenChat-streaming_tgi.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/inference|generativeai|llm-workshop|deploy-openchat|OpenChat-streaming_tgi.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/inference|generativeai|llm-workshop|deploy-openchat|OpenChat-streaming_tgi.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/inference|generativeai|llm-workshop|deploy-openchat|OpenChat-streaming_tgi.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/inference|generativeai|llm-workshop|deploy-openchat|OpenChat-streaming_tgi.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/inference|generativeai|llm-workshop|deploy-openchat|OpenChat-streaming_tgi.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/inference|generativeai|llm-workshop|deploy-openchat|OpenChat-streaming_tgi.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/inference|generativeai|llm-workshop|deploy-openchat|OpenChat-streaming_tgi.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/inference|generativeai|llm-workshop|deploy-openchat|OpenChat-streaming_tgi.ipynb)