## Deploy HuggingFace models with ezsmdeploy and streaming responses

TGI is an open source, purpose-built solution for deploying LLMs. It incorporates optimizations including tensor parallelism for faster multi-GPU inference, dynamic batching to boost overall throughput, and optimized transformers code using flash-attention for popular model architectures including BLOOM, T5, GPT-NeoX, StarCoder, and LLaMa.

TGI deep learning containers support token streaming using Server-Sent Events (SSE). With token streaming, the server can start answering after the first prefill pass directly, without waiting for all the generation to be done. For extremely long queries, this means clients can start to see something happening orders of magnitude before the work is done. The following diagram shows a high-level end-to-end request/response workflow for hosting LLMs on a SageMaker endpoint using the TGI container.

In [1]:
%pip uninstall -y ezsmdeploy --quiet
# !pip install --upgrade pip
%pip install ezsmdeploy==2.0.5 --quiet

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


### Note: you may need to restart the kernel to use updated packages.

## Testing foundation model 

In [2]:
import sagemaker
sagemaker.__version__

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


'2.196.0'

In [3]:
# !pip show ezsmdeploy

In [5]:
import ezsmdeploy

In [6]:
ezonsm = ezsmdeploy.Deploy(model = "tiiuae/falcon-7b-instruct",
                           huggingface_model=True,
                           foundation_model=True,
                           instance_type='ml.g5.2xlarge'
                           )

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml
[32m∙●∙[0m [Ksagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml
[32m∙∙●[0m sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
[Ksagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK default

### Normal response

In [16]:
%%time
ezonsm.predict({"inputs":"Paris is the capital of "})

CPU times: user 3.18 ms, sys: 0 ns, total: 3.18 ms
Wall time: 649 ms


[{'generated_text': 'Paris is the capital of “La Belle France” and is a city that is full of surprises. It is a city that'}]

In [135]:
body = {
    "inputs":"Paris is the capital of ",
    "parameters":{
        "max_new_tokens":50,
        "return_full_text": False,
        
    },
    "stream": True
}

In [136]:
import boto3

client = boto3.client('sagemaker-runtime')

In [137]:
import io
class LineIterator:
    """
    A helper class for parsing the byte stream input. 
    
    The output of the model will be in the following format:
    ```
    b'{"outputs": [" a"]}\n'
    b'{"outputs": [" challenging"]}\n'
    b'{"outputs": [" problem"]}\n'
    ...
    ```
    
    While usually each PayloadPart event from the event stream will contain a byte array 
    with a full json, this is not guaranteed and some of the json objects may be split across
    PayloadPart events. For example:
    ```
    {'PayloadPart': {'Bytes': b'{"outputs": '}}
    {'PayloadPart': {'Bytes': b'[" problem"]}\n'}}
    ```
    
    This class accounts for this by concatenating bytes written via the 'write' function
    and then exposing a method which will return lines (ending with a '\n' character) within
    the buffer via the 'scan_lines' function. It maintains the position of the last read 
    position to ensure that previous bytes are not exposed again. 
    """
    
    def __init__(self, stream):
        self.byte_iterator = iter(stream)
        self.buffer = io.BytesIO()
        self.read_pos = 0

    def __iter__(self):
        return self

    def __next__(self):
        while True:
            self.buffer.seek(self.read_pos)
            line = self.buffer.readline()
            if line and line[-1] == ord('\n'):
                self.read_pos += len(line)
                return line[:-1]
            try:
                chunk = next(self.byte_iterator)
            except StopIteration:
                if self.read_pos < self.buffer.getbuffer().nbytes:
                    continue
                raise
            if 'PayloadPart' not in chunk:
                print('Unknown event type:' + chunk)
                continue
            self.buffer.seek(0, io.SEEK_END)
            self.buffer.write(chunk['PayloadPart']['Bytes'])

In [138]:
import json
from sagemaker.base_deserializers import StreamDeserializer

ezonsm.deserializer=StreamDeserializer()

resp = client.invoke_endpoint_with_response_stream(EndpointName=ezonsm.endpoint_name,
                                                   Body=json.dumps(body),
                                                   ContentType="application/json")

event_stream = resp['Body']

In [139]:
import time
for line in LineIterator(event_stream):
    resp = json.loads('{' + line.decode().replace("data",'"data"').replace('false','"false"').replace('true','"true"') + '}')
    try:
        print(resp['data']['token']['text'], end='')
        time.sleep(0.01)
    except:
        pass

“La Belle France” and is a city that is full of surprises. It is a city that is full of surprises, and it is a city that is full of surprises.
Paris is a city that is full of surprises, and it is

In [140]:
ezonsm.predictor.delete_endpoint()