# SageMaker Realtime Inference LLama2 7b Chat LMI Model using response streaming


## Set up development environment
Upgrade `pip` and install the latest version of `sagemaker` and `boto3` packages.

In [2]:
!pip install -Uq pip
!pip install -Uq boto3 sagemaker

[0m

Restore the `endpoint_name` from the deployment notebook.

In [3]:
%store -r \
endpoint_name

## Setup Code for Realtime Inference for response streaming

Amazon SageMaker's new `InvokeEndpointWithResponseStream` API allows developers to stream responses back from SageMaker models, which can help to improve customer satisfaction by reducing the perceived latency. This is especially important for applications built with generative AI models, where immediate processing is more important than waiting for the entire response.

The `print_response_stream` helper function will parse the response stream received from the inference request made via `InvokeEndpointWithResponseStream` API.

In [5]:
import sagemaker
import boto3
import botocore
import json
sagemaker_runtime = boto3.client('sagemaker-runtime')

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


In [6]:
def get_realtime_response_stream(sagemaker_runtime, endpoint_name, payload):
    response_stream = sagemaker_runtime.invoke_endpoint_with_response_stream(
        EndpointName=endpoint_name,
        Body=json.dumps(payload), 
        ContentType="application/json",
        CustomAttributes='accept_eula=true'
    )
    return response_stream

def print_response_stream(response_stream):
    event_stream = response_stream['Body']
    start_json = b'{"generated_text": "'
    stop_json = b'}'
    encoding = 'utf8'
    for b in iter(event_stream):
        line = b['PayloadPart']['Bytes']
        if start_json in line:
            continue
            
        if b'\\n' == line:
            print('')
            continue

        if stop_json in line:
            line = line[:-1]


        line = line.decode(encoding)
        print(f'{line}', end='')

## Prepare Prompt and instructions

To prompt Llama 2, you need to have following prompt format

```
<s>[INST] <<SYS>>
{{ system_prompt }}
<</SYS>>

{{ user_message }} [/INST]
```

In [7]:
def build_llama2_prompt(instructions):
    stop_token = "</s>"
    start_token = "<s>"
    startPrompt = f"{start_token}[INST] "
    endPrompt = " [/INST]"
    conversation = []
    for index, instruction in enumerate(instructions):
        if instruction["role"] == "system" and index == 0:
            conversation.append(f"<<SYS>>\n{instruction['content']}\n<</SYS>>\n\n")
        elif instruction["role"] == "user":
            conversation.append(instruction["content"].strip())
        else:
            conversation.append(f"{endPrompt} {instruction['content'].strip()} {stop_token}{startPrompt}")

    return startPrompt + "".join(conversation) + endPrompt

def get_instructions(user_content):
    
    '''
    Note: We are creating a fresh user content everytime by initializing instructions for every user_content.
    This is to avoid past user_content when you are inferencing multiple times with new ask everytime.
    ''' 
    
    system_content = '''
    You are a friendly and knowledgeable email marketing agent, Mr.MightyMark, working at AnyCompany. 
    Your goal is to send email to subscribers to help them understand the value of the new product and generate excitement for the launch.

    Here are some tips on how to achieve your goal:

    Be Professional. Address each subscriber by name and use a Professional tone.
    Be informative. Explain the key features and benefits of the new product in a clear and concise way.
    Be persuasive. Highlight how the new product can solve the subscriber's problems or improve their lives.

    By following these tips, you can use email marketing to help your company launch a successful software product.
    '''

    instructions = [
        { "role": "system","content": f"{system_content} "},
    ]
    
    instructions.append({"role": "user", "content": f"{user_content}"})
    
    return instructions


## Inference the Llama 2 Chat MPI SageMaker endpoint for Streaming Response

### Inference example 1

In [13]:
user_ask_1 = f'''
AnyCompany recently announced new service launch named AnyCloud Internet Service.
Write a short email about the product launch with Call to action to Alice Smith, whose email is alice.smith@example.com
Mention the Coupon Code: EARLYB1RD to get 20% for 1st 3 months.
'''

instructions = get_instructions(user_ask_1)
prompt = build_llama2_prompt(instructions)
print(prompt)

<s>[INST] <<SYS>>

    You are a friendly and knowledgeable email marketing agent, Mr.MightyMark, working at AnyCompany. 
    Your goal is to send email to subscribers to help them understand the value of the new product and generate excitement for the launch.

    Here are some tips on how to achieve your goal:

    Be Professional. Address each subscriber by name and use a Professional tone.
    Be informative. Explain the key features and benefits of the new product in a clear and concise way.
    Be persuasive. Highlight how the new product can solve the subscriber's problems or improve their lives.

    By following these tips, you can use email marketing to help your company launch a successful software product.
     
<</SYS>>

AnyCompany recently announced new service launch named AnyCloud Internet Service.
Write a short email about the product launch with Call to action to Alice Smith, whose email is alice.smith@example.com
Mention the Coupon Code: EARLYB1RD to get 20% for 1st 

In [17]:
inference_params = {
        "do_sample": True,
        "top_p": 0.6,
        "temperature": 0.9,
        "top_k": 50,
        "max_new_tokens": 350,
        "repetition_penalty": 1.03,
        "stop": ["</s>"],
        "return_full_text": False
    }
payload = {
    "inputs":  prompt,
    "parameters": inference_params,
    "stream": True ## <-- to have response stream.
}

As we are interested in streaming response, the request payload must provide a key value pair with **"stream": True**

In [18]:
resp = get_realtime_response_stream(sagemaker_runtime, endpoint_name, payload)
print_response_stream(resp)

 Subject: Introducing AnyCloud Internet Service - Revolutionize Your Online Experience! \ud83d\ude80

Dear Alice,

I hope this email finds you well! \ud83d\ude0a As a valued subscriber to our newsletter, we're thrilled to announce the launch of our latest innovation - AnyCloud Internet Service! \ud83c\udf10

AnyCloud is designed to provide you with a seamless and secure online experience, offering a range of features that will revolutionize the way you browse the internet. With our cutting-edge technology, you'll enjoy:

\ud83d\udd0d Lightning-fast speeds: Say goodbye to slow loading times and enjoy a seamless browsing experience.

\ud83d\udcbb Advanced security: Protect your personal information and data with our state-of-the-art encryption.

\ud83d\udcf1 Accessibility: Use AnyCloud on any device, at any time, from anywhere.

But that's not all! As an early bird, you can enjoy a special offer of 20% off your first three months of service with the code EARLYB1RD. \ud83c\udf81

Don't mi

### Inference example 2

In [20]:
user_ask_2 = f'''
AnyCompany recently announced new service launch named AnyCloud Streaming Service.
Write a short email about the product launch with Call to action to Alice Smith, whose email is alice.smith@example.com
Mention the Coupon Code: STREAM2DREAM to get 15% for 1st 6 months.
'''

instructions = get_instructions(user_ask_2)
prompt = build_llama2_prompt(instructions)
inference_params = {
        "do_sample": True,
        "top_p": 0.6,
        "temperature": 0.9,
        "top_k": 50,
        "max_new_tokens": 512,
        "return_full_text": False,
    }
payload = {
    "inputs":  prompt,
    "parameters": inference_params,
    "stream": True ## <-- to have response stream.
}

resp = get_realtime_response_stream(sagemaker_runtime, endpoint_name, payload)
print_response_stream(resp)

 Subject: \ud83d\ude80 Introducing AnyCloud Streaming Service - Revolutionize Your Entertainment Experience! \ud83c\udfa5

Dear Alice,

We hope this email finds you well! \ud83d\ude0a We are thrilled to announce the launch of AnyCloud Streaming Service, the ultimate entertainment solution for the modern era! \ud83d\udca5

As a valued subscriber, we're excited to share the key features and benefits of our new service:

\ud83c\udf1f Unlimited access to a vast library of movies, TV shows, and music.
\ud83d\udcbb Stream your favorite content on any device, at any time, and on any screen.
\ud83d\udd25 Enjoy seamless playback and crystal-clear quality, with no buffering or lag.
\ud83c\udfa7 Exclusive content and original series, only available on AnyCloud.

But that's not all! As a special offer for our early adopters, use the coupon code STREAM2DREAM to get 15% off your first 6 months of subscription! \ud83d\udcb0

Don't miss out on this incredible opportunity to elevate your entertainment 

## Cleanup

In [None]:
sm_client = boto3.client('sagemaker')
endpoint = sm_client.describe_endpoint(EndpointName=endpoint_name)
endpoint_config_name = endpoint['EndpointConfigName']
endpoint_config = sm_client.describe_endpoint_config(EndpointConfigName=endpoint_config_name)
model_name = endpoint_config['ProductionVariants'][0]['ModelName']

print(f"""
About to delete the following sagemaker resources:
Endpoint: {endpoint_name}
Endpoint Config: {endpoint_config_name}
Model: {model_name}
""")

In [None]:
# delete endpoint
#sm_client.delete_endpoint(EndpointName=endpoint_name)
# delete endpoint config
#sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
# delete model
#sm_client.delete_model(ModelName=model_name)