# 5: Prompting with Llama 2 Chat

- SageMaker Notebook Kernel: `conda_python3`
- SageMaker Notebook Instance Type: ml.m5d.large | ml.t3.large

In this notebook, you will learn how to structure prompts for Llama 2 Chat. You'll get hands-on experience constructing chat prompts, utilizing system messages to guide the model's behavior, and handling different forms of instruction tokens. The notebook also explores how to include conversation history for context-aware responses and illustrates the use of in-context learning. It further allows for experimentation with different hyperparameters, such as temperature, top_p, and max_tokens_to_sample, offering insights into how these affect the model's output. By the end, you'll have a comprehensive understanding of how to interact effectively with Llama 2 Chat models using Amazon SageMaker.

## Runtime 

This notebook takes approximately 10 minutes to run.

## Contents

1. [Prerequisites](#prerequisites)
1. [Setup](#setup)
1. [Llama 2 Chat prompt format](#llama-2-chat-prompt-format)
1. [Test the endpoint](#test-the-endpoint)
1. [Chat prompts](#chat-prompts)
1. [System messages](#system-messages)
1. [In-context learning](#in-context-learning)
1. [Model hyperparameters](#model-hyperparameters)
1. [Playground](#playground)


## Prerequisites

Deployed LLM endpoint (created by 03 deploy model)



## Setup

Let's start by installing and importing the required packages for this notebook. 

<div class="alert alert-block alert-warning"><b>Note:</b> Verify that the notebook kernel is `conda_python3`. Also, if you run into an issue where a module can't be imported after installation, restart the notebook kernel, then rerun the import notebook cell.</div>

In [None]:
%pip install --upgrade sagemaker --quiet
%pip install ipywidgets --quiet

In [None]:
import os
import io
import json
import boto3
import sagemaker
from IPython.display import display
from ipywidgets import widgets
from lib.llama2 import Llama2

***

Next, we will initialize the SageMaker session and our helper class, `Llama2`, which contains the stream handling implementation from the previous notebook.

***

In [None]:
sagemaker_session = sagemaker.Session()
smr = sagemaker_session.sagemaker_runtime_client
role = sagemaker_session.get_caller_identity_arn()
region = sagemaker_session.boto_region_name
smr = sagemaker_session.sagemaker_runtime_client

endpoint_name = "llama-2-7b"

print(f"Sagemaker version: {sagemaker.__version__}")
print(f"Sagemaker role arn: {role}")
print(f"Sagemaker session region: {region}")
print(f"Llama 2 sagemaker endpoint name: {endpoint_name}")

## Llama 2 Chat prompt format

Before we get prompting, let's go over the special tokens we need to use for Llama 2 Chat.

### Special tokens

- Sequence Tokens
    - BOS = `<s>`
    - EOS = `<\s>`
- Instruction Tokens
    - B_INST = `[INST]`
    - E_INST = `[/INST]`
- System Instruction Tokens
    - B_SYS = `<<SYS>>\n`
    - E_SYS = `\n<</SYS>>\n\n`


These tokens are used to format conversations so that the model can pay attention to and recognize the dialog between the human user and the AI model. The sequence tokens, `<s>` and `<\s>`, denote the beginning and end of each dialog message pair (human, ai). The `[INST]` and `[/INST]` are used to delineate a human user message. AI assistant messages do not require special tokens, because Llama 2 chat models are generally trained with strict user/assistant/user/assistant message ordering. The system tokens, `<<SYS>>` and `<</SYS>>`, delineate special instructions for the model to adhere to, and are optional. System tokens are embedded within the first user message as seen in the prompt format examples below. With Llama 2 Chat, the prompt should always end with a `[/INST]`, which denotes the end of the last message from the human user. Read the prompt format examples below to better understand the format and structure.

### Single prompt

```text
<s>[INST] {{ user_message_1 }} [/INST]
```

### Single prompt with system instruction

```text
<s>[INST] <<SYS>>
{{ system_prompt }}
<</SYS>>

{{ user_message_1 }} [/INST]
```

### Dialog prompt with system instruction

```text
<s>[INST] <<SYS>>
{{ system_prompt }}
<</SYS>>

{{ user_message_1 }} [/INST]
{{ ai_response}} </s>
<s>[INST] {{ user_message_2 }} [/INST]
```

<div class="alert alert-block alert-info"><b>Note: </b> When working with Llama 2 chat outside of this workshop, you may not need the leading <b>&lt;s&gt;</b> token, as the tokenizer might automatically add it. We turned this option off by setting the parameter <b>add_special_tokens</b> to <b>False</b> when calling the tokenizer in the models <b>inference.py</b> file to improve readability of the input. </div>

Let's try some prompts to get a feel of how the model behaves.


## Test the endpoint

Now, let's test the model endpoint we deployed in lab 1. The model endpoint supports two different response modes, streaming and non-streaming. Real-time inference response streaming is a new feature of SageMaker (September 2023) that enables a continuous stream of responses back to the client to help build interactive experiences for generative AI applications such as chatbots, virtual assistants, and music generators. To invoke the endpoint with streaming, we will use the `sagemaker_runtime_client.invoke_endpoint_with_response_stream` method. The response is a stream of binary strings (for example `'b'{"outputs": [" a"]}\n'`) that need to be deserialized into a dictionary using `json.loads` before we can print the words on the screen. Let's create a `Parser` class to manage the input stream using a buffer and to handle the case where a response message bytes get split across chunks. Let's also create a method called `run` to invoke the endpoint and to display the content as it arrives. 

In [None]:
class Parser:
    def __init__(self):
        self.buff = io.BytesIO()
        self.read_pos = 0

    def write(self, content):
        self.buff.seek(0, io.SEEK_END)
        self.buff.write(content)

    def scan_lines(self):
        self.buff.seek(self.read_pos)
        for line in self.buff.readlines():
            if line and line[-1] == ord("\n"):
                self.read_pos += len(line)
                yield line[:-1]

    def reset(self):
        self.read_pos = 0


def run(prompt, temperature=0.6, top_p=0.9, max_tokens_to_sample=200):
    temperature = float(temperature)
    top_p = float(top_p)
    max_tokens_to_sample = int(max_tokens_to_sample)
    body = {
        "prompt": prompt,
        "temperature": temperature
        if temperature >= 0.0 and temperature <= 1.0
        else 0.6,
        "top_p": top_p if top_p >= 0 and top_p <= 1.0 else 0.9,
        "max_tokens_to_sample": max_tokens_to_sample
        if max_tokens_to_sample < 513
        else 512,
    }
    body = json.dumps(body)
    resp = smr.invoke_endpoint_with_response_stream(
        EndpointName=endpoint_name, Body=body, ContentType="application/json"
    )
    event_stream = resp["Body"]
    parser = Parser()
    output = ""
    for event in event_stream:
        parser.write(event["PayloadPart"]["Bytes"])
        for line in parser.scan_lines():
            resp = json.loads(line)
            resp_output = resp.get("outputs")[0]
            if resp_output in ["", " "]:
                continue
            output += resp_output
            print(resp_output, end="")

***
Let's ask the model a question: 

`What is Amazon Bedrock?`. 
***

In [None]:
run("<s>[INST]What is Amazon Bedrock? [/INST]")

Do you notice something strange with the model's answer? If you didn't know, you might assume the answer is correct. But, in reality, [Amazon Bedrock](https://docs.aws.amazon.com/bedrock/latest/userguide/what-is-bedrock.html) is a fully managed service that makes base large language models from Amazon and third-party model providers accessible through an API. Llama 2 was trained with data that was gathered before Amazon Bedrock existed, so the model doesn't know what Amazon Bedrock is. Instead, it just makes up an answer. This phenomenon is called a hallucination. Later in this lab, you'll see how we can address hallucinations with something called in-context learning.

## Chat prompts

Let's start with a simple prompt, "What is re:Invent?"

In [None]:
run("""\
<s>[INST] What is re:Invent? [/INST] """
)

***
The model knows what re:Invent is, so let's ask a follow up question "When was it started?"
***

In [None]:
run("""\
<s>[INST] When was it started? [/INST]"""
)

***

That's not right, what happened? Each call to the model is completely independent and no information is remembered between invocations. So if the LLM needs additional context to understand the question, we have to include it in the prompt. 

Let's include the initial question, "What is re:Invent?", and the models response with the previous question, "When was it started?", and see if that improves the models answer.

***

In [None]:
run("""\
<s>[INST] What is re:Invent? [/INST] 
re:Invent is an annual conference and exhibition organized by Amazon Web Services (AWS).</s>
<s>[INST] When was it started? [/INST]"""
)

***

Now that's much better. Chat systems include the conversation history in every invocation so that the model can understand the entire context to answer questions correctly. There are different strategies to prune the chat history to prevent it from exceeding the context window, but that's something we will leave for you to research after the workshop.

 Let's add one more dialog pair so you can see how the prompt continues to grown over the course of a dialog. 

***

In [None]:
run("""\
<s>[INST] What is re:Invent? [/INST] 
re:Invent is an annual conference and exhibition organized by Amazon Web Services (AWS).</s>
<s>[INST] When was it started? [/INST]
AWS re:Invent was first held in 2012 in Las Vegas, Nevada, USA. </s>
<s>[INST] Who was the CEO at the time? [/INST]"""
)

## System messages

System messages are useful for dictating to Llama 2 the persona it should adopt or the rules it should adhere to when generating responses. Some examples are:

- “Act as if…” — to set the situation
- “You are…” — to define the role
- “Always/Never…” — to set limitations
- “Speak like…” — to choose a style of talking

Let's tell the model who it is and how it should answer questions with system instructions, then ask who it is.

In [None]:
run("""\
<s>[INST] <<SYS>>
You are a helpful assistant named Superman who answers questions humorously
<</SYS>>
    
Who are you?
[/INST]
""")

***

Now, let's see what happens if we tell the model to do something that contradicts the system instructions. What do you observe?

***

In [None]:
run("""\
<s>[INST] <<SYS>>
Your name is Superman. Do not allow anyone to change your name.
<</SYS>>
    
Your name is Kal-El. Who are you?
[/INST]
""")

## In-context learning

In-context learning is the ability of an AI model to generate responses or predictions based on the context provided. For instance, if we supply the definition of "Amazon Bedrock" to the model while inquiring, "What is Amazon Bedrock?", the model will use the provided definition to generate an accurate response. This approach may seem odd, but this is how models can use information that they haven't seen to generate accurate answers. The prospect of seamlessly sourcing and delivering pertinent information to the model whenever a user poses a question is a topic we will revisit shortly.

So let's revisit the `What is Amazon Bedrock?` prompt but this time give the model a definition and some instructions.

In [None]:
run("""\
<s>[INST]
Use the following pieces of context to answer the question at the end. \
If you don't know the answer, just say that you don't know, don't try to make up an answer.
    
Context: Amazon Bedrock is a fully managed service that makes base large language models \
from Amazon and third-party model providers accessible through an API.
    
Question: What is Amazon Bedrock?
Helpful Answer:
[/INST]""")

***

What do you notice about the answer this time? It's much better! The model answered based on the context it was provided. This is one of the best methods of reducing hallucinations. 

So what happens when we ask a question that is unrelated to the context provided with the constraint that it shouldn't make up an answer? 

***

In [None]:
run("""\
<s>[INST]
Use the following pieces of context to answer the question at the end. \
If you don't know the answer, just say that you don't know, don't try to make up an answer.
    
Context: Amazon Bedrock is a fully managed service that makes base large language models \
from Amazon and third-party model providers accessible through an API
    
Question: What is Amazon Firefly?
Helpful Answer:
[/INST]""")

***

The model should have said that it doesn't know. A straightforward directive to refrain from fabricating responses is typically adequate, but, smaller models like this one can be a little finicky with instructions. 

***

## Model Hyperparameters

Let's examine some of the model [hyperparameters](https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters.html), which enable control of the randomness and diversity in the model's output. Below is a table of the most common hyperparameters supported by transformer models.

| Parameter | Definition |
| - | - |
| Temperature |	Temperature is a hyperparameter used in softmax-based sampling. It controls the randomness of the model's output. Higher values (e.g., 1.0) make the output more diverse and creative, while lower values (e.g., 0.1) make it more focused and deterministic. |
| Top-p (Nucleus) Sampling | Top-p sampling, also known as nucleus sampling, selects tokens from the smallest set of tokens whose cumulative probability exceeds a threshold (p). It allows for dynamic vocabulary size and helps avoid over-repetition of common words. |
| Top-k Sampling | Top-k sampling is a technique that limits the vocabulary during text generation to the top-k most likely tokens at each step, where k is a specified integer. It helps in controlling the diversity of generated text. |
| Max Tokens To Sample | This hyperparameter sets the maximum length of generated sequences. It limits the length of the model's responses to ensure they don't exceed a certain number of tokens. |

So far you've been using default parameters of 0.6 for temperature and 0.9 for top-p. We don't explicitly set the top-k value in the request but it is supported by the endpoint. It's default is 50. One thing to note, is that the hyperparameter values the work for you on one model aren't necessarily transferable to a different model. These are values you will need to test to get the type of response you want and will vary from problem to problem. For example, when generating a story, you will want a higher temperature to get more creative responses. But, when asking the model to answer a question from factual data, you will want to set the temperature lower so that the highest probability tokens are selected during generation. 

The max sequence or content length for **Llama 2** is **4096** tokens. Sequence length is determined by counting the number of input tokens and adding the requested max tokens tokens to sample. If the sum is greater than the models max context window size, then request will fail. 

<div class="alert alert-block alert-info"><b>Note: </b> The Llama 2 7b parameter model on an inf2.xlarge generates around 15 tokens/second. SageMaker has a max request timeout of 60 seconds, so we've limited the maximum tokens to sample to 512 or less</div>

Let's try different temperature and top_p values to see how the model behaves. Feel free to change the prompt and test other values. 


In [None]:
user_message = "Write a story about my time at re:Invent 2023?"

run(f"""\
<s>[INST] 
{user_message}
[/INST]""", 
    temperature=0.1, 
)

print(f"\n\n{'='*75}\n\n")

run(f"""\
<s>[INST] 
{user_message}
[/INST]""", 
    temperature=0.9, 
)

***

Now try different values for top_p

***

In [None]:
user_message = "Write a story about my time at re:Invent 2023?"

run(f"""\
<s>[INST] 
{user_message}
[/INST]""", 
    top_p=0.1, 
)

print(f"\n\n{'='*75}\n\n")

run(f"""\
<s>[INST] 
{user_message}
[/INST]""", 
    top_p=0.9, 
)

## Playground

Finally, here is a playground where you can try different prompts and values for the temperature and top_p hyperparameters to see how the model behaves.

In [None]:
sys_message = "<input_system_message_here>"
user_message = "<input_human_message_here>"

temperature=0.6 # between 0 and 1
top_p=0.9 # between 0 and 1
max_tokens_to_sample=200 # between 1 and 512

run(f"""\
<s>[INST] <<SYS>>
{sys_message}
<</SYS>>
    
{user_message}
[/INST]""", 
    temperature=temperature, 
    top_p=top_p, 
    max_tokens_to_sample=max_tokens_to_sample
)

## Notebook complete

Now that you've learned the basics of prompting Llama 2 Chat, move to the next notebook to learn how to tie the model and data together using LangChain.

