# Prompt Engineering - Chatbot
## A Builders Guide

Hello!

In this notebook we will look at a few prompt engineering techniques.  We will experiment by loading a - relatively small - 3 billion parameter Large Language Model (LLM) within the notebook environment itself and throwing some prompts its way to see what we can make it do.

After trying a few different prompts, we will run a simple chatbot using the prompt engineering techniques we explored. 

At the end, I will list some pointers in case you would like to build on this code, by dropping in other LLMs.

### Working Environment 

[![Open In Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/build-on-aws/generative-ai-prompt-engineering/prompt-engineering-chatbot/prompt-engineering-chatbot.ipynb)


This notebook has been designed, written and tested to run for free on [Amazon SageMaker Studio Lab](https://studiolab.sagemaker.aws/) with GPU.  Studio Lab is a free machine learning (ML) development environment that provides compute and storage (up to 15GB) at no cost with NO credit card required.

You can sign up for Amazon SageMaker Studio Lab here: [https://studiolab.sagemaker.aws/]

> Whatever environment you end up using, make sure you have at least 12 GB of disk space available, and access to a GPU to run this code.

### Libraries
First, if needed, install some libraries.  The main libraries are: `torch` - the PyTorch framework, and `transformers` - a library from [Hugging Face](https://huggingface.co/), a great open source set of libraries for working and experimenting with the underlying technology of generative AI.   

In [None]:
%%capture
%pip install torch==1.13.1
%pip install transformers==4.27.4
%pip install accelerate==0.19.0
%pip install scipy==1.10.1
%pip install bitsandbytes==0.39.0
%pip install ipywidgets==7.5.1

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

In [None]:
import os
os.environ['LD_LIBRARY_PATH'] = '/usr/local/cuda/lib64:' + os.getenv('LD_LIBRARY_PATH', '')
os.environ['BITSANDBYTES_NOWELCOME'] = '1'
os.environ['CUDA_RUNTIME_LIB'] = '112'

### Setting up a LLM to work with (not the focus of this notebook).

The Hugging Face hub has many many pre-trained, ready-to-go, transformer networks to play with.  Let's set a variable `model_name` so we can reference it throughout the code.

(If you want to drop in another model, see the notes towards the end of this notebook.)

In [None]:
model_name = "google/flan-t5-xl"

The following lines will download assets from Hugging face.



In [None]:
# We use a tokenizer to convert words into tokens.
tokenizer = AutoTokenizer.from_pretrained(model_name)

free_in_GB = int(torch.cuda.mem_get_info()[0]/1024**3)
max_memory = f'{int(torch.cuda.mem_get_info()[0]/1024**3)-2}GB'

n_gpus = torch.cuda.device_count()
max_memory = {i: max_memory for i in range(n_gpus)}

# Now let's load the model itself.
model = AutoModelForSeq2SeqLM.from_pretrained(
    model_name, 
    torch_dtype=torch.float16, 
    device_map='auto', 
    load_in_8bit=True,
    max_memory=max_memory
)

 

This next function is a basic wrapper for `model.generate` that we will use to generate tokens given some input tokens. In other words, we can use this function to generate text from a prompt.

> Note: This is a simple and easy implementation, and we won't focus on thr details as we want to spend our time experimenting with prompts...

In [None]:
def causal_prediction(prompt, max_length_addition=30):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)
    max_length = len(input_ids[0]) + max_length_addition
    outputs = model.generate(
        input_ids, 
        do_sample=True, 
        max_length = max_length,
        pad_token_id=tokenizer.eos_token_id, 
        num_beams=1,
    )
    output = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return output

# Prompting

Now that we have a LLM ready to go, let's send it some prompts.  A prompt is a simple text string, there really isn't anything more to it than that. By structuring the text string in certain ways, using clear natural language, and using a few tricks, we can cause the LLM to generate the output we want.  This is prompt engineering.

## A Simple Prompt

First, let's send in a simple text string with no other instruction.  What will the LLM do here?

Large Language Models, in a basic sense, are probabilistic token machines :). In other words, they predict the next word in a sequence.  Keep this in mind, this will always be the case for every example in this notebook.  All the LLM does, is look at the sequence of words you pass in via the prompt, and then loop, making next word predictions.

In [None]:
print(causal_prediction("My name is Mike."))

In the case above, notice how the text generated makes sense. However if you run the cell several times, you will get completely different outputs.  Interesting, but not very useful. 

## Summarization Task - Zero Shot

Base LLMs, or Foundation Models are trained on language.  Good models have a good understanding of language, but may not work so well to 'understand' the 'intent' of what we want them to do.  Some foundation models have been further trained or 'instruction tuned' to understand instructions.  This can be as simple as understanding that they may be asked to complete some task, and expected to give a response (such as being a general chatbot), but may also include specific natural language tasks such as translation, summarization or entity extraction.

The model we are using, `flan-t5` has been instruction tuned.  So we should be able to ask, in natural language, for it to complete a task such as summarization.  

The following prompt includes the 'instruction' to `Summarize my text` and then contains the text we want the model to summarize.  By adding in our own data into the context (via the prompt) like this, we are using what is called 'in context learning'.  The text we want the model to summarize was (obviously) not part of the dataset it was originally trained on, but the model can still access this data as it is now part of the prompt.

In [None]:
print(causal_prediction("""Summarize my text:
Today I went to the beach and saw a whale swimming in the sea. I had a great day and ate ice cream.  
"""))

This is a very simple example, but the flan-T5 model does a reasonable job of summarizing the text.  This prompt is an example of zero-shot learning, or zero-shot inference.  The reason for this name wil become more clear in a moment.

If the model we're using was smaller, with even fewer parameters, we would probably find that the model would do a poor job of summarizing the text.  In order to keep this notebook simple, we will stick with just one loaded model, but let's see how we could support a smaller model to make a better summarization. 

One-shot, or few-shot learning is the method of providing 'examples' of the generation we want to the LLM to produce.  Providing these examples gives the LLM a pattern to follow.  Remember the LLM is just in the business of 'next word prediction', so the more help we can provide the better. This can also support larger more capable models to 'understand' our intentions.  For example we can imply the length of the summary we want through the examples we provide.


### One-Shot Example

In this one-shot example, we provide one example of the kind of summary we want to get, then we include the text we want to be summarized.  Notice how the example is provided in exactly the same format.

In [None]:
print(causal_prediction("""Summarize my text:
I have two dogs and a cat. They are my best friends.
Summary: My pets are my best friends.

Summarize my text:
Today I went to the beach and saw a whale swimming in the sea. I had a great day and ate ice cream.  
Summary:"""))

### Few-Shot Example

In this few-shot example, we provide two examples of the kind of summary we want to get.

In [None]:
print(causal_prediction("""Summarize my text:
I have two dogs and a cat. They are my best friends.
Summary: My pets are my best friends.

Summarize my text:
I visited Paris and saw the Eiffel Tower. The city's rich history made it a truly unforgettable experience.
Summary: My visit to Paris was unforgettable

Summarize my text:
Today I went to the beach and saw a whale swimming in the sea. I had a great day and ate ice cream.  
Summary:"""))

Now we have seen examples of; 'zero-shot' where we provide no in context examples, 'one-shot' and 'few shot' learning where we provide one or more samples to the LLM.

How many 'shots' can we include?  We are limited by the context size of the model.  The context of the model is the memory space that we have to store the models output or 'completion'.  The completion includes the prompt, plus the generated output. So how many examples we can use depends on the size of the examples, and the space in the context.  

## Chatbot

Let's make a simple chatbot.  There is no special library to include and no setting to apply to the LLM, all we need is prompt engineering!

Let's use what we know of prompts with in context learning, to create a simple chatbot. 

First let's use zero-shot.  This is unlikely to yield good results, but `flan-t5-xl` is big enough that it will probably get the general idea.

In [None]:
print(causal_prediction("""
Mike:My name is Mike. What is your name?
Chatbot:"""))

### Few-Shot

Now let's include some few-shot in context learning. With our in context learning we can help the LLM understand what kind of chatbot we want it to be, and start the cadence of the conversation. 

In [None]:
print(causal_prediction("""
This is a friendly, safe chatbot session between a user and a computer called Chatbot. 
Mike:My name is Mike.
Chatbot:Hello Mike, I am Chatbot and I offer wise advise on technical issues.
Mike:I have a problem with my computer, its a Mac laptop.
Chatbot:What is wrong with your computer?
Mike:My computer keeps crashing."""))

Each time you run the code you will get a different output.  You can also experiment with the prompt to see what difference that makes.

But this chat experience is not very fluid. If we want to maintain a conversation with the chatbot we need to maintain a chat history.  We need a simple application layer that can keep a history of everything that has been said in the chat so far.  When we add our new input (chat line), it needs to be combined with the history, and passed to the model to make a few next word predictions.  The model does not maintain state, the state is maintained by the chat history in our simple app. 

The next cell does some memory cleanup while we get ready to load the model again within our simple app.  Also follow the instructions to restart the notebook kernel so we have the resources we need to run the chatbot.

In [None]:
del model
del tokenizer 

### Restart your Kernel...

If you're running this notebook within Jupyter Labs, or the free Amazon SageMaker Studio Lab environment, please restart your kernel to make sure you have the resources to run this code.

From the Jupyter menu, select:
`Kernel` -> `Restart Kernel...`

## Super Simple Simple Chatbot (S3C?)

Let's use what we know about prompt engineering to create a simple chatbot.  All the code, including the LLM is running within this notebook environment, so it will be simple, cheap to run, and a good playground for working in. 

Don't expect great things from this smallish model/chatbot, but have fun! 

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

In [None]:
import os
os.environ['LD_LIBRARY_PATH'] = '/usr/local/cuda/lib64:' + os.getenv('LD_LIBRARY_PATH', '')
os.environ['BITSANDBYTES_NOWELCOME'] = '1'
os.environ['CUDA_RUNTIME_LIB'] = '112'

In [None]:
model_name = "google/flan-t5-xl"

This first class wraps the calls to the LLM, extracts just the parts of the completion we want, and maintains state in a conversation history.  

In [None]:
class ChatBot:
    def __init__(self, model_name, init_prompt=[]):
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.tokenizer = AutoTokenizer.from_pretrained(model_name, truncation=True, max_length=512)
        free_in_GB = int(torch.cuda.mem_get_info()[0]/1024**3)
        max_memory = f'{int(torch.cuda.mem_get_info()[0]/1024**3)-2}GB'
        n_gpus = torch.cuda.device_count()
        max_memory = {i: max_memory for i in range(n_gpus)}
        self.model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map='auto', load_in_8bit=True, max_memory=max_memory)
        self.conversation_history = init_prompt

    def generate_response(self, input_text):
        self.conversation_history.append(f"User: {input_text}")
        input_with_history = "\n".join(self.conversation_history)
        input_tokens = self.tokenizer.encode(input_with_history, return_tensors='pt').to(self.device)
        generated_tokens = self.model.generate(input_tokens, max_length=500, num_return_sequences=1)
        response = self.tokenizer.decode(generated_tokens[0], skip_special_tokens=True)
        self.conversation_history.append(f"{response}")
        
        return response

This next cell contains our starting prompt.  Feel free to experiment with this and see what impact your experiments have on the output of the chatbot.  Note that the prompt is a Python list, and our code will use this as the start of the chat history. 

In [None]:
initial_prompt = [
    "This is a friendly and safe chat session between a user and a computer called Chatbot.",
    "Start of chat:",
    "User: I am a real person with a question to ask. Who are you?",
    "Chatbot: I am a chatbot, and I am here to help.",
    "User: Where do you live?"
    "Chatbot: I am hosted on AWS.",
    "User: What is a car?",
    "Chatbot: A car is a four-wheeled road vehicle that is powered by an engine and is able to carry a small number of people.",
    "User: Do you have a dog?",
    "Chatbot: As a chatbot, I am incapable of owning a dog.",
    "User: What color is the Tardis?",
    "Chatbot: The Tardis is Dr Who's time travel machine, and is blue.", 
    "User: What is the Millennium Falcon?",
    "Chatbot: The Millennium Falcon is a spaceship from Star Wars.  It has had various captains over the years."
]

Let's load the class and load the chatbot.



In [None]:
chatbot = ChatBot(model_name, initial_prompt)

 

Finally, so that we can interact with the chatbot, we use iPyWidgets to render a simple chatbot interface right here in the notebook.  When we run this code, we should see a simple text box and `Send` button.  

As a start, ask the chatbot: `What is the capital of Australia?`

And our chatbot has 'memory' (in the chat history).  Try something like this: 

```
User: What is the capital of Australia?
Chatbot: ...
User: Which river runs through there?
Chatbot: ...
```

These are just examples, try anything you like!


In [None]:
import ipywidgets as widgets
from IPython.display import display, clear_output

def on_send_button_click(button):
    input_text = input_field.value
    response = chatbot.generate_response(input_text)

    with output:
        print("User:", input_text)
        print(response.strip())

    input_field.value = ""

def on_input_field_submit(text):
    on_send_button_click(None)

# Create the input field and send button
input_field = widgets.Text(placeholder='Type your message here...')
send_button = widgets.Button(description='Send')
output = widgets.Output()

# Assign the function to the button click event and the input field submit event
send_button.on_click(on_send_button_click)
input_field.on_submit(on_input_field_submit)

# Display the chat interface
display(output, input_field, send_button)

## Want to use a different model with this notebook?

Hugging Face have many models that you can use and drop in to code like this. But you may need to make  modifications depending on the model you choose. 

You should be able to drop other T5 based models in by just changing the `model_name` variable.  

However decoder only transformer models such as Bloom and the GPT family, use a slightly different library.  Look to replace `AutoModelForSeq2SeqLM` with `AutoModelForCausalLM` both in the `import` statements and the code that loads the models using `.from_pretrained`. 

For example:

- from: `transformers import AutoTokenizer, AutoModelForSeq2SeqLM`
- to: `transformers import AutoTokenizer, AutoModelForCausalLM`

and 

- from: `model = AutoModelForSeq2SeqLM.from_pretrained(model_name, ...`
- to: `model = AutoModelForCausalLM.from_pretrained(model_name, ...`

Keep in mind when you load other models that some LLMs are huge, and you can run out of capacity on the system you are running on very quickly.  Watch out for disk space.  On linux based systems Hugging Face will store a cached version of the model (and tokenizer but typically this is much smaller) to `~/.cache/huggingface`.  When you change model, you might want to delete this cache to make room for the new download.  You can use `rm -rf ~/.cache/huggingface` as needed.  Also when the model loads, it will load into the GPU of the system.  If the model is too big for the GPU you will typically see a kernel crash, and things stop working.  

### Thanks for working through this notebook!

My name is Mike Chambers, and I am a AI/ML Specialist for Amazon Web Services.

You can connect with me on [LinkedIn](https://linkedin.com/in/mikegchambers).

```
License Info

MIT No Attribution

Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.

Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
the Software, and to permit persons to whom the Software is furnished to do so.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
```