<div style="background-color: #FFDDDD; border-left: 5px solid red; padding: 10px; color: black;">
    <strong>Kernel: Python 3 (ipykernel)
</div>

## Lab 1: Setup a LLM Playground on SageMaker Studio

---

__Large Language Model (LLM) with `Llama2`, `LangChain`, and `Streamlit`.__

In this lab, we learn how to use SageMaker to download, provision, and send prompts to a Large Language Model, `Llama 2`. We create an agent using `LangChain`, and tie everything together by creating a UI and text input using `Streamlit` to make our own hosted chatbot interface.

## Contents

- [Model License information](#Model-License-information)
- [Download and host Llama2 model](#Download-and-host-Llama2-model)
  - [Set up](#Set-up)
- [Sending prompts](#Sending-prompts)
  - [Supported Parameters](#Supported-Parameters)
  - [Notes](#Notes)
- [Building an agent with LangChain](#Building-an-agent-with-LangChain)
- [LangChain Tools](#LangChain-Tools)
- [Developing and deploying the UI with Streamlit](#Developing-and-deploying-the-UI-with-Streamlit)
- [Tearing down resources](#Tearing-down-resources)

### Model License information
---

To perform inference on these models, you need to pass `custom_attributes='accept_eula=true'` as part of header. This means you have read and accept the end-user-license-agreement (EULA) of the model. EULA can be found in model card description or from https://ai.meta.com/resources/models-and-libraries/llama-downloads/. By default, this notebook sets `custom_attribute='accept_eula=false'`, so all inference requests will fail until you explicitly change this custom attribute.

Note: Custom_attributes used to pass EULA are key/value pairs. The key and value are separated by `'='` and pairs are separated by `';'`. If the user passes the same key more than once, the last value is kept and passed to the script handler (i.e., in this case, used for conditional logic). For example, if `'accept_eula=false; accept_eula=true'` is passed to the server, then `'accept_eula=true'` is kept and passed to the script handler.

---

### Connect to a Hosted Llama2 Model
---

#### Set up

We begin by installing and upgrading necessary packages.

In [None]:
%pip install "langchain==0.2.16" "streamlit==1.38.0" wikipedia "numexpr==2.8.7" faiss-cpu opensearch-py==2.3.2 -q

In [3]:
import os
import boto3
import sagemaker
from sagemaker import serializers, deserializers
from ipywidgets import Dropdown
from IPython.display import display, Markdown
from typing import Dict, List
from datetime import datetime

import sys
sys.path.append(os.path.dirname(os.getcwd()))
from utils.helpers import pretty_print_html 

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


In [4]:
eula_dropdown = Dropdown(
    options=["True", "False"],
    value="False",
    description="**Please accept Llama2 EULA to continue:**",
    style={"description_width": "initial"},
    layout={"width": "max-content"},
)
display(eula_dropdown)

Dropdown(description='**Please accept Llama2 EULA to continue:**', index=1, layout=Layout(width='max-content')…

In [5]:
custom_attribute = f'accept_eula={eula_dropdown.value.lower()}'
pretty_print_html(f"Your Llama2 EULA attribute is set to: {custom_attribute}")

#### Connect to an Hosted Llama Model

In [6]:
endpoint_name = "meta-llama31-8b-instruct-tg-ep" 
boto_region = boto3.Session().region_name

pretty_print_html(f"Using Llama Endpoint: {endpoint_name} in region: {boto_region}")

In [7]:
sess = sagemaker.session.Session(
    boto_session=boto3.Session(region_name=boto_region)
)
smr_client = boto3.client(
    "sagemaker-runtime", 
    region_name=boto_region
)

pretrained_predictor = sagemaker.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sess,
    serializer=serializers.JSONSerializer(),
    deserializer=deserializers.JSONDeserializer(),
)

### Sending prompts
---

Next, we invoke the endpoint hosting our Llama 2 LLM with some queries. To guess the best results, however, it is important to be aware of the adjustable parameters of this model.

#### Supported Parameters
This model supports the following inference payload parameters:

* **max_new_tokens:** Model generates text until the output length (excluding the input context length) reaches max_new_tokens. If specified, it must be a positive integer.
* **temperature:** Controls the randomness in the output. Higher temperature results in output sequence with low-probability words and lower temperature results in output sequence with high-probability words. If `temperature` -> 0, it results in greedy decoding. If specified, it must be a positive float.
* **top_p:** In each step of text generation, sample from the smallest possible set of words with cumulative probability `top_p`. If specified, it must be a float between 0 and 1.

We'll begin with 512, 0.9, and 0.6 for these respectively, though feel free to alter these are we do to see how this may affect the LLM output.

You may specify any subset of the parameters mentioned above while invoking an endpoint. 

#### Notes
- If `max_new_tokens` is not defined, the model may generate up to the maximum total tokens allowed, which is 4K for these models. This may result in endpoint query timeout errors, so it is recommended to set `max_new_tokens` when possible. For 7B, 13B, and 70B models, we recommend to set `max_new_tokens` no greater than 1500, 1000, and 500 respectively, while keeping the total number of tokens less than 4K.
- In order to support a 4k context length, this model has restricted query payloads to only utilize a batch size of 1. Payloads with larger batch sizes will receive an endpoint error prior to inference.
- This model only supports 'system', 'user' and 'assistant' roles, starting with 'system', then 'user' and alternating (u/a/u/a/u...).

In [8]:
def set_meta_llama_params(
    max_new_tokens=512,
    top_p=0.9,
    temperature=0.6,
):
    """ set Llama parameters """
    llama_params = {}
    llama_params['max_new_tokens'] = max_new_tokens
    llama_params['top_p'] = top_p
    llama_params['temperature'] = temperature
    return llama_params

In [25]:
def print_dialog(inputs, payload, response):
    dialog_output = []
    for msg in inputs:
        dialog_output.append(f"**{msg['role'].upper()}**: {msg['content']}\n")
    dialog_output.append(f"**ASSISTANT**: {response['generated_text']}")
    dialog_output.append("\n---\n")
    
    display(Markdown('\n'.join(dialog_output)))

In [38]:
def format_messages(messages: List[Dict[str, str]]) -> List[str]:
    """
    Format messages for Llama 3+ chat models.
    
    The model only supports 'system', 'user' and 'assistant' roles, starting with 'system', then 'user' and 
    alternating (u/a/u/a/u...). The last message must be from 'user'.
    """
    # auto assistant suffix
    # messages.append({"role": "assistant"})
    
    output = "<|begin_of_text|>"
    # Adding the inferred prefix
    _system_prefix = f"\n\nCutting Knowledge Date: December 2023\nToday Date: {datetime.now().strftime('%d %b %Y')}\n\n"
    for i, entry in enumerate(messages):
        output += f"<|start_header_id|>{entry['role']}<|end_header_id|>"
        if i == 0:
            output += f"{_system_prefix}{entry['content']}<|eot_id|>"
        elif i >= 1 and 'content' in entry:
            output += f"\n\n{entry['content']}<|eot_id|>"
    output += "<|start_header_id|>assistant<|end_header_id|>\n"
    return output


def send_prompt(params, prompt, instruction=""):

    pre_instruction = "You are a helpful ai assistant. Keep your answers short and less than 5 sentences and only talkative when required!"
    if not instruction:
        instruction = pre_instruction
    
    # default 'system', 'user' and 'assistant' prompt format
    base_input = [
        {"role": "system", "content": instruction},
        {"role": "user", "content": prompt},
    ]
    # convert s/u/a format 
    optz_input = format_messages(base_input)
    payload = {
        "inputs": optz_input,
        "parameters": params
    }
    response = pretrained_predictor.predict(payload)
    print_dialog(base_input, payload, response)
    return payload, response

With functions defined for the printing of the dialog and the prompt sending, let's begin sending queries to our Llama 2 LLM!

Note that we can also adjust the parameters supplied of the model, which we do by altering the `top_p parameter`.

In [39]:
%%time
params = set_meta_llama_params(top_p=0.4)
payload, response = send_prompt(params, prompt="What is the recipe of a pumpkin pie?")

**SYSTEM**: You are a helpful ai assistant. Keep your answers short and less than 5 sentences and only talkative when required!

**USER**: What is the recipe of a pumpkin pie?

**ASSISTANT**: To make a classic pumpkin pie, you'll need:

- 1 cup of pumpkin puree
- 1 cup of heavy cream
- 1/2 cup of sugar
- 1/2 teaspoon of salt
- 1/2 teaspoon of ground cinnamon
- 1/4 teaspoon of ground nutmeg
- 1/4 teaspoon of ground ginger
- 2 large eggs
- 1 pie crust (homemade or store-bought)

Mix the pumpkin puree, heavy cream, sugar, salt, cinnamon, nutmeg, and ginger in a bowl. Add the eggs and mix well. Pour the mixture into the pie crust and bake at 425°F (220°C) for 15 minutes, then reduce heat to 350°F (180°C) and bake for an additional 30-40 minutes.

---


CPU times: user 5.16 ms, sys: 690 μs, total: 5.85 ms
Wall time: 6.05 s


In [40]:
%%time
params = set_meta_llama_params(top_p=0.6)
payload, response = send_prompt(params, prompt="How do I learn to play the guitar?", instruction="always answer with Haiku")

**SYSTEM**: always answer with Haiku

**USER**: How do I learn to play the guitar?

**ASSISTANT**: Fingers on the frets
Practice every single day
Music in my soul

---


CPU times: user 4.67 ms, sys: 0 ns, total: 4.67 ms
Wall time: 609 ms


In [41]:
%%time
params = set_meta_llama_params(top_p=0.8)
payload, response = send_prompt(params, prompt="What's a good strategy for chess?", instruction="always answer with emojis")

**SYSTEM**: always answer with emojis

**USER**: What's a good strategy for chess?

**ASSISTANT**: 🤔

1. **Control the Center**: The center of the board is the most important area in chess. Try to control as many central squares as possible with your pawns and pieces. This gives you more mobility and makes it harder for your opponent to maneuver. 📍

2. **Develop Your Pieces**: Move your pieces out from their starting positions and develop them towards the center of the board. This helps you control more squares and attack your opponent's position. 🚀

3. **Protect Your King**: Keep your king safe by castling (moving your king to the side of the board and placing your rook in front) and placing pieces in front to block potential attacks. 🏰

4. **Pawn Structure**: Manage your pawns carefully, as they can either support or weaken your position. Try to create pawn chains (rows of pawns) to block your opponent's pieces and create barriers. 🚧

5. **Analyze Your Opponent's Moves**: Observe your opponent's strategy and adjust your plan accordingly. Look for weaknesses in their position and try to exploit them. 🔍

6. **Be Patient**: Chess is a game of strategy, not a sprint. Take your time to think and plan your moves carefully. Avoid making impulsive decisions that might put you at risk. ⏰

7. **Endgame Play**: In the endgame, focus on promoting your pawns to queens and rooks, as they are the most powerful pieces. Try to create a passed pawn (a pawn that has no opposing pawn on the same file) to promote it to a queen. 🏆

---


CPU times: user 5.17 ms, sys: 0 ns, total: 5.17 ms
Wall time: 11.9 s


In [42]:
%%time
params = set_meta_llama_params(top_p=0.6)
tokyo_payload, tokyo_response = send_prompt(params, prompt="What are the top 5 things to do in Tokyo?")

**SYSTEM**: You are a helpful ai assistant. Keep your answers short and less than 5 sentences and only talkative when required!

**USER**: What are the top 5 things to do in Tokyo?

**ASSISTANT**: Tokyo offers a wide range of activities. Here are the top 5 things to do in Tokyo:

1. **Visit the Tokyo Skytree**: At 634 meters tall, it's the tallest tower in the world, offering breathtaking views of the city.
2. **Explore the Shibuya Crossing**: This famous intersection is known for its busiest and most colorful street scene in the world.
3. **Visit the Meiji Shrine**: Dedicated to the deified spirits of Emperor Meiji and his wife, Empress Shoken, this shrine is a serene oasis in the midst of the bustling city.
4. **Walk through the Asakusa district**: This historic district is home to the famous Senso-ji Temple, a colorful and lively area filled with traditional shops and food stalls.
5. **Visit the Tsukiji Outer Market**: While the inner market has moved to a new location, the outer market still offers a fascinating glimpse into Tokyo's seafood culture and fresh sushi.

---


CPU times: user 6.27 ms, sys: 0 ns, total: 6.27 ms
Wall time: 6.96 s


Because we are interacting with the llama2 **chat** LLM, we can input a previous prompt with a further question in a conversation manner. 

Also, because we are capturing the payload and response for each inference to our endpoint, we can feed this back into our LLM as part of our next prompt, in order to continue the conversation. In the following output we can see the requests and repsonses from the user, and the assistant:

In [45]:
%%time
base_input = [
    {
        "role": "system", 
        "content": "You are a helpful ai assistant. Keep your answers short and less than 5 sentences and only talkative when required!"
    },
    {
        "role": "user", 
        "content": "What are the top 5 things to do in Tokyo?"},
    {
        "role": "assistant",
        "content": tokyo_response['generated_text'],
    },
    {
        "role": "user", 
        "content": "What is so great about #1?"  # <<---- Your follow up question here!
    },
]
optz_input = format_messages(base_input)

payload = {
    "inputs": optz_input,
    "parameters": {
        "max_new_tokens": 512, 
        "top_p": 0.9, 
        "temperature": 0.6
    }
}
response = pretrained_predictor.predict(payload, custom_attributes=custom_attribute)
print_dialog(base_input, payload, response)

**SYSTEM**: You are a helpful ai assistant. Keep your answers short and less than 5 sentences and only talkative when required!

**USER**: What are the top 5 things to do in Tokyo?

**ASSISTANT**: Tokyo offers a wide range of activities. Here are the top 5 things to do in Tokyo:

1. **Visit the Tokyo Skytree**: At 634 meters tall, it's the tallest tower in the world, offering breathtaking views of the city.
2. **Explore the Shibuya Crossing**: This famous intersection is known for its busiest and most colorful street scene in the world.
3. **Visit the Meiji Shrine**: Dedicated to the deified spirits of Emperor Meiji and his wife, Empress Shoken, this shrine is a serene oasis in the midst of the bustling city.
4. **Walk through the Asakusa district**: This historic district is home to the famous Senso-ji Temple, a colorful and lively area filled with traditional shops and food stalls.
5. **Visit the Tsukiji Outer Market**: While the inner market has moved to a new location, the outer market still offers a fascinating glimpse into Tokyo's seafood culture and fresh sushi.

**USER**: What is so great about #1?

**ASSISTANT**: **Tokyo Skytree** is great because it offers:

- **Panoramic views**: From its observation decks, you can see the entire city, including famous landmarks like the Tokyo Tower and Mount Fuji on a clear day.
- **Interactive exhibits**: The tower has interactive exhibits and a museum that showcase Tokyo's history, culture, and technology.
- **Shopping and dining**: The complex surrounding the tower has a variety of shops, restaurants, and cafes to explore.

---


CPU times: user 16.4 ms, sys: 594 μs, total: 17 ms
Wall time: 3.5 s


---
### Building an agent with LangChain
---

We now have a LLM that can continue conversations in a chat interface! However, there is a more effective option than manually capturing the request and response for each inference request and feeding this back into the model.

[LangChain](https://www.langchain.com/) is a framework that helps us simplify this process. We can use LangChain to send prompts to our LLM, store chat histroy, and feed this back into the model in order to have a conversation.

LangChain also allows us to define a content header to transform the inputs and outputs to the LLM, which we will do in the next cell.

In [46]:
from typing import Dict
from langchain.llms import SagemakerEndpoint
from langchain.llms.sagemaker_endpoint import LLMContentHandler

class ContentHandler(LLMContentHandler):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt, model_kwargs):
        base_input = [{"role" : "user", "content" : prompt}]
        optz_input = format_messages(base_input)
        input_str = json.dumps({
            "inputs" : optz_input, 
            "parameters" : {**model_kwargs}
        })
        return input_str.encode('utf-8')
    
    def transform_output(self, output):
        response_json = json.loads(output.read().decode("utf-8"))
        return response_json["generated_text"]

We can then pass the SageMaker endpoint we previoiusly provisioned into a LangChain `SageMaker Endpoint` object, which allows LangChain to interact with out Llama 2 LLM. We are also passing in parameters which we defined previously.

In [47]:
import json
from sagemaker import session

content_handler = ContentHandler()

llm=SagemakerEndpoint(
     endpoint_name=pretrained_predictor.endpoint_name, 
     region_name=session.Session().boto_region_name, 
     model_kwargs={
         "max_new_tokens": 400, 
         "top_p": 0.9, 
         "temperature": 0.6
     },
     endpoint_kwargs={
         "CustomAttributes": custom_attribute
     },
     content_handler=content_handler
 )

We can now create a chat prompt template that LangChain will pass to our LLM. The LangChain [ChatPromptTemplate](https://python.langchain.com/docs/modules/model_io/prompts/prompt_templates/#chatprompttemplate) object allows us to do this.

We also have [ConversationBufferMemory](https://api.python.langchain.com/en/latest/memory/langchain.memory.buffer.ConversationBufferMemory.html) and [LLMChain](https://docs.langchain.com/docs/components/chains/llm-chain) objects. The former allows to store the conversation memory, and the latter brings together the Chat Prompt Template, LLM, and Conversation Buffer Memory. We also set `verbose` to `True`, allowing us, in this case, to see the conversation history up until this point.

In [52]:
from langchain.prompts import (
    ChatPromptTemplate,
    MessagesPlaceholder,
    SystemMessagePromptTemplate,
    HumanMessagePromptTemplate,
)
from langchain.chains import LLMChain
from langchain.memory import ConversationBufferWindowMemory

# Prompt 
prompt = ChatPromptTemplate(
    messages=[
        SystemMessagePromptTemplate.from_template(
            "Assistant is a chatbot having a conversation with a human. Assistant is informative and polite, and only answers the question asked."
        ),
        # The `variable_name` here is what must align with memory
        MessagesPlaceholder(variable_name="chat_history"),
        HumanMessagePromptTemplate.from_template("{question}")
    ]
)

# Notice that we set`return_messages=True` to fit into the MessagesPlaceholder
# Notice that `"chat_history"` aligns with the MessagesPlaceholder name
memory = ConversationBufferWindowMemory(
    memory_key="chat_history",
    return_messages=True, 
    k=2
)

conversation = LLMChain(
    llm=llm,
    prompt=prompt,
    verbose=True,
    memory=memory
)

Now we have our conversation LLM Chain, LangChain will pass our query, as well as the history and chat prompt template, to the LLM. This is great for a chatbot interface, as we'll demonstrate now. For each of the following three cells' output, you'll notice the conversation history after the text `Entering new LLMChain chain...`, and the query response after `Finished chain.`

In [53]:
# Notice that we just pass in the `question` variables - `chat_history` gets populated by memory
def simple_conversation(question):
    print(conversation({"question": question})['text'])

In [54]:
simple_conversation('hi!')

  print(conversation({"question": question})['text'])




[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mSystem: Assistant is a chatbot having a conversation with a human. Assistant is informative and polite, and only answers the question asked.
Human: hi![0m

[1m> Finished chain.[0m
Hello! It's nice to meet you. Is there something I can help you with or would you like to chat?


In [55]:
simple_conversation("How can I travel from New York to Los Angeles?")



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mSystem: Assistant is a chatbot having a conversation with a human. Assistant is informative and polite, and only answers the question asked.
Human: hi!
AI: Hello! It's nice to meet you. Is there something I can help you with or would you like to chat?
Human: How can I travel from New York to Los Angeles?[0m

[1m> Finished chain.[0m
There are several ways to travel from New York to Los Angeles, depending on your budget, time constraints, and personal preferences. Here are a few options:

1. **Flying:** The fastest way to reach Los Angeles from New York is by flying. You can take a direct flight from one of New York's three major airports (JFK, LGA, or EWR) to Los Angeles International Airport (LAX). Flight duration is approximately 5 hours.

2. **Driving:** If you prefer a road trip, you can drive from New York to Los Angeles. The distance is approximately 2,796 miles and takes around 40-50 hours with no

In [56]:
simple_conversation("Can you tell me more about the first option?")



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mSystem: Assistant is a chatbot having a conversation with a human. Assistant is informative and polite, and only answers the question asked.
Human: hi!
AI: Hello! It's nice to meet you. Is there something I can help you with or would you like to chat?
Human: How can I travel from New York to Los Angeles?
AI: There are several ways to travel from New York to Los Angeles, depending on your budget, time constraints, and personal preferences. Here are a few options:

1. **Flying:** The fastest way to reach Los Angeles from New York is by flying. You can take a direct flight from one of New York's three major airports (JFK, LGA, or EWR) to Los Angeles International Airport (LAX). Flight duration is approximately 5 hours.

2. **Driving:** If you prefer a road trip, you can drive from New York to Los Angeles. The distance is approximately 2,796 miles and takes around 40-50 hours with normal traffic conditions. Yo

___
### LangChain Tools
___

LangChain, further, has [tools](https://python.langchain.com/docs/modules/agents/tools/) which it can use to send API requests to perform various tasks which it may not have been able to do in isolation, such as make a search request or check the weather. Today, we will be using a math, and a Wikipedia tool, though please see a more complete list [here](https://js.langchain.com/docs/api/tools/). It is also possible to [create your own tool](https://python.langchain.com/docs/modules/agents/tools/custom_tools).

We also use LangChain [agents](https://docs.langchain.com/docs/components/agents/). Agents are especially powerful where there is not a predetermined chain of calls, like we've had above so far. It is possible to have an unknown chain that depends on the user's input. In these types of chains, there is a “agent” which has access to a suite of tools. Depending on the user input, the agent can then decide which, if any, of these tools to call. An agent could call multiple LLM Chains that we defined above, each with their own tools. They can also be extended with custom logic to allow for retries, and error handling.

Defining our two tools, as well as the LangChain agent, will give us a model that will able to determine whether it needs to use Wikipedia or a math tool, or whether it is able to answer a question on its own. If it needs the tool, it will make a request to the tool, receive the response, and then return that response to the user.

We also define an Output Parser, which is a method of parsing the output from the prompt. If the LLM produces output uses certain headers, we can enable complex interactions where variables are generated by the LLM in their response and passed into the next step of the chain.

In [57]:
from langchain.agents import AgentOutputParser
from langchain.agents.conversational_chat.prompt import FORMAT_INSTRUCTIONS
from langchain.output_parsers.json import parse_json_markdown
from langchain.schema import AgentAction, AgentFinish

class OutputParser(AgentOutputParser):
    def get_format_instructions(self) -> str:
        return FORMAT_INSTRUCTIONS

    def parse(self, text: str) -> AgentAction | AgentFinish:
        try:
            # this will work IF the text is a valid JSON with action and action_input
            response = parse_json_markdown(text)
            action, action_input = response["step"], response["step_input"]
            if action == "Final Answer":
                # this means the agent is finished so we call AgentFinish
                return AgentFinish({"output": action_input}, text)
            else:
                # otherwise the agent wants to use an action, so we call AgentAction
                return AgentAction(action, action_input, text)
        except Exception:
            # sometimes the agent will return a string that is not a valid JSON
            # often this happens when the agent is finished
            # so we just return the text as the output
            return AgentFinish({"output": text}, text)

    @property
    def _type(self) -> str:
        return "conversational_chat"

# initialize output parser for agent
parser = OutputParser()

We initialize the agent with the tools we have defined above, the [agent type](https://python.langchain.com/docs/modules/agents/agent_types/), as well as the LLM, memory, and output parser we defined above. Again we set `Verbose` to `True`, which in this case will allow us to see if and how the agent calls a tool it has access to.

In [58]:
memory = ConversationBufferWindowMemory(
    memory_key="chat_history", 
    k=8, 
    return_messages=True, 
    output_key="output"
)

In [86]:
from langchain.agents import initialize_agent
from langchain.agents import AgentOutputParser, load_tools

llm=SagemakerEndpoint(
     endpoint_name=pretrained_predictor.endpoint_name, 
     region_name=session.Session().boto_region_name, 
     model_kwargs={
         "max_new_tokens": 400, 
         "top_p": 0.1, 
         "temperature": 0.2
     },
     endpoint_kwargs={"CustomAttributes": custom_attribute},
     content_handler=content_handler
 )

# equip agents with tools
tools = load_tools(["llm-math", "wikipedia"], llm=llm)

# initialize agent
agent = initialize_agent(
    agent="chat-conversational-react-description",
    memory=memory,
    max_iterations=2,
    llm=llm,
    handle_parsing_errors="Check your output and make sure it conforms. It must be entirely in JSON!",
    tools=tools,
    verbose=True,
    agent_kwargs={
        "output_parser": parser
    }
)

We also provide a background prompt to the model. This provides the LLM with instructions of the tools it has access to, when to use which, and how to use each. This allows the LLM to firstly know when to use a tool (as opposed to answering in isolation 'by itself'), but also allows the LangChain agent to create a request to the tool the LLM has identified, before returning to the LLM to respond in a natural language way.

In [87]:
# create the system message
system_message = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nAssistant is a expert JSON builder designed to assist with a wide range of tasks.

Assistant is able to respond to the User and use tools using JSON strings that contain "step" and "step_input" parameters.

All of Assistant's communication is performed using this JSON format.

Assistant can also use tools by responding to the user with tool use instructions in the same "step" and "step_input" JSON format. Tools available to Assistant are:

- "Calculator": Useful for when you need to answer questions about math.
  - To use the calculator tool, Assistant should write like so:
    ```json
    {{"step": "Calculator",
      "step_input": "sqrt(4)"}}
    ```

- "wikipedia": Useful when you need a summary of a person, place, historical event, or other subject. Input is typically a noun, like a person, place, historical event, or another subject.
  - To use the wikipedia tool, Assistant should format the JSON like the following before getting the response and returning to the user:
    ```json
    {{"step": "wikipedia",
      "step_input": "Statue of Liberty"}}

When Assistant responds with JSON they make sure to enclose the JSON with three back ticks.

Here are some previous conversations between the Assistant and User:

User: Hey how are you today?
Assistant: ```json
{{"step": "Final Answer",
 "step_input": "I'm good thanks, how are you?"}}
\```
User: I'm great, what is the square root of 4?
Assistant: ```json
{{"step": "Calculator",
 "step_input": "sqrt(4)"}}
\```
User: Who is the President of the United States of America?
Assistant: ```json
{{"step": "wikipedia",
 "step_input": "President of United States of America"}}
\```
User: What is 9 cubed?
Assistant: ```
{{"step": "Calculator",
 "step_input": "9**3"}}
\```
User: 729
Assistant: ```
{{"step": "Final Answer",
 "step_input": "The answer to your question is 729."}}
\```
User: Can you tell me about the Statue of Liberty?
Assistant: ```
{{"step": "wikipedia",
 "step_input": "Statue of Liberty"}}
\```
User: What is the square root of 81?
Assistant: ```
{{"step": "Calculator",
 "step_input": "sqrt(81)"}}
\```
User: 9
Assistant: ```
{{"step": "Final Answer",
 "step_input": "The answer to your question is 9."}}
\```

Here is the latest conversation between Assistant and User.<|eot_id|>"""

few_shot = agent.agent.create_prompt(
    system_message=system_message,
    tools=tools
)
agent.agent.llm_chain.prompt = few_shot

human_msg = "<|start_header_id|>user<|end_header_id|>\nRespond to the following in JSON with 'step' and 'step_input' values\nUser: {input}"

agent.agent.llm_chain.prompt.messages[2].prompt.template = human_msg

We can now send some prompts to the LLM and see when/how it uses the tools!

In [88]:
def agent_conversation(question):
    print(agent(question))

In [89]:
agent_conversation("hey how are you today?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m```json
{"step": "Final Answer",
 "step_input": "I'm good thanks, how are you?"}
```[0m

[1m> Finished chain.[0m
{'input': 'hey how are you today?', 'chat_history': [HumanMessage(content='hey how are you today?'), AIMessage(content="I'm good thanks, how are you?"), HumanMessage(content='Tell me about the Empire Statue Building'), AIMessage(content='Agent stopped due to iteration limit or time limit.'), HumanMessage(content='hey how are you today?'), AIMessage(content="I'm good thanks, how are you?"), HumanMessage(content='Tell me about the Empire Statue Building'), AIMessage(content='Agent stopped due to iteration limit or time limit.')], 'output': "I'm good thanks, how are you?"}


In [90]:
agent_conversation("Tell me about the Empire Statue Building")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m```json
{"step": "wikipedia",
 "step_input": "Empire State Building"}
```[0m
Observation: [33;1m[1;3mPage: Empire State Building
Summary: The Empire State Building is a 102-story Art Deco skyscraper in the Midtown South neighborhood of Manhattan in New York City. The building was designed by Shreve, Lamb & Harmon and built from 1930 to 1931. Its name is derived from "Empire State", the nickname of the state of New York. The building has a roof height of 1,250 feet (380 m) and stands a total of 1,454 feet (443.2 m) tall, including its antenna. The Empire State Building was the world's tallest building until the first tower of the World Trade Center was topped out in 1970; following the September 11 attacks in 2001, the Empire State Building was New York City's tallest building until it was surpassed in 2012 by One World Trade Center. As of 2022, the building is the seventh-tallest building in New York City, the ninth-talles

In [91]:
agent_conversation("what is 4 to the power of 2.1?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m```json
{"step": "Calculator",
 "step_input": "4**2.1"}
```[0m
Observation: [36;1m[1;3mAnswer:  55.78856902770393[0m
Thought:[32;1m[1;3m```json
{"step": "Final Answer",
 "step_input": "The answer to your question is 55.78856902770393."}[0m

[1m> Finished chain.[0m
{'input': 'what is 4 to the power of 2.1?', 'chat_history': [HumanMessage(content='hey how are you today?'), AIMessage(content="I'm good thanks, how are you?"), HumanMessage(content='Tell me about the Empire Statue Building'), AIMessage(content='Agent stopped due to iteration limit or time limit.'), HumanMessage(content='hey how are you today?'), AIMessage(content="I'm good thanks, how are you?"), HumanMessage(content='Tell me about the Empire Statue Building'), AIMessage(content='Agent stopped due to iteration limit or time limit.'), HumanMessage(content='hey how are you today?'), AIMessage(content="I'm good thanks, how are you?"), HumanMessage(content='Te

In [92]:
agent_conversation("What is the square root of 64?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m```json
{"step": "Calculator",
 "step_input": "sqrt(64)"}[0m
Observation: [36;1m[1;3mAnswer:  8.0[0m
Thought:[32;1m[1;3m```json
{"step": "Final Answer",
 "step_input": "The square root of 64 is 8."}
```[0m

[1m> Finished chain.[0m
{'input': 'What is the square root of 64?', 'chat_history': [HumanMessage(content='hey how are you today?'), AIMessage(content="I'm good thanks, how are you?"), HumanMessage(content='Tell me about the Empire Statue Building'), AIMessage(content='Agent stopped due to iteration limit or time limit.'), HumanMessage(content='hey how are you today?'), AIMessage(content="I'm good thanks, how are you?"), HumanMessage(content='Tell me about the Empire Statue Building'), AIMessage(content='Agent stopped due to iteration limit or time limit.'), HumanMessage(content='hey how are you today?'), AIMessage(content="I'm good thanks, how are you?"), HumanMessage(content='Tell me about the Empire Statue Buil

___
### Developing and deploying the UI with Streamlit
___

Let's bring all of this together and host our chatbot interface!

For this we will use [Streamlit](https://streamlit.io/). Streamlit is an open-source Python library that allows you to create and deploy web applications. It can be deployed from our local machine, or from the Cloud. Today, we will deploy it directly from SageMaker Studio.

The file `chat_app.py` (`../studio-local-ui/`) brings together all of what we have discussed so far. It initializes a LangChain Agent, with the tools and conversation memory we spoke about previously. It connects to our same Llama 2 LLM.

The majority of this code you will be familiar with from the notebook so far. The rest uses the [Streamlit library](https://docs.streamlit.io/library/api-reference), as well as [LangChain Streamlit packages](https://python.langchain.com/docs/integrations/memory/streamlit_chat_message_history). 

It is one of the last lines of the file, `response = agent(prompt, callbacks=[st_cb])` that sends the prompt to the agent, as well as specifies the [StreamlitCallbackHandler](https://python.LangChain.com/docs/integrations/callbacks/streamlit) which can display the reasoning and actions in the streamlit app. By default we are not showing this in the conversation, and have a regex that filers out too much of the conversation history and thought process, though in order to see comment out the line at the end `response = re.sub("\{.*?\}","",response["output"])`.

We are also using [st.chat_message](https://docs.streamlit.io/library/api-reference/chat/st.chat_message) to handle the chat message container, and [st.write](https://docs.streamlit.io/library/api-reference/write-magic/st.write) to return this, along with the previous conversation, back to the UI.

We can [build Streamlit apps in SageMaker Studio](https://aws.amazon.com/blogs/machine-learning/build-streamlit-apps-in-amazon-sagemaker-studio/). We will do this by hosting the app on the Jupyter Server. 

Firstly, let's write the output of our SageMaker endpoint to a text file so it can be read by the `app.py`:

In [93]:
f = open("../studio-local-ui/endpoint_name.txt", "w")
f.write(pretrained_predictor.endpoint_name)
f.close()

In [94]:
f = open("../studio-local-ui/custom_attribute.txt", "w")
f.write(custom_attribute)
f.close()

Run the following cells marked with `%%bash`, these cells will install a few packages in your conda environment and spin up a new Streamlit UI that's accessible from the URL described below.

In [95]:
%%bash
sudo apt-get install -yq jq

Reading package lists...
Building dependency tree...
Reading state information...
The following additional packages will be installed:
  libjq1 libonig5
The following NEW packages will be installed:
  jq libjq1 libonig5
0 upgraded, 3 newly installed, 0 to remove and 0 not upgraded.
Need to get 357 kB of archives.
After this operation, 1087 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/main amd64 libonig5 amd64 6.9.7.1-2build1 [172 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/main amd64 libjq1 amd64 1.6-2.1ubuntu3 [133 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy/main amd64 jq amd64 1.6-2.1ubuntu3 [52.5 kB]


debconf: delaying package configuration, since apt-utils is not installed


Fetched 357 kB in 1s (611 kB/s)
Selecting previously unselected package libonig5:amd64.
(Reading database ... 13790 files and directories currently installed.)
Preparing to unpack .../libonig5_6.9.7.1-2build1_amd64.deb ...
Unpacking libonig5:amd64 (6.9.7.1-2build1) ...
Selecting previously unselected package libjq1:amd64.
Preparing to unpack .../libjq1_1.6-2.1ubuntu3_amd64.deb ...
Unpacking libjq1:amd64 (1.6-2.1ubuntu3) ...
Selecting previously unselected package jq.
Preparing to unpack .../jq_1.6-2.1ubuntu3_amd64.deb ...
Unpacking jq (1.6-2.1ubuntu3) ...
Setting up libonig5:amd64 (6.9.7.1-2build1) ...
Setting up libjq1:amd64 (1.6-2.1ubuntu3) ...
Setting up jq (1.6-2.1ubuntu3) ...
Processing triggers for libc-bin (2.35-0ubuntu3.8) ...


In [None]:
%%bash
cd ../studio-local-ui
DOMAIN_ID=$(jq -r '.DomainId' /opt/ml/metadata/resource-metadata.json)
SPACE_NAME=$(jq -r '.SpaceName' /opt/ml/metadata/resource-metadata.json)
STREAMLIT_URL=$(aws sagemaker describe-space --domain-id $DOMAIN_ID --space-name $SPACE_NAME | jq -r '.Url')

echo "=====>  Launch Streamlit: $STREAMLIT_URL/proxy/8501/"

streamlit run chat_app.py --server.runOnSave true --server.port 8501 > /dev/null

=====>  Launch Streamlit: https://tu6mkavngfsynef.studio.us-east-1.sagemaker.aws/jupyterlab/default/proxy/8501/



>> from /opt/conda/lib/python3.10/site-packages/langchain/callbacks/__init__.py import StreamlitCallbackHandler

with new imports of:

>> from langchain_community.callbacks.streamlit import StreamlitCallbackHandler
You can use the langchain cli to **automatically** upgrade many imports. Please see documentation here <https://python.langchain.com/v0.2/docs/versions/v0_2/>
  from langchain.callbacks import StreamlitCallbackHandler

>> from langchain.memory.chat_message_histories import StreamlitChatMessageHistory

with new imports of:

>> from langchain_community.chat_message_histories import StreamlitChatMessageHistory
You can use the langchain cli to **automatically** upgrade many imports. Please see documentation here <https://python.langchain.com/v0.2/docs/versions/v0_2/>
  from langchain.memory.chat_message_histories import StreamlitChatMessageHistory

`from langchain_community.llms import SagemakerEndpoint`.

To install langchain-community run `pip install -U langchain-community`.

<div style="background-color: #6bb07e; border-left: 5px solid #6bb07e; padding: 10px; color: black;">
    - Navigate to: https://example.studio.us-east-1.sagemaker.aws/jupyterlab/default/proxy/8501/
</div>

<div style="background-color: #6bb07e; border-left: 5px solid #6bb07e; padding: 10px; color: black;">
    <i>- Replace "example" with your your current url host `https://use_this_host.studio.us-east-1...`</i>
</div>

<div style="background-color: #FFDDDD; border-left: 5px solid red; padding: 10px; color: black;">
    Please *interrupt* the above cell to stop Streamlit app
</div>

Navigate to `Kernel` > `Interrupt Kernel` 

OR

Use the `Stop` Button from the toolbar to interrupt your kernel