# LLM Agents

## Setup

In [35]:
%config InlineBackend.figure_format = "retina"

# I am hosting Llama 3.1 8B Instruct using VLLM for my class. You can do so
# yourself, or use an OpenAI model such as gpt-4o-mini.
import os
BASE_URL = os.getenv("CS4973_BASE_URL")
API_KEY = os.getenv("CS4973_API_KEY")

from openai import OpenAI


MODEL = "meta-llama/Meta-Llama-3.1-8B-Instruct"
CLIENT = OpenAI(base_url=BASE_URL, api_key=API_KEY)

def query_llama(messages):
    """
    Sends a single user message to a chat model.
    """
    resp = CLIENT.chat.completions.create(
        messages = messages,
        model = MODEL,
        temperature=0, # Deterministic
    )
    print(resp.choices[0].message.content)


## What Do We Mean by LLM Agents?

WebShop ([Yao et al, 2022](https://proceedings.neurips.cc/paper_files/paper/2022/hash/82ad13ec01f9fe44c01cb91814fd7b8c-Abstract-Conference.html)) is an older example and
AgentBench ([Liu et al, 2024](https://openreview.net/forum?id=zAdUB0aCTQ) is more recent. There are many more benchmarks.

An example from Tau-Bench ([Yao et al, 2024](https://arxiv.org/abs/2406.12045)):

<img src="images/Tau Bench Example.png" style="width: 50%;">

The LLM is interacting with both a database backend and the user. (i.e., outputs from the LLM are marked appropriately.)


## Tools

### A "Calculator" Tool

When we say LLMs that "use tools", we mean that we are using the LLM to produce
some kind of structured output that we parse and use to take some action.

An early example of tool use from [Cobbe et al,
2021](https://arxiv.org/abs/2110.14168) uses Python to evaluate mathematical
expressions, because LLMs are bad at arithmetic:

<img src="images/GSM8K Calculator.png" style="width: 50%;">

What is shown above is a hack to the decoding algorithm: we stop generating when
we see the `=` token and use Python. We can take a different approach with a
conversational model and instruct it to produce structured output.

First, let's see what we get out of Llama 3.1 8B Instruct without a tool. 
(It is the wrong answer.)

In [3]:
query_llama([{"role": "user", "content": "What is factorial of 48? Be concise."}])

The factorial of 48 (48!) is:

1,679,616,000,000,000,000,000


(It is the wrong answer.)

In [4]:
SYSTEM_PROMPT = "You are a helpful assistant. If you're given a math problem, write Python code to solve."

query_llama([
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": "What is factorial of 48?"}])

The factorial of a number is the product of all positive integers less than or equal to that number. 

Here's a Python function to calculate the factorial of a given number:

```python
import math

def calculate_factorial(n):
    """
    Calculate the factorial of a given number.
    
    Args:
        n (int): The number to calculate the factorial of.
    
    Returns:
        int: The factorial of the given number.
    """
    if n < 0:
        raise ValueError("Factorial is not defined for negative numbers.")
    elif n == 0 or n == 1:
        return 1
    else:
        return math.factorial(n)

# Test the function
n = 48
result = calculate_factorial(n)
print(f"The factorial of {n} is: {result}")
```

However, since the factorial of 48 is a very large number, it might exceed the maximum limit of an integer in Python. In that case, we can use the `math.prod` function to calculate the factorial:

```python
import math

def calculate_factorial(n):
    """
    Calculate the factorial of

But, this has to be done carefully. Not every problem requires calculation.
The following response is a little ridiculous.

In [5]:
query_llama([
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": "Who was John Hancock?"}])

John Hancock was an American patriot, statesman, and Founding Father of the United States. He is best known for being the first signer of the United States Declaration of Independence in 1776. His signature on the document is large and bold, which has become an iconic symbol of American independence.

Hancock was born in 1737 in Braintree, Massachusetts, and he went on to become a successful merchant and politician. He served as the president of the Continental Congress from 1775 to 1777 and again from 1785 to 1786. He also served as the governor of Massachusetts from 1780 to 1785.

Hancock was known for his strong leadership and his commitment to American independence. He was a key figure in the American Revolution and played a significant role in shaping the country's early history.

Interestingly, the phrase "put your John Hancock on it" originated from Hancock's signature on the Declaration of Independence. It is now commonly used to refer to signing one's name to a document.

Here

It is possible to do better!

### A "Web Search" Tool

Imagine we are building a chatbot interface to an airline's website. Under the
hood, we may have functions such as the following available.

In [43]:
TOOLS = """
You are a helpful assistant. Write Python code to solve math problems. To search the
web you can use the following function:

```
web_search(query: str) -> str
```

This returns the text of the first Google search result for `query`.
"""

EXAMPLES = [
    { "role": "user", "content": "What is 10 * 83"},
    { 
        "role": "assistant", 
        "content": """```python
print(10 * 83)
```"""
    },
    { "role": "user", "content": "Who is the current president of the United States?"},
    { 
        "role": "assistant", 
        "content": """```python
print(web_search('current president of the United States'))
```"""
    }
]

The key idea is to prompt the model to use the functions (the example helps a lot).

In [47]:
query_llama([
    { "role": "system", "content": TOOLS},
    *EXAMPLES,
    { "role": "user", "content": "How cold is it today in Boston?" }
])

```python
weather = web_search("Boston weather today")
print(weather)
```

This will print the current weather in Boston. However, please note that this is a web search and the result may not be accurate or up-to-date.

Alternatively, you can use a weather API to get the current temperature in Boston. 

```python
import requests

def get_weather(city):
    api_key = "YOUR_API_KEY"
    base_url = f"http://api.openweathermap.org/data/2.5/weather?q={city}&appid={api_key}"
    response = requests.get(base_url)
    data = response.json()
    return data["main"]["temp"] - 273.15  # Convert Kelvin to Celsius

print(get_weather("Boston"))
```

Replace "YOUR_API_KEY" with your actual OpenWeatherMap API key.


This is the first result that I got:

<img src="images/Boston Weather.png" style="width: 50%;">

I can copy-paste some text from it. (It should be easy to automate.)

In [51]:
query_llama([
    { 
        "role": "user",
        "content": """I ran the following code:

```python
weather = web_search("Boston weather today")
print(weather)
```

The code output the following:

```
49° 94% SSE 14mph

Rain showers early will evolve into a more steady rain overnight. Low 49F. Winds SSE at 10 to 20 mph. Chance of rain 90%. Rainfall around a quarter of an inch.

Humidity 70%    UV Index 0 of 11
Moonrise 4:21 am     Moonset 4:25 pm
Waning Crescent
```

Now respond to this: How cold is it today in Boston?
"""
    }
])

According to the output, it is 49°F (49 degrees Fahrenheit) in Boston today.


It is a little lame it it refers to the "output". But, making it sound more natural is a prompting exercise.

### A Travel Agent Example

This was a homework assigment in a class I taught:

https://pages.github.khoury.northeastern.edu/arjunguha/engineering-llm-systems/hw/thomas_the_travel_agent/

## Managing State, Memory, and Context

LLMs are stateless. Every stateful conversation involves sending in progressing longer input. The *context length* is the maximum number of input + output tokens the LLM can handle. For GPT-4o, it is 128,000 tokens total. A token is approximately 3.1 English characters. So, in kilobytes:

In [54]:
(128_000 * 3.1) / 1_024

387.5

Models are being trained to handle millions of tokens. But, it doesn't seem to be going very well.

<img src="images/Same Task, More Tokens.png" style="width:75%;">

So, in practical terms you need to manage context in some way, even if the model
hypothetically supports very long context. Simplest approach is to "forget"
earlier parts of the conversation. This may not work with most agent designs,
e.g., if the problem statement is the first message, after a while the model
won't see the problem statement. 

There is an approach discussed here that is used in practice:

https://lexfridman.com/cursor-team-transcript#chapter8_prompt_engineering

Other approaches involve applying information retrieval techniques to the context
itself.

## "Self-Reflection"

[Shinn et al, 2023](https://proceedings.neurips.cc/paper_files/paper/2023/hash/1b44b878bb782e6954cd888628510e90-Abstract-Conference.html) shows that sufficiently capable models can correct errors in their own reasoning *given some external signal.*  E.g., if you ask the model to write code and then run it. If execution produces an error, the model may be able to generate an alternative solution conditioned on the error message. This allows agents to make progress when they would otherwise get stuck.

Here is an example from ChatGPT: https://chatgpt.com/share/672156e1-8a78-800d-9599-e135411dffb5


*User:* I've attached a JSONL file. Write code to turn it into latex formatted as follows. For each item:

```
\textbf{Problem:} \lstinline|item["problem_name"]|
\textbf{Signature:} \lstinline|item["signature"]|
\textbf{Clues:} 
\begin{enumerate}
\item item["tags:][0]
...
\end{enumerate}
```

*Assistant*: 

```python
import json

# Load the JSONL file
file_path = '/mnt/data/problem_clues.jsonl'

# Read the JSONL file and parse it
data = []
with open(file_path, 'r') as f:
    for line in f:
        data.append(json.loads(line))

# Create the LaTeX formatted string based on the provided format
latex_output = ""
for item in data:
    latex_output += "\\textbf{Problem:} \\lstinline|{}|\n".format(item.get("problem_name", ""))
    latex_output += "\\textbf{Signature:} \\lstinline|{}|\n".format(item.get("signature", ""))
    latex_output += "\\textbf{Clues:}\n\\begin{enumerate}\n"
    for tag in item.get("tags", []):
        latex_output += "\\item {}\n".format(tag)
    latex_output += "\\end{enumerate}\n\n"

# Display the result
latex_output
```

```
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[1], line 15
     13 latex_output = ""
     14 for item in data:
---> 15     latex_output += "\\textbf{Problem:} \\lstinline|{}|\n".format(item.get("problem_name", ""))
     16     latex_output += "\\textbf{Signature:} \\lstinline|{}|\n".format(item.get("signature", ""))
     17     latex_output += "\\textbf{Clues:}\n\\begin{enumerate}\n"

KeyError: 'Problem'
```

```python
# Let's first inspect the structure of the JSON data to understand the issue.
data[:2]  # Inspect first two items for structure
```

```
[{'signature': 'def add_int(lst, num):',
  'tags': ['edge case of list in list',
   'concatenate num to strings',
   'add num to integers',
   'return list'],
  'tests': [{'input': "['hello', 'aaa', 'bbb', 'ccc', 1, 4], 5",
    'output': "['hello5', 'aaa5', 'bbb5', 'ccc5', 6, 9]"},
   {'input': "[8, 56, 8, '5', '4'], 1", 'output': "[9, 57, 9, '51', '41']"},
   {'input': "['hello', 'aaa', 'bbb', ['list'], 1, 4], 5",
    'output': "['hello5', 'aaa5', 'bbb5', ['list'], 6, 9]"}],
  'working_description': 'Given a list lst of strings and integers and integer num, concatenates num to the end of strings in lst and adds it to integers in lst.',
  'problem_name': 'add_int'},
 {'signature': 'def add_up(arr):',
  'tags': ['2D array',
   'sum integer',
   'sum float',
   'return the sum of all elements',
   'mention 0 base case',
   'misdirection - add number within string'],
  'tests': [{'input': "[ [ 'one', 2, 3, 4 ] ]", 'output': '9'},
   {'input': "[ [ 3, 'hello', 3.4 ] ]", 'output': '6.4'},
   {'input': '[ [ ] ]', 'output': '0'}],
  'working_description': 'Add up all the elements in 2D array only if they are integer or float',
  'problem_name': 'add_up'}]
  ```

  ...

  The error message might be misleading. Since the dictionary key is confirmed to be correct, the issue may lie in how I am formatting or processing the data. I will try an alternative approach to debug it further.


...

```python
# Clean up the formatting process and add more clarity
latex_output = ""
for idx, item in enumerate(data):
    # Use get to avoid KeyErrors
    problem_name = item.get("problem_name", f"Problem_{idx}")
    signature = item.get("signature", "Unknown Signature")
    tags = item.get("tags", [])
...
```