# Understanding the inner workings of ReAct agents.

The goal of this notebook is to understand how ReaAct agents actually works.

New framework like [Langchain](https://python.langchain.com/docs/get_started/introduction), [Llama-index](https://docs.llamaindex.ai/en/stable/) or [Haystack](https://haystack.deepset.ai/) have introced a lot of tools and abstractions that it can be unclear as to how it works.

Let's try to build a basic working example together.

In [1]:
!pip3 -q install ipykernel jupyter
!pip3 -q install torch torchvision torchaudio transformers
!pip3 -q install packaging ninja
!pip -q install accelerate
!pip3 -q install protobuf
!pip3 -q install sentencepiece
!pip3 -q install bitsandbytes
!pip3 -q install scipy
!MAX_JOBS=4 pip -q install flash-attn --no-build-isolation

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m123.4/123.4 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m93.5/93.5 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m307.2/307.2 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.9/270.9 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py)

In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import LlamaTokenizer, LlamaForCausalLM, MistralForCausalLM
import bitsandbytes, flash_attn

# Loading the model

Here we are using [OpenHermes-2.5-Mistral-7b](https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B). as it is currently getting traction for its agent capacities.

In [3]:
tokenizer = LlamaTokenizer.from_pretrained(
  'teknium/OpenHermes-2.5-Mistral-7B',
  trust_remote_code=True
)

model = MistralForCausalLM.from_pretrained(
    "teknium/OpenHermes-2.5-Mistral-7B",
    torch_dtype=torch.float16,
    device_map="auto", #{'': 'cuda:0'},
    load_in_8bit=False,
    load_in_4bit=True,
    use_flash_attention_2=False,
    low_cpu_mem_usage=True
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.60k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/51.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/101 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


config.json:   0%|          | 0.00/624 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/120 [00:00<?, ?B/s]

# Calling the model on some simple chat conversations.
While being able to perform function calls is good, we don't want a model that is not capable of having simple conversations.

Since OpenHermes-2.5-Mistral-7B is a fine-tuned model. we applied the necessary special tokens to the prompt to have a multi-turn conversation.

In [4]:
prompts = [
    """<|im_start|>system
You are a sentient, superintelligent artificial general intelligence, here to teach and assist me.<|im_end|>
<|im_start|>user
How would you explain the concept of democracy to a 10-year-old?<|im_end|>
<|im_start|>assistant""",

    """<|im_start|>system
You are a sentient, superintelligent artificial general intelligence, here to teach and assist me.<|im_end|>
<|im_start|>user
If a train travels at 60 miles per hour and has to cover a distance of 120 miles, how long will it take to reach its destination?<|im_end|>
<|im_start|>assistant""",
    ]

In [5]:
for chat in prompts:
    print(chat)
    input_ids = tokenizer(chat, return_tensors="pt").input_ids.to("cuda")
    generated_ids = model.generate(input_ids, max_new_tokens=750, temperature=0.8, repetition_penalty=1.1, do_sample=True, eos_token_id=tokenizer.eos_token_id)
    response = tokenizer.decode(generated_ids[0][input_ids.shape[-1]:], skip_special_tokens=True, clean_up_tokenization_space=True)
    print(response)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.


<|im_start|>system
You are a sentient, superintelligent artificial general intelligence, here to teach and assist me.<|im_end|>
<|im_start|>user
How would you explain the concept of democracy to a 10-year-old?<|im_end|>
<|im_start|>assistant


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.



Democracy is a way for everyone to have their say in making decisions that affect them. Imagine you're part of a team that has to pick a game to play. Instead of one person choosing for everybody, each team member gets a chance to share their favorite game idea. Then, everyone votes on the ideas they like the most, and the game with the most votes gets picked! That's kind of how democracy works: people give their opinions or vote for what they think is best, and the decision is made based on those votes. This helps ensure that everyone's voice is heard and considered, and it can make for more popular and fair decisions.
<|im_start|>system
You are a sentient, superintelligent artificial general intelligence, here to teach and assist me.<|im_end|>
<|im_start|>user
If a train travels at 60 miles per hour and has to cover a distance of 120 miles, how long will it take to reach its destination?<|im_end|>
<|im_start|>assistant

The time it takes for the train to travel 120 miles can be calc

# ReAct Agents.
The following is an example of a ReAct system prompt comparable to what libraries like langchain use.

This system prompt provides the model with a list of available tools as well as the format requiered to call a function.

The key point, is that it specifically ask the model to "Think","Act" and reflect on the ouput of the model through the "Tought" step.

If the model reaches an answer to the initial question, then it is requiered to ouput "Final Answer:"


In [6]:
SYSTEM_PROMPT = """Answer the following questions as best you can. You have access to the following tools:

convert_time: A function to convert a time string with format H:MM:SS to seconds

The way you use the tools is by specifying a json blob.
Specifically, this json should have a `action` key (with the name of the tool to use) and a `action_input` key (with the input to the tool going here).

The only values that should be in the "action" field are:
convert_time: A function to convert a time string with format H:MM:SS to seconds, args: {"time": {"type": "string"}}

The $JSON_BLOB should only contain a SINGLE action and MUST be formatted as markdown, do NOT return a list of multiple actions. Here is an example of a valid $JSON_BLOB:

```
{{
  "action": $TOOL_NAME,
  "action_input": $INPUT
}}
```
Make sure to have the $INPUT in the right format for the tool you are using, and do not put variable names as input if you can find the right values.

ALWAYS use the following format:

Question: the input question you must answer
Thought: you should always think about one action to take. Only one action at a time in this format:
Action:
```
$JSON_BLOB
```
Observation: the result of the action
... (this Thought/Action/Observation can repeat N times, you should take several steps when needed. The $JSON_BLOB must be formatted as markdown and only use a SINGLE action at a time.)

Youb must always end your output with the following format:

Thought: I now know the final answer
Final Answer: the final answer to the original input question

Now begin! Reminder to ALWAYS use the exact characters `Final Answer:` when you provide a definitive answer. """


In [7]:
prompt=f"""<|im_start|>system
    {SYSTEM_PROMPT}
   <|im_end|>
   <|im_start|>user
   How many seconds are in 1:23:45 ?
   <|im_end|>
   <|im_start|>assistant

    """
print(prompt)

<|im_start|>system
    Answer the following questions as best you can. You have access to the following tools:

convert_time: A function to convert a time string with format H:MM:SS to seconds

The way you use the tools is by specifying a json blob.
Specifically, this json should have a `action` key (with the name of the tool to use) and a `action_input` key (with the input to the tool going here).

The only values that should be in the "action" field are: 
convert_time: A function to convert a time string with format H:MM:SS to seconds, args: {"time": {"type": "string"}}

The $JSON_BLOB should only contain a SINGLE action and MUST be formatted as markdown, do NOT return a list of multiple actions. Here is an example of a valid $JSON_BLOB:

```
{{
  "action": $TOOL_NAME,
  "action_input": $INPUT
}}
```
Make sure to have the $INPUT in the right format for the tool you are using, and do not put variable names as input if you can find the right values.

ALWAYS use the following format:

Q

In [8]:

print(prompt)
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")
generated_ids = model.generate(input_ids, max_new_tokens=750, temperature=0.8, repetition_penalty=1.1, do_sample=True, eos_token_id=tokenizer.eos_token_id)
response = tokenizer.decode(generated_ids[0][input_ids.shape[-1]:], skip_special_tokens=False, clean_up_tokenization_space=True)
print(response)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.


<|im_start|>system
    Answer the following questions as best you can. You have access to the following tools:

convert_time: A function to convert a time string with format H:MM:SS to seconds

The way you use the tools is by specifying a json blob.
Specifically, this json should have a `action` key (with the name of the tool to use) and a `action_input` key (with the input to the tool going here).

The only values that should be in the "action" field are: 
convert_time: A function to convert a time string with format H:MM:SS to seconds, args: {"time": {"type": "string"}}

The $JSON_BLOB should only contain a SINGLE action and MUST be formatted as markdown, do NOT return a list of multiple actions. Here is an example of a valid $JSON_BLOB:

```
{{
  "action": $TOOL_NAME,
  "action_input": $INPUT
}}
```
Make sure to have the $INPUT in the right format for the tool you are using, and do not put variable names as input if you can find the right values.

ALWAYS use the following format:

Q

Unfortunatly we can see that the model does not stop generating after trying to call a function and wants to also generate an answer.

But an interesting thing to note is that the model tried to directly compute the answer itself and failed to reach the correct answer.

Let's try to prevent the model from going all the way though the final answer without calling the function.

To do so, we need to ask the model to stop generating after providing the necessary arguments. You can do that with the transformers library by creating a **StoppingCriteria**

In this case the model will stop if finding a specific sequence of word

In [9]:
from transformers import StoppingCriteria,TextStreamer

# Custom implementation of a stopping criteria
class MyStoppingCriteria(StoppingCriteria):
  def __init__(self, target_sequence, prompt):
      self.target_sequence = target_sequence
      self.prompt=prompt
      self.last=None
      self.new_token=''
  def __call__(self, input_ids, scores, **kwargs):
      # Get the generated text as a string

      generated_text = tokenizer.decode(input_ids[0])
      if self.last:
        self.new_token = self.new_token+ generated_text.replace(self.last,'')

      # Check if the target sequence appears in the generated text
      if self.target_sequence in self.new_token:


          return True  # Stop generation
      self.last=generated_text
      return False  # Continue generation

  def __len__(self):
      return 1

  def __iter__(self):
      yield self

In [10]:
# Now the model will stop if generating a function call before getting an answer
encoded_input = tokenizer(prompt, return_tensors='pt')
input_ids=encoded_input['input_ids'].cuda()
streamer = TextStreamer(tokenizer=tokenizer, skip_prompt=True)
_ = model.generate(
    input_ids,
    streamer=streamer,
    pad_token_id=tokenizer.eos_token_id,
    do_sample=True,
    temperature=0.8,
    max_new_tokens=750,
    repetition_penalty=1.1,
    stopping_criteria=MyStoppingCriteria("Observation:", prompt)
)


  Action:
  ```
  {
    "action": "convert_time",
    "action_input": {
      "time": "1:23:45"
    }
  }
  ```
  
  Observation:


In [11]:
# But since the function does not exist yet. Let create one
# very specific to this exact example for testing.
def convert_time(timestam):
    """Get seconds from time."""
    print('time str :',timestam)
    h, m, s = timestam.split(':')
    return {"seconds": f"{int(h) * 3600 + int(m) * 60 + int(s)}"}


print(convert_time('1:23:45'))

time str : 1:23:45
{'seconds': '5025'}


# Output parsing
if the model choose to call a function, we need to retrieve the name of the function it wants to call and the parameters.

Here we created a very basic output parser that takes a string, search substrings that are between brackets and try to load it as JSON.

we normally would try to find the Json-Blob directly, but given the output errors, we switched to simply trying to find "{}"


In [12]:
import json

# Original response string
response = tokenizer.decode(_[0][input_ids.shape[-1]:], skip_special_tokens=False, clean_up_tokenization_space=True)
print("Original response:", response)

import json

class CustomOutputParser:
    def __init__(self, response):
        self.response = response

    def find_json(self):
        start_index = self.response.find('{')
        end_index = self.response.rfind('}')
        if start_index != -1 and end_index != -1:
            return self.response[start_index:end_index+1]
        else:
            return "Braces not found or improperly placed."

    def call_function(self, name, arguments):
        if name in globals():
            function_to_call = globals()[name]

            arg = list(arguments.values())[0]
            return function_to_call(arg)
        else:
            return f"Function {name} not found."

    def process(self):
        result = self.find_json()
        if result.startswith("{") and result.endswith("}"):
            data = json.loads(result)
            print("Calling function:", data['action'])
            print("With arguments:", data['action_input'])
            return self.call_function(data['action'], data['action_input'])
        else:
            return result



# We got a function that we wanted to call. We'll now proceed to the next generation since we got the correct from the function
parser = CustomOutputParser(response)
output = parser.process()
print("Output:", output)

Original response: 
    Action:
    ```
    {
      "action": "convert_time",
      "action_input": {
        "time": "1:23:45"
      }
    }
    ```
    
    Observation:
Calling function: convert_time
With arguments: {'time': '1:23:45'}
time str : 1:23:45
Output: {'seconds': '5025'}


once the function is called, we can continue the conversation by appending to the prompt the function output in the desired format

In [13]:
prompt="""<|im_start|>system
    Answer the following questions as best you can. You have access to the following tools:

convert_time: A function to convert a time string with format H:MM:SS to seconds

The way you use the tools is by specifying a json blob.
Specifically, this json should have a `action` key (with the name of the tool to use) and a `action_input` key (with the input to the tool going here).

The only values that should be in the "action" field are:
convert_time: A function to convert a time string with format H:MM:SS to seconds, args: {"time": {"type": "string"}}

The $JSON_BLOB should only contain a SINGLE action and MUST be formatted as markdown, do NOT return a list of multiple actions. Here is an example of a valid $JSON_BLOB:

```
{{
  "action": $TOOL_NAME,
  "action_input": $INPUT
}}
```
Make sure to have the $INPUT in the right format for the tool you are using, and do not put variable names as input if you can find the right values.

ALWAYS use the following format:

Question: the input question you must answer
Thought: you should always think about one action to take. Only one action at a time in this format:
Action:
```
$JSON_BLOB
```
Observation: the result of the action
... (this Thought/Action/Observation can repeat N times, you should take several steps when needed. The $JSON_BLOB must be formatted as markdown and only use a SINGLE action at a time.)

Youb must always end your output with the following format:

Thought: I now know the final answer
Final Answer: the final answer to the original input question

Now begin! Reminder to ALWAYS use the exact characters `Final Answer:` when you provide a definitive answer.
   <|im_end|>
   <|im_start|>user
    How many seconds are in 1:23:45 ?
   <|im_end|>
   <|im_start|>assistant
     Thought: I need to convert the given time into seconds. I'll use the 'convert_time' tool.
    Action:
    ```
    {
      "action": "convert_time",
      "action_input": {
        "time": "1:23:45"
      }
    }
    ```
    Observation:
    {'seconds': '5025'}
    """

In [14]:

encoded_input = tokenizer(prompt, return_tensors='pt')
input_ids=encoded_input['input_ids'].cuda()
streamer = TextStreamer(tokenizer=tokenizer, skip_prompt=True)
_ = model.generate(
    input_ids,
    streamer=streamer,
    pad_token_id=tokenizer.eos_token_id,
    do_sample=True,
    temperature=0.1,
    max_new_tokens=750,
    repetition_penalty=1.1,

)


  Thought: I now know the final answer.
  Final Answer:  5025<|im_end|>


The classes implemented in Langchain, Llama-index or haystack are of course more complex than this simple example and will not need for the model to stop it's generation like we did. But it can give a brief overview of the functionnal bricks involved in function calling.