---
title: "License to Call: Introducing Transformers Agents 2.0"
thumbnail: /blog/assets/agents/thumbnail.png
authors:
  - user: m-ric
  - user: lysandre
  - user: pcuenq
---

# License to Call: Introducing Transformers Agents 2.0

## TL;DR

We are releasing Transformers Agents 2.0!

⇒ 🎁 On top of our existing agent type, we introduce two new agents that **can iterate based on past observations to solve complex tasks**.

⇒ 💡 We aim for the code to be **clear and modular, and for common attributes like the final prompt and tools to be transparent**.

⇒ 🤝 We add **sharing options** to boost community agents.

⇒ 💪 **Extremely performant new agent framework**, allowing a Llama-3-70B-Instruct agent to outperform GPT-4 based agents in the GAIA Leaderboard!

🚀 Go try it out and climb ever higher on the GAIA leaderboard!

## Table of Contents

- [What is an agent?](#what-is-an-agent)
- [The Transformers Agents approach](#the-transformers-agents-approach)
    - [Main elements](#main-elements)
- [Example use-cases](#example-use-cases)
    - [Self-correcting Retrieval-Augmented-Generation](#self-correcting-retrieval-augmented-generation)
    - [Using a simple multi-agent setup 🤝 for efficient web browsing](#using-a-simple-multi-agent-setup-for-efficient-web-browsing)
- [Testing our agents](#testing-our-agents)
    - [Benchmarking LLM engines](#benchmarking-llm-engines)
    - [Climbing up the GAIA Leaderboard with a multi-modal agent](#climbing-up-the-gaia-leaderboard-with-a-multi-modal-agent)
- [Conclusion](#conclusion)

## What is an agent?

Large Language Models (LLMs) can tackle a wide range of tasks, but they often struggle with specific tasks like logic, calculation, and search. When prompted in these domains in which they do not perform well, they frequently fail to generate a correct answer.

One approach to overcome this weakness is to create an **agent**, which is just a program driven by an LLM. The agent is empowered by **tools** to help it perform actions. When the agent needs a specific skill to solve a particular problem, it relies on an appropriate tool from its toolbox.

Thus when during problem-solving the agent needs a specific skill, it can just rely on an appropriate tool from its toolbox.

Experimentally, agent frameworks generally work very well, achieving state-of-the-art performance on several benchmarks. For instance, have a look at [the top submissions for HumanEval](https://paperswithcode.com/sota/code-generation-on-humaneval): they are agent systems.

## The Transformers Agents approach

Building agent workflows is complex, and we feel these systems need a lot of clarity and modularity. We launched Transformers Agents one year ago, and we’re doubling down on our core design goals.

Our framework strives for:

- **Clarity through simplicity:** we reduce abstractions to the minimum. Simple error logs and accessible attributes let you easily inspect what’s happening and give you more clarity.
- **Modularity:** We prefer to propose building blocks rather than full, complex feature sets. You are free to choose whatever building blocks are best for your project.
    - For instance, since any agent system is just a vehicle powered by an LLM engine, we decided to conceptually separate the two, which lets you create any agent type from any underlying LLM.

On top of that, we have **sharing features** that let you build on the shoulders of giants!

### Main elements

- `Tool`: this is the class that lets you use a tool or implement a new one. It is composed mainly of a callable forward `method` that executes the tool action, and a set of a few essential attributes: `name`, `descriptions`, `inputs` and `output_type`. These attributes are used to dynamically generate a usage manual for the tool and insert it into the LLM’s prompt.
- `Toolbox`: It's a set of tools that are provided to an agent as resources to solve a particular task. For performance reasons, tools in a toolbox are already instantiated and ready to go. This is because some tools take time to initialize, so it’s usually better to re-use an existing toolbox and just swap one tool, rather than re-building a set of tools from scratch at each agent initialization.
- `CodeAgent`: a very simple agent that generates its actions as one single blob of Python code. It will not be able to iterate on previous observations.
- `ReactAgent`: ReAct agents follow a cycle of Thought ⇒ Action ⇒ Observation until they’ve solve the task. We propose two classes of ReactAgent:
    - `ReactCodeAgent` generates its actions as python blobs.
    - `ReactJsonAgent` generates its actions as JSON blobs.

Check out [the documentation](https://huggingface.co/docs/transformers/en/main_classes/agent) to learn how to use each component!

How do agents work under the hood?

In essence, what an agent does is “allowing an LLM to use tools”. Agents have a key `agent.run()` method that:

- Provides information about tool usage to your LLM in a **specific prompt**. This way, the LLM can select tools to run to solve the task.
- **Parses** the tool calls from the LLM output (can be via code, JSON format, or any other format).
- **Executes** the calls.
- If the agent is designed to iterate on previous outputs, it **keeps a memory** with previous tool calls and observations. This memory can be more or less fine-grained depending on how long-term you want it to be.

<p align="center">
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/agents/agent_single_multistep.png" alt="graph of agent workflows" width=90%>
</p>


For more general context about agents, you could read [this excellent blog post](https://lilianweng.github.io/posts/2023-06-23-agent/) by Lilian Weng or [our earlier blog post](https://huggingface.co/blog/open-source-llms-as-agents) about building agents with LangChain.


To take a deeper dive in our package, go take a look at the [agents documentation](https://huggingface.co/docs/transformers/en/transformers_agents).


## Example use cases

In order to get access to the early access of this feature, please first install `transformers` from its `main` branch:
```
pip install "git+https://github.com/huggingface/transformers.git#egg=transformers[agents]"
```
Agents 2.0 will be released in the v4.41.0 version, landing mid-May.


### Self-correcting Retrieval-Augmented-Generation

Quick definition: Retrieval-Augmented-Generation (RAG) is “using an LLM to answer a user query, but basing the answer on information retrieved from a knowledge base”. It has many advantages over using a vanilla or fine-tuned LLM: to name a few, it allows to ground the answer on true facts and reduce confabulations, it allows to provide the LLM with domain-specific knowledge, and it allows fine-grained control of access to information from the knowledge base.

Let’s say we want to perform RAG, and some parameters must be dynamically generated. For example, depending on the user query we could want to restrict the search to specific subsets of the knowledge base, or we could want to adjust the number of documents retrieved. The difficulty is: how to dynamically adjust these parameters based on the user query?

Well, we can do this by giving our agent an access to these parameters!

Let's setup this system. 

Tun the line below to install required dependancies:
```
pip install langchain sentence-transformers faiss-cpu
```

We first load a knowledge base on which we want to perform RAG: this dataset is a compilation of the documentation pages for many `huggingface` packages, stored as markdown.

In [1]:
import datasets
knowledge_base = datasets.load_dataset("m-ric/huggingface_doc", split="train")


  from .autonotebook import tqdm as notebook_tqdm


Now we prepare the knowledge base by processing the dataset and storing it into a vector database to be used by the retriever. We are going to use LangChain, since it features excellent utilities for vector databases:

In [2]:
from langchain.docstore.document import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings

source_docs = [
    Document(page_content=doc["text"], metadata={"source": doc["source"].split("/")[1]})
    for doc in knowledge_base
]

docs_processed = RecursiveCharacterTextSplitter(chunk_size=500).split_documents(
    source_docs
)[:1000]

embedding_model = HuggingFaceEmbeddings(
    model_name="thenlper/gte-small",
    model_kwargs={"device": "mps"},
    encode_kwargs={"device": "mps"},
)
vectordb = FAISS.from_documents(documents=docs_processed, embedding=embedding_model)



In [3]:
all_sources = list(set([doc.metadata["source"] for doc in docs_processed]))
print(all_sources)


['hub-docs', 'diffusers', 'datasets-server', 'blog', 'transformers', 'deep-rl-class', 'peft', 'hf-endpoints-documentation', 'datasets', 'course', 'optimum', 'gradio', 'evaluate', 'pytorch-image-models']


In [4]:
import json
from transformers.agents import Tool
from langchain_core.vectorstores import VectorStore

class RetrieverTool(Tool):
    name = "retriever"
    description = "Retrieves some documents from the knowledge base that have the closest embeddings to the input query."
    inputs = {
        "query": {
            "type": "text",
            "description": "The query to perform. This should be semantically close to your target documents. Use the affirmative form rather than a question.",
        },
        "source": {
            "type": "text", 
            "description": ""
        },
    }
    output_type = "text"
    
    def __init__(self, vectordb: VectorStore, all_sources: str, **kwargs):
        super().__init__(**kwargs)
        self.vectordb = vectordb
        self.inputs["source"]["description"] = (
            f"The source of the documents to search, as a str representation of a list. Possible values in the list are: {all_sources}. If this argument is not provided, all sources will be searched."
          )

    def forward(self, query: str, source: str = None) -> str:
        assert isinstance(query, str), "Your search query must be a string"

        if source:
            if isinstance(source, str) and "[" not in str(source): # if the source is not representing a list
                source = [source]
            source = json.loads(str(source).replace("'", '"'))

        docs = self.vectordb.similarity_search(query, filter=({"source": source} if source else None), k=3)

        if len(docs) == 0:
            return "No documents found with this filtering. Try removing the source filter."
        return "Retrieved documents:\n\n" + "\n===Document===\n".join(
            [doc.page_content for doc in docs]
        )


In [5]:
import json
from transformers.agents import Tool
from langchain_core.vectorstores import VectorStore

class RetrieverTool(Tool):
    name = "retriever"
    description = "Retrieves some documents from the knowledge base that have the closest embeddings to the input query."
    inputs = {
        "query": {
            "type": "text",
            "description": "The query to perform. This should be semantically close to your target documents. Use the affirmative form rather than a question.",
        },
        "source": {
            "type": "text", 
            "description": ""
        },
    }
    output_type = "text"
    
    def __init__(self, vectordb: VectorStore, all_sources: str, **kwargs):
        super().__init__(**kwargs)
        self.vectordb = vectordb
        self.inputs["source"]["description"] = (
            f"The source of the documents to search, as a str representation of a list. Possible values in the list are: {all_sources}. If this argument is not provided, all sources will be searched."
          )

    def forward(self, query: str, source: str = None) -> str:
        assert isinstance(query, str), "Your search query must be a string"

        if source:
            if isinstance(source, str) and "[" not in str(source): # if the source is not representing a list
                source = [source]
            source = json.loads(str(source).replace("'", '"'))

        docs = self.vectordb.similarity_search(query, filter=({"source": source} if source else None), k=3)

        if len(docs) == 0:
            return "No documents found with this filtering. Try removing the source filter."
        return "Retrieved documents:\n\n" + "\n===Document===\n".join(
            [doc.page_content for doc in docs]
        )


In [9]:
import os

from dotenv import load_dotenv
load_dotenv()

True

In [14]:
from langchain_groq import ChatGroq
llm_engine = ChatGroq(temperature=0, model_name="llama3-70b-8192")

In [15]:
from transformers.agents import HfEngine, ReactJsonAgent

# llm_engine = HfEngine("meta-llama/Meta-Llama-3-70B-Instruct")

agent = ReactJsonAgent(
    tools=[RetrieverTool(vectordb, all_sources)],
    llm_engine=llm_engine
)

agent_output = agent.run("Please show me a LORA finetuning script")

print("Final output:")
print(agent_output)


[37;1mPlease show me a LORA finetuning script[0m
[31;20mError in generating llm output: Got unknown type {'role': <MessageRole.SYSTEM: 'system'>, 'content': 'You will be given a task to solve as best you can. You have access to the following tools:\n\n- retriever: Retrieves some documents from the knowledge base that have the closest embeddings to the input query.\n    Takes inputs: {\'query\': {\'type\': \'text\', \'description\': \'The query to perform. This should be semantically close to your target documents. Use the affirmative form rather than a question.\'}, \'source\': {\'type\': \'text\', \'description\': "The source of the documents to search, as a str representation of a list. Possible values in the list are: [\'hub-docs\', \'diffusers\', \'datasets-server\', \'blog\', \'transformers\', \'deep-rl-class\', \'peft\', \'hf-endpoints-documentation\', \'datasets\', \'course\', \'optimum\', \'gradio\', \'evaluate\', \'pytorch-image-models\']. If this argument is not provided, 

Final output:
Error in generating final llm output: Got unknown type {'role': <MessageRole.SYSTEM: 'system'>, 'content': "An agent tried to answer a user query but it failed to do so. You are tasked with providing an answer instead. Here is the agent's memory:"}.


In [7]:

llm_engine = HfEngine("meta-llama/Meta-Llama-3-70B-Instruct")

<transformers.agents.llm_engine.HfEngine at 0x32f04bf80>