# Agentic RAG with Hugging Face smolagents vs Vanilla RAG

Author: [@MariaKhalusova](https://x.com/mariaKhalusova)

Last updated: Jan 7th, 2025

## What you'll learn:

1. Parsing PDF documents from S3 into DataStax AstraDB with Unstructured Platform
2. Building Vanilla RAG in pure Python without using specialized frameworks
3. Differences between Vanilla RAG and Agentic RAG
4. Creating Agentic RAG with Hugging Face `smolagents` library
5. Whether Agentic RAG can produce better answers (spoiler: it can!)

In Vanilla RAG, your system uses the user's question to perform a single retrieval step and get a batch of documents that are meant to be relevant to the query. These documents are then passed on to the LLM to generate an answer grounded in the context of those documents.

If the results of the retrieval are inadequate (either irrelevant, or incomplete), this will have a direct negative impact on generation. There are many different methods one can employ to improve the retrieval quality - from choosing a better embedding model, to switching to a different retrieval method (e.g. BM25, or hybrid, metadata filtering, etc.), to increasing the number of retrieved documents and adding a reranker, and so on. However, there still may be situations where a single retrieval step, or retrieving based on the user query "as is" may not produce optimals results.

In this tutorial, we will build the simplest Agentic RAG application that will use retriever as a tool, and will be able to a) reformulate the user query to improve the retrieval results, b) review the results, and c) retrieve more context, if needed. This should allow the RAG application to perform better answer complex question, for example, the ones that might require query decomposition and multiple retrieval steps.

There are several frameworks available for building agentic RAG, in this tutorial, we'll be using the latest library from Hugging Face called [`smolagents`](https://github.com/huggingface/smolagents). The library is lightweight, and very easy to start using to build agentic applications, including but not limited to Agentic RAG.

## Preparing the data

Every RAG application starts with data, and most of the time - unstructured data (PDFs, Word documents, SharePoint files, emails, etc.). Preprocessing this type of data to make it available for retrieval can be a challenging task. [Unstructured Platform](https://unstructured.io/) significantly simplifies this process - it can connect to any data sources you may have in your organization, preprocess the data from those sources making it RAG-ready, and upload the results into your database of choice.

To start transforming your data with Unstructured Platform, you'll need to [sign up on the Unstructured For Developers page](https://unstructured.io/developers). Once you do, you can log into the Platform and process up to 1000 pages per day for free for the first 14 days.

In this tutorial, our data will consist of annual 10-K SEC filings from Walmart Inc., Chevron Corporation, and Costco Wholesale Corporation for the 2023 fiscal year. These reports offer a deep insight into each company's financial performance that year. The documents are originally in PDF format and we have them stored in an Amazon S3 bucket. After preprocessing, we'll store the document chunks with their embeddings in DataStax AstraDB for retrieval. Here is what we need to do to prepare the data:
* Create an S3 _source connector_ in Unstructured Platform to connect it to the documents
* Create an AstraDB _destination connector_ in Unstructured Platform to upload the processed documents
* Create a _workflow_ that starts with a source connector, adds data transformation steps (such as extracting content of the PDFs with Antropic Claude Sonnet, enriching the documents with metadata, chunking the text, and generating embedding vectors for the similarity search), and then ends with uploading the results into the destination.

Let's briefly go over these steps.

### Create an S3 source connector in Unstructured Platform

Log in to your Unstructured Platform account, click `Connectors` on the left side bar, make sure you have `Sources` selected, and click `New` to create a new source connector. Choose S3, and enter the required info about your bucket.

<img src="https://framerusercontent.com/images/I1hhUk4xRAheCxMOLgrXZZiO0.png" alt="S3 connector settings" width="500"/>

### Create an AstraDB destination connector in Unstructured Platform

Create an account on [datastax.com](https://www.datastax.com/), and create a new Serverless (Vector) Database. Once it's instantiated, grab your credentials - API endpoint, and an application token,- and save them.

In the database, create a collection. Give it a name, then in the embedding generation method choose `Bring my own` as we will generate the embeddings automatically with Unstructured Platform. The dimensions value should be set to 3072 in this example as we'll be using `"text-embedding-3-large"` model from OpenAI.

Now you can create a destination connector for AstraDB in Unstructured Platform, similar to how you created the source connector.

<img src="https://framerusercontent.com/images/Szq022IHqD04mAjyIUlYgdNcVNM.png" alt="S3 connector settings" width="500"/>

<img src="https://framerusercontent.com/images/sJmB9GJ8JhZrPwIm6NccnP82GM.png" alt="S3 connector settings" width="500"/>





### Create a workflow in Unstructured Platform

Navigate to the `Workflows` tab in Unstructured Platform, and click `New workflow`. Choose `Build it with Me` option to set up the workflow with pre-configured options.

First, choose your source and destination using the connectors that you've just created.

Next, select "Platinum" workflow that will use Anthropic Claude Sonnet to preprocess the files:

<img src="https://framerusercontent.com/images/TRUyuKsfDzmjY5YSE76cdmwreLI.png" alt="S3 connector settings" width="500"/>

Optionally, set a schedule. In this example we don't need it.

That's it! Once the workflow is configured, run it, and wait for the job to finish. The documents will be processed, and written into AstraDB, where we can retrieve them from.

Now, let's build RAG!


## Setup

Run the line below to install required dependencies:

* smolagents: to configure agentic RAG
* astrapy: to connect to AstraDB and query it
* python-dotenv: to manage environment variables


In [1]:
!pip install --upgrade -q smolagents astrapy python-dotenv

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.9/89.9 kB[0m [31m901.0 kB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.7/67.7 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m177.1/177.1 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.5/57.5 MB[0m [31m14.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m320.6/320.6 kB[0m [31m16.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.4/76.4 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.6/6.6 MB[0m [31m63.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.1/13.1 MB[0m [31m57.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Create a local `.env` file that contains the following environment variables, and upload it to your notebook's directory.

* `ASTRA_DB_APPLICATION_TOKEN`
* `ASTRA_DB_API_ENDPOINT`
* `ASTRA_DB_COLLECTION_NAME`
* `ASTRA_DB_NAMESPACE`
* `OPENAI_API_KEY`

In this tutorial, we've generated embeddings for the data using a model from OpenAI, so we need the key to embed the user queries. For convenience, we'll also use an LLM from OpenAI for generation, and it will be the same for Vanilla RAG and Agentic RAG.

In [3]:
import os
from dotenv import load_dotenv

def load_environment_variables(path_to_dot_env_file) -> None:
    """
    Load environment variables from .env file.
    Raises an error if critical environment variables are missing.
    """
    load_dotenv(path_to_dot_env_file)
    required_vars = [
        "ASTRA_DB_APPLICATION_TOKEN",
        "ASTRA_DB_API_ENDPOINT",
        "ASTRA_DB_COLLECTION_NAME",
        "ASTRA_DB_NAMESPACE",
        "OPENAI_API_KEY"
    ]

    for var in required_vars:
        if not os.getenv(var):
            raise ValueError(f"Missing required environment variable: {var}")

load_environment_variables('/content/.env')

## Set up AstraDB collection and OpenAI client

In [4]:
from openai import OpenAI
from astrapy import DataAPIClient

In [5]:
OPENAI_CLIENT = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
EMBEDDING_MODEL = "text-embedding-3-large"
GENERATION_MODEL = "gpt-3.5-turbo-0125"

In [6]:
def get_collection(collection_name: str, keyspace: str):
    """
    Establish connection to Astra DB and return the specified collection.
    Args:
        collection_name (str): Name of the collection to retrieve
        keyspace (str): Database keyspace
    Returns:
        Collection object from Astra DB
    """

    astra_client = DataAPIClient(os.getenv("ASTRA_DB_APPLICATION_TOKEN"))
    database = astra_client.get_database(os.getenv("ASTRA_DB_API_ENDPOINT"))

    astradb_collection = database.get_collection(name=collection_name,
                                                 keyspace=keyspace)

    print(f"Collection: {astradb_collection.full_name}\n")
    return astradb_collection


In [7]:
COLLECTION = get_collection(os.getenv("ASTRA_DB_COLLECTION_NAME"), os.getenv("ASTRA_DB_NAMESPACE"))

Collection: default_keyspace.pdf_vlm_collection



In [8]:
def get_embedding(text: str):
    """
    Generate embedding for given text using OpenAI's embedding model.

    Args:
        text (str): Input text to embed

    Returns:
        Embedding vector for the input text
    """
    return OPENAI_CLIENT.embeddings.create(
        input=text, model=EMBEDDING_MODEL
    ).data[0].embedding


## Vanilla RAG

For the Vanilla RAG we'll create a simple retriever that will use similarity search based on the query and return top N documents:  

In [18]:
def simple_retriever(query: str, n=5):
    """
    Retrieve documents based on the given query using similarity search

    Args:
        query (str): query to pass to the DB
        n: Number of documents to retrieve

    Returns:
        List of the retrieved documents' texts
    """

    query_embedding = get_embedding(query)

    results = COLLECTION.find(sort={"$vector": query_embedding}, limit=n)
    docs = [doc["content"] for doc in results]

    return  "\nRetrieved documents:\n" + "".join(
            [
                f"\n\n===== Document {str(i)} =====\n" + doc
                for i, doc in enumerate(docs)
            ]
        )

Now, the whole RAG can be described in one simple function:

In [19]:
from typing import List
def vanilla_rag(question: str):
    """
    Generate an answer based on retrieved documents and user question.

    Args:
        question (str): User's input question
    Returns:
        LLM-generated answer
    """

    prompt = (
        "You are an assistant that can answer user questions given provided context. "
        "Provide a conversational answer. "
        "If you don't know the answer, or no documents are provided, "
        "say 'I do not have enough context to answer the question.'"
    )

    # retrieve documents using the simple retriever, 5 documents by default
    relevant_documents = simple_retriever(question)

    # add user question and the docs to the prompt
    augmented_prompt = (
        f"{prompt}"
        f"User question: {question}\n\n"
        f"Retrieved documents to use as context:\n\n {relevant_documents}"
    )

    # pass everything to the LLM to generate an answer
    response = OPENAI_CLIENT.chat.completions.create(
        messages=[
            {'role': 'system', 'content': 'You answer users questions.'},
            {'role': 'user', 'content': augmented_prompt},
        ],
        model=GENERATION_MODEL,
        temperature=0,
    )

    return response.choices[0].message.content


Let's try it out with a simple question:

In [43]:
question = "What are Costco's merchandise categories?"
vanilla_rag(question)

"Costco's merchandise categories include Foods and Sundries, Non-Foods, Fresh Foods, Warehouse Ancillary, and Other Businesses. These categories encompass a wide range of products such as groceries, electronics, health and beauty items, furniture, apparel, and more. Additionally, they offer services like gasoline, pharmacy, optical, and travel to complement their core warehouse operations."

This worked just fine, because an answer to this question is located in a single paragraph that can be reliably retrieved with similarity search. Now that you've seen how Vanilla RAG works, let's talk about what's different in Agentic RAG.

## Agentic RAG

There are many definitions of what an "AI agent" is, for example:

* "An AI agent is meant to accomplish tasks typically provided by the users. In an AI agent, AI is the brain that processes the task, plans a sequence of actions to achieve this task, and determines whether the task has been accomplished." by Chip Huyen
* "AI Agents are programs where LLM outputs control the workflow." by Hugging Face smolagents team

An Agent typically has access to Tools which help it get additional information and/or also perform actions. This can be a retriever, or a function to do Web search, a calculator, and image generator, and so on.

Tools help the LLM agent overcome some of its limitations. For example, are retriever tool, just like in Vanilla RAG, can help get additional information, and a calculator might be a useful, since AI models aren't great at math.

We'll build an agent that can rephrase a query if needed, call the same simple retriever as before as a tool, but it will have an option to call this tool more than once to retrieve additional information to improve the answer.
Let's see how we can do this with `smolagents`:

## Agentic RAG with `smolagents`


First, we'll create a RetrieverTool class.

At the core of the tool is a function that an LLM can use in an agentic system.
However, to use this function, the LLM will need to be given its API:
* `name`: the name of the tool to give the LLM
* `description`: is used to populate the agent's system prompt to inform about the tool's capabilities.
* `forward` method: the "main" function to be executed.
* `inputs`: what imputs can be given to the tool


Note, that here we take all of the same functions as before (`get_embedding` & `simple_retriever`), and simply wrap them into a RetrieverTool class to make the same simple retriever usable as a tool for the Agent.

In [47]:
from smolagents import Tool

class RetrieverTool(Tool):
    name = "retriever_tool"
    description = "Uses semantic search to retrieve documents that could be relevant to answer the query."
    inputs = {
        "query": {
            "type": "string",
            "description": "The query to perform. This should be semantically close to the target documents. Use the affirmative form rather than a question.",
        }
    }
    output_type = "string"

    def __init__(self, collection, openai_client, **kwargs):
        super().__init__(**kwargs)
        self.retriever = collection
        self.embedder = openai_client

    def get_embedding(self, text: str):
        return self.embedder.embeddings.create(
            input=text, model=EMBEDDING_MODEL
            ).data[0].embedding

    def simple_retriever(self, query, n=5):
      query_embedding = get_embedding(query)
      results = self.retriever.find(sort={"$vector": query_embedding}, limit=n)
      docs = [doc["content"] for doc in results]
      return "\nRetrieved documents:\n" + "".join(
            [
                f"\n\n===== Document {str(i)} =====\n" + doc
                for i, doc in enumerate(docs)
            ]
        )

    def forward(self, query: str) -> str:
        assert isinstance(query, str), "Your search query must be a string"

        docs = simple_retriever(query)

        return docs

retriever_tool = RetrieverTool(COLLECTION, OPENAI_CLIENT)

To create an Agent, we have two options:
* `ToolCallingAgent`, it generates tool calls as a JSON under the hood.
* `CodeAgent`, a new type of `ToolCallingAgent` that generates its tool calls as blobs of code, which works really well for LLMs that have strong coding performance.

In this example, we're using a rather old fashioned LLM (`"gpt-3.5-turbo-0125"`), so this works better with `ToolCallingAgent`.

smolagents allow you to use models from the Hugging Face, but also integrate with OpenAI and Antropic via `LiteLLMModel`.

In [49]:
from smolagents import ToolCallingAgent, LiteLLMModel

model = LiteLLMModel(model_id=GENERATION_MODEL)

agent = ToolCallingAgent(tools=[retriever_tool], model=model, verbose=True)

That's all it takes to set it up! Now let's compare how it performs compared to Vanilla RAG on some slightly trickier questions.

## Question 1: How do employee incentive plans differ between Chevron and Walmart?

In [44]:
question = "How do employee incentive plans differ between Chevron and Walmart?"

First, let's get an answer from Vanilla RAG:

In [28]:
vanilla_rag(question)

"Employee incentive plans at Chevron and Walmart differ in their structure and focus. Walmart's incentive plan includes stock options, restricted stock, and other equity compensation awards to align associates' interests with shareholders. On the other hand, Chevron emphasizes attracting, developing, and retaining talent through long-term employment models, leadership development programs, and succession planning. While Walmart offers incentives like Deferred Compensation and Deferred Bonuses for continuous employment, Chevron focuses on inclusive leadership development programs and employee networks to foster a diverse and inclusive work environment. Both companies prioritize employee engagement and growth, but their incentive plans reflect their unique organizational strategies and values."

Now, let's see what the Agent does and how it answers:

In [50]:
agent_output = agent.run(question)

print("Final answer:")
print(agent_output)

Final answer:
Employee incentive plans at Chevron include the Chevron Incentive Plan, which is an annual cash bonus plan for eligible employees linked to corporate and individual performance. The plan also includes the LTIP (Long-Term Incentive Plan) for officers and other regular salaried employees with significant positions. Awards under the LTIP consist of stock options and other share-based compensation. On the other hand, Walmart's Employee Incentive Plan grants stock options, restricted stock, restricted stock units, performance share units, and other equity compensation. The Walmart Inc. Stock Incentive Plan of 2015 aligns associates' interests with shareholders' interests by granting these awards.


As you can see, the Agent decomposed the original question into two individual queries - `'Employee incentive plans at Chevron'` and `'Employee incentive plans at Walmart'`, and retrieved documents for each of them to generate the final answer that compares all of the information.  

Let's take a closer look at the final answers.

Agentic answer is more true to the source because it provides specific details about the incentive plans at each company, whereas Vanilla RAG answer makes generalizations and includes information that is **not directly stated** in the sources. Here's a breakdown of why Agentic answer is more accurate:

* Agentic RAG references the "Chevron Incentive Plan" and the "Long-Term Incentive Plan (LTIP)" at Chevron it also specifically names "Stock Incentive Plan of 2015" at Walmart. These are precise names that are more accurate to the source. Vanilla RAG provides a general overview of incentive plans without using the specific names used in the source.

* Specific types of compensation: Agentic RAG specifies that Chevron's LTIP includes "stock options and other share-based compensation," and that Walmart's Stock Incentive Plan of 2015 includes "stock options, restricted stock, restricted stock units, performance share units, and other equity compensation awards". These details are in the source, which lists the types of awards offered in each plan, while Vanilla RAG only mentions the general types of compensation, such as "stock options" and "restricted stock".

* Vanilla RAG makes broad statements and generalizations that are not directly stated in the sources, e.g. "Chevron emphasizes attracting, developing, and retaining talent through long-term employment models, leadership development programs, and succession planning" and "Chevron's approach focuses on fostering an inclusive work environment through leadership development programs and employee networks" which is not directly supported by the provided sources. Agentic RAG represents the plans as they are described in the source, avoiding generalization.

## Question 2: How does Costco manage its supply chain and what are the potential vulnerabilities and risks in this process?

Let's try another question.

In [45]:
another_question = "How does Costco manage its supply chain and what are the potential vulnerabilities and risks in this process?"

In [34]:
vanilla_rag(another_question)

"Costco manages its supply chain by utilizing a global network of suppliers, both domestic and international, to purchase merchandise for its stores, clubs, and online platforms. By establishing efficient relationships with suppliers and ensuring compliance with local laws and regulations, Costco is able to offer a wide range of high-quality products at competitive prices to its customers. However, potential vulnerabilities and risks in this process may include disruptions in the supply chain, such as delays in receiving products, logistical challenges, and compliance issues with regulations in different jurisdictions. These factors can impact Costco's in-stock levels, the attractiveness of its merchandise assortment, and ultimately, customer satisfaction."

In [35]:
agent_output = agent.run(another_question)

print("Final answer:")
print(agent_output)

Final answer:
Costco manages its supply chain by purchasing quality merchandise from numerous domestic and foreign suppliers and importers, including both U.S. and international suppliers. They have relationships with suppliers to efficiently sell significant quantities of products and offer low prices to customers. The company faces risks and vulnerabilities in its supply chain, including uncertainties in supply, pricing, and access to new products, supplier compliance issues, risks related to labor disputes, natural disasters, public health emergencies, and economic and political conditions, as well as commodity price fluctuations and disruptions in transportation and delivery processes.


In this example, the only thing Agentic RAG did was modifying the query for the retriever.

While both answers are quite good, just by modifying the query, the Agentic RAG managed to pull out docs that are more relevent even in a single retrieval step, and generate a slightly better response as a result.

Agentic answer states that Costco buys from "numerous domestic and foreign suppliers and importers". This directly reflects the source's emphasis on not being reliant on any single supplier and seeking alternatives when needed. Vanilla RAG response only mentions a "global network of suppliers" which is less precise.

Agentic answer lists specific supply chain risks that are directly stated in the source such as "uncertainties in supply, pricing, and access to new products, supplier compliance issues, risks related to labor disputes, natural disasters, public health emergencies, and economic and political conditions". It also mentions "commodity price fluctuations and disruptions in transportation and delivery processes". Vanilla RAG only mentions general "disruptions in the supply chain," "logistical challenges," and "compliance issues," which is much less specific and not as true to the information in the source.

## Conclusion

As you have seen in this tutorial, by rephrasing the query, or decomposing it, and making multiple uses of the retriever, Agentic RAG is able to provide higher quality answers compared to Vanilla RAG, even though they use the exact same retriever over exact same documents, and the exact same generation LLM.

Agentic RAG also requires a more powerful LLM compared to what you can get away with in non-agentic RAG, for the following reasons:
* Long context is needed for the multiple steps
* An agent may need to perform multiple step, and mistakes in steps can propagate and compound.

You may also notice, that Vanilla RAG is faster, and uses fewer tokens. An AI agent can take multiple steps and deliver a good result but also quickly burn through your API credits. It's always a good idea to consider all methods available for you, the actual use case you're trying to solve, and what approach will yield the best result for your set of requirements.

Next steps:

Check out the default system prompt that `smolagents` uses for tool calling: [here on GitHub](https://github.com/huggingface/smolagents/blob/681758ae84a8075038dc676d8af7262077bd00c3/src/smolagents/prompts.py#L114C1-L219C4). You can provide a custom `system_prompt` when initializing your agent. Try it with examples that are more representative of your use case, and modified instructions.

