# LlamaIndex: Building an Agent Reasoning Loop

This example demonstrates reasoning over multiple steps using an agent the integrates with function tooling.

Please reference this [DeepLearning.AI](https://learn.deeplearning.ai/courses/building-agentic-rag-with-llamaindex/lesson/ix5w5/building-an-agent-reasoning-loop) course for more details.  

### Set Environment

In [2]:
from dotenv import load_dotenv
import nest_asyncio
import llama_index
from llama_index.llms.openai import OpenAI

from tools_util import get_doc_tools


load_dotenv()
nest_asyncio.apply()
llama_index.core.__version__

llm = OpenAI(model="o4-mini", temperature=0)

### Load Data

In [None]:
import httpx
import os

files = ['https://arxiv.org/pdf/2505.10543']

os.makedirs('./data', exist_ok=True)

for idx, f in enumerate(files):
    file_name = f"./data/{f.split("/")[-1]}.pdf"
    if not os.path.exists(file_name):
        r = httpx.get(f, timeout=20)
        with open(file_name, 'wb') as f:
            f.write(r.content)

### Create Agent runner and worker

The agent running is created. This is the orchestrator and task dispatcher for the pipeline. The tasks are dispatched to agent workers that utilize the tools provided. The function calling tool is aware of state built up to this point and decides when the result should be returned to the user.

In [9]:
from llama_index.core.agent.workflow import FunctionAgent

vector_tool, summary_tool = get_doc_tools("./data/2505.10543.pdf", '2505_10543')

agent_worker = FunctionAgent(
    name="2505_10543_search_agent",
    tools=[vector_tool, summary_tool], 
    llm=llm, 
    verbose=True
)
#agent = AgentRunner(agent_worker)

### Submitting Different Requests

#### Queries

Below is a query chosen to intiate a multi-step chain-of-thought process by the agent worker. 

Notice the debug shows which query engine was chosen by the agent work to synthesis a response.

In [7]:
response = agent.query(
    "From the paper, describe how the results drive the conclusions."
)

Added user message to memory: From the paper, describe how the results drive the conclusions.
=== Calling Function ===
Calling function: summary_query_engine_my_doc with args: {"input": "Results section"}
=== Function Output ===
The results section likely delves into the performance analysis of different prompting strategies on various tasks using open-source large language models. It may discuss how larger models generally outperform smaller ones, but strategic prompting can help bridge the performance gap for smaller models. The importance of a dense, task-aligned reward signal in enhancing decision-making for agents is likely emphasized. Furthermore, the section may touch upon how advanced prompting techniques can notably boost performance for smaller models on complex tasks, although the effectiveness of these strategies can vary, resulting in significant gains or potential drops in performance. Additionally, the limitations of current benchmarks in capturing reasoning complexity a

Since the summary query engine was used, we'd expect all the nodes to be used. And since prior analysis shows 20 nodes created from the document and two queries were done, 40 nodes being used to create a response makes sense. 

In [9]:
print(f"Nodes used: {len(response.source_nodes)}")

Nodes used: 40


Lets access the meta data for the first node.

In [13]:
import pprint
pprint.pprint(response.source_nodes[0].metadata)

{'creation_date': '2025-05-19',
 'file_name': 'file0.pdf',
 'file_path': 'data\\file0.pdf',
 'file_size': 328049,
 'file_type': 'application/pdf',
 'last_modified_date': '2025-05-19',
 'page_label': '1'}


#### Chat

Now lets access the agent use the chat API so conversational memory is maintained. The memory buffer maintained depends on the context window of the LLM.

In [14]:
response = agent.chat(
    "Describe the analysis done for this paper."
)

Added user message to memory: Describe the analysis done for this paper.
=== LLM Response ===
The paper’s analysis proceeds in four broad strokes: (1) systematic empirical evaluation, (2) prompt-ablation studies, (3) error-mode and statistical breakdowns, and (4) qualitative case studies.  

1. Systematic empirical evaluation  
   • Models and Benchmarks  
     – Open-source LLMs ranging from ~125 M to 30 B parameters  
     – A diverse task suite: arithmetic and quantitative reasoning, chain-of-thought benchmarks, text-based games, StarCraft II micromanagement, instruction following, interactive decision-making, automated planning, meta-heuristic generation, RL-agent mental modeling  
   • Metrics  
     – Task-specific scores (accuracy, win‐rate, game points, planning cost)  
     – Aggregate measures (mean task accuracy, macro F1)  

2. Prompt-ablation studies  
   • Prompt formats  
     – Zero-shot vs. few-shot exemplars vs. chain-of-thought vs. self-ask vs. automatically generate

In [15]:
response = agent.chat(
    "What are the conclusions of the analysis?"
)

Added user message to memory: What are the conclusions of the analysis?
=== LLM Response ===
From the empirical analyses—spanning ablations of prompt style, scale‐curve regressions, error breakdowns, and targeted case studies—the authors draw four core conclusions:  

1. Model scale remains the dominant driver of raw performance.  
   - Larger LLMs (10 B+ parameters) consistently achieve higher scores across all tasks.  
   - Prompt engineering can help, but it cannot fully close the gap that sheer parameter count provides.  

2. Prompting yields conditional gains—and can even harm—especially for smaller models.  
   - On complex reasoning benchmarks, few-shot and chain-of-thought prompts often boost small and medium models by 10–30%.  
   - On simpler tasks, however, those same reasoning‐heavy prompts introduce overhead that sometimes reduces accuracy.  
   - Performance under advanced prompting is brittle: gains vary widely by task, exemplar choice, and random seed.  

3. There is no

#### Task Debugging and Steerability

Now lets experiment with agent control to debug and enable steerability of the results.

In [5]:
agent_worker = FunctionCallingAgentWorker.from_tools(
    [vector_tool, summary_tool], 
    llm=llm, 
    verbose=True
)
agent = AgentRunner(agent_worker)

In [6]:
task = agent.create_task(
"Tell me about the analysis results, and how the are reflected in the conclusions?"
)

In [7]:
step_output = agent.run_step(task.task_id)

Added user message to memory: Tell me about the analysis results, and how the are reflected in the conclusions?
=== Calling Function ===
Calling function: vector_query_engine_my_doc with args: {"input": "analysis results"}
=== Function Output ===
The analysis results presented in the provided context show the performance metrics of different models across various tasks. The table displays the minimum, median, and maximum average scores over three runs for each model and task. It highlights how different model sizes and methodologies impact performance in tasks such as Bandit, Rock Paper Scissors, Hanoi, and Messenger. Additionally, the text discusses the background of using LLMs in complex text-based games, the emergence of prompt-level techniques to enhance reasoning capabilities, and the application of evolutionary strategies to optimize LLM performance.


In [17]:
import textwrap

completed_steps = agent.get_completed_steps(task.task_id)
print(f"Num completed for task {task.task_id}: {len(completed_steps)}")
wrapped_text = textwrap.fill(str(completed_steps[0].output.sources[0].raw_output), width=150, replace_whitespace=False)
print(wrapped_text)

Num completed for task 2bb8adf1-a87f-425c-a61f-8e3554b8eeae: 1
The analysis results presented in the provided context show the performance metrics of different models across various tasks. The table displays the
minimum, median, and maximum average scores over three runs for each model and task. It highlights how different model sizes and methodologies impact
performance in tasks such as Bandit, Rock Paper Scissors, Hanoi, and Messenger. Additionally, the text discusses the background of using LLMs in
complex text-based games, the emergence of prompt-level techniques to enhance reasoning capabilities, and the application of evolutionary strategies
to optimize LLM performance.


In [18]:
upcoming_steps = agent.get_upcoming_steps(task.task_id)
print(f"Num of upcomping steps for task {task.task_id}: {len(upcoming_steps)}")
wrapped_text = textwrap.fill(str(upcoming_steps[0]), width=150, replace_whitespace=False)
print(wrapped_text)

Num of upcomping steps for task 2bb8adf1-a87f-425c-a61f-8e3554b8eeae: 1
task_id='2bb8adf1-a87f-425c-a61f-8e3554b8eeae' step_id='8b463beb-4201-4122-b729-0d6d3649bafb' input=None step_state={} next_steps={} prev_steps={}
is_ready=True


The agent worker autogenerates actions from conversation history. Lets add an additional user input to the intermediate results to include the model that had the best performance. This injection of user input will help produce a final result meeting user needs that wasn't apparent in the intermediate results. The results will be added to the conversation memory.

In [19]:
step_output = agent.run_step(
    task.task_id, input="Which model had the best overall performance?"
)

Added user message to memory: Which model had the best overall performance?
=== Calling Function ===
Calling function: vector_query_engine_my_doc with args: {"input": "best overall performance model"}
=== Function Output ===
LLAMA 3.3-70 B Base


Run the next step and check if its the last.

In [20]:
step_output = agent.run_step(task.task_id)
print(step_output.is_last)

=== LLM Response ===
The LLAMA 3.3-70 B Base model achieved the best overall performance. Its minimum, median and maximum average scores across the Bandit, Rock-Paper-Scissors, Hanoi and Messenger tasks were higher than those of all other models evaluated. This superior performance underpins the paper’s conclusion that scaling to larger LLMs—when combined with prompt-level enhancements and evolutionary strategies—yields the most robust gains in complex text-based reasoning tasks.
True


It is. Now finalize the results.

In [21]:
response = agent.finalize_response(task.task_id)

In [23]:
wrapped_text = textwrap.fill(str(response), width=150, replace_whitespace=False)
print(wrapped_text)

The LLAMA 3.3-70 B Base model achieved the best overall performance. Its minimum, median and maximum average scores across the Bandit, Rock-Paper-
Scissors, Hanoi and Messenger tasks were higher than those of all other models evaluated. This superior performance underpins the paper’s conclusion
that scaling to larger LLMs—when combined with prompt-level enhancements and evolutionary strategies—yields the most robust gains in complex text-
based reasoning tasks.
