# LlamaIndex: Tooling Examples

This example demonstrates reasoning over multiple steps using an agent the integrates with function tooling.

Please reference this [DeepLearning.AI](https://learn.deeplearning.ai/courses/building-agentic-rag-with-llamaindex/lesson/ix5w5/building-an-agent-reasoning-loop) course for more details.  

### Set Environment

In [1]:
from dotenv import load_dotenv
import nest_asyncio
import llama_index
from llama_index.llms.openai import OpenAI
import textwrap
from llama_index.core.tools import FunctionTool
from pydantic import BaseModel
from llama_index.core.output_parsers import PydanticOutputParser
from llama_index.core.agent.workflow import FunctionAgent

from tools_util import get_doc_tools

load_dotenv()
nest_asyncio.apply()
llama_index.core.__version__

llm = OpenAI(model="o4-mini", temperature=0)

def long_print(msg: str):
    wrapped_text = textwrap.fill(msg, width=140, replace_whitespace=False)
    print(wrapped_text)

### Load Data

In [3]:
import httpx
import os

files = ['https://arxiv.org/pdf/2505.10543']

os.makedirs('./data', exist_ok=True)

for idx, f in enumerate(files):
    file_name = f"./data/{f.split("/")[-1]}.pdf"
    if not os.path.exists(file_name):
        r = httpx.get(f, timeout=20)
        with open(file_name, 'wb') as f:
            f.write(r.content)

### Creating Tools

#### Function Agents

Below, some basic tooling is done using the `FunctionAgent` class to decide which tool to call.

Note that I purposely implemented the `multiply` tool incorrectly to ensure that the tool is being called. I have noticed that sometimes the LLM will not call the tool and produce an answer that doesn't reflect the tool implementations.

To address this, I found it necessary to add system prompts and an output schema to ensure usable responses. These additions did not completely resolve the problems. Sometimes the asserts in the script below fail and the tools provided to the agent are not called. These additions have also considerably added to the response time.




In [7]:
class IntegerResult(BaseModel):
    """Data model for a tool output."""
    result: int

def add(x: int, y: int) -> int:
    """A tool to add two integers together."""
    print('Hit add numbers.')
    return x + y

def multiply(x: int, y: int) -> int:
    """A tool to multiply two integers together."""
    print('Hit multiply numbers.')
    return x ** y

add_tool = FunctionTool.from_defaults(fn=add, fn_schema=IntegerResult)
multiply_tool = FunctionTool.from_defaults(fn=multiply, fn_schema=IntegerResult)

agent1 = FunctionAgent(
    name="add_multiply_agent",
    description="An agent that can add or multiply two numbers.",
    tools=[add_tool, multiply_tool], 
    llm=llm, 
    system_prompt=("Only use the available tools to answer questions."
                   "Do not evaluate output."
                   "The final response should only contain tool output."),
    output_parser=PydanticOutputParser(output_cls=IntegerResult),
    verbose=True
)
response = await agent1.run('Add the numbers 4 and 5.')
print(f'Addition response: {str(response)}')
assert int(str(response)) == 9
response = await agent1.run('Multiply the numbers 4 and 5.')
print(f'Multiply response: {str(response)}')
assert int(str(response)) == 4 ** 5


Addition response: 9
Multiply response: 1024


#### Query Engine Tools

The `get_doc_tools` module function below creates two query engines tools: one that specializes in specific question about a document and another that specializes in summaries. 

In [None]:
vector_tool, summary_tool = get_doc_tools("./data/2505.10543.pdf", '2505_10543')

The summary tool used to get a summary of the document.

In [11]:
summarization_response = summary_tool.call("Please provide a concise sentence that summarizes the document.")
long_print(str(summarization_response))

The document discusses the evaluation of large language models on dynamic decision-making tasks and the effectiveness of strategic prompting
techniques in closing the performance gap between larger and smaller models, emphasizing the limitations of current models in emergent
reasoning and self-learning.


Now lets ask a more specific question.

In [13]:
detailed_response = vector_tool.call("What are the conclusions of the document?")
long_print(str(detailed_response))

The conclusions of the document are as follows:
- Larger language models generally achieve higher scores in line with scaling laws, but
carefully designed prompts can enable smaller models to match or surpass the baseline performance of larger models.
- Advanced prompting
techniques primarily benefit smaller models on complex planning and reasoning tasks, offering less value to already high-performing large
language models.
- Advanced reasoning techniques can significantly improve performance when reasoning and decision-making align, but they
can also introduce instability and lead to significant performance drops.
- Transforming a sparse reward into a dense, task-aligned
quantitative reward can improve the learning effectiveness of large language model agents within complex environments.
- There is little
evidence for self-learning or emergent reasoning in dynamic tasks that require planning and spatial coordination, highlighting the
limitations of current static approaches.


### Multi-Tool Agent

Now lets combine all the tools created above into one agent to see if correct tools are utilized.

In [21]:
agent2 = FunctionAgent(
    name="A multi-tool agent",
    description="An agent that perform limited arithmetic and answer certain LLM questions.",
    tools=[add_tool, multiply_tool, vector_tool, summary_tool], 
    llm=llm, 
    system_prompt=("Only use your available tools to answer questions."
                   "Do not evaluate output."
                   "The final response should only contain tool output."),
    verbose=True
)
response = await agent2.run("What conclusions can be drawn about LLM reasoning. Also, multiply 4 and 5.")

In [22]:
long_print(str(response))

The conclusions about LLM reasoning:
- Larger models generally exhibit better reasoning performance.
- Strategic prompting techniques can
help smaller models close the gap but may also cause overthinking in simple tasks.
- Advanced prompting can boost smaller models on complex
tasks, though results are inconsistent and can sometimes worsen performance.
- LLMs currently show limitations in emergent reasoning or
self-learning, often failing by producing invalid action sequences or getting stuck in loops.
- There is a need for more robust solutions
and critical assessment of prompting methods that claim to enable advanced LLM capabilities.

Result of multiplying 4 and 5: 1024


The response above reflects the content of the document and the purposely-defective multiply tool. 

Below are the tools called by the agent.

In [26]:
for tool in response.tool_calls:
    print(tool.tool_name)

summary_query_engine_2505_10543
multiply


Note, the response is not deterministic. Sometimes the multiply tool was not called to answer the arithmetic part of the questions.