# Extending AI agents with self-prompting

* prompts as operation (computation)
    * synthetic data generation (generate file paths)
    * data organization (ie organize list of files to dir based on their type or work/leisure activity)
    * policy understanding and compliance reasoning (upload doc & receipt)

### Self-Prompting Agents: Harnessing LLMs for Specialized Tasks

Large language models have emerged as remarkably versatile computational tools, capable of tasks ranging from sophisticated analysis to creative generation. Through careful prompting, we can use these models to perform natural language processing tasks like sentiment analysis and text classification, engage in analytical reasoning for problem-solving, generate structured data from unstructured text, and even act as domain experts in fields like software architecture or cybersecurity.

This computational flexibility stems from the models’ ability to understand and follow complex instructions, maintain context through multi-turn interactions, and adapt their outputs to specific formats and requirements. For example, the same underlying model can transform unstructured data into JSON, generate visualizations through tools like Graphviz, provide expert analysis in specialized domains, and even create interactive experiences like educational games or simulated systems.

#### The Challenge of Clean Architecture

However, this power presents a challenge. If we simply tell our agent to “think like a marketing expert” or “analyze like a data scientist,” we risk muddying its decision-making process. The agent’s primary job is to coordinate actions and achieve goals, not to become entangled in domain-specific reasoning. We need to maintain a clear separation between the agent’s strategic thinking and the specialized analytical capabilities we want to leverage.

Consider a company’s organizational structure: A CEO doesn’t need to be an expert in marketing, engineering, and finance. Instead, they need to understand when to consult their experts in each department and how to coordinate their inputs toward company goals. Our agent should work the same way – maintaining clear decision-making while having access to specialized capabilities through well-defined interfaces.

The solution is to isolate these prompts that are focused on specific tasks and expose them as tools. By doing this, we can keep the agent’s architecture clean and focused on its primary role of coordinating actions, while still leveraging the full power of LLM-based computation when needed.

#### Understanding Self-Dialog

When we expose “prompting” as a tool to the agent, we are allowing it to engage in “self-dialog.” Essentially, it is using its own capabilities to prompt itself for specialized tasks. This pattern enables the agent to dynamically adopt expert personas, perform complex analysis, and generate structured content, all while maintaining a clear separation between strategic decision-making and specialized processing.

In this tutorial, we’ll explore how to implement this pattern effectively, create different types of LLM-based tools for specialized tasks, and combine these tools to solve complex problems. By treating LLMs as tools within our agent’s toolkit, we can extend its capabilities while keeping its architecture clean and focused.

The LLM can:

* Transform unstructured data into structured formats by thinking through the patterns and relationships
* Analyze sentiment and emotion by carefully considering language nuances and context
* Generate creative solutions by exploring possibilities from multiple perspectives
* Extract key insights by systematically examining information through different analytical frameworks
* Clean and normalize data by applying consistent rules and handling edge cases thoughtfully

For example, when analyzing customer feedback, the LLM might first organize the raw text into structured categories, then analyze sentiment in each category, and finally synthesize insights about customer satisfaction trends. Each step involves the LLM engaging in a different type of analytical thinking.

#### The Tool-Based Solution

We can achieve this balance by exposing the LLM’s capabilities as tools. This approach allows us to:

* Keep the agent’s decision-making process clean and focused on coordinating actions
* Access the full power of LLM-based analysis and transformation when needed
* Maintain clear boundaries between strategic coordination and specialized processing
* Create reusable, well-defined interfaces for common analytical tasks

Think of these tools as specialized departments in our organization. Each has a clear purpose and interface, but the internal workings – the specific prompts and chains of reasoning – are encapsulated within the tool itself. The agent doesn’t need to know how the sentiment analysis tool works internally; it just needs to know when to use it and what to expect from it.

#### Building a Toolkit

We can create different types of LLM-based tools to handle various specialized tasks:

* **Transformation Tools:** Converting between different data formats and structures
* **Analysis Tools:** Providing expert insight in specific domains
* **Generation Tools:** Creating structured content from specifications
* **Validation Tools:** Checking if content meets specific criteria
* **Extraction Tools:** Pulling specific information from larger contexts

Each tool type encapsulates a specific kind of self-dialog, making it available to the agent through a clean interface. This allows us to leverage the LLM’s sophisticated reasoning capabilities while maintaining architectural clarity.

#### Tools vs unstructured data and prompting

* prompt as a bridge between unstructured real-world and computer tools
* ability to extract information from unstructured (photo of a room, website) to structured (json) form
* prompt-based tools to reshape/convert unstructured to structured info

#### A Self-Prompting Example

Imagine we’re building an agent to automate accounts payable processing. Every day, the agent receives dozens of emails with attached invoices from different vendors, each using their own unique format and layout. Some are PDFs, others are scanned images that have been converted to text, and a few even arrive as plain text in the email body. Our agent needs to understand each invoice, extract key information like the invoice number, date, amount, and line items, and then insert this data into the company’s accounting database. Without automation, this would be a tedious manual task requiring someone to read each invoice and transcribe the important details.

This is where the computational power of large language models becomes transformative. Through self-prompting, our agent can use an LLM as a universal parser that understands the natural structure of invoices, regardless of their format. The LLM can read an invoice like a human would, identifying key information based on context and common patterns, then output that information in a structured format our agent can process. We don’t need to write separate parsers for each vendor’s invoice format or maintain complex rules about where to find specific information — the LLM can handle all of that complexity for us.

Here’s how we can implement this capability as a reusable tool for our agent:

````python
@register_tool()
def prompt_llm_for_json(action_context: ActionContext, schema: dict, prompt: str):
    """
    Have the LLM generate JSON in response to a prompt. Always use this tool when you need structured data out of the LLM.
    This function takes a JSON schema that specifies the structure of the expected JSON response.
    
    Args:
        schema: JSON schema defining the expected structure
        prompt: The prompt to send to the LLM
        
    Returns:
        A dictionary matching the provided schema with extracted information
    """
    generate_response = action_context.get("llm")
    
    # Try up to 3 times to get valid JSON
    for i in range(3):
        try:
            # Send prompt with schema instruction and get response
            response = generate_response(Prompt(messages=[
                {"role": "system", 
                 "content": f"You MUST produce output that adheres to the following JSON schema:\n\n{json.dumps(schema, indent=4)}. Output your JSON in a ```json markdown block."},
                {"role": "user", "content": prompt}
            ]))

            # Check if the response has json inside of a markdown code block
            if "```json" in response:
                # Search from the front and then the back
                start = response.find("```json")
                end = response.rfind("```")
                response = response[start+7:end].strip()

            # Parse and validate the JSON response
            return json.loads(response)
            
        except Exception as e:
            if i == 2:  # On last try, raise the error
                raise e
            print(f"Error generating response: {e}")
            print("Retrying...")
````

You may have noticed a new `action_context` parameter above. Don’t worry about that for now, we will talk about this architectural choice in a later section. For now, just know that it is a context object that contains the LLM and other useful information.

With this tool in place, we can extract structured data from any text input. For example, an agent processing business documents could extract information in standardized formats:

From an invoice:

```python
invoice_schema = {
    "type": "object",
    "properties": {
        "invoice_number": {"type": "string"},
        "date": {"type": "string"},
        "amount": {"type": "number"},
        "line_items": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "description": {"type": "string"},
                    "quantity": {"type": "number"},
                    "unit_price": {"type": "number"}
                }
            }
        }
    }
}

extracted_data = prompt_llm_for_json(
    action_context=context,
    schema=invoice_schema,
    prompt="Extract invoice details from this text: 'INVOICE #1234...'"
)
```

The tool works by combining several key elements:

* A system message that firmly instructs the LLM to adhere to the provided JSON schema
* Retry logic to handle potential parsing failures
* Support for both direct JSON output and markdown-formatted JSON
* Error handling to ensure we either get valid JSON or fail explicitly

It is also possible to implement this using function calling, where the LLM generates a function call with the extracted data. However, we are going to implement this using plain prompting to show you how it can be done with any LLM.

The function is particularly powerful because it enforces structure through the schema while allowing flexibility in the prompt. This means we can use it for a wide variety of extraction tasks just by defining appropriate schemas. The agent can use this tool to:

* Extract meeting details from emails, converting them to calendar-ready formats
* Transform unstructured report data into structured analytics inputs
* Convert web page content into structured product or contact information
* Process customer communications into actionable support tickets

What makes this approach especially valuable is that it creates a reliable bridge between unstructured text and structured data processing. The agent can use this tool whenever it needs to convert natural language information into a format that other systems can process programmatically. This enables workflows where the agent can:

* Receive unstructured text input from various sources
* Use `prompt_llm_for_json` to extract relevant information in a structured format
* Pass the structured data to other APIs or services for further processing
* Make decisions based on the processed results

The tool’s retry logic and error handling make it robust enough to handle a lot of LLM variability in output, while its flexibility through schema definition makes it adaptable to new use cases without requiring changes to the core implementation.

#### Balancing Flexibility and Reliability in Data Extraction

When designing our agent’s toolset, we face an important architectural decision regarding data extraction. We can either rely on the agent to use the general-purpose `prompt_llm_for_json` tool with dynamically generated schemas, or we can create specialized extraction tools with fixed schemas for specific types of documents. This choice reflects a classic tradeoff between flexibility and reliability.

Consider our invoice processing example. With the general-purpose approach, we’re trusting our agent to construct appropriate schemas and prompts for each situation. This provides maximum flexibility — the agent can adapt its extraction approach based on the specific context, potentially extracting different fields for different types of invoices or even handling entirely new document types without requiring new tool implementations. However, this flexibility comes with risks. The agent might generate inconsistent schemas over time, leading to data inconsistency in our database. It might miss critical fields that should always be extracted, or it might structure the data in ways that make downstream processing difficult.

Let’s look at how we could create a specialized invoice extraction tool that provides more guarantees about the extracted data:

```python
@register_tool(tags=["document_processing", "invoices"])
def extract_invoice_data(action_context: ActionContext, document_text: str) -> dict:
    """
    Extract standardized invoice data from document text. This tool enforces a consistent
    schema for invoice data extraction across all documents.
    
    Args:
        document_text: The text content of the invoice to process
        
    Returns:
        A dictionary containing extracted invoice data in a standardized format
    """
    # Define a fixed schema for invoice data
    invoice_schema = {
        "type": "object",
        "required": ["invoice_number", "date", "amount"],  # These fields must be present
        "properties": {
            "invoice_number": {"type": "string"},
            "date": {"type": "string", "format": "date"},
            "amount": {
                "type": "object",
                "properties": {
                    "value": {"type": "number"},
                    "currency": {"type": "string"}
                },
                "required": ["value", "currency"]
            },
            "vendor": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "tax_id": {"type": "string"},
                    "address": {"type": "string"}
                },
                "required": ["name"]
            },
            "line_items": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "description": {"type": "string"},
                        "quantity": {"type": "number"},
                        "unit_price": {"type": "number"},
                        "total": {"type": "number"}
                    },
                    "required": ["description", "total"]
                }
            }
        }
    }

    # Create a focused prompt that guides the LLM in invoice extraction
    extraction_prompt = f"""
    Extract invoice information from the following document text. 
    Focus on identifying:
    - Invoice number (usually labeled as 'Invoice #', 'Reference', etc.)
    - Date (any dates labeled as 'Invoice Date', 'Issue Date', etc.)
    - Amount (total amount due, including currency)
    - Vendor information (company name, tax ID if present, address)
    - Line items (individual charges and their details)

    Document text:
    {document_text}
    """
    
    # Use our general extraction tool with the specialized schema and prompt
    return prompt_llm_for_json(
        action_context=action_context,
        schema=invoice_schema,
        prompt=extraction_prompt
    )
```

This specialized approach offers several advantages:

* **Data Consistency:** The fixed schema ensures that invoice data is always structured the same way, making it easier to work with downstream systems like databases or accounting software.
* **Required Fields:** We can specify which fields are required, ensuring critical information is always extracted or an error is raised if it can’t be found.
* **Field Validation:** The schema can include format specifications (like ensuring dates are properly formatted) and field-specific constraints.
* **Focused Prompting:** We can provide detailed guidance to the LLM about where to look for specific information, improving extraction accuracy.

However, this specialization also means we need to create and maintain separate extraction tools for each type of document we want to process. If we later need to handle purchase orders, receipts, or contracts, we’ll need to implement new tools for each.

The choice between these approaches often depends on your specific needs:

Use specialized tools when:

* Data consistency is critical
* You have a well-defined set of document types
* You need to enforce specific validation rules
* The extracted data feeds into other systems with strict requirements

Use the general-purpose approach when:

* You need to handle a wide variety of document types
* Document formats and requirements change frequently
* You’re prototyping or exploring new use cases
* The downstream systems are flexible about data format

In practice, many systems use a combination of both approaches: specialized tools for common, critical document types where consistency is important, and the general-purpose tool as a fallback for handling edge cases or new document types. This hybrid approach gives you the best of both worlds — reliability where you need it most, and flexibility where it matters.


#### A Complete Example of Prompting for Structured Data

Let’s create an invoice processing system that combines specialized extraction with a simple storage mechanism. The system will use the LLM’s capabilities to understand invoice content while maintaining strict data consistency through a fixed schema.

First, let’s create our specialized invoice extraction tool:

```python
@register_tool(tags=["document_processing", "invoices"])
def extract_invoice_data(action_context: ActionContext, document_text: str) -> dict:
    """
    Extract standardized invoice data from document text.

    This tool ensures consistent extraction of invoice information by using a fixed schema
    and specialized prompting for invoice understanding. It will identify key fields like
    invoice numbers, dates, amounts, and line items from any invoice format.

    Args:
        document_text: The text content of the invoice to process

    Returns:
        A dictionary containing the extracted invoice data in a standardized format
    """
    invoice_schema = {
        "type": "object",
        "required": ["invoice_number", "date", "total_amount"],
        "properties": {
            "invoice_number": {"type": "string"},
            "date": {"type": "string"},
            "total_amount": {"type": "number"},
            "vendor": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "address": {"type": "string"}
                }
            },
            "line_items": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "description": {"type": "string"},
                        "quantity": {"type": "number"},
                        "unit_price": {"type": "number"},
                        "total": {"type": "number"}
                    }
                }
            }
        }
    }

    # Create a focused prompt for invoice extraction
    extraction_prompt = f"""
            You are an expert invoice analyzer. Extract invoice information accurately and 
            thoroughly. Pay special attention to:
            - Invoice numbers (look for 'Invoice #', 'No.', 'Reference', etc.)
            - Dates (focus on invoice date or issue date)
            - Amounts (ensure you capture the total amount correctly)
            - Line items (capture all individual charges)
            
            Stop and think step by step. Then, extract the invoice data from:
            
            <invoice>
            {document_text}
            </invoice>
    """

    # Use prompt_llm_for_json with our specialized prompt
    return prompt_llm_for_json(
        action_context=action_context,
        schema=invoice_schema,
        prompt=extraction_prompt
    )
```

```python
@register_tool(tags=["storage", "invoices"])
def store_invoice(action_context: ActionContext, invoice_data: dict) -> dict:
    """
    Store an invoice in our invoice database. If an invoice with the same number
    already exists, it will be updated.
    
    Args:
        invoice_data: The processed invoice data to store
        
    Returns:
        A dictionary containing the storage result and invoice number
    """
    # Get our invoice storage from context
    storage = action_context.get("invoice_storage", {})
    
    # Extract invoice number for reference
    invoice_number = invoice_data.get("invoice_number")
    if not invoice_number:
        raise ValueError("Invoice data must contain an invoice number")
    
    # Store the invoice
    storage[invoice_number] = invoice_data
    
    return {
        "status": "success",
        "message": f"Stored invoice {invoice_number}",
        "invoice_number": invoice_number
    }
```

Our agent includes two specialized tools that integrate to manage the invoice processing workflow:

* **extract_invoice_data** acts as our intelligent document analyzer.
  This function uses self-prompting to take raw document text and transform it into structured data following a consistent schema.
  It uses a prompt that guides the LLM to identify crucial invoice elements like invoice numbers, dates, and line items.
  By enforcing a fixed JSON schema with required fields, the tool ensures data consistency regardless of the original invoice format.
  (In production, further safeguards against hallucination may be required, but this demonstrates the core functionality.)

* **store_invoice** provides a simple persistence mechanism in a dictionary.
  Once an invoice has been properly extracted and structured, this function saves it to our invoice database, using the invoice number as a unique identifier.
  The invoices are stored separate from the memory so that they can be persisted across runs of the agent.

To use this system, we would set up our agent with these tools and configure it to handle invoice processing tasks:

```python
def create_invoice_agent():
    # Create action registry with our invoice tools
    action_registry = PythonActionRegistry()
    
    # Create our base environment
    environment = PythonEnvironment()
    
    # Define our invoice processing goals
    goals = [
        Goal(
            name="Persona",
            description="You are an Invoice Processing Agent, specialized in handling and storing invoice data."
        ),
        Goal(
            name="Process Invoices",
            description="""
            Your goal is to process invoices by extracting their data and storing it properly.
            For each invoice:
            1. Extract all important information including numbers, dates, amounts, and line items
            2. Store the extracted data indexed by invoice number
            3. Provide confirmation of successful processing
            4. Handle any errors appropriately
            """
        )
    ]

    # Create the agent
    return Agent(
        goals=goals,
        agent_language=AgentFunctionCallingActionLanguage(),
        action_registry=action_registry,
        generate_response=generate_response,
        environment=environment
    )
```

This implementation provides several key benefits:

* **Consistent Data Structure:**
  The fixed schema in `extract_invoice_data` ensures all invoices are processed into a consistent format.
  The prompting and logic for how to extract invoice data is separate from the agent’s core reasoning, making it easier to modify and maintain.

* **Modular Design:**
  Each tool has a single, clear responsibility, making the system easy to maintain and extend.
  Details for how the tools are implemented are hidden from the overall goals of the agent.

* **Error Handling:**
  Built-in validation ensures required fields are present and data is properly formatted.

* **Persistent Storage:**
  The simple dictionary-based storage can be easily replaced with a database or other persistence mechanism by modifying the storage tools.
  The work that the agent does can now be persisted across runs.

The specialized schema and focused prompting help ensure accurate extraction, while the storage tools maintain data organization. You can extend this system by adding more specialized tools for different types of invoices or additional processing capabilities.

#### Horizontal Scaling of Agents Through Tools

One of the most powerful aspects of this tool-based approach is how it enables horizontal scaling of agent capabilities. Rather than constantly expanding the core goals or system prompt of an agent — which can lead to prompt bloat and conflicting instructions — we can encapsulate specific functionality in well-defined tools that the agent can access as needed.

#### Encapsulating Complexity in Tools

Tools serve as specialized modules that hide implementation complexity from the agent’s core reasoning. Consider our invoice processing example:

* **Abstraction of Domain Knowledge:**
  The `extract_invoice_data` tool encapsulates specialized knowledge about invoice formats, field identification, and data extraction.
  The agent doesn’t need to understand these details — it just needs to know when to use the tool.

* **Separation of Concerns:**
  Each tool handles a specific function (extraction, storage), allowing the agent to focus on high-level coordination rather than implementation specifics.
  This separation makes the entire system more maintainable and easier to reason about.

* **Focused Prompting:**
  By moving specialized prompting inside tools, we keep the agent’s core goals simple and focused.
  The extraction tool handles its own specialized prompt engineering, freeing the agent from needing to generate perfect prompts for every task.

#### Maintainability and Adaptability

Tools create a modular architecture that offers significant maintenance advantages:

* **Independent Development:**
  Tools can be developed, tested, and improved independently of the agent’s core logic.
  Specialized teams can work on different tools without needing to understand or modify the entire agent system.

* **Versioning and Updates:**
  Individual tools can be updated without changing the agent’s core goals.
  For example, we could improve the invoice extraction algorithm without touching any other part of the system.

* **Plug-and-Play Functionality:**
  New capabilities can be added by simply registering new tools with the action registry.
  The agent automatically gains access to these capabilities through its function-calling abilities.

#### Adapting Agents Through Tool Management

This architecture makes it remarkably easy to adapt agents for different use cases:

* **Tool Composition:**
  Create specialized agents by selecting which tools they have access to.
  An invoice processing agent might have document tools, while a customer service agent might have access to CRM tools.

* **Capability Evolution:**
  Start with simple implementations and gradually enhance capabilities by upgrading tools.
  For example, our simple dictionary-based invoice storage could be replaced with a database connector without changing the agent’s core logic.

* **Context Management:**
  Tools can manage their own state and context, reducing the cognitive load on the agent.
  In our example, the storage tool manages its own data structure, allowing the agent to focus on process flow rather than data management.

#### Practical Implementation Considerations

When implementing a tool-based architecture for horizontal scaling:

* **Tool Discoverability:**
  Ensure tools have clear descriptions and tags so the agent can understand when to use them.
  Well-documented tool interfaces help both human developers and AI agents.

* **Error Handling:**
  Build robust error handling into tools to prevent failures from cascading through the system.
  Tools should provide clear error messages that guide the agent toward resolution.

* **Instrumentation:**
  Add logging and monitoring to tools to track their usage and performance.
  This provides valuable insights for improving both the tools and the agent’s decision-making about when to use them.

* **Contextual Awareness:**
  Design tools to preserve and utilize context when appropriate.
  For example, our invoice storage tool could be enhanced to track modification history or flag unusual changes.

#### Conclusion

Horizontal scaling through tools represents a paradigm shift in how we build and evolve agent systems. Rather than creating monolithic agents with ever-expanding capabilities encoded in their core prompts, we can build modular, adaptable systems that grow through the addition of specialized tools.

This approach mirrors successful software engineering practices — encapsulation, modularity, and separation of concerns — applied to the unique challenges of LLM-based agents. By focusing complexity in tools rather than core agent reasoning, we create systems that are more maintainable, more adaptable, and ultimately more capable of solving complex, real-world problems.
