# Document Parse as a Tool w/ LLMs

In this tutorial we are going to explore how to implement a triage agentic system using Tool(Function) Calls with LLMs from scratch. To proceed with this tutorial, you'll need API keys for OpenAI, Serper, and Upstage. We'll be building an intelligent document parsing system that can triage and route documents, extract relevant information from PDFs, and use that information to answer questions appropriately.
 
We'll cover:
1. Setting up the necessary tools and APIs
2. Creating function schemas for tool calling and triage routing
3. Implementing document parsing and classification functionality
4. Building the triage agent architecture
5. Integrating these tools with LLMs for intelligent routing
 
This approach demonstrates how LLMs can be enhanced with specialized tools and agentic capabilities to handle complex document workflow tasks. The triage system will intelligently analyze documents, determine their type and priority, extract key information, and route them to appropriate downstream processes. By the end, you'll understand how to create your own tool-augmented LLM systems with intelligent triage capabilities.

# Import dependencies

Let's highlight some key dependencies with why we need them:

- PyPDF2: For reading and parsing PDF documents
- OpenAI: This is a standard SDK for any LLM API calls. In this tutorial, we are going to use it to access GPT models from OpenAI and Solar models from Upstage.
- Pydantic: For data validation and schema generation
- dotenv: For loading environment variables containing API keys

> ❓ Why do we include GPT models in this tutorial?

While our main goal is to integrate Document Parse in Solar model, it is quite convenient to test things first with GPT models. GPT models are known to be reliable and performant on various tasks, so they serve as a good upperbound for evaluating our document parsing system. Then we can safely move to replace GPT model with Solar model. This is a common strategy for every engineering project to not being lost in the middle of developing the entire system.

> ❓ What is the role of dotenv?

dotenv is a Python library that helps manage environment variables by loading them from a .env file into your application's environment. This is particularly useful for handling sensitive information like API keys that shouldn't be hardcoded in your source code. To use dotenv, first create a .env file in your project's root directory and add your environment variables in KEY=VALUE format. Then, in your Python code, import load_dotenv from python-dotenv and call load_dotenv() at the start of your program. After that, you can access the environment variables using os.getenv("KEY_NAME"). This keeps your sensitive credentials secure and makes it easier to manage different configurations across development and production environments.
# 


In [2]:
import os
import json
import inspect
import requests
from io import BytesIO

import PyPDF2
from openai import OpenAI
from pydantic import BaseModel

from dotenv import load_dotenv

## Setting up the API Keys

Once `load_dotenv()` is called successfully, you will see `True` is returned and printed out. At this point, all the variables from `.env` file is loaded up as environment variable. Hence, you can access them with `os.getenv()` function.

In [3]:
load_dotenv()

True

Basically, we are going to need the following three API keys:
- `SERPER_API_KEY`: API key for Google Search API service from [Serper.dev](https://serper.dev/)
- `OPENAI_API_KEY`: API key for accessing [OpenAI](https://openai.com/)'s GPT models. 
- `UPSTAGE_API_KEY`: API key for accessing [Upstage](https://www.upstage.ai/)'s Solar models.

In [4]:
SERPER_API_KEY = os.getenv("SERPER_API_KEY")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
UPSTAGE_API_KEY = os.getenv("UPSTAGE_API_KEY")

## Defining Helper Functions

Here, we are defining two helper functions necessary for tool(function) calling:
- `function_to_schema()`: Converts a Python function into a JSON schema format that can be used for function calling with LLMs. This can be thought as just structured string that is going to be injected into LLM as context so that LLM can understand what kind of functions are available to call
- `execute_tool_call()`: Executes the function call based on the LLM's response and returns the result

In [5]:
def function_to_schema(func) -> dict:
    """
    Converts a Python function into a JSON schema format for LLM function calling.
    
    Args:
        func: The Python function to convert to schema
        
    Returns:
        dict: JSON schema describing the function's interface
        
    The schema includes:
    - Function name
    - Description from docstring
    - Parameters with their types
    - Required parameters list
    """
    # Map Python types to JSON schema types
    type_map = {
        str: "string",
        int: "integer", 
        float: "number",
        bool: "boolean",
        list: "array",
        dict: "object",
        type(None): "null",
    }

    try:
        # Get function signature using inspect
        signature = inspect.signature(func)
    except ValueError as e:
        raise ValueError(
            f"Failed to get signature for function {func.__name__}: {str(e)}"
        )

    # Build parameters dictionary
    parameters = {}
    for param in signature.parameters.values():
        try:
            # Get JSON type for parameter, default to string if type not found
            param_type = type_map.get(param.annotation, "string")
        except KeyError as e:
            raise KeyError(
                f"Unknown type annotation {param.annotation} for parameter {param.name}: {str(e)}"
            )
        parameters[param.name] = {"type": param_type}

    # Get list of required parameters (those without default values)
    required = [
        param.name
        for param in signature.parameters.values()
        if param.default == inspect._empty
    ]

    # Return complete schema
    return {
        "type": "function",
        "function": {
            "name": func.__name__,
            "description": (func.__doc__ or "").strip(),
            "parameters": {
                "type": "object",
                "properties": parameters,
                "required": required,
            },
        },
    }
    
def execute_tool_call(tool_call, tools, agent_name):
    """
    Executes a function call based on the LLM's response.
    
    Args:
        tool_call: Object containing function call details from LLM
        tools: Dictionary mapping function names to actual functions
        agent_name: Name of the agent making the call, for logging
        
    Returns:
        The result of executing the specified function with given arguments
        
    This function:
    1. Extracts function name and arguments from tool_call
    2. Logs the function call
    3. Executes the function with provided arguments
    """
    # Extract function name and parse arguments from JSON
    name = tool_call.function.name
    args = json.loads(tool_call.function.arguments)

    # Log the function call
    print(f"{agent_name}:", f"{name}({args})")

    # Execute the function with unpacked arguments
    return tools[name](**args)  # call corresponding function with provided arguments

## Testing the Tool Calling

Tool(function) calling is going to happen essentially in real environment, hence it would actually cost some resources (money, time, etc.,). That means we should be able to test if LLM can really understand well about the functions that we are going to provide without actually executing them. 


We can validate the LLM's understanding of the tools by defining functions with only signatures and docstrings, without actual implementations. Tool calling is essentially prompt engineering, so we should validate the LLM's understanding before making real API calls.

### Define a simple wrapper class and initial user prompts

In [6]:
class Agent(BaseModel):
    name: str = "Agent"
    model: str = "gpt-4o"
    instructions: str = "You are a helpful Agent"
    tools: list = []
    
client = OpenAI(api_key=OPENAI_API_KEY)

messages = [
    {
        "role": "user",
        "content": "Provide a comprehensive summary of the paper, "
                   "'ChunkKV - Semantic-Preserving KV Cache Compression "
                   "for Efficient Long-Context LLM Inference' on arXiv. "
    },
]    

#### Implement Triage Agentic system with dummy functions

Let's define triage agentic system with the following three components:
- `to_paper_search_agent()`: A dummy function that does nothing but with appropriate docstring for LLMs to understand this function is for searching paper information. This dummy function will be implemented in detail with Serper.dev API.
- `to_download_and_parse_paper_agent()`: A dummy function that does nothing but with appropriate docstring for LLMs to understand this function is for downloading and parsing paper PDF. This dummy function will be implemented in detail with Upstage's Document Parse API.
- `supervisor_agent`: The center of triage agentic system. While it could looks too fancy, this is just LLM with designated tools(functions) with appropriate system prompt.

In [7]:
def to_paper_search_agent():
    """Use this to search for paper URL on arXiv only when paper URL is not found yet."""
    return ""

def to_download_and_parse_paper_agent():
    """Use this to download and parse paper only when paper URL is found."""
    return ""

supervisor_agent = Agent(
    name="Supervisor Agent",
    instructions=(
        "You are a academic paper analyzer. "
        "- Basiclly, you don't have knowledge of the requested paper."
        "- Hence, you need to use the provided tools to get the paper information from the internet. "
        "- Your job is to find appropriate tool to transfer to based on the user's request and results of tool calls. "
        "- If enough information is collected to complete the user request, you should say directly answer to the user request. "
    ),
    tools=[to_paper_search_agent, to_download_and_parse_paper_agent]
)

tool_schemas = [function_to_schema(tool) for tool in supervisor_agent.tools]
tools = {tool.__name__: tool for tool in supervisor_agent.tools}

#### Call LLM with Function schemas (1st conversation)

Let's make a call to LLM with the pre-defined functions. The JSON schema of the pre-defined functions are already prepared in `tool_schemas` with `function_to_schema()`. 

> ❓ Why making a manual chat complemetion call?

While we can make chat completion calls iteratively in a while loop, this would make us hard to analyze any potential issues. This is especially true when you are dealing with LLMs since LLM itself is already very high-level entity. We don't know what is going on inside of them, hence we have to make sure if everything works as expected one by one before moving on to the fully autonomous system.

In [8]:
# Initial trial
response = client.chat.completions.create(
    model=supervisor_agent.model,
    messages=[{"role": "system", "content": supervisor_agent.instructions}] + messages,
    tools=tool_schemas or None,
    tool_choice="auto",
)
print(response.choices[0].message.content)
print(response.choices[0].message.tool_calls)

None
[ChatCompletionMessageToolCall(id='call_7zbELPSWR4JXMbjR5OQoB4p2', function=Function(arguments='{}', name='to_paper_search_agent'), type='function')]


Let's understand the outputs
- `response.choices[0].message.content`: when LLM makes a decision to make tool(function) calls, `response.choices[0].message.content` should contain nothing (hence, `None`).
- `response.choices[0].message.tool_calls`: instead, when LLM makes a decision to make tool(function) calls, `response.choices[0].message.tool_calls` should contain `ChatCompletionMessageToolCall` object instance. Inside `ChatCompletionMessageToolCall`, there are all the information what we need to do next such as:
  - `.function.name`: what tool(function) has to be called
  - `.function.arguments`: what arguments should be passed to the tool(function) when called

We see the LLM correctly made a decision to call `to_paper_search_agent` function. This is correct because we don't have the paper information yet, so we are going to have to go through searching process before analyzing it.

#### Call LLM with Function schemas (2nd conversation)

Let's make the 2nd call to LLM. At this time, we are going to inject a dummy message with the tool(function) calling output. It has to follow the below format:

```python
    {
        "role": "tool", 
        "tool_call_id": response.choices[0].message.tool_calls[0].id,
        "content": "Paper URL: https://arxiv.org/abs/2502.00299"
    }
```

> ❓ here, you have to assign the ID of the tool(function) that LLM has previoulsy decided to call. That ID can be found in the previous message. If these IDs do not match, it will cause an error. 

We are pretending that we have successfully obtained the URL of the paper as the result of the `to_paper_search_agent` function call.

In [9]:
# Subsequent trial (Second)
messages.append(response.choices[0].message)
messages.append(
    {
        "role": "tool", 
        "tool_call_id": response.choices[0].message.tool_calls[0].id,
        "content": "Paper URL: https://arxiv.org/abs/2502.00299"
    }
)

response = client.chat.completions.create(
    model=supervisor_agent.model,
    messages=[{"role": "system", "content": supervisor_agent.instructions}] + messages,
    tools=tool_schemas or None,
    tool_choice="auto",
)
print(response.choices[0].message.content)
print(response.choices[0].message.tool_calls)

None
[ChatCompletionMessageToolCall(id='call_TX7Y8E6mwqjri6tDVbvbRZY3', function=Function(arguments='{}', name='to_download_and_parse_paper_agent'), type='function')]


At this time, we see the LLM correctly made a decision to call `to_download_and_parse_paper_agent` function. This is correct because we have the target URL of the paper, so it is time to download and pare the content of it.

#### Call LLM with Function schemas (3rd conversation)

Let's make the 3rd call to LLM. At this time, we are going to inject a dummy message with the tool(function) calling output again. It has to follow the below format:

```python
    {
        "role": "tool", 
        "tool_call_id": response.choices[0].message.tool_calls[0].id, 
        "content": "Retrieved Paper Content\n"
        "--------------------------------\n"
        "Title: ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference\n"
        "To reduce memory costs in long-context inference with Large Language Models (LLMs), many recent works focus on compressing the key-value (KV) cache of different tokens. However, we identify that the previous KV cache compression methods measure token importance individually, neglecting the dependency between different tokens in the real-world language characterics. In light of this, we introduce ChunkKV, grouping the tokens in a chunk as a basic compressing unit, and retaining the most informative semantic chunks while discarding the less important ones. Furthermore, observing that ChunkKV exhibits higher similarity in the preserved indices across different layers, we propose layer-wise index reuse to further reduce computational overhead. We evaluated ChunkKV on cutting-edge long-context benchmarks including LongBench and Needle-In-A-HayStack, as well as the GSM8K and JailbreakV in-context learning benchmark. Our experiments with instruction tuning and multi-step reasoning (O1 and R1) LLMs, achieve up to 10\% performance improvement under aggressive compression ratios compared to existing methods."
    }
```

To simulate function callings and their results, I simply grasped the actual content of the abtract of the paper from arXiv. 

In [10]:
# Subsequent trial (Third)
messages.append(response.choices[0].message)
messages.append(
    {
        "role": "tool", 
        "tool_call_id": response.choices[0].message.tool_calls[0].id, 
        "content": "Retrieved Paper Content\n"
        "--------------------------------\n"
        "Title: ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference\n"
        "To reduce memory costs in long-context inference with Large Language Models (LLMs), many recent works focus on compressing the key-value (KV) cache of different tokens. However, we identify that the previous KV cache compression methods measure token importance individually, neglecting the dependency between different tokens in the real-world language characterics. In light of this, we introduce ChunkKV, grouping the tokens in a chunk as a basic compressing unit, and retaining the most informative semantic chunks while discarding the less important ones. Furthermore, observing that ChunkKV exhibits higher similarity in the preserved indices across different layers, we propose layer-wise index reuse to further reduce computational overhead. We evaluated ChunkKV on cutting-edge long-context benchmarks including LongBench and Needle-In-A-HayStack, as well as the GSM8K and JailbreakV in-context learning benchmark. Our experiments with instruction tuning and multi-step reasoning (O1 and R1) LLMs, achieve up to 10\% performance improvement under aggressive compression ratios compared to existing methods."
    }
)

response = client.chat.completions.create(
    model=supervisor_agent.model,
    messages=[{"role": "system", "content": supervisor_agent.instructions}] + messages,
    tools=tool_schemas or None,
    tool_choice="auto",
)
print(response.choices[0].message.content)
print(response.choices[0].message.tool_calls)

The paper "ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference" addresses the challenge of reducing memory costs in long-context inference using Large Language Models (LLMs). It identifies limitations in previous key-value (KV) cache compression methods which treat token importance in isolation, overlooking inter-token dependencies inherent in language.

To counter this, ChunkKV introduces a novel approach where tokens are grouped into chunks, serving as the fundamental compression unit. This method prioritizes retaining chunks that hold significant semantic information and discards less important ones. An additional innovation is the proposal of layer-wise index reuse, leveraging the observed similarity in preserved indices across different layers to reduce computational demands further.

The effectiveness of ChunkKV is demonstrated through evaluations on advanced long-context benchmarks like LongBench, Needle-In-A-HayStack, GSM8K, and JailbreakV

At this time, we see the LLM correctly made a decision not to call any function. This is correct because we already have the actual content of the paper. Hence, instead of making an another tool(function) calling, the LLM decided to answer the question from the initial user prompt: `"Provide a comprehensive summary of the paper, 'ChunkKV - Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference' on arXiv."`

## Filling in the Dummy Tools (Functions)

Now, it is time to fill in the dummy functions with actual behaviour! We are starting off with the same setup as below:

In [54]:
class Agent(BaseModel):
    name: str = "Agent"
    model: str = "gpt-4o"
    instructions: str = "You are a helpful Agent"
    tools: list = []
    
client = OpenAI(api_key=OPENAI_API_KEY)

messages = [
    {
        "role": "user",
        "content": "Provide a comprehensive summary of the paper, "
                   "'ChunkKV - Semantic-Preserving KV Cache Compression "
                   "for Efficient Long-Context LLM Inference' on arXiv. "
    },
]    

Let's define triage agentic system with the following three components:
- `to_paper_search_agent(paper_title: str)`: It has the same docstring as we had in the dummy function along with additional argument of `paper_title`. The information of this argument will be passed to the LLM after applying `function_to_schema()` function to it.
  - what this function does is very simple. It makes an API call to Serper.dev for google search, then return the first searched item whose link is starting with `https://arxiv.org`.
  - how the value of `paper_title` argument is assigned even though we gave a mission in natural language? This is totally up to the LLM itself. The LLM decides what value to assign to the argument.

- `to_download_and_parse_paper_agent(paper_url: str)`: It has the same docstring as we had in the dummy function along with additiona argument of `paper_url`. The information of this argument will be passed to the LLM after applying `function_to_schema()` function to it.
  - this function simply download the PDF of the paper with the given URL, then it returns the first 10,000 characters of it.
  - as this function expects paper_url as argument (required), LLM will make a decision to call it only when URL info. is provided. 

- `supervisor_agent`: The center of triage agentic system. This is exactly the same as in our dummy example.

In [55]:
def to_paper_search_agent(paper_title: str):
    """Use this to search for paper URL on arXiv only when paper URL is not found yet."""
    url = "https://google.serper.dev/search"

    payload = json.dumps({"q": f"{paper_title} on arXiv"})
    headers = {
        'X-API-KEY': SERPER_API_KEY,
        'Content-Type': 'application/json'
    }

    response = requests.request("POST", url, headers=headers, data=payload)
    search_results = response.json()['organic']
    
    if len(search_results) == 0:
        return "Count not find the URL to download the paper"
    
    first_result = search_results[0]
    if not first_result['link'].startswith("https://arxiv.org"):
        return "Could not find the URL to download the paper"
    
    return f"URL to download '{paper_title}': {first_result['link'].replace('abs', 'pdf')}"

def to_download_and_parse_paper_agent(paper_url: str):
    """Use this to download and parse paper only when paper URL is found."""
    response = requests.get(paper_url)
    pdf_reader = PyPDF2.PdfReader(BytesIO(response.content))
    text = "Retrieved Paper Content\n-----------------------------------\n"
    for page in pdf_reader.pages:
        text += page.extract_text() + "\n"
    return text.strip()[:10000]

supervisor_agent = Agent(
    name="Supervisor Agent",
    instructions=(
        "You are a academic paper analyzer. "
        "- Basiclly, you don't have knowledge of the requested paper."
        "- Hence, you need to use the provided tools to get the paper information from the internet. "
        "- Your job is to find appropriate tool to transfer to based on the user's request and results of tool calls. "
        "- If enough information is collected to complete the user request, you should say directly answer to the user request. "
    ),
    tools=[to_paper_search_agent, to_download_and_parse_paper_agent]#, to_paper_analysis_agent]#, to_triage, to_end_agent],
)


Now, instead of making a sequential LLM API calls manually, let's create a simple `run` function that does the same thing with `While` loop. The loop is escaped when there is no need for tool(function) call anymore.

In [56]:
def run(client, messages, supervisor_agent):
    # Loop through the conversation steps
    while True:
        # Prepare tools for the current step
        tool_schemas = [function_to_schema(tool) for tool in supervisor_agent.tools]
        tools = {tool.__name__: tool for tool in supervisor_agent.tools}
        
        # Get model response
        response = client.chat.completions.create(
            model=supervisor_agent.model,
            messages=[{"role": "system", "content": supervisor_agent.instructions}] + messages,
            tools=tool_schemas or None,
            tool_choice="auto",
        )
        
        if response.choices[0].message.tool_calls:
            print(response.choices[0].message.tool_calls)
        else:
            print("--------------------------------")
            print(response.choices[0].message.content)
            print("--------------------------------")
            break # escape the loop when there is no need for tool(function) call anymore
        
        # Add model response to messages
        messages.append(response.choices[0].message)
        
        # Add tool response to messages
        if response.choices[0].message.tool_calls:
            for tool_call in response.choices[0].message.tool_calls:
                tool_response = execute_tool_call(tool_call, tools, supervisor_agent.name)
                
                messages.append({
                    "role": "tool", 
                    "tool_call_id": tool_call.id, 
                    "content": tool_response
                })

In [57]:
run(client, messages, supervisor_agent)

[ChatCompletionMessageToolCall(id='call_td4d3gSzmIjXLckS2b5m0D8z', function=Function(arguments='{"paper_title":"ChunkKV - Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference"}', name='to_paper_search_agent'), type='function')]
Supervisor Agent: to_paper_search_agent({'paper_title': 'ChunkKV - Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference'})
[ChatCompletionMessageToolCall(id='call_zg8EX15mTG0LBnGf4Op3EcrA', function=Function(arguments='{"paper_url":"https://arxiv.org/pdf/2502.00299"}', name='to_download_and_parse_paper_agent'), type='function')]
Supervisor Agent: to_download_and_parse_paper_agent({'paper_url': 'https://arxiv.org/pdf/2502.00299'})
--------------------------------
The paper titled "ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference" presents a new method for reducing memory costs in long-context inference with Large Language Models (LLMs). Here’s a comprehensive summa

As you can see, everything now works as expected with the actual function implementations.

Even though this tutorial is going super smooth, it is worth noting that this tutorial was written on top of lots of tials and errors of prompt engineering on system prompt, docstrings, etc,. Hence, keep in mind to start with dummy implementation to secure your budget. After you are confident with your dummy example, you can safely move on to the actual implementation. This will significantly reduce the cost.

## Try the same with Upstage's Solar-Pro 

We tried out the Triage Agentic system with OpenAI's GPT-4o. We can easily switch to Upstage's Solar model. The only thing you should make changes is the API endpoint and the API key. Other than that, all remains the same. Check out the below code snippet and the outputs.

In [59]:
supervisor_agent.model = "solar-pro"

client = OpenAI(
    base_url="https://api.upstage.ai/v1",
    api_key=UPSTAGE_API_KEY
)

messages = [
    {
        "role": "user",
        "content": "Provide a comprehensive summary of the paper, "
                   "'ChunkKV - Semantic-Preserving KV Cache Compression "
                   "for Efficient Long-Context LLM Inference' on arXiv. "
    },
]

run(client, messages, supervisor_agent)

[ChatCompletionMessageToolCall(id='3e45e34a-a053-4b8a-9425-2fb99aa4ff95', function=Function(arguments='{"paper_title":"ChunkKV - Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference"}', name='to_paper_search_agent'), type='function')]
Supervisor Agent: to_paper_search_agent({'paper_title': 'ChunkKV - Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference'})
[ChatCompletionMessageToolCall(id='82640c4a-0263-4b0f-b646-2d52a50080ec', function=Function(arguments='{"paper_url":"https://arxiv.org/pdf/2502.00299"}', name='to_download_and_parse_paper_agent'), type='function')]
Supervisor Agent: to_download_and_parse_paper_agent({'paper_url': 'https://arxiv.org/pdf/2502.00299'})
--------------------------------
The paper 'ChunkKV - Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference' presents a novel approach to KV cache compression that aims to preserve semantic information in the cache. The authors propose a 

## Using Document Parse as a Tool

Previously, the `to_download_and_parse_paper_agent()` function simply truncated the content of PDF by 10,000 characters. Since PDF has some meta information to keep the document as PDF format, this would have some impact on quality and token usage. 

Instead, we are going to modify the `to_download_and_parse_paper_agent()` to use Upstage's Document Parse. Document Parse is known to be the best document analyzer, and we can extract information of any kind of documents into various formats (HTML, Markdown, etc,.). We can hope to lower the token usage and to increase accuracy of the document analysis with Document Parse.

Additionally, in this section, we are going to have `truncate_tokens_if_needed()` helpful function. Since some LLMs have limited context length support (Solar offers 32K context length), a full length of a single PDF can't be fit. If we pass more tokens than what model could handle, that would crash the program. Hence, we have to manually fill up the full context length by calculating the number of tokens in a given prompt. 

In [11]:
# Run this cell if you are using macos

import os
os.environ["PATH"] = "/opt/homebrew/bin/:" + os.environ["PATH"]

Here is what `truncate_tokens_if_needed()` function does:
- It applies the `apply_chat_template()` function of the given tokenizer to the messages. `apply_chat_template()` function correctly translates(encodes) the whole messages into tokens with special tokens. Upstage provides the tokenizer as open source on Hugging Face, hence we can simply load it up with `transformers` library as below:

```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("upstage/solar-pro-preview-instruct")
```

- It truncates the encoded tokens up to the allowed `max_token_limit` which is 32,000 for Solar model.
- It decodes the tokens back to string and returns it.

In [61]:
def truncate_tokens_if_needed(tokenizer, messages, content, max_token_limit=32000):
    """
    Truncate the markdown content if the total tokens exceed the maximum limit.
    
    Args:
        tokenizer: The tokenizer to use for encoding/decoding
        messages: List of message dictionaries for the conversation
        content: The markdown content to potentially truncate
        max_token_limit: Maximum token limit (default: 32000)
        
    Returns:
        truncated_markdown: The potentially truncated markdown
        base_token_numbers: Number of tokens in the base conversation
        paper_token_numbers: Number of tokens in the paper after potential truncation
    """
    inputs = tokenizer.apply_chat_template(
        [
            {"role": "system", "content": supervisor_agent.instructions}
        ] + messages
    )
    base_token_numbers = len(inputs)
    encoded_content = tokenizer.encode(content)
    paper_token_numbers = len(encoded_content)

    print(f"Base token numbers: {base_token_numbers}")
    print(f"Paper token numbers: {paper_token_numbers}")
    print(f"Total token numbers: {base_token_numbers + paper_token_numbers}")

    total_token_numbers = base_token_numbers + paper_token_numbers

    if total_token_numbers > max_token_limit:
        # Calculate how many tokens we need to truncate
        tokens_to_keep = max_token_limit - base_token_numbers
        # Truncate the encoded markdown
        encoded_content = encoded_content[:tokens_to_keep]
        # Update the paper token count
        paper_token_numbers = len(encoded_content)
        # Update the markdown string by decoding the truncated tokens
        truncated_content = tokenizer.decode(encoded_content)
        print(f"Truncated paper tokens to: {paper_token_numbers}")
    else:
        print("No truncation needed")
        truncated_content = content

    print(f"Total token numbers: {base_token_numbers + paper_token_numbers}")
    return truncated_content, base_token_numbers, paper_token_numbers

When using Upstage's Document Parse, we need to deal with some edge cases. That is, sometime Document parse fails at parsing a given document, and it does not return any intermediate results but just failure message.

Imagine that you are makeing an API call to Document Parse with 30 pages long document. If that process fails, you are not only going to lose $0.01 dollars per page but also lose your valuable time. To prevent that akward moment happening, we can simply split the whole document into separate files which contain a single page each. `split_pdf_by_pages()` is a helper function for that purpose.

Additionally, we have defined `get_document_parse_response()` wrapper function to make an API call to Document Parse easily. This is to keep the logic as simple as possible.

Finally, `to_download_and_parse_paper_agent()` is modified to call `split_pdf_by_pages()` to split whole document into series of multiple files, `get_document_parse_response()` to parse the document each file by file, then `truncate_tokens_if_needed()` to truncate the contents up to the context length that Solar model supports.

In [65]:
import json
import shutil
import requests
from PyPDF2 import PdfReader, PdfWriter

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("upstage/solar-pro-preview-instruct")
message_template = [
    {
        "role": "user",
        "content": "Provide a comprehensive summary of the paper below, \n"
    },
]

def split_pdf_by_pages(input_pdf_path, root_path, pages_per_pdf=10):
    # Open the PDF
    pdf = PdfReader(input_pdf_path)
    total_pages = len(pdf.pages)
    
    # Calculate number of output PDFs needed
    num_pdfs = (total_pages + pages_per_pdf - 1) // pages_per_pdf
    
    output_paths = []
    
    # Create output directory using input filename
    base_name = os.path.splitext(input_pdf_path)[0]
    os.makedirs(base_name, exist_ok=True)
    
    # Split into multiple PDFs
    for i in range(num_pdfs):
        writer = PdfWriter()
        
        # Calculate start and end pages for this split
        start_page = i * pages_per_pdf
        end_page = min((i + 1) * pages_per_pdf, total_pages)
        
        # Add pages to writer
        for page_num in range(start_page, end_page):
            writer.add_page(pdf.pages[page_num])
            
        # Save the split PDF
        output_path = f"{root_path}/{i+1}.pdf"
        with open(output_path, "wb") as output_file:
            writer.write(output_file)
        output_paths.append(output_path)
        
    return output_paths

def get_document_parse_response(filename, api_key):
    url = "https://api.upstage.ai/v1/document-ai/document-parse"

    headers = {"Authorization": f"Bearer {api_key}"}
    files = {"document": open(filename, "rb")}
    data = {"output_formats": "['markdown']"}

    response = requests.post(url, headers=headers, files=files, data=data)
    upstage_response = json.loads(response.text)
    return upstage_response

def to_download_and_parse_paper_agent(paper_url: str):
    """Use this to download and parse paper only when paper URL is found."""
    response = requests.get(paper_url)
    # Save the PDF to a temporary file
    root_path = "tmp"
    temp_pdf_path = "temp_paper.pdf"
    with open(temp_pdf_path, "wb") as f:
        f.write(response.content)

    shutil.rmtree(root_path, ignore_errors=True)
    os.makedirs(root_path, exist_ok=True)

    split_factor = 1
    split_pdfs = split_pdf_by_pages(temp_pdf_path, root_path, split_factor) # by 10

    markdown = ""
    total_responses = []
    for i, split_pdf in enumerate(split_pdfs):
        upstage_response = get_document_parse_response(split_pdf, UPSTAGE_API_KEY)
        
        # Append the response to the total_responses list
        total_responses.append({f"page_{i+1 * split_factor}": upstage_response})        
        # Also write the response to a JSON file for persistence
        json_output_path = f"{root_path}/response_{i+1}.json"
        with open(json_output_path, "w") as json_file:
            json.dump(upstage_response, json_file, indent=2)

        try:
            markdown += upstage_response['content']['markdown']
        except KeyError:
            pass

    markdown = "Retrieved Paper Content\n-----------------------------------\n" + markdown
    markdown, _, _ = truncate_tokens_if_needed(tokenizer, message_template, markdown)
    return markdown

Let's see the whole process in action with Upstage's Solar and Document Parse!

In [66]:
supervisor_agent = Agent(
    name="Supervisor Agent",
    instructions=(
        "You are a academic paper analyzer. "
        "- Basiclly, you don't have knowledge of the requested paper."
        "- Hence, you need to use the provided tools to get the paper information from the internet. "
        "- Your job is to find appropriate tool to transfer to based on the user's request and results of tool calls. "
        "- If enough information is collected to complete the user request, you should say directly answer to the user request. "
    ),
    tools=[to_paper_search_agent, to_download_and_parse_paper_agent]#, to_paper_analysis_agent]#, to_triage, to_end_agent],
)

In [67]:
supervisor_agent.model = "solar-pro"

client = OpenAI(
    base_url="https://api.upstage.ai/v1",
    api_key=UPSTAGE_API_KEY
)

messages = [
    {
        "role": "user",
        "content": "Provide a comprehensive summary of the paper, "
                   "'ChunkKV - Semantic-Preserving KV Cache Compression "
                   "for Efficient Long-Context LLM Inference' on arXiv. "
    },
]

run(client, messages, supervisor_agent)

[ChatCompletionMessageToolCall(id='9521a151-6c08-463b-a377-2e2c0df1563d', function=Function(arguments='{"paper_title":"ChunkKV - Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference"}', name='to_paper_search_agent'), type='function')]
Supervisor Agent: to_paper_search_agent({'paper_title': 'ChunkKV - Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference'})
[ChatCompletionMessageToolCall(id='f85864f5-f429-46dd-a900-cc97f7773ed8', function=Function(arguments='{"paper_url":"https://arxiv.org/pdf/2502.00299"}', name='to_download_and_parse_paper_agent'), type='function')]
Supervisor Agent: to_download_and_parse_paper_agent({'paper_url': 'https://arxiv.org/pdf/2502.00299'})


Token indices sequence length is longer than the specified maximum sequence length for this model (53672 > 4096). Running this sequence through the model will result in indexing errors


Base token numbers: 115
Paper token numbers: 53672
Total token numbers: 53787
Truncated paper tokens to: 31885
Total token numbers: 32000
--------------------------------
The assistant summarizes the key findings from the experiments and analysis conducted on the ChunkKV method, which is a novel KV cache compression technique for efficient long-context inference in large language models (LLMs). The method retains the most informative semantic chunks from the original KV cache, leading to improved performance compared to existing methods. The experiments were conducted on various LLMs and benchmarks, demonstrating the effectiveness of ChunkKV in preserving essential contextual information for complex reasoning tasks, long-context understanding, and safety evaluations. The method’s chunk-based approach maintains crucial contextual information, leading to superior performance in challenging scenarios and benchmarks. The proposed layer-wise index reuse technique provides significant comput