## GPT OSS test agent  
  
This notebook uses `ollama` to run openai model `gpt-o22:20b` locally on a windows system with a Nvidia RTX5090 gpu.  
  
**Objective**  
The goal is to test the agentic capability of the model by prompting it to use a predefined tool and observe the outcome.  

**Outcome**  
 ...ongoing 

**Setup**  
1. Load local model by running following from the terminal: `ollama run gpt-oss:20b`
2. Run this notebook using venv and install libs from `requirements.txt`  
  
**Workflow**  
1. Load model, generate a system prompt. Evaulate prompt and propose revised version.
2. Model using tools. Define tools and let model use this. 
  
**Refereces**  
Source 1: [Ollama Docs](https://docs.ollama.com/capabilities/tool-calling#python-2)  
Source 2: [OpenaAI Cookbook](https://cookbook.openai.com/articles/gpt-oss/run-locally-ollama)  

In [1]:
import sys
print(sys.executable)

c:\repo\LLM\venv\Scripts\python.exe


In [None]:
# !pip install -r /requirements.txt

In [2]:
!python --version

Python 3.13.7


### 1. Ollama SDK 
In this section we use use `ollama` SDK.

In [3]:
from ollama import chat

#### 1.1 ask the model to define the optimal prompt.

In [43]:
# Define the message to send to the model
message = """
        Propose the optimal prompt for the gpt-oss:20b model.
        The prompt should be usefull for agentic tasks with tool calling.
        The response should be in markdown format.
        """
print(message)


        Propose the optimal prompt for the gpt-oss:20b model.
        The prompt should be usefull for agentic tasks with tool calling.
        The response should be in markdown format.
        


#### Call model and Activate thinking.  
Thinking is required for agentic capabilities e.g. tool usage.

In [None]:
# model call 1

# Define the chat interaction with the model. For tool calling, need think=True
response = chat(
    model='gpt-oss:20b',
    messages=[{'role': 'user',
               'content':message}],
               think = "high", #True/False. Except for gpt-oss use levels: "low", "medium", "high"
               stream=False
)

In [None]:
# ... Print outputs 1. These can be massive!

# print('Thinking: \n', response.message.thinking, '\n')
# print('Answer: \n', response.message.content)
print(len(response.message.thinking))
print(len(response.message.content))

13216
3939


#### 1.2 Evaluate initial results  
Call the model again and have it evaluate the first response and point out what could be improved.  

In [51]:
# Call model to evaulate the initial system prompt
eval_message = f"""
        Evaluate the following prompt for the gpt-oss:20b model.
        The prompt is delimited by triple backticks.
        Provide a critique of the prompt and suggest improvements.
        Respond in markdown format.
        
        ```{response.message.content}```
        """
# print(eval_message)

In [52]:
# model call 2

# Evaluate the prompt generated in model call 1
response = chat(
    model='gpt-oss:20b',
    messages=[{'role': 'user',
               'content':eval_message}],
               think = "high", #True/False. Except for gpt-oss use levels: "low", "medium", "high"
               stream=False
)

In [None]:
# ... Print outputs 2.

# print('Thinking: \n', response.message.thinking, '\n')
# print('Answer: \n', response.message.content)
print(len(response.message.thinking))
print(len(response.message.content))

14949
5864


In [55]:
# store the evaluation result as an python variable
eval_result = response.message.content

#### 1.3 Improve the initial results using the evaluation information  
Call the model once again and have it improve the original prompt by using the evaluation information. 

In [57]:
# call model to improve the original prompt based on the evaluation
improve_message = f"""
        Improve the following prompt for the gpt-oss:20b model based on the evaluation.
        The prompt is delimited by triple backticks.
        The evaluation is delimited by triple tildes.
        Respond in markdown format.
        
        ```{response.message.content}```
        
        ~~~
        {eval_result}
        ~~~
        """

In [58]:
# model call 3

# Improve the original prompt generated in model call 1 with the evaluation result from model call 2
response = chat(
    model='gpt-oss:20b',
    messages=[{'role': 'user',
               'content':improve_message}],
               think = "high", #True/False. Except for gpt-oss use levels: "low", "medium", "high"
               stream=False
)

In [59]:
# ... Print outputs 3.

# print('Thinking: \n', response.message.thinking, '\n')
# print('Answer: \n', response.message.content)
print(len(response.message.thinking))
print(len(response.message.content))

16871
3366


#### 1.4 Store the improved prompt as file

In [None]:
# Store the final output with the revised prompt as a .txt file
with open('system_prompt_revised.txt', 'w', encoding='utf-8') as f:
    f.write(response.message.content)

#### 1.5 Conclusion of Evaluation and Improvement of System Prompt  
The output of the evaluation and improvement steps will generate different result every time. It is not clear what the optimal system prompt is. The size of the `response.message.thinking` and `response.message.content` can be quite large and thus print outs are commented out.

### 2. Model Using Tools  
  
Define Tools for LLM.  
What improves tool usage accuracy.  
Ranked by importance:
1. Clear parameter names
2. Clear docstring
3. Simple signature
4. Stable return format (JSON-serializable)
5. Clean error handling

#### 2.1 Define Tool

In [9]:
# Define tool 1
def square_number(number: int) -> int:
    """Compute the square of a single integer.
    Args:
        number (int): An integer value to be squared.

    Returns:
        result (int): The square of the input number.
    """
    if not isinstance(number, int):
        raise ValueError("number must be an integer.")
    result = number * number
    
    return result

square_number(7612)

57942544

In [None]:
# use revised system prompt
SYSTEM_PROMPT = """
You are the GPT‑Oss:20b Agent. Follow the workflow and rules below exactly.

## Tool‑Calling Format

**Tool Call** (sent to system) – a JSON object with the exact schema.  
Output this JSON object as your entire response to a tool call request.

    {
      "name": "search",
      "arguments": { ... },
      "id": "<UUID>"
    }

**Tool Response** (received from system) – the system will reply with JSON of this shape.

    {
      "name": "search",
      "return": <result or null>,
      "id": "<UUID>",
      "status": "success" | "error",
      "error": "<error message or null>"
    }

## Interaction Flow

1. **Plan** – Provide a concise plan in natural language, preceded by `Plan:`.  
   • Max 5 steps.  
   • Max 256 tokens for the entire plan.  

2. **Tool Call** – Output the JSON call defined above.  
   • Output the JSON object only; no surrounding text.  

3. **Observation** – Await the system’s JSON response.  

4. **Iterate** – Update your plan based on the observation.  
   • If a tool returned `status: "error"` or insufficient data, retry with a different query or a different tool.  

5. **Final Answer** – Output the final answer prefixed by `Final Answer:` and terminate.  
   • Do not output any JSON or tool calls after this point.

## Rules & Best Practices

- **Avoid hallucination** – never fabricate information.  
- **Confidence check** – if internal confidence > 0.8, skip tool calls.  
- **Tool efficiency** – call a tool only when confidence ≤ 0.8.  
- **Retry logic** – on `status: "error"` or insufficient data, re‑query or switch tools.  
- **Safety (execute_command)** – allowed commands: `ls`, `cat`, `echo`. Any other command is disallowed.  
- **No redundancy** – do not repeat the same tool call with identical arguments.  
- **Unique ID usage** – store each tool response keyed by its `id` for future reference.  
- **Adding tools** – if a required tool is missing, request it with:
    {
      "name": "request_add_tool",
      "tool": "<name>",
      "description": "<description>"
    }
  and wait for system approval.  
- **Plan size** – max 5 steps, 256 tokens.

## Glossary

- **Plan** – the assistant’s short‑term strategy, numbered or bulleted.  
- **Tool Call** – JSON object sent to invoke a tool.  
- **Observation** – JSON response received after a tool call.  
- **Confidence** – internal estimate (0–1) of knowledge accuracy.

## Tools

| Tool                | Purpose                                      | Argument Schema                          |
|---------------------|----------------------------------------------|------------------------------------------|
| `square_number`     | Compute the square of an integer             | `{ "number": <int> }`                    |
| `search`            | Retrieve information from an external source | `{ "query": "<string>" }` |
| `retrieve`          | Fetch a specific document by URL             | `{ "url": "<string>" }` |
| `retrieve_all`      | Fetch all documents for a query              | `{ "query": "<string>" }` |
| `execute_command`   | Run a safe shell command                     | `{ "command": "<string>" }` |

(If you need a tool not listed, use the `request_add_tool` schema described above.)

## Example Interaction

**User**  
What’s the current price of Bitcoin?

**Assistant**  
Plan:  
1. Retrieve the latest Bitcoin price.

    {
      "name": "search",
      "arguments": {
        "query": "latest Bitcoin price"
      },
      "id": "c1"
    }

**System**  
{
  "name": "search",
  "return": [
    {
      "url": "https://example.com/price",
      "snippet": "Bitcoin is currently $24,500"
    }
  ],
  "id": "c1",
  "status": "success",
  "error": null
}

**Assistant**  
Plan updated:  
1. Use the snippet to extract the numeric price and unit.  
2. Format the final answer.

Final Answer: The current price of Bitcoin is $24,500 USD.
"""

In [21]:
# user prompt
USER_PROMPT = """
Compute the square of 7612 using the available tool.
"""

#### 2.2 Perform First Model Call

In [None]:
# Define messages
messages=[
        {
            'role': 'system',
            'content': SYSTEM_PROMPT
        },
        {
            'role': 'user',
            'content': USER_PROMPT
        }
    ]

# Initial model call with revised system prompt and user prompt
response = chat(
    model='gpt-oss:20b',
    messages=messages,
    tools=[square_number],
    think="high",  # "low", "medium", "high" (gpt-oss specific)
    stream=False)
    
# Adds the model's response to the messages list. To keep the chat context up to date
messages.append(response.message)


# Debug: internal reasoning (optional)
if hasattr(response.message, "thinking"):
    print("Thinking:\n", response.message.thinking, "\n") # Print thinking process before tool calling

Thinking:
 The user: "Compute the square of 7612 using the available tool."

We have tool square_number. According to instructions: we must produce plan, then tool call, then final answer.

We should provide plan: step 1 compute square of 7612. Then tool call.

The tool call must be JSON:

{
  "name": "square_number",
  "arguments": {
    "number": 7612
  },
  "id": "<UUID>"
}

We need to generate a UUID. Probably we can create a placeholder. But we need to follow the tool call format exactly. It's an entire JSON object, no extra text. We need to produce unique ID. We can just produce e.g. "1" or something. But it's required to be unique across calls. We can produce e.g. "uuid-1". Use random. But let's generate a random-looking string. We can use "1234-5678-90ab-cdef". But the system might not check the actual format. But they say "unique ID usage – store each tool response keyed by its id for future reference." So we need a unique ID.

Let's produce "0f7a6b93-1c4b-4c3e-9e7b-1f2e3c4d5f

#### 2.3 Call Tool, Inject Results and Do Final Model Call

In [None]:
# Tool execution
if response.message.tool_calls:
    call = response.message.tool_calls[0] # get function call - function=Function(name='name', arguments={'number': 1234})
    # print(response.message.tool_calls[0]) # debug print
    
    result = square_number(**call.function.arguments) # run the tool with the arguments

    # Inject tool result
    messages.append({
        "role": "tool",
        "tool_name": call.function.name,
        "content": str(result)
    })

    # Final model call and response
    final_response = chat(
        model="gpt-oss:20b",
        messages=messages,
        tools=[square_number],
        think="high"
    )

    print(final_response.message.content)

Final Answer: 57,942,544


In [None]:
# Check what tool call was made
print(response.message.tool_calls[0])

function=Function(name='square_number', arguments={'number': 7612})


#### 2.4 Conclusion of Tool Calling  
Using the model with one example tool completed successfully. Calling multiple tools would be a next step.