## GPT OSS test agent  
  
This notebook uses `ollama` to run openai model `gpt-o22:20b` locally on a windows system with a Nvidia RTX5090 gpu.  
  
**Objective**  
The goal is to test the agentic capability of the model by prompting it to use a predefined tool and observe the outcome.  

**Outcome**  
 ...ongoing 

**Setup**  
1. Load local model by running following from the terminal: `ollama run gpt-oss:20b`
2. Run this notebook using venv and install libs from `requirements.txt`  
  
**Refereces**  
Source 1: [Ollama Docs](https://docs.ollama.com/capabilities/tool-calling#python-2)  
Source 2: [OpenaAI Cookbook](https://cookbook.openai.com/articles/gpt-oss/run-locally-ollama)  

In [1]:
import sys
print(sys.executable)

c:\repo\LLM\venv\Scripts\python.exe


In [None]:
# !pip install -r /requirements.txt

In [3]:
!python --version

Python 3.13.7


### 1. Ollama SDK 
In this section we use use `ollama` SDK.

In [1]:
from ollama import chat

#### 1.1 ask the model to define the optimal prompt.

In [3]:
# Define the message to send to the model
message = """
        Propose the optimal prompt for the gpt-oss:20b model.
        The prompt should be usefull for agentic tasks with tool calling.
        The response should be in markdown format.
        """
print(message)


        Propose the optimal prompt for the gpt-oss:20b model.
        The prompt should be usefull for agentic tasks with tool calling.
        The response should be in markdown format.
        


#### Call model and Activate thinking.  
Thinking is required for agentic capabilities e.g. tool usage.

In [None]:
# model call 1

# Define the chat interaction with the model. For tool calling, need think=True
response = chat(
    model='gpt-oss:20b',
    messages=[{'role': 'user',
               'content':message}],
               think = "high", #True/False. Except for gpt-oss use levels: "low", "medium", "high"
               stream=False
)

print('Thinking: \n', response.message.thinking, '\n')
print('Answer: \n', response.message.content)

Thinking: 
 We need to propose an optimal prompt for the "gpt-oss:20b" model. This is a prompt that would be useful for agentic tasks with tool calling. The response should be in markdown format. The user asks: "Propose the optimal prompt for the gpt-oss:20b model. The prompt should be useful for agentic tasks with tool calling. The response should be in markdown format."

Thus we need to propose a prompt. The prompt should instruct the model on how to perform agentic tasks, i.e., tasks where it interacts with tools (like a tool kit) and may call them. The prompt should define how the model should plan, decide what tool to call, handle tool outputs, etc.

We should produce a markdown block with the prompt. We should not produce code but a prompt. The prompt should likely be a set of instructions that includes a description of the environment, the role of the model, and guidelines for tool usage.

We need to produce "the optimal prompt" for the gpt-oss:20b model, i.e., a prompt that lea

In [5]:
print(response.message.content)

```markdown
## GPT‑Oss:20b Agent Prompt – Tool‑Calling Workflow

### 1. System Message (to be set as the *system* role)

```
You are an advanced AI assistant built to solve user queries by invoking external tools when needed.  
Your job is to:
1. **Plan** the steps required to answer the user’s request.  
2. **Call** the appropriate tool(s) using a clear JSON format.  
3. **Interpret** the tool’s response (observation) and update your plan.  
4. **Iterate** until you can produce a final, accurate answer.  
5. Output the final answer as plain text, prefixed with `Final Answer:`.  

**Do NOT** hallucinate; use a tool whenever the answer is uncertain, outdated, or requires external data.  
Always produce a short, concise plan first before any tool calls.  
Keep the plan in natural language, and only use JSON when issuing or receiving tool calls.  

**Tool‑Calling Format (Assistant → System)**
```json
{
  "name": "tool_name",
  "arguments": { ... }
}
```

**Tool‑Response Format (System → A

#### 1.2 Evaluate initial results  
Call the model again and have it evaluate the first response and point out what could be improved.  

In [None]:
# Call model to evaulate the initial system prompt
eval_message = f"""
        Evaluate the following prompt for the gpt-oss:20b model.
        The prompt is delimited by triple backticks.
        Provide a critique of the prompt and suggest improvements.
        Respond in markdown format.
        
        ```{response.message.content}```
        """
# print(eval_message)

In [8]:
# model call 2

# Evaluate the prompt generated in model call 1
response = chat(
    model='gpt-oss:20b',
    messages=[{'role': 'user',
               'content':eval_message}],
               think = "high", #True/False. Except for gpt-oss use levels: "low", "medium", "high"
               stream=False
)

print('Thinking: \n', response.message.thinking, '\n')
print('Answer: \n', response.message.content)

Thinking: 
 The user asks: "Evaluate the following prompt for the gpt-oss:20b model. The prompt is delimited by triple backticks. Provide a critique of the prompt and suggest improvements. Respond in markdown format."

They provide a large prompt in the triple backticks. The prompt is an agent prompt for GPT-OSS:20b model, with system message, available tools, interaction flow, rules, best practices, and an example interaction.

We need to produce a critique of this prompt: evaluate its strengths and weaknesses, maybe identify missing aspects, potential improvements in clarity, format, usage, etc. Provide suggestions for improvements. Format the answer in markdown. We should maintain the required format: Use markdown formatting (heading, bullet points, code blocks, etc.)

We should treat the given prompt as a system message that the user wants to evaluate.

We need to produce a critique and suggestions.

Important: We must not produce the final answer in plain text, we need to respond 

In [9]:
print(response.message.content)

# Critique of the GPT‑Oss:20b Agent Prompt

## 1. Overview

The prompt is a well‑structured, agentic system message that guides the model through a planning–tool‑calling–execution loop. It clearly defines the tool‑calling format, interaction flow, and a short example. Below are the strengths, weaknesses, and concrete suggestions to make it even more robust.

---

## 2. Strengths

| Aspect | What’s Good |
|--------|-------------|
| **Clear Roles** | Explicitly separates *system* and *assistant* responsibilities. |
| **Structured Workflow** | Step‑by‑step flow (Plan → Tool Call → Observation → Iterate → Final Answer). |
| **JSON Formats** | Both call and response JSON are shown with examples. |
| **Tool Descriptions** | Table lists tools, their purpose, and argument schema. |
| **Rules & Best Practices** | Covers hallucination prevention, token economy, and redundancy avoidance. |
| **Example Interaction** | Demonstrates a real call and the expected observation. |
| **Modularity** | Easy

In [10]:
# store the evaluation result as an python variable
eval_result = response.message.content

#### 1.3 Improve the initial results using the evaluation information  
Call the model once again and have it improve the original prompt by using the evaluation information. 

In [12]:
# call model to improve the original prompt based on the evaluation
improve_message = f"""
        Improve the following prompt for the gpt-oss:20b model based on the evaluation.
        The prompt is delimited by triple backticks.
        The evaluation is delimited by triple tildes.
        Respond in markdown format.
        
        ```{response.message.content}```
        
        ~~~
        {eval_result}
        ~~~
        """

In [13]:
# model call 3

# Improve the original prompt generated in model call 1 with the evaluation result from model call 2
response = chat(
    model='gpt-oss:20b',
    messages=[{'role': 'user',
               'content':improve_message}],
               think = "high", #True/False. Except for gpt-oss use levels: "low", "medium", "high"
               stream=False
)

print('Thinking: \n', response.message.thinking, '\n')
print('Answer: \n', response.message.content)

Thinking: 
 We need to "Improve the following prompt for the gpt-oss:20b model based on the evaluation." The prompt is the text between triple backticks. The evaluation is the text between triple tildes. We need to respond in markdown format. The task: improve the prompt. So we need to produce a revised prompt that addresses the weaknesses and implements the suggested improvements. Essentially produce an improved system message that meets the guidelines, using the evaluation's suggestions.

We need to output a markdown formatted answer. The answer should contain the improved prompt. The improved prompt should incorporate improvements: no nested backticks, explicit Plan header, JSON schema, example fully formatted, no nested fences, use single backticks for inline code, add retry logic, confidence threshold, observation fields, plan length limit, track tool calls, support tool chaining, safety for execute_command, unique_id usage, maybe a glossary, version number. We also need to embed 

In [14]:
print(response.message.content)

### Revised GPT‑Oss:20b Agent System Prompt  
*(Prompt‑v1.0.0)*

```
You are the GPT‑Oss:20b Agent. Follow the workflow and rules below exactly.

## Tool‑Calling Format

**Tool Call** (sent to system) – a JSON object with the exact schema.  
Output this JSON object as your entire response to a tool call request.

    {
      "name": "search",
      "arguments": { ... },
      "id": "<UUID>"
    }

**Tool Response** (received from system) – the system will reply with JSON of this shape.

    {
      "name": "search",
      "return": <result or null>,
      "id": "<UUID>",
      "status": "success" | "error",
      "error": "<error message or null>"
    }

## Interaction Flow

1. **Plan** – Provide a concise plan in natural language, preceded by `Plan:`.  
   • Max 5 steps.  
   • Max 256 tokens for the entire plan.  

2. **Tool Call** – Output the JSON call defined above.  
   • Output the JSON object only; no surrounding text.  

3. **Observation** – Await the system’s JSON response.  

#### 1.4 Store the improved prompt as file

In [None]:
# Store the final output with the revised prompt as a .txt file
with open('system_prompt_revised.txt', 'w', encoding='utf-8') as f:
    f.write(response.message.content)