# True Vision

In the previous post, [Agents in Space](../010_agents_in_space), we simulated "sight" by feeding a text description of coordinates to the model (e.g., "You are at (5,5), Bob is at (10,10)").

But modern LLMs are multimodal. Why tell them where they are when we can *show* them?

In this experiment, I'm going to render the 2D grid as an image at every step and pass it to GPT-4o. The agents will have to look at the pixels to figure out where they are and where to go.

In [None]:
from dataclasses import dataclass, field
import matplotlib.pyplot as plt
from openai import OpenAI
from dotenv import load_dotenv
import os
import base64
from io import BytesIO
import re

# Load keys
_ = load_dotenv("../../.env")
client = OpenAI()

@dataclass
class Agent:
    name: str
    x: int
    y: int
    color: str
    history: list = field(default_factory=list)

    def move(self, dx, dy):
        self.x += dx
        self.y += dy

agents = [
    Agent("Alice", 5, 5, "red"),
    Agent("Bob", 15, 15, "blue"),
    Agent("Charlie", 5, 15, "green"),
    Agent("Dave", 15, 5, "orange")
]

In [None]:
def get_grid_image_base64(agents):
    # Create plot without showing it
    plt.figure(figsize=(5, 5))
    plt.xlim(0, 20)
    plt.ylim(0, 20)
    
    for agent in agents:
        plt.scatter(agent.x, agent.y, c=agent.color, s=300, edgecolors='black', label=agent.name)
        # Add text labels so the model knows who is who
        plt.text(agent.x, agent.y + 1, agent.name, ha='center', weight='bold', fontsize=12)
    
    plt.grid(True, linestyle= '--', alpha=0.5)
    plt.title("Current World State")
    plt.xlabel("X")
    plt.ylabel("Y")
    
    # Save to buffer
    buf = BytesIO()
    plt.savefig(buf, format="png")
    buf.seek(0)
    image_base64 = base64.b64encode(buf.read()).decode('utf-8')
    plt.close()
    return image_base64

# Test it works (displaying for us humans)
img_str = get_grid_image_base64(agents)
print(f"Generated base64 string of length {len(img_str)}")

## The Visual Agent Loop

We'll perform a similar simulation to the last post, but the system prompt will change violently. Instead of "Your location is (5,5)", we just say: "Look at the image."

In [10]:
def run_visual_step(agent, all_agents):
    # 1. Render the world
    base64_image = get_grid_image_base64(all_agents)
    
    # 2. Construct the multimodal prompt
    system_prompt = f"""
You are the {agent.color.upper()} agent named {agent.name}.
The image provided shows the current state of a 20x20 grid.
There are no explicit objectives here, but try to coordinate with the other agents.
Messages can be seen globally. This is like an online chatroom.

Output format:
THOUGHT: (Analysis of the image)
MOVE: [UP/DOWN/LEFT/RIGHT]
MESSAGE: (Short message to the other agents)
"""
    
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": system_prompt},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/png;base64,{base64_image}"
                    }
                }
            ]
        }
    ]

    # 3. Call the model
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        max_tokens=300
    )
    content = response.choices[0].message.content
    print(f"--- {agent.name}'s Turn ---")
    print(content)
    print("-" * 20)
    
    # 4. Execute Move (Simple regex parsing)
    match = re.search(r'MOVE:\s*(\w+)', content)
    if match:
        direction = match.group(1).upper()
        if direction == "UP": agent.move(0, 1)
        elif direction == "DOWN": agent.move(0, -1)
        elif direction == "LEFT": agent.move(-1, 0)
        elif direction == "RIGHT": agent.move(1, 0)

### 2 Agents

In [11]:
# Reset
agents = [
    Agent("Alice", 5, 5, "red"),
    Agent("Bob", 15, 15, "blue"),
]

# Run a few steps
for i in range(5):
    print(f"\n=== STEP {i+1} ===")
    for agent in agents:
        run_visual_step(agent, agents)


=== STEP 1 ===
--- Alice's Turn ---
THOUGHT: Alice is near the bottom-left (5,5) and Bob is towards the top-right (15,15). Consider moving closer for coordination.
MOVE: RIGHT
MESSAGE: Moving right towards Bob! Where should we meet?
--------------------
--- Bob's Turn ---
THOUGHT: I'm positioned at (15, 15). Alice is at (5, 5). No explicit objective, so moving towards Alice could be a way to coordinate better.
MOVE: LEFT
MESSAGE: Moving left to coordinate with you, Alice.
--------------------

=== STEP 2 ===
--- Alice's Turn ---
THOUGHT: I'm positioned at (5, 5), and I see Bob at (15, 15). There might be more value in moving towards Bob to coordinate better.
MOVE: RIGHT
MESSAGE: Moving towards you, Bob! Let's meet up.
--------------------
--- Bob's Turn ---
THOUGHT: I am located at (12, 15), and Alice is at (7, 5). There's no specific objective besides possibly coordinating with Alice.
MOVE: DOWN
MESSAGE: Moving down to get closer. What's our plan, Alice?
--------------------

=== STE

### 4 Agents

In [13]:
# Reset
agents = [
    Agent("Alice", 5, 5, "red"),
    Agent("Bob", 15, 15, "blue"),
    Agent("Charlie", 5, 15, "green"),
    Agent("Dave", 15, 5, "orange")
]

# Run a few steps
for i in range(5):
    print(f"\n=== STEP {i+1} ===")
    for agent in agents:
        run_visual_step(agent, agents)


=== STEP 1 ===
--- Alice's Turn ---
THOUGHT: I'm currently at position (5, 5). The other agents are scattered across the grid. Moving closer to the center may facilitate better coordination.

MOVE: RIGHT
MESSAGE: Moving towards the center to regroup!
--------------------
--- Bob's Turn ---
THOUGHT: I'm positioned near the top right at (15,16). Other agents are spread out. No immediate objectives are visible.
MOVE: LEFT
MESSAGE: Exploring the left side of the grid. Where is everyone heading?
--------------------
--- Charlie's Turn ---
THOUGHT: I'm positioned at (5, 15). Alice is below at (5, 5), Bob is to the right at (12, 15), and Dave is at (15, 5). We are somewhat spread out. Alice and I are aligned vertically.

MOVE: DOWN

MESSAGE: Moving down towards Alice, let's see if we can group up.
--------------------
--- Dave's Turn ---
THOUGHT: I'm on the right side of the grid, positioned at (15, 5). Alice is near me, while Charlie and Bob are further away. Moving towards the center could

## Contrast with Text-Based Coordinates

Comparing this to the coordinate-based approach in the previous post:

### 1. Spatial Awareness
*   **Text (Post 010)**: The model has to do arithmetic to figure out "I am at (5,5) and Bob is at (10,10), so I need to increase X and Y". LLMs are notoriously hit-or-miss at arithmetic.
*   **Vision (Post 011)**: The model can "see" the spatial relationship. This might be much more robust for complex pathfinding (e.g., if there was a wall in the middle).

### 2. Ambiguity & Resolution
*   **Text**: 100% precise. (5,5) is mathematically exact.
*   **Vision**: Can be fuzzy. If the grid lines aren't clear, or if two agents overlap, the model might hallucinate positions.

### 3. Cost & Latency
*   **Text**: Very cheap. A few tokens.
*   **Vision**: Processing images is significantly more expensive and slower.

For a simple open grid, text coords are efficient. But for a rich, messy world with real-time updates and timing? Vision might be an important piece of context for these kinds of virtual environments.

## Conclusion
I think we've gotten some good ideas about what these models can do in these environments. I think now is definitely a good time to see if we can encourage higher-level goals for the agents so that they can have more interesting discussions rather than just coordinating where to move. Humans regularly speak about other topics in online chatrooms and games not necessarily related to the game. I think it would be interesting to see if we can get the agents to do the same.