Harbor agent adapter for pi coding agent to run Terminal-Bench evaluations.
# Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install harbor
uv tool install harbor
# Install this package (for development)
cd ~/workspaces/pi-terminal-bench
uv venv
source .venv/bin/activate
uv pip install -e ".[dev]"There is a bug in Harbor's upload_dir function that causes verifier failures when the agent creates a /tests directory during task execution. The fix must be applied before running evaluations.
The Problem: When using docker cp /path/to/source container:/target:
- If
/targetdoes NOT exist: copies contents ofsourceinto/target✓ - If
/targetALREADY exists: copiessourceas a subdirectory/target/source/✗
The pi agent (and other agents) may create /tests during task execution, causing the verifier's test files to end up at /tests/tests/test.sh instead of /tests/test.sh.
Apply the fix:
# Find Harbor's docker.py location
HARBOR_DOCKER=$(python -c "import harbor.environments.docker.docker as m; print(m.__file__)")
# Apply the patch (backs up original first)
cp "$HARBOR_DOCKER" "${HARBOR_DOCKER}.bak"
# Edit the upload_dir function - change this:
# async def upload_dir(self, source_dir: Path | str, target_dir: str):
# await self._run_docker_compose_command(
# ["cp", str(source_dir), f"main:{target_dir}"],
# check=True,
# )
#
# To this:
# async def upload_dir(self, source_dir: Path | str, target_dir: str):
# # Append /. to source to copy contents, not the directory itself
# source = str(source_dir).rstrip('/') + '/.'
# await self._run_docker_compose_command(
# ["cp", source, f"main:{target_dir}"],
# check=True,
# )Or use this one-liner to apply the patch:
python -c "
import harbor.environments.docker.docker as m
p = m.__file__
t = open(p).read()
old = ''' async def upload_dir(self, source_dir: Path | str, target_dir: str):
await self._run_docker_compose_command(
[
\"cp\",
str(source_dir),
f\"main:{target_dir}\",
],
check=True,
)'''
new = ''' async def upload_dir(self, source_dir: Path | str, target_dir: str):
# Append /. to source to copy contents, not the directory itself
source = str(source_dir).rstrip('/') + '/.'
await self._run_docker_compose_command(
[
\"cp\",
source,
f\"main:{target_dir}\",
],
check=True,
)'''
if old in t:
open(p, 'w').write(t.replace(old, new))
print('✓ Patch applied successfully')
else:
print('✗ Already patched or file structure changed')
"Verify the patch:
python -c "from harbor.environments.docker.docker import DockerEnvironment; import inspect; print('PATCHED' if 'rstrip' in inspect.getsource(DockerEnvironment.upload_dir) else 'NOT PATCHED')"See ERROR.md for detailed investigation of this issue.
- Docker running
- API key for your chosen provider:
# Anthropic (OAuth token preferred) export ANTHROPIC_OAUTH_TOKEN="..." # OR export ANTHROPIC_API_KEY="..." # OpenAI export OPENAI_API_KEY="..." # Google export GEMINI_API_KEY="..."
# Run locally with Docker
harbor run \
-d terminal-bench@2.0 \
--agent-import-path pi_terminal_bench:PiAgent \
-m anthropic/claude-sonnet-4-5 \
-n 4
# Run on cloud (Daytona)
export DAYTONA_API_KEY="..."
harbor run \
-d terminal-bench@2.0 \
--agent-import-path pi_terminal_bench:PiAgent \
-m anthropic/claude-sonnet-4-5 \
--env daytona \
-n 32harbor run -d terminal-bench@2.0 -a oracleharbor run \
-d terminal-bench@2.0 \
--agent-import-path pi_terminal_bench:PiAgent \
-m anthropic/claude-sonnet-4-5 \
--task-ids <task-id>To submit results to the Terminal-Bench leaderboard:
harbor run \
-d terminal-bench@2.0 \
--agent-import-path pi_terminal_bench:PiAgent \
-m anthropic/claude-sonnet-4-5 \
--k 5 \
--jobs-dir "./pi-tbench-results"Then email the jobs directory to:
After running evaluations, use show-results.js to display results with leaderboard comparison:
./show-results.jsThis will parse the latest results from pi-tbench-results/ and show where pi ranks on the Terminal-Bench 2.0 leaderboard.
# Run tests
pytest
# Lint
ruff check src/
ruff format src/MIT