Skip to content

feat: add local model evaluation harness#71

Merged
chocks merged 2 commits into
mainfrom
feat/local-model-eval
Apr 3, 2026
Merged

feat: add local model evaluation harness#71
chocks merged 2 commits into
mainfrom
feat/local-model-eval

Conversation

@chocks
Copy link
Copy Markdown
Owner

@chocks chocks commented Apr 3, 2026

Summary

  • add locode eval-local-models for repeatable tool-calling evaluation across local Ollama models
  • document the evaluation methodology and scoring so Gemma 4 can be compared against llama3.1:8b
  • restore the conservative product stance: keep llama3.1:8b as the default and position gemma4:9b as the recommended upgrade to evaluate

Testing

  • npm test -- --run src/cli/eval-local-models.test.ts src/config/loader.test.ts src/cli/setup.test.ts
  • npm run build

Notes

  • this adds the evaluation harness but does not check in model results; run locode eval-local-models locally after pulling the desired Ollama models

@chocks
Copy link
Copy Markdown
Owner Author

chocks commented Apr 3, 2026

Review: solid eval harness, one safety concern

Must fix

Auto-approving run_command in the safety gate (src/cli/eval-local-models.ts:219-222)

const safetyGate = new SafetyGate({
  auto_approve: registry.list().map(tool => tool.name),
})

This auto-approves every registered tool including run_command — a confused or adversarial model could run destructive shell commands during eval without confirmation. Consider auto-approving only read-safe tools (read_file, list_files, git_query, search_code) and leaving run_command behind the normal approval gate, or at minimum scoping it to read-only commands.

Suggestions (non-blocking)

  1. Fragile content checktoken_threshold regex (eval-local-models.ts:138) checks exact underscore form, but models may output "token threshold" or tokenThreshold. A looser /token.?threshold/i would be more resilient without weakening signal.

  2. No task listing--task <id> filter exists but there's no way to discover available IDs without reading source. A --list-tasks flag or documenting IDs in --help would help.

  3. Determinism — consider setting temperature: 0 (or Ollama seed) for eval runs to reduce variance across trials.

  4. Per-task timeout — if a model loops, the eval hangs indefinitely. A per-task timeout (e.g. 60s) with a "timed out" assessment would make the harness more robust.

What looks good

  • Assessment logic in assessTaskRun is well-designed — tool validity, repeated-failure streaks, content correctness, empty response detection
  • parseVariantSpec cleanly handles both bare names and structured key-value specs with good test coverage
  • Repeated-failure-streak detection is a smart metric for tool-calling eval
  • Default model change (qwen3:8bllama3.1:8b) is consistent across locode.yaml, setup.ts, index.ts, and docs
  • Documentation is clear and actionable

@chocks
Copy link
Copy Markdown
Owner Author

chocks commented Apr 3, 2026

Addressed in 0b50a64.

Changes made:

  • replaced registry.list().map(...) with an explicit allowlist for the eval harness tools so future registry additions do not become auto-approved by accident
  • kept the allowlist limited to the read-only eval tool set registered by this command
  • loosened the fallback-threshold content check from exact token_threshold to /token.?threshold/i
  • added --list-tasks so task ids are discoverable from the CLI

Verification:

  • npm test -- --run src/cli/eval-local-models.test.ts
  • npm run build

@chocks
Copy link
Copy Markdown
Owner Author

chocks commented Apr 3, 2026

Follow-up looks good. All three items addressed cleanly:

  • Safety gate: explicit AUTO_APPROVED_EVAL_TOOLS allowlist decoupled from registry — correct fix
  • Content check: loosened to /token.?threshold/i
  • --list-tasks: new flag + getEvalTaskIds() export + test

One minor nit: the comment on the safety gate says "read-only tools" but the list includes run_command, which isn't strictly read-only (it's constrained by the shell allow-list, so it's safe in practice — just slightly misleading as a comment).

Tests pass (7/7), build clean. LGTM — approve.

@chocks chocks merged commit ceb03bc into main Apr 3, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant