feat: add local model evaluation harness by chocks · Pull Request #71 · chocks/locode

chocks · 2026-04-03T02:29:37Z

Summary

add locode eval-local-models for repeatable tool-calling evaluation across local Ollama models
document the evaluation methodology and scoring so Gemma 4 can be compared against llama3.1:8b
restore the conservative product stance: keep llama3.1:8b as the default and position gemma4:9b as the recommended upgrade to evaluate

Testing

npm test -- --run src/cli/eval-local-models.test.ts src/config/loader.test.ts src/cli/setup.test.ts
npm run build

Notes

this adds the evaluation harness but does not check in model results; run locode eval-local-models locally after pulling the desired Ollama models

chocks · 2026-04-03T02:32:49Z

Review: solid eval harness, one safety concern

Must fix

Auto-approving run_command in the safety gate (src/cli/eval-local-models.ts:219-222)

const safetyGate = new SafetyGate({
  auto_approve: registry.list().map(tool => tool.name),
})

This auto-approves every registered tool including run_command — a confused or adversarial model could run destructive shell commands during eval without confirmation. Consider auto-approving only read-safe tools (read_file, list_files, git_query, search_code) and leaving run_command behind the normal approval gate, or at minimum scoping it to read-only commands.

Suggestions (non-blocking)

Fragile content check — token_threshold regex (eval-local-models.ts:138) checks exact underscore form, but models may output "token threshold" or tokenThreshold. A looser /token.?threshold/i would be more resilient without weakening signal.
No task listing — --task <id> filter exists but there's no way to discover available IDs without reading source. A --list-tasks flag or documenting IDs in --help would help.
Determinism — consider setting temperature: 0 (or Ollama seed) for eval runs to reduce variance across trials.
Per-task timeout — if a model loops, the eval hangs indefinitely. A per-task timeout (e.g. 60s) with a "timed out" assessment would make the harness more robust.

What looks good

Assessment logic in assessTaskRun is well-designed — tool validity, repeated-failure streaks, content correctness, empty response detection
parseVariantSpec cleanly handles both bare names and structured key-value specs with good test coverage
Repeated-failure-streak detection is a smart metric for tool-calling eval
Default model change (qwen3:8b → llama3.1:8b) is consistent across locode.yaml, setup.ts, index.ts, and docs
Documentation is clear and actionable

chocks · 2026-04-03T02:35:06Z

Addressed in 0b50a64.

Changes made:

replaced registry.list().map(...) with an explicit allowlist for the eval harness tools so future registry additions do not become auto-approved by accident
kept the allowlist limited to the read-only eval tool set registered by this command
loosened the fallback-threshold content check from exact token_threshold to /token.?threshold/i
added --list-tasks so task ids are discoverable from the CLI

Verification:

npm test -- --run src/cli/eval-local-models.test.ts
npm run build

chocks · 2026-04-03T02:35:57Z

Follow-up looks good. All three items addressed cleanly:

✅ Safety gate: explicit AUTO_APPROVED_EVAL_TOOLS allowlist decoupled from registry — correct fix
✅ Content check: loosened to /token.?threshold/i
✅ --list-tasks: new flag + getEvalTaskIds() export + test

One minor nit: the comment on the safety gate says "read-only tools" but the list includes run_command, which isn't strictly read-only (it's constrained by the shell allow-list, so it's safe in practice — just slightly misleading as a comment).

Tests pass (7/7), build clean. LGTM — approve.

feat: add local model evaluation harness

9d7a1cb

fix: tighten eval tool approvals

0b50a64

chocks merged commit ceb03bc into main Apr 3, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add local model evaluation harness#71

feat: add local model evaluation harness#71
chocks merged 2 commits into
mainfrom
feat/local-model-eval

chocks commented Apr 3, 2026

Uh oh!

chocks commented Apr 3, 2026

Uh oh!

chocks commented Apr 3, 2026

Uh oh!

chocks commented Apr 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chocks commented Apr 3, 2026

Summary

Testing

Notes

Uh oh!

chocks commented Apr 3, 2026

Review: solid eval harness, one safety concern

Must fix

Suggestions (non-blocking)

What looks good

Uh oh!

chocks commented Apr 3, 2026

Uh oh!

chocks commented Apr 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant