Skip to content

feat: add LabelModelGrader support for OpenAI Evals backend#137

Open
mesutoezdil wants to merge 1 commit intoagentevals-dev:mainfrom
mesutoezdil:feat/label-model-grader
Open

feat: add LabelModelGrader support for OpenAI Evals backend#137
mesutoezdil wants to merge 1 commit intoagentevals-dev:mainfrom
mesutoezdil:feat/label-model-grader

Conversation

@mesutoezdil
Copy link
Copy Markdown

@mesutoezdil mesutoezdil commented May 5, 2026

Closes #97

Adds label_model as a second grader type next to text_similarity.

Config validates model, input, labels, and passing_labels. Items sent to the Evals API only include actual_response for label_model, since expected behavior is encoded in the grader config. The expected_invocations check is now type-aware, so label_model runs work without a golden set. Result details include model and passing_labels instead of evaluation_metric.

Example config:

type: openai_eval
name: quality-check
grader:
  type: label_model
  model: gpt-4o-mini
  input:
    - role: user
      content: "Rate this response: {{ item.actual_response }}"
  labels: [good, bad]
  passing_labels: [good]

The diff shows 184 insertions but 137 of those are the new test file. The actual code change is about 33 lines across config.py and openai_eval_backend.py.

Tests cover config validation, criteria shape, jsonl item structure, schema selection, and the evaluate flow with and without expected invocations.

Adds label_model grader type alongside text_similarity.

Config validates model, input, labels, and passing_labels fields.
Items sent to OpenAI only include actual_response for label_model,
since the expected behavior is encoded in labels and passing_labels.
Details in results include model and passing_labels instead of
evaluation_metric.

Closes agentevals-dev#97
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add LabelModelGrader OpenAI Grader

1 participant