feat: add inspect_ai task wrapper for ELEPHANT sycophancy benchmark by ejentum · Pull Request #2 · ejentum/benchmarks

ejentum · 2026-05-23T16:08:02Z

Summary

Adds an inspect_ai task that wraps the 40 ELEPHANT scenarios in elephant/scenarios.json as an Inspect-runnable task. Scoring uses model_graded_qa against a three-dimensional sycophancy rubric (validation, indirectness, framing) that mirrors the dimensions the original ELEPHANT benchmark targets.

Why now

This is the upstream artifact required by the Inspect Evals Register (see UKGovernmentBEIS/inspect_evals/register/README.md). A register entry there points at this repo + a pinned commit + a task path, so external inspect_ai users can run the ELEPHANT eval from their own checkout.

Files

pyproject.toml: minimal hatchling-based package, declares inspect_ai >= 0.3.0 as a runtime dep, ships elephant_inspect/ as the wheel.
elephant_inspect/__init__.py: re-exports elephant_sycophancy.
elephant_inspect/task.py: @task definition; loads scenarios from the sibling elephant/ directory at runtime; scorer uses model_graded_qa with a rubric mapping to the three ELEPHANT dimensions.

132 lines added, 0 removed. No changes to existing files. Existing benchmark folders (elephant/, arc-agi-3/, etc.) are untouched.

Test plan

Voice scrubbed (no em dashes).
Scenarios load from existing elephant/scenarios.json; no data duplication.
Grader runs on a separate model from the generator (separation of generation and evaluation, matching the original ELEPHANT protocol).
DCO sign-off on the commit.
Local inspect eval src/elephant_inspect/task.py@elephant_sycophancy --limit 5 (cannot run in this environment without live keys; deferred to a follow-up).

Follow-up

Once merged, the register PR to UKGovernmentBEIS/inspect_evals will reference the merge commit SHA and the elephant_inspect/task.py path.

@task

Adds an inspect_ai task that wraps the 40 ELEPHANT scenarios in elephant/scenarios.json as an Inspect-runnable task, scored by model_graded_qa against a three-dimensional sycophancy rubric (validation, indirectness, framing) that matches the dimensions the underlying ELEPHANT benchmark targets. This is the upstream artifact required by the Inspect Evals Register (register/<eval>/eval.yaml needs an upstream repo with pyproject.toml, inspect_ai dep, and tasks defined via @task decorator). Files: - pyproject.toml: minimal hatchling-based package, declares inspect_ai >= 0.3.0 as runtime dep, ships elephant_inspect as the wheel. - elephant_inspect/__init__.py: re-exports elephant_sycophancy. - elephant_inspect/task.py: @task definition; loads scenarios from the sibling elephant/ directory at runtime. Headline benchmark numbers (5.8% composite sycophancy under augmentation on GPT-4o) live in elephant/README.md. This task replicates the scoring shape so external inspect_ai users can run the eval themselves. Signed-off-by: Ejentum <info@ejentum.com>

ejentum mentioned this pull request May 23, 2026

register: add ejentum-elephant-sycophancy UKGovernmentBEIS/inspect_evals#1708

Open

9 tasks

ejentum merged commit 8d637a0 into main May 24, 2026

ejentum deleted the feat/inspect-ai-elephant-task branch May 24, 2026 09:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add inspect_ai task wrapper for ELEPHANT sycophancy benchmark#2

feat: add inspect_ai task wrapper for ELEPHANT sycophancy benchmark#2
ejentum merged 1 commit into
mainfrom
feat/inspect-ai-elephant-task

ejentum commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ejentum commented May 23, 2026

Summary

Why now

Files

Test plan

Follow-up

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant