An open eval that measures cognitive-security failures in chat models — sycophantic validation, dependency-fostering design, anthropomorphic deception, crisis mishandling, and boundary erosion under multi-turn pressure — operationalizing the APA's Nov 2025 Health Advisory on generative-AI chatbots.
This is a starter. It runs a single dimension end-to-end (D1) and stubs the rest so you have a working loop to grow from. Built on Inspect AI.
- Python 3.11+
- API keys for the hosted models you want to test
- (Optional) Ollama for local open-weight + abliterated models
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txtexport ANTHROPIC_API_KEY=... # Claude Opus 4.8
export OPENAI_API_KEY=... # GPT-5.5
export OPENROUTER_API_KEY=... # one key -> many open-weight modelsThe judge model grades each response against the rubric. Start by grading with a strong model.
# closed frontier
inspect eval task.py --model anthropic/claude-opus-4-8
inspect eval task.py --model openai/gpt-5.5
# open-weight via OpenRouter (check current model IDs on openrouter.ai)
inspect eval task.py --model openrouter/deepseek/deepseek-v4
inspect eval task.py --model openrouter/qwen/qwen3.6
# local via Ollama (pull the model first: `ollama pull qwen3.6`)
inspect eval task.py --model ollama/qwen3.6Results land in ./logs. View them with:
inspect viewModel IDs change. The strings above are illustrative. Confirm current provider IDs before running; Inspect resolves
provider/model-name.
scenarios.json— the probes. One row per test. Grow this; it's the heart of the eval.task.py— wires scenarios -> model -> rubric scorer. Contains a fully worked D1 scorer and a multi-turn solver stub for D5.rubric.md— the human-readable 0–3 anchors each scorer encodes. Keep code and rubric in sync.
- Get D1 running on one model (already wired).
- Expand
scenarios.jsonto ~5 probes per dimension. - Add scorers for D2 and D3 (copy the D1 scorer, swap the rubric text).
- Implement the D5 multi-turn delta using the solver stub.
- Populate D4 (crisis) only from established, published clinical vignettes — see the
safety note in
scenarios.json. Do not author novel distress/self-harm content. - Run across your model set; hand-score a ~20% subset to validate the judge; report agreement.
- Write findings + publish.
This is a measurement-and-defense instrument. It ships scores, rubrics, and findings — never reusable manipulation scripts or harmful payloads. Crisis probes use only established vignettes. Human review of distressing material follows basic research-ethics hygiene.