COGSEC-EVAL (starter scaffold)

An open eval that measures cognitive-security failures in chat models — sycophantic validation, dependency-fostering design, anthropomorphic deception, crisis mishandling, and boundary erosion under multi-turn pressure — operationalizing the APA's Nov 2025 Health Advisory on generative-AI chatbots.

This is a starter. It runs a single dimension end-to-end (D1) and stubs the rest so you have a working loop to grow from. Built on Inspect AI.

0. What you need first

Python 3.11+
API keys for the hosted models you want to test
(Optional) Ollama for local open-weight + abliterated models

1. Install

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

2. Set keys (only the ones you'll use)

export ANTHROPIC_API_KEY=...      # Claude Opus 4.8
export OPENAI_API_KEY=...         # GPT-5.5
export OPENROUTER_API_KEY=...     # one key -> many open-weight models

3. Run it

The judge model grades each response against the rubric. Start by grading with a strong model.

# closed frontier
inspect eval task.py --model anthropic/claude-opus-4-8
inspect eval task.py --model openai/gpt-5.5

# open-weight via OpenRouter (check current model IDs on openrouter.ai)
inspect eval task.py --model openrouter/deepseek/deepseek-v4
inspect eval task.py --model openrouter/qwen/qwen3.6

# local via Ollama (pull the model first: `ollama pull qwen3.6`)
inspect eval task.py --model ollama/qwen3.6

Results land in ./logs. View them with:

inspect view

Model IDs change. The strings above are illustrative. Confirm current provider IDs before running; Inspect resolves provider/model-name.

How the pieces fit

scenarios.json — the probes. One row per test. Grow this; it's the heart of the eval.
task.py — wires scenarios -> model -> rubric scorer. Contains a fully worked D1 scorer and a multi-turn solver stub for D5.
rubric.md — the human-readable 0–3 anchors each scorer encodes. Keep code and rubric in sync.

Build order (see the chat walkthrough for detail)

Get D1 running on one model (already wired).
Expand scenarios.json to ~5 probes per dimension.
Add scorers for D2 and D3 (copy the D1 scorer, swap the rubric text).
Implement the D5 multi-turn delta using the solver stub.
Populate D4 (crisis) only from established, published clinical vignettes — see the safety note in scenarios.json. Do not author novel distress/self-harm content.
Run across your model set; hand-score a ~20% subset to validate the judge; report agreement.
Write findings + publish.

Responsible use

This is a measurement-and-defense instrument. It ships scores, rubrics, and findings — never reusable manipulation scripts or harmful payloads. Crisis probes use only established vignettes. Human review of distressing material follows basic research-ethics hygiene.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
results		results
.gitignore		.gitignore
LICENSE		LICENSE
PREREGISTRATION.md		PREREGISTRATION.md
README.md		README.md
requirements.txt		requirements.txt
rubric.md		rubric.md
scenarios.json		scenarios.json
task.py		task.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

COGSEC-EVAL (starter scaffold)

0. What you need first

1. Install

2. Set keys (only the ones you'll use)

3. Run it

How the pieces fit

Build order (see the chat walkthrough for detail)

Responsible use

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

COGSEC-EVAL (starter scaffold)

0. What you need first

1. Install

2. Set keys (only the ones you'll use)

3. Run it

How the pieces fit

Build order (see the chat walkthrough for detail)

Responsible use

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages