Skip to content

cfoudysec/affectdrift

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

COGSEC-EVAL (starter scaffold)

An open eval that measures cognitive-security failures in chat models — sycophantic validation, dependency-fostering design, anthropomorphic deception, crisis mishandling, and boundary erosion under multi-turn pressure — operationalizing the APA's Nov 2025 Health Advisory on generative-AI chatbots.

This is a starter. It runs a single dimension end-to-end (D1) and stubs the rest so you have a working loop to grow from. Built on Inspect AI.


0. What you need first

  • Python 3.11+
  • API keys for the hosted models you want to test
  • (Optional) Ollama for local open-weight + abliterated models

1. Install

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

2. Set keys (only the ones you'll use)

export ANTHROPIC_API_KEY=...      # Claude Opus 4.8
export OPENAI_API_KEY=...         # GPT-5.5
export OPENROUTER_API_KEY=...     # one key -> many open-weight models

3. Run it

The judge model grades each response against the rubric. Start by grading with a strong model.

# closed frontier
inspect eval task.py --model anthropic/claude-opus-4-8
inspect eval task.py --model openai/gpt-5.5

# open-weight via OpenRouter (check current model IDs on openrouter.ai)
inspect eval task.py --model openrouter/deepseek/deepseek-v4
inspect eval task.py --model openrouter/qwen/qwen3.6

# local via Ollama (pull the model first: `ollama pull qwen3.6`)
inspect eval task.py --model ollama/qwen3.6

Results land in ./logs. View them with:

inspect view

Model IDs change. The strings above are illustrative. Confirm current provider IDs before running; Inspect resolves provider/model-name.


How the pieces fit

  • scenarios.json — the probes. One row per test. Grow this; it's the heart of the eval.
  • task.py — wires scenarios -> model -> rubric scorer. Contains a fully worked D1 scorer and a multi-turn solver stub for D5.
  • rubric.md — the human-readable 0–3 anchors each scorer encodes. Keep code and rubric in sync.

Build order (see the chat walkthrough for detail)

  1. Get D1 running on one model (already wired).
  2. Expand scenarios.json to ~5 probes per dimension.
  3. Add scorers for D2 and D3 (copy the D1 scorer, swap the rubric text).
  4. Implement the D5 multi-turn delta using the solver stub.
  5. Populate D4 (crisis) only from established, published clinical vignettes — see the safety note in scenarios.json. Do not author novel distress/self-harm content.
  6. Run across your model set; hand-score a ~20% subset to validate the judge; report agreement.
  7. Write findings + publish.

Responsible use

This is a measurement-and-defense instrument. It ships scores, rubrics, and findings — never reusable manipulation scripts or harmful payloads. Crisis probes use only established vignettes. Human review of distressing material follows basic research-ethics hygiene.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages