Skip to content

causalNLP/gt-harmbench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 

Repository files navigation

GT-HarmBench

Code coming soon!

This repository will host the code and dataset for GT-HarmBench, a benchmark of 2,009 high-stakes scenarios for evaluating AI alignment in multi-agent environments.

About

Frontier AI systems are increasingly capable and deployed in high-stakes multi-agent environments. However, existing AI safety benchmarks largely evaluate single agents, leaving multi-agent risks such as coordination failure and conflict poorly understood. GT-HarmBench addresses this gap with scenarios spanning game-theoretic structures including the Prisoner's Dilemma, Stag Hunt, and Chicken, drawn from realistic AI risk contexts in the MIT AI Risk Repository.

Key Findings

  • Across 15 frontier models, agents choose socially beneficial actions in only 62% of cases, frequently leading to harmful outcomes.
  • Models show sensitivity to game-theoretic prompt framing and ordering.
  • Game-theoretic interventions improve socially beneficial outcomes by up to 18%.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors