GT-HarmBench

Code coming soon!

This repository will host the code and dataset for GT-HarmBench, a benchmark of 2,009 high-stakes scenarios for evaluating AI alignment in multi-agent environments.

About

Frontier AI systems are increasingly capable and deployed in high-stakes multi-agent environments. However, existing AI safety benchmarks largely evaluate single agents, leaving multi-agent risks such as coordination failure and conflict poorly understood. GT-HarmBench addresses this gap with scenarios spanning game-theoretic structures including the Prisoner's Dilemma, Stag Hunt, and Chicken, drawn from realistic AI risk contexts in the MIT AI Risk Repository.

Key Findings

Across 15 frontier models, agents choose socially beneficial actions in only 62% of cases, frequently leading to harmful outcomes.
Models show sensitivity to game-theoretic prompt framing and ordering.
Game-theoretic interventions improve socially beneficial outcomes by up to 18%.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GT-HarmBench

About

Key Findings

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

GT-HarmBench

About

Key Findings

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages