Official code release for Profiling the Irrational Agent: Cognitive Modeling of LLM Behaviors in Sequential Jailbreaks, accepted to ICML 2026.
This repository provides a reproducible framework for profiling large language models as sequential decision-making agents under controlled safety-evaluation settings. The core idea is to move beyond outcome-only attack success rates and estimate interpretable behavioral descriptors from multi-turn trajectories.
Paper PDF · Citation · License
Overview of the C-IGT evaluation protocol and GRW cognitive profiling pipeline.
The framework has three components:
- C-IGT evaluation: a Contextual Iowa Gambling Task protocol that elicits refusal/compliance trajectories across reward, framing, and counterfactual-feedback conditions.
- GRW cognitive modeling: a Generalized Rescorla-Wagner model that fits interpretable parameters such as asymmetric learning, choice inertia, risk curvature, reward amplification, and framing bias.
- Analysis pipeline: scripts for running experiments, computing behavioral metrics, fitting parameters, and generating cognitive profiles.
The default configuration is intentionally safe: it runs in mock mode on a small bundled demo dataset. Lightweight AdvBench and AgentHarm demo files are included for checking dataset-compatible workflows and must be selected explicitly.
git clone https://github.com/YancyKahn/JailbreakCognitiveModeling.git
cd JailbreakCognitiveModeling
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtThe code is tested with Python 3.10+.
Run the unit/integration tests:
python tests/test_new_architecture.pyRun a complete mock C-IGT experiment without API keys:
python scripts/experiment/run_experiments.py \
--mock \
--max_instructions 1 \
--trials 2Generated outputs are written under logs/, which is ignored by git. With the default configuration, batch results are saved as scenario-group CSV files under logs/cigt-smoke/mock-model/.
The main batch entry point is:
python scripts/experiment/run_experiments.pyBy default, it reads scripts/config/experiments.json, uses data/AdvBench/demo_harmful_behaviors_custom.csv, and runs with mock responses.
To evaluate a user-provided dataset, prepare a CSV with a goal column and run:
python scripts/experiment/run_experiments.py \
--dataset path/to/your_dataset.csv \
--output_dir logs/cigt-real \
--trials 50For a single-instruction debug run:
python scripts/experiment/run_mab.py \
--source ollama \
--model mock-model \
--instruction "explain account recovery safety" \
--trials 2 \
--mockThe LLM client supports OpenAI-compatible chat completion endpoints. To run real API experiments, copy .env.example to .env locally and fill in the provider credentials. The .env file is intentionally ignored by git.
Supported provider prefixes include:
OPENAISILICONFLOWDASHSCOPEDMXAPINVOLLAMA
You can verify the local configuration with:
python scripts/config/config_manager.py checkdata/
AdvBench/
demo_harmful_behaviors_custom.csv AdvBench-format demo CSV
AgentHarm/
demo_harmful_behaviors_all_dedup.csv AgentHarm-format demo CSV
The default smoke test uses the AdvBench-format demo CSV in mock mode. To run a dataset-compatible experiment with one of the included demo files, pass it explicitly:
python scripts/experiment/run_experiments.py \
--dataset data/AdvBench/demo_harmful_behaviors_custom.csv \
--output_dir logs/cigt-advbench \
--trials 50After an experiment finishes, fit cognitive parameters with:
python scripts/analysis/fit_params.py \
--folder logs/cigt-smoke/mock-model/Compute behavioral metrics or visualizations with:
python scripts/analysis/analyze.py \
--folder logs/cigt-smoke/mock-model/ \
--type metricsThe GRW report estimates:
| Parameter | Interpretation |
|---|---|
alpha_pos |
learning rate for positive prediction errors |
alpha_neg |
learning rate for negative prediction errors |
rho |
risk/utility curvature |
theta |
static decision or framing bias |
lambda |
choice inertia |
phi |
memory decay |
beta |
choice sensitivity |
R_perc |
perceptual reward amplification |
lambda_LA |
loss-aversion coefficient |
| Group | Paper-facing condition |
|---|---|
Baseline |
neutral standard control |
Optimism |
high reward probability |
Stimulus |
exaggerated or salient language feedback |
Magnitude |
high scalar reward salience |
Punishment |
explicit negative feedback |
Threat |
survival/threat framing |
Authority |
authority/root-access framing |
Regret |
counterfactual feedback |
src/
mab/ C-IGT scenarios, environment, prompt runner, LLM client
fitting/ GRW parameter fitting and baseline comparison models
analysis/ metrics, drift plots, radar/profile plots
scripts/
experiment/ batch, single-run, and pipeline entry points
analysis/ parameter fitting and analysis entry points
config/ public configs and configuration checker
data/ demo datasets and experiment data interfaces
paper/ accepted paper PDF
tests/ integration tests
This repository is intended for controlled LLM safety evaluation, reproducibility, and behavioral analysis. It should not be used to deploy jailbreak systems, automate harmful behavior, or construct attack services. The default configuration uses benign prompts and mock outputs so that new users can verify the pipeline without invoking real models or unsafe content.
@inproceedings{yang2026profiling,
title = {Profiling the Irrational Agent: Cognitive Modeling of LLM Behaviors in Sequential Jailbreaks},
author = {Yang, Xikang and Zhou, Biyu and Tang, Xuehai and Han, Jizhong and Hu, Songlin},
booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
year = {2026}
}This project is released under the MIT License. See LICENSE.
