TRIAD

From Risk Classification to Action Plan Remediation:
A Guardrail Feedback-Driven Framework for LLM Agents

Yuhao Sun¹ · Jiacheng Zhang¹ · Shaanan Cohney¹ · Zhexin Zhang² · Feng Liu¹ · Xingliang Yuan^1,†

¹The University of Melbourne · ²Tsinghua University
^†Corresponding author

📖 Overview

LLM-based guardrails usually safeguard agents by emitting binary allow/deny signals before execution. But agent risks often arise when an otherwise benign task is contaminated by untrusted content or injected instructions — and a binary guardrail blocks the whole task, sacrificing the legitimate goal.

TRIAD (Tripartite Response for Iterative Agent Guardrailing) is a guardrail-integrated agent framework that turns guardrail outputs from static risk signals into actionable verbal feedback. We finetune a language model, Tri-Guard, to produce structured natural-language feedback together with a three-way decision, and inject that feedback back into the agent's context — forming a closed loop between guardrail feedback and agent planning.

✨ Key Idea — Three-Way Guardrail Decisions

Decision	When	What happens
🟢 Proceed	the plan is safe and on-goal	execute the proposed action
🟠 Update	the plan is partially unsafe	inject feedback → the agent revises its plan, dropping the harmful part while preserving the benign task
🔴 Refuse	the request is purely harmful	block execution and refuse

The Update decision is what lets TRIAD stop an attack without killing the user's legitimate task — the core difference from allow-or-block guardrails.

📊 Results

Across four agent backbones (Qwen3-32B, Kimi-2.5, Gemini-2.5-Pro, GPT-5.1) on ASB and AgentHarm:

🛡️ Average Attack Success Rate: 74.45% → 10.42%
✅ Average Task Success Rate: 28.45% → 68.60%
⚖️ Best Helpfulness–Safety score: 80.92 on AgentHarm

See the project page or the paper for the full tables and case studies.

📁 Repository Structure

TRIAD/
├── evaluation/main_eval/
│   ├── agentharm/run_eval.py     # AgentHarm evaluation entrypoint
│   └── asb/run_eval.py           # ASB (ASB-DPI / ASB-IPI) evaluation entrypoint
├── train/reweighted_sft_trainer.py   # weighted SFT for Tri-Guard (decision-confidence reweighting)
├── train/sft_trainer.py              # plain SFT baseline (uniform weighting)
├── data/train/sft_training_data.json   # curated SFT training set
├── guardrail_prompts.py          # shared guardrail system prompts
├── models/                       # downloaded model weights go here (git-ignored)
├── docs/                         # project page + paper PDF + figures
└── requirements.txt, accelerate_config.yaml, deepspeed_zero{2,3}.json

⚙️ Environment Setup

We recommend an isolated Conda environment.

# 1. Clone the repository
git clone https://github.com/YUHAOSUNABC/TRIAD.git
cd TRIAD

# 2. Create and activate a Conda environment
conda create -n triad python=3.12 -y
conda activate triad

# 3. Install dependencies
pip install -r requirements.txt
# Optional runtime deps not pinned in requirements.txt (install as needed):
#   pip install anthropic python-dotenv pyyaml inspect_evals

⚠️ Compatibility note — Qwen3.5 base guardrail

The Tri-Guard guardrail is fine-tuned from Qwen3.5, which is new enough that stable PyPI releases may not load it yet. requirements.txt pins the minimum transformers / vllm, but if you hit model type 'qwen3_5' ... not recognized (transformers too old) or a Qwen3_5...Config type error (vLLM too old), install both from source / nightly:

# transformers from source (a.k.a. nightly) — when the model is newer than any release
pip install git+https://github.com/huggingface/transformers.git

# vLLM nightly pre-release wheels
pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly

See the Qwen3.5-9B model card for the exact vllm serve command (note --reasoning-parser qwen3).

🛠️ Quickstart

# 1. API keys (only needed for closed-source models or LLM-as-judge)
cp .env.example .env   # then edit with your own keys

# 2. Model weights (download each model into its own subdir under models/)
export MODEL_DIR="$PWD/models"

# 3. Run AgentHarm evaluation
cd evaluation/main_eval/agentharm
bash run_eval.sh             # no-defense baseline
bash run_guardrail_eval.sh   # with guardrail

# 4. Run ASB evaluation
cd ../asb
bash run_eval.sh
bash run_guardrail_eval.sh

🔥 Reproducing Training

The repo ships two trainers — both consume the same SFT data, differing only in how samples are weighted:

train/reweighted_sft_trainer.py — the weighted SFT used for Tri-Guard. Each sample is re-weighted by guardrail decision confidence via --weight_field / --weight_min / --weight_max / --refuse_weight, so the loss emphasizes confident (and de-emphasizes Refuse) samples.
train/sft_trainer.py — a plain SFT baseline with uniform sample weighting (ablation).

# Tri-Guard — weighted SFT (DeepSpeed ZeRO-3 comes from accelerate_config.yaml)
accelerate launch --config_file accelerate_config.yaml \
    train/reweighted_sft_trainer.py \
    --model_path "$MODEL_DIR/Qwen3.5-9B" \
    --dataset_path data/train/sft_training_data.json \
    --output_dir "$MODEL_DIR/guardrail_model/Tri-Guard" \
    --num_train_epochs 3 \
    --learning_rate 1e-6 \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 4

# Plain-SFT ablation: swap train/reweighted_sft_trainer.py → train/sft_trainer.py (same flags).

Note. DeepSpeed is configured by accelerate_config.yaml (deepspeed_config_file: deepspeed_zero3.json), so do not also pass --deepspeed to the trainer (double-specifying conflicts). deepspeed_zero2.json is provided as a lighter-memory ZeRO-2 alternative — point accelerate_config.yaml at it to use it.

📝 Citation

@article{sun2026triad,
  title   = {From Risk Classification to Action Plan Remediation:
             A Guardrail Feedback-Driven Framework for LLM Agents},
  author  = {Sun, Yuhao and Zhang, Jiacheng and Cohney, Shaanan and
             Zhang, Zhexin and Liu, Feng and Yuan, Xingliang},
  journal = {arXiv preprint},
  year    = {2026}
}

🤝 Acknowledgements

Our evaluation builds on AgentSecurityBench (ASB) and AgentHarm. We thank the authors of these benchmarks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TRIAD

From Risk Classification to Action Plan Remediation:
A Guardrail Feedback-Driven Framework for LLM Agents

📖 Overview

✨ Key Idea — Three-Way Guardrail Decisions

📊 Results

📁 Repository Structure

⚙️ Environment Setup

⚠️ Compatibility note — Qwen3.5 base guardrail

🛠️ Quickstart

🔥 Reproducing Training

📝 Citation

🤝 Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data/train		data/train
docs		docs
evaluation/main_eval		evaluation/main_eval
train		train
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
accelerate_config.yaml		accelerate_config.yaml
deepspeed_zero2.json		deepspeed_zero2.json
deepspeed_zero3.json		deepspeed_zero3.json
guardrail_prompts.py		guardrail_prompts.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

TRIAD

From Risk Classification to Action Plan Remediation:A Guardrail Feedback-Driven Framework for LLM Agents

📖 Overview

✨ Key Idea — Three-Way Guardrail Decisions

📊 Results

📁 Repository Structure

⚙️ Environment Setup

⚠️ Compatibility note — Qwen3.5 base guardrail

🛠️ Quickstart

🔥 Reproducing Training

📝 Citation

🤝 Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

From Risk Classification to Action Plan Remediation:
A Guardrail Feedback-Driven Framework for LLM Agents

Packages