From Risk Classification to Action Plan Remediation:
A Guardrail Feedback-Driven Framework for LLM Agents
Yuhao Sun1 · Jiacheng Zhang1 · Shaanan Cohney1 · Zhexin Zhang2 · Feng Liu1 · Xingliang Yuan1,†
1The University of Melbourne · 2Tsinghua University
†Corresponding author
LLM-based guardrails usually safeguard agents by emitting binary allow/deny signals before execution. But agent risks often arise when an otherwise benign task is contaminated by untrusted content or injected instructions — and a binary guardrail blocks the whole task, sacrificing the legitimate goal.
TRIAD (Tripartite Response for Iterative Agent Guardrailing) is a guardrail-integrated agent framework that turns guardrail outputs from static risk signals into actionable verbal feedback. We finetune a language model, Tri-Guard, to produce structured natural-language feedback together with a three-way decision, and inject that feedback back into the agent's context — forming a closed loop between guardrail feedback and agent planning.
| Decision | When | What happens |
|---|---|---|
| 🟢 Proceed | the plan is safe and on-goal | execute the proposed action |
| 🟠 Update | the plan is partially unsafe | inject feedback → the agent revises its plan, dropping the harmful part while preserving the benign task |
| 🔴 Refuse | the request is purely harmful | block execution and refuse |
The Update decision is what lets TRIAD stop an attack without killing the user's legitimate task — the core difference from allow-or-block guardrails.
Across four agent backbones (Qwen3-32B, Kimi-2.5, Gemini-2.5-Pro, GPT-5.1) on ASB and AgentHarm:
- 🛡️ Average Attack Success Rate: 74.45% → 10.42%
- ✅ Average Task Success Rate: 28.45% → 68.60%
- ⚖️ Best Helpfulness–Safety score: 80.92 on AgentHarm
See the project page or the paper for the full tables and case studies.
TRIAD/
├── evaluation/main_eval/
│ ├── agentharm/run_eval.py # AgentHarm evaluation entrypoint
│ └── asb/run_eval.py # ASB (ASB-DPI / ASB-IPI) evaluation entrypoint
├── train/reweighted_sft_trainer.py # weighted SFT for Tri-Guard (decision-confidence reweighting)
├── train/sft_trainer.py # plain SFT baseline (uniform weighting)
├── data/train/sft_training_data.json # curated SFT training set
├── guardrail_prompts.py # shared guardrail system prompts
├── models/ # downloaded model weights go here (git-ignored)
├── docs/ # project page + paper PDF + figures
└── requirements.txt, accelerate_config.yaml, deepspeed_zero{2,3}.json
We recommend an isolated Conda environment.
# 1. Clone the repository
git clone https://github.com/YUHAOSUNABC/TRIAD.git
cd TRIAD
# 2. Create and activate a Conda environment
conda create -n triad python=3.12 -y
conda activate triad
# 3. Install dependencies
pip install -r requirements.txt
# Optional runtime deps not pinned in requirements.txt (install as needed):
# pip install anthropic python-dotenv pyyaml inspect_evalsThe Tri-Guard guardrail is fine-tuned from Qwen3.5, which is new enough
that stable PyPI releases may not load it yet. requirements.txt pins the minimum transformers / vllm,
but if you hit model type 'qwen3_5' ... not recognized (transformers too old) or a Qwen3_5...Config
type error (vLLM too old), install both from source / nightly:
# transformers from source (a.k.a. nightly) — when the model is newer than any release
pip install git+https://github.com/huggingface/transformers.git
# vLLM nightly pre-release wheels
pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightlySee the Qwen3.5-9B model card for the exact vllm serve command (note --reasoning-parser qwen3).
# 1. API keys (only needed for closed-source models or LLM-as-judge)
cp .env.example .env # then edit with your own keys
# 2. Model weights (download each model into its own subdir under models/)
export MODEL_DIR="$PWD/models"
# 3. Run AgentHarm evaluation
cd evaluation/main_eval/agentharm
bash run_eval.sh # no-defense baseline
bash run_guardrail_eval.sh # with guardrail
# 4. Run ASB evaluation
cd ../asb
bash run_eval.sh
bash run_guardrail_eval.shThe repo ships two trainers — both consume the same SFT data, differing only in how samples are weighted:
train/reweighted_sft_trainer.py— the weighted SFT used for Tri-Guard. Each sample is re-weighted by guardrail decision confidence via--weight_field/--weight_min/--weight_max/--refuse_weight, so the loss emphasizes confident (and de-emphasizes Refuse) samples.train/sft_trainer.py— a plain SFT baseline with uniform sample weighting (ablation).
# Tri-Guard — weighted SFT (DeepSpeed ZeRO-3 comes from accelerate_config.yaml)
accelerate launch --config_file accelerate_config.yaml \
train/reweighted_sft_trainer.py \
--model_path "$MODEL_DIR/Qwen3.5-9B" \
--dataset_path data/train/sft_training_data.json \
--output_dir "$MODEL_DIR/guardrail_model/Tri-Guard" \
--num_train_epochs 3 \
--learning_rate 1e-6 \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 4
# Plain-SFT ablation: swap train/reweighted_sft_trainer.py → train/sft_trainer.py (same flags).Note. DeepSpeed is configured by
accelerate_config.yaml(deepspeed_config_file: deepspeed_zero3.json), so do not also pass--deepspeedto the trainer (double-specifying conflicts).deepspeed_zero2.jsonis provided as a lighter-memory ZeRO-2 alternative — pointaccelerate_config.yamlat it to use it.
@article{sun2026triad,
title = {From Risk Classification to Action Plan Remediation:
A Guardrail Feedback-Driven Framework for LLM Agents},
author = {Sun, Yuhao and Zhang, Jiacheng and Cohney, Shaanan and
Zhang, Zhexin and Liu, Feng and Yuan, Xingliang},
journal = {arXiv preprint},
year = {2026}
}Our evaluation builds on AgentSecurityBench (ASB) and AgentHarm. We thank the authors of these benchmarks.
