Clipping-Free Policy Optimization (CFPO) replaces the heuristic clipping mechanism in GRPO/PPO with a convex quadratic penalty derived from Total Variation divergence constraints. This yields an everywhere-differentiable objective that eliminates zero-gradient regions, reduces reward hacking, and improves training stability.
Clipping-Free Policy Optimization (CFPO) addresses the limitations of standard PPO's clipping mechanism. Traditional PPO uses hard clipping boundaries that can lead to:
- Zero-gradient regions: When the policy ratio is outside the clipping range, gradients become zero
- Reward hacking: Models may exploit reward functions by staying within clipping bounds
- Training instability: Hard boundaries can cause abrupt changes in policy updates
CFPO replaces the heuristic clipping mechanism with a convex quadratic penalty derived from Total Variation divergence constraints. This results in an everywhere-differentiable objective that enforces stable policy updates without hard boundaries.
We provide model checkpoints trained with CFPO for both alignment and reasoning settings. You're welcome to try out the models and evaluate their performance!
View Alignment Model Collection
The collection includes CFPO models and RLOO baselines.
View Reasoning Model Collection
The collection includes CFPO models and GRPO baselines.
This repository contains two implementations of CFPO:
-
RLHF: Based on OpenRLHF, a high-performance RLHF framework for alignment settings. -
RLVR: Contains implementations for reasoning settings:
Firstly, cd into the desired directory and set up the environment as specified in their respective README.md file. Then follow the instructions below.
See RLHF/README.md for detailed installation and usage instructions.
Each subdirectory (open-r1 and verl) contains its own setup and usage instructions.
CFPO has been evaluated across both reasoning and alignment settings:
- Reasoning: CFPO matches clipping-based methods on downstream benchmarks while extending the stable training regime
- Alignment: CFPO mitigates verbosity exploitation and reduces capability degradation, while achieving competitive instruction-following performance
Comparison of GRPO and CFPO across different model sizes (1.5B, 3B, 7B) on reasoning tasks.
CFPO performance on alignment tasks.
For detailed results, tables, and figures, please refer to the paper.
This project builds upon the following open-source repositories:
If you use CFPO in your research, please cite:
@article{cagatan2026clipping,
title={Clipping-Free Policy Optimization for Large Language Models},
author={{\c{C}}a{\u{g}}atan, {\"O}mer Veysel and Akg{\"u}n, Bar{\i}{\c{s}} and {\c{S}}ahin, G{\"o}zde G{\"u}l and Zhao, Xuandong},
journal={arXiv preprint arXiv:2601.22801},
year={2026}
}

