Skip to content

asparius/CFPO

Repository files navigation

Clipping-Free Policy Optimization (CFPO) for Large Language Models

arXiv Hugging Face

Clipping-Free Policy Optimization (CFPO) replaces the heuristic clipping mechanism in GRPO/PPO with a convex quadratic penalty derived from Total Variation divergence constraints. This yields an everywhere-differentiable objective that eliminates zero-gradient regions, reduces reward hacking, and improves training stability.

Method Comparison

🧭 What is CFPO?

Clipping-Free Policy Optimization (CFPO) addresses the limitations of standard PPO's clipping mechanism. Traditional PPO uses hard clipping boundaries that can lead to:

  • Zero-gradient regions: When the policy ratio is outside the clipping range, gradients become zero
  • Reward hacking: Models may exploit reward functions by staying within clipping bounds
  • Training instability: Hard boundaries can cause abrupt changes in policy updates

CFPO replaces the heuristic clipping mechanism with a convex quadratic penalty derived from Total Variation divergence constraints. This results in an everywhere-differentiable objective that enforces stable policy updates without hard boundaries.

🚀 Model Collections

We provide model checkpoints trained with CFPO for both alignment and reasoning settings. You're welcome to try out the models and evaluate their performance!

Alignment Models (RLHF)

View Alignment Model Collection

The collection includes CFPO models and RLOO baselines.

Reasoning Models (RLVR)

View Reasoning Model Collection

The collection includes CFPO models and GRPO baselines.

📦 Repository Structure

This repository contains two implementations of CFPO:

  • RLHF: Based on OpenRLHF, a high-performance RLHF framework for alignment settings.

  • RLVR: Contains implementations for reasoning settings:

🛠️ Getting Started

Firstly, cd into the desired directory and set up the environment as specified in their respective README.md file. Then follow the instructions below.

RLHF (Alignment Settings)

See RLHF/README.md for detailed installation and usage instructions.

RLVR (Reasoning Settings)

Each subdirectory (open-r1 and verl) contains its own setup and usage instructions.

📊 Results

CFPO has been evaluated across both reasoning and alignment settings:

  • Reasoning: CFPO matches clipping-based methods on downstream benchmarks while extending the stable training regime
  • Alignment: CFPO mitigates verbosity exploitation and reduces capability degradation, while achieving competitive instruction-following performance

Reasoning Results

Reasoning Comparison: Metrics by Iterations

Comparison of GRPO and CFPO across different model sizes (1.5B, 3B, 7B) on reasoning tasks.

Alignment Results

Alignment Results

CFPO performance on alignment tasks.

For detailed results, tables, and figures, please refer to the paper.

📚 References

This project builds upon the following open-source repositories:

📄 Citation

If you use CFPO in your research, please cite:

@article{cagatan2026clipping,
  title={Clipping-Free Policy Optimization for Large Language Models},
  author={{\c{C}}a{\u{g}}atan, {\"O}mer Veysel and Akg{\"u}n, Bar{\i}{\c{s}} and {\c{S}}ahin, G{\"o}zde G{\"u}l and Zhao, Xuandong},
  journal={arXiv preprint arXiv:2601.22801},
  year={2026}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors