Clipping-Free Policy Optimization (CFPO) for Large Language Models

Clipping-Free Policy Optimization (CFPO) replaces the heuristic clipping mechanism in GRPO/PPO with a convex quadratic penalty derived from Total Variation divergence constraints. This yields an everywhere-differentiable objective that eliminates zero-gradient regions, reduces reward hacking, and improves training stability.

🧭 What is CFPO?

Clipping-Free Policy Optimization (CFPO) addresses the limitations of standard PPO's clipping mechanism. Traditional PPO uses hard clipping boundaries that can lead to:

Zero-gradient regions: When the policy ratio is outside the clipping range, gradients become zero
Reward hacking: Models may exploit reward functions by staying within clipping bounds
Training instability: Hard boundaries can cause abrupt changes in policy updates

CFPO replaces the heuristic clipping mechanism with a convex quadratic penalty derived from Total Variation divergence constraints. This results in an everywhere-differentiable objective that enforces stable policy updates without hard boundaries.

🚀 Model Collections

We provide model checkpoints trained with CFPO for both alignment and reasoning settings. You're welcome to try out the models and evaluate their performance!

Alignment Models (RLHF)

View Alignment Model Collection

The collection includes CFPO models and RLOO baselines.

Reasoning Models (RLVR)

View Reasoning Model Collection

The collection includes CFPO models and GRPO baselines.

📦 Repository Structure

This repository contains two implementations of CFPO:

RLHF: Based on OpenRLHF, a high-performance RLHF framework for alignment settings.
RLVR: Contains implementations for reasoning settings:
- open-r1: Based on TRL (Hugging Face Open-R1)
- verl: Based on VERL

🛠️ Getting Started

Firstly, cd into the desired directory and set up the environment as specified in their respective README.md file. Then follow the instructions below.

RLHF (Alignment Settings)

See RLHF/README.md for detailed installation and usage instructions.

RLVR (Reasoning Settings)

Each subdirectory (open-r1 and verl) contains its own setup and usage instructions.

📊 Results

CFPO has been evaluated across both reasoning and alignment settings:

Reasoning: CFPO matches clipping-based methods on downstream benchmarks while extending the stable training regime
Alignment: CFPO mitigates verbosity exploitation and reduces capability degradation, while achieving competitive instruction-following performance

Reasoning Results

Reasoning Comparison: Metrics by Iterations

Comparison of GRPO and CFPO across different model sizes (1.5B, 3B, 7B) on reasoning tasks.

Alignment Results

CFPO performance on alignment tasks.

For detailed results, tables, and figures, please refer to the paper.

📚 References

This project builds upon the following open-source repositories:

📄 Citation

If you use CFPO in your research, please cite:

@article{cagatan2026clipping,
  title={Clipping-Free Policy Optimization for Large Language Models},
  author={{\c{C}}a{\u{g}}atan, {\"O}mer Veysel and Akg{\"u}n, Bar{\i}{\c{s}} and {\c{S}}ahin, G{\"o}zde G{\"u}l and Zhao, Xuandong},
  journal={arXiv preprint arXiv:2601.22801},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
RLHF		RLHF
RLVR		RLVR
.DS_Store		.DS_Store
.gitignore		.gitignore
GRPOvsCFPO.png		GRPOvsCFPO.png
README.md		README.md
RLHF-PLOT.png		RLHF-PLOT.png
metrics_by_iterations_all_models.png		metrics_by_iterations_all_models.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Clipping-Free Policy Optimization (CFPO) for Large Language Models

🧭 What is CFPO?

🚀 Model Collections

Alignment Models (RLHF)

Reasoning Models (RLVR)

📦 Repository Structure

🛠️ Getting Started

RLHF (Alignment Settings)

RLVR (Reasoning Settings)

📊 Results

Reasoning Results

Alignment Results

📚 References

📄 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Clipping-Free Policy Optimization (CFPO) for Large Language Models

🧭 What is CFPO?

🚀 Model Collections

Alignment Models (RLHF)

Reasoning Models (RLVR)

📦 Repository Structure

🛠️ Getting Started

RLHF (Alignment Settings)

RLVR (Reasoning Settings)

📊 Results

Reasoning Results

Alignment Results

📚 References

📄 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages