Yuchen Yan1,2,*, Â
Liang Jiang2, Â
Jin Jiang3, Â
Shuaicheng Li2, Â
Zujie Wen2, Â
Zhiqiang Zhang2, Â
Jun Zhou2, Â
Jian Shao1, Â
Yueting Zhaung1, Â
Yongliang Shen1,â€
1Zhejiang University, Â
2Ant Group, Â
3Peking University
Preprint. Under review.
*Contribution during internship at Ling Team, Ant Group. †Corresponding Author
- 2026.02.09: We release our paper.
Building upon our previous work InftyThink, we introduce InftyThink+, an end-to-end reinforcement learning framework that directly optimizes the complete iterative reasoning trajectory. Building on InftyThink’s paradigm of model-controlled iteration boundaries and explicit summarization, our approach proceeds in two stages: a cold-start stage that uses supervised fine-tuning to establish the basic iterative reasoning format, followed by an RL stage that optimizes strategic decisions through trajectory-level learning. We carefully design the rollout strategy, reward formulation, and policy gradient estimation tailored to InftyThink’s single-trajectory, multi-inference structure. This design separates format acquisition from strategy optimization, enabling the model to learn not only how to produce iterative reasoning, but also when to summarize, what to preserve, and how to effectively leverage self-generated summaries across iterations.
Codes and documentations are on the way.
If you find our work helpful, feel free to give us a cite.
@misc{yan2026inftythinkplus,
title={InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning},
author={Yuchen Yan and Liang Jiang and Jin Jiang and Shuaicheng Li and Zujie Wen and Zhiqiang Zhang and Jun Zhou and Jian Shao and Yueting Zhuang and Yongliang Shen},
year={2026},
eprint={2602.06960},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2602.06960},
}
If you have any questions, please contact us by email: yanyuchen@zju.edu.cn
