-
Notifications
You must be signed in to change notification settings - Fork 238
Description
Problem
The emergence of the inference-time compute paradigm presents a critical safety challenge: ensuring chain-of-thought (CoT) faithfulness. Through our work on OptiLLM, an open-source inference optimization framework implementing over 15 state-of-the-art techniques (MCTS, Chain-of-Thought, Mixture of Agents, etc.), we've identified three interconnected risks in reasoning models:
- Unfaithful reasoning chains: Reasoning becomes unfaithful to the model's actual decision process, creating a false impression of safety while masking problematic reasoning.
- Monitoring degradation: As reasoning complexity increases, transparency decreases, weakening monitoring mechanisms.
- Unsafe exploration: During reasoning, models may "think" about harmful content even when final outputs appear safe.
These challenges are particularly dangerous because they undermine our ability to verify safety properties and apply safeguards in frontier models. Recent research confirms that even leading reasoning models can exhibit significant unfaithfulness in complex reasoning tasks [1,2].
Approach
We propose SafeCoT, a comprehensive framework to measure, monitor, and ensure chain-of-thought faithfulness. Our approach leverages our expertise in inference-time optimization to:
- Create a taxonomy of faithfulness failures across diverse reasoning models and tasks
- Develop quantitative faithfulness metrics tailored to various inference-time techniques
- Implement real-time monitoring systems to detect faithfulness violations during inference
- Design intervention mechanisms to address safety risks when faithfulness violations are detected
- Build a benchmarking suite to evaluate the correlation between CoT faithfulness and downstream safety outcomes
We will leverage OptiLLM's inference proxy architecture to integrate monitoring capabilities with any reasoning model via their APIs, ensuring broad applicability across the AI ecosystem.
Impact
This project will deliver:
- An open-source toolkit for evaluating CoT faithfulness across reasoning models
- Empirical evidence on how faithfulness correlates with safety across inference techniques
- Practical monitoring strategies for inference-time compute, ready for integration into existing systems
- Adaptive intervention mechanisms to mitigate unsafe reasoning behaviors
By systematically addressing CoT faithfulness, we aim to provide the AI safety community with both a deeper scientific understanding of this critical challenge and practical tools to enhance safety in the inference-time compute paradigm.
References
- Guo, Y., Wang, Z., Yang, Q., et al. (2025). "Flash Thinking: Optimizing Chain-of-Thought Reasoning with Supervised Reinforcement Learning."
- Snell, J., Reid, M., Tian, F., et al. (2024). "Towards Trustworthy Chain-of-Thought Reasoning Through Faithfulness."