Describe the bug
I know this sounds very weird. However, when I use the deepspeed to optimize a "Qwen/Qwen2.5-3B" model, the model does not update at all. The same exact training code works with "Qwen/Qwen2.5-1.5B". Also checked and optimizing "meta-llama/Llama-3.2-3B" does not work. The parameters remain exactly the same. However, by just setting "torch_adam" to true, the issue goes away.