Skip to content
This repository has been archived by the owner on Dec 14, 2023. It is now read-only.

About step_loss of version2 #46

Closed
suzhenghang opened this issue Apr 10, 2023 · 4 comments
Closed

About step_loss of version2 #46

suzhenghang opened this issue Apr 10, 2023 · 4 comments

Comments

@suzhenghang
Copy link

During the training of version2, the step loss easily becomes NaN, even if the learning rate is lowered. Have you encountered this issue before?

@ExponentialML
Copy link
Owner

@suzhenghang No I haven't. What type of setup are you running on? (GPU, CPU, Python version, etc.)

@suzhenghang
Copy link
Author

suzhenghang commented Apr 10, 2023

Thanks,I solved this issue by disabling xformers during training. In the previous v1 version, I had it enabled. Reference link

@ExponentialML
Copy link
Owner

Thanks,I solved this issue by disabling xformers during training. In the previous v1 version, I had it enabled. Reference link

Glad you solved it! I have tried Xformers with Torch 2.0 and while it does work without NaN loss, I don't see any initial improvement. If you ever think about trying it, it should work.

@Rbrq03
Copy link

Rbrq03 commented Oct 31, 2023

I encountered this issue as well. It appears to be a problem with version2 when using fp16. Disabling mixed precision can resolve the issue. However, I'm not certain about the exact cause of the problem.

@ExponentialML If you're looking to address this issue, I'd be happy to provide more information to help you reproduce the bug. For anyone else facing this problem, disabling mixed precision might be the best solution.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants