Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于训练 #10

Closed
roar-1128 opened this issue Oct 26, 2021 · 3 comments
Closed

关于训练 #10

roar-1128 opened this issue Oct 26, 2021 · 3 comments

Comments

@roar-1128
Copy link

作者您好,我在集群训练的时候出现问题,希望您能解答一下:

我的环境是:
torch == 1.7.1
torchvision == 0.8.2
detectron == 0.2.1

集群显卡使用:
1块显存12G的V100

学习率设置:
IMS_PER_BATCH: 2
BASE_LR: 0.00001
WARMUP_FACTOR: 0.00001
报出结果:NAN

学习率设置:
IMS_PER_BATCH: 4
BASE_LR: 0.00001
WARMUP_FACTOR: 0.00001
报错结果:CUDA out of memory

请问怎么解决这个问题?

@roar-1128
Copy link
Author

是16g的v100

@easton-cau
Copy link
Owner

您好,我将BatchSize设置为2,GPU数量设置为1,学习率不变,没有出现NAN的情况,但是Loss下降会有波动,且较为缓慢。我认为这可能是BatchSize设置的太小,导致梯度震荡严重,不利于收敛。建议更换算力,增大BatchSize或者在多块GPU上运行。

@roar-1128
Copy link
Author

感谢作者的回复,我再尝试一下多块GPU

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants