We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
作者您好,我在集群训练的时候出现问题,希望您能解答一下:
我的环境是: torch == 1.7.1 torchvision == 0.8.2 detectron == 0.2.1
集群显卡使用: 1块显存12G的V100
学习率设置: IMS_PER_BATCH: 2 BASE_LR: 0.00001 WARMUP_FACTOR: 0.00001 报出结果:NAN
学习率设置: IMS_PER_BATCH: 4 BASE_LR: 0.00001 WARMUP_FACTOR: 0.00001 报错结果:CUDA out of memory
请问怎么解决这个问题?
The text was updated successfully, but these errors were encountered:
是16g的v100
Sorry, something went wrong.
您好,我将BatchSize设置为2,GPU数量设置为1,学习率不变,没有出现NAN的情况,但是Loss下降会有波动,且较为缓慢。我认为这可能是BatchSize设置的太小,导致梯度震荡严重,不利于收敛。建议更换算力,增大BatchSize或者在多块GPU上运行。
感谢作者的回复,我再尝试一下多块GPU
No branches or pull requests
作者您好,我在集群训练的时候出现问题,希望您能解答一下:
我的环境是:
torch == 1.7.1
torchvision == 0.8.2
detectron == 0.2.1
集群显卡使用:
1块显存12G的V100
学习率设置:
IMS_PER_BATCH: 2
BASE_LR: 0.00001
WARMUP_FACTOR: 0.00001
报出结果:NAN
学习率设置:
IMS_PER_BATCH: 4
BASE_LR: 0.00001
WARMUP_FACTOR: 0.00001
报错结果:CUDA out of memory
请问怎么解决这个问题?
The text was updated successfully, but these errors were encountered: