possible to wrap the teacher model by DistributedDataParallel? #35

Weijiang-Xiong · 2022-09-22T12:34:57Z

Hello, I'm trying to use your idea in my thesis work, thanks for your great idea and code!
I set require_grad=False for all the parameters in the teacher model, and wrapped it in DistributedDataParallel.
But what I got with my own code is that the training stucks at loss.backward(), the losses are not NaN.
If I lower down the batch size and run with just 1 GPU, the code works fine. But if I use DistributedDataParallel, the training will stuck immediately.

Would you have an idea about it? Is it because the exponential moving average somehow affects the computation graph?
Thanks

yujheli · 2022-09-30T00:43:25Z

Were you using my code or re-write your own code? I think the current version should not get stuck.

Weijiang-Xiong · 2022-09-30T06:17:46Z

Were you using my code or re-write your own code? I think the current version should not get stuck.

Thansk for the reply! I'm using my own implementation as I want to use the idea of adaptive teacher in single stage detectors like retinanet and fcos. I actually found the problem was caused by inconsistent gradient across GPUs, and that was a bug in my dummy novice code. I tried to filter the images by the number of instances in pseudo-labels, if the sub-batches in some GPUs is empty but others are not, then gradient is inconsistent and backward will hang.

Weijiang-Xiong closed this as completed Sep 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

possible to wrap the teacher model by DistributedDataParallel? #35

possible to wrap the teacher model by DistributedDataParallel? #35

Weijiang-Xiong commented Sep 22, 2022 •

edited

yujheli commented Sep 30, 2022

Weijiang-Xiong commented Sep 30, 2022 •

edited

possible to wrap the teacher model by DistributedDataParallel? #35

possible to wrap the teacher model by DistributedDataParallel? #35

Comments

Weijiang-Xiong commented Sep 22, 2022 • edited

yujheli commented Sep 30, 2022

Weijiang-Xiong commented Sep 30, 2022 • edited

Weijiang-Xiong commented Sep 22, 2022 •

edited

Weijiang-Xiong commented Sep 30, 2022 •

edited