Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

possible to wrap the teacher model by DistributedDataParallel? #35

Closed
Weijiang-Xiong opened this issue Sep 22, 2022 · 2 comments
Closed

Comments

@Weijiang-Xiong
Copy link

Weijiang-Xiong commented Sep 22, 2022

Hello, I'm trying to use your idea in my thesis work, thanks for your great idea and code!
I set require_grad=False for all the parameters in the teacher model, and wrapped it in DistributedDataParallel.
But what I got with my own code is that the training stucks at loss.backward(), the losses are not NaN.
If I lower down the batch size and run with just 1 GPU, the code works fine. But if I use DistributedDataParallel, the training will stuck immediately.

Would you have an idea about it? Is it because the exponential moving average somehow affects the computation graph?
Thanks

@yujheli
Copy link
Contributor

yujheli commented Sep 30, 2022

Were you using my code or re-write your own code? I think the current version should not get stuck.

@Weijiang-Xiong
Copy link
Author

Weijiang-Xiong commented Sep 30, 2022

Were you using my code or re-write your own code? I think the current version should not get stuck.

Thansk for the reply! I'm using my own implementation as I want to use the idea of adaptive teacher in single stage detectors like retinanet and fcos. I actually found the problem was caused by inconsistent gradient across GPUs, and that was a bug in my dummy novice code. I tried to filter the images by the number of instances in pseudo-labels, if the sub-batches in some GPUs is empty but others are not, then gradient is inconsistent and backward will hang.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants