You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.
Hello, I'm trying to use your idea in my thesis work, thanks for your great idea and code!
I set require_grad=False for all the parameters in the teacher model, and wrapped it in DistributedDataParallel.
But what I got with my own code is that the training stucks at loss.backward(), the losses are not NaN.
If I lower down the batch size and run with just 1 GPU, the code works fine. But if I use DistributedDataParallel, the training will stuck immediately.
Would you have an idea about it? Is it because the exponential moving average somehow affects the computation graph?
Thanks
The text was updated successfully, but these errors were encountered:
Were you using my code or re-write your own code? I think the current version should not get stuck.
Thansk for the reply! I'm using my own implementation as I want to use the idea of adaptive teacher in single stage detectors like retinanet and fcos. I actually found the problem was caused by inconsistent gradient across GPUs, and that was a bug in my dummy novice code. I tried to filter the images by the number of instances in pseudo-labels, if the sub-batches in some GPUs is empty but others are not, then gradient is inconsistent and backward will hang.
Hello, I'm trying to use your idea in my thesis work, thanks for your great idea and code!
I set require_grad=False for all the parameters in the teacher model, and wrapped it in DistributedDataParallel.
But what I got with my own code is that the training stucks at loss.backward(), the losses are not NaN.
If I lower down the batch size and run with just 1 GPU, the code works fine. But if I use DistributedDataParallel, the training will stuck immediately.
Would you have an idea about it? Is it because the exponential moving average somehow affects the computation graph?
Thanks
The text was updated successfully, but these errors were encountered: