-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FloatingPointError: Predicted boxes or scores contain Inf/NaN. Training has diverged. #5
Comments
Could you please provide the training log? It may be caused by the start iteration of stage 2 or the setting of the sampling threshold. |
Did you solve it, please @793761775 |
i got this same error, but when using COCO dataset to train the model and this error came at 12k iter which is the starting iter of VOS. |
Could be related, but I got infs, and nans when trying to infer on my custom images. The root cause of the issue is probably some box scaling and clipping which resulted in predicted boxes that had zero width or height. predicted_boxes.scale(scale_x, scale_y) UnSniffer/detection/inference/inference_utils.py Lines 149 to 150 in dad023f
For my case predicted_boxes contains coordinates that
I think first case applies, this means we scale them up, causing coordinates to overflow the image boundaries, then after clipping their x (or y) coordinates they will be both the same coordinate: width (or height).
Then matmul results in nans, and torch.linalg.eig(A) results in crash with: Intel MKL ERROR: Parameter 3 was incorrect on entry to DGEBAL. A workaround for inference that worked for me was to turn off resizing by setting MIN_SIZE_TEST to 0 here:
in UnSniffer.yaml But it would be better to turn off scaling in case output is already scaled, or if it isn't, then we probably need to filter out boxes with zero width/height or something |
Excuse me, I have set up the environment and the dataset. Then I start training, but it shows""FloatingPointError: Predicted boxes or scores contain Inf/NaN. Training has diverged."
Can you please tell me how to solve this problem?Is it a learning rate issue?
The text was updated successfully, but these errors were encountered: