Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FloatingPointError: Predicted boxes or scores contain Inf/NaN. Training has diverged. #5

Open
793761775 opened this issue Jul 6, 2023 · 5 comments

Comments

@793761775
Copy link

Excuse me, I have set up the environment and the dataset. Then I start training, but it shows""FloatingPointError: Predicted boxes or scores contain Inf/NaN. Training has diverged."
Can you please tell me how to solve this problem?Is it a learning rate issue?

@Went-Liang
Copy link
Owner

Could you please provide the training log? It may be caused by the start iteration of stage 2 or the setting of the sampling threshold.

@793761775
Copy link
Author

2023-07-28 11-33-09屏幕截图
Excuse me, this is the problem I had running the code, the command I used was " python train_net.py --dataset-dir /home/yangyz/UnSniffer/VOC --num-gpus 1 --config-file VOC-Detection /faster-rcnn/UnSniffer.yaml --random-seed 0 --resume". Then the file "UnSniffer.yaml" I did not change.

@YH-2023
Copy link

YH-2023 commented Jul 30, 2023

Did you solve it, please @793761775

@rohit901
Copy link

i got this same error, but when using COCO dataset to train the model and this error came at 12k iter which is the starting iter of VOS.

@balazon
Copy link

balazon commented Sep 15, 2023

Could be related, but I got infs, and nans when trying to infer on my custom images. The root cause of the issue is probably some box scaling and clipping which resulted in predicted boxes that had zero width or height.

predicted_boxes.scale(scale_x, scale_y)
predicted_boxes.clip(result.image_size)

predicted_boxes.scale(scale_x, scale_y)
predicted_boxes.clip(result.image_size)

For my case predicted_boxes contains coordinates that

  • already scaled, or
  • larger than input image size for some reason

I think first case applies, this means we scale them up, causing coordinates to overflow the image boundaries, then after clipping their x (or y) coordinates they will be both the same coordinate: width (or height).
This later causes torch_ncut_detection's pairwise function to have full zero rows, and then _ncut_relabel's reciprocal converts the zeros to infs.
d2 here contains those infs

d2 = torch.diag(torch.reciprocal(torch.sqrt(torch.diag(d2))))

Then matmul results in nans, and
torch.linalg.eig(A) results in crash with:
Intel MKL ERROR: Parameter 3 was incorrect on entry to DGEBAL.

A workaround for inference that worked for me was to turn off resizing by setting MIN_SIZE_TEST to 0 here:


in UnSniffer.yaml

But it would be better to turn off scaling in case output is already scaled, or if it isn't, then we probably need to filter out boxes with zero width/height or something

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants