FloatingPointError: Predicted boxes or scores contain Inf/NaN. Training has diverged. #5

793761775 · 2023-07-06T06:47:31Z

Excuse me, I have set up the environment and the dataset. Then I start training, but it shows""FloatingPointError: Predicted boxes or scores contain Inf/NaN. Training has diverged."
Can you please tell me how to solve this problem?Is it a learning rate issue?

Went-Liang · 2023-07-15T14:40:07Z

Could you please provide the training log? It may be caused by the start iteration of stage 2 or the setting of the sampling threshold.

793761775 · 2023-07-28T07:10:32Z

Excuse me, this is the problem I had running the code, the command I used was " python train_net.py --dataset-dir /home/yangyz/UnSniffer/VOC --num-gpus 1 --config-file VOC-Detection /faster-rcnn/UnSniffer.yaml --random-seed 0 --resume". Then the file "UnSniffer.yaml" I did not change.

YH-2023 · 2023-07-30T08:41:23Z

Did you solve it, please @793761775

rohit901 · 2023-08-18T16:56:05Z

i got this same error, but when using COCO dataset to train the model and this error came at 12k iter which is the starting iter of VOS.

balazon · 2023-09-15T09:39:34Z

Could be related, but I got infs, and nans when trying to infer on my custom images. The root cause of the issue is probably some box scaling and clipping which resulted in predicted boxes that had zero width or height.

predicted_boxes.scale(scale_x, scale_y)
predicted_boxes.clip(result.image_size)

UnSniffer/detection/inference/inference_utils.py

Lines 149 to 150 in dad023f

    
           predicted_boxes.scale(scale_x, scale_y) 
        
           predicted_boxes.clip(result.image_size)

For my case predicted_boxes contains coordinates that

already scaled, or
larger than input image size for some reason

I think first case applies, this means we scale them up, causing coordinates to overflow the image boundaries, then after clipping their x (or y) coordinates they will be both the same coordinate: width (or height).
This later causes torch_ncut_detection's pairwise function to have full zero rows, and then _ncut_relabel's reciprocal converts the zeros to infs.
d2 here contains those infs

UnSniffer/detection/inference/ncut_torch.py

Line 74 in dad023f

d2 = torch.diag(torch.reciprocal(torch.sqrt(torch.diag(d2))))

Then matmul results in nans, and
torch.linalg.eig(A) results in crash with:
Intel MKL ERROR: Parameter 3 was incorrect on entry to DGEBAL.

A workaround for inference that worked for me was to turn off resizing by setting MIN_SIZE_TEST to 0 here:

UnSniffer/detection/configs/VOC-Detection/faster-rcnn/UnSniffer.yaml

Line 17 in dad023f

MIN_SIZE_TEST: 800

in UnSniffer.yaml

But it would be better to turn off scaling in case output is already scaled, or if it isn't, then we probably need to filter out boxes with zero width/height or something

balazon mentioned this issue Sep 15, 2023

Making inferences with test images #9

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FloatingPointError: Predicted boxes or scores contain Inf/NaN. Training has diverged. #5

FloatingPointError: Predicted boxes or scores contain Inf/NaN. Training has diverged. #5

793761775 commented Jul 6, 2023

Went-Liang commented Jul 15, 2023

793761775 commented Jul 28, 2023

YH-2023 commented Jul 30, 2023

rohit901 commented Aug 18, 2023

balazon commented Sep 15, 2023 •

edited

Loading

FloatingPointError: Predicted boxes or scores contain Inf/NaN. Training has diverged. #5

FloatingPointError: Predicted boxes or scores contain Inf/NaN. Training has diverged. #5

Comments

793761775 commented Jul 6, 2023

Went-Liang commented Jul 15, 2023

793761775 commented Jul 28, 2023

YH-2023 commented Jul 30, 2023

rohit901 commented Aug 18, 2023

balazon commented Sep 15, 2023 • edited Loading

balazon commented Sep 15, 2023 •

edited

Loading