'assert (boxes1[:, 2:] >= boxes1[:, :2]).all()' happened when training #101
Comments
@LovPe Thank you for your interest in DETR. This assertion is not a fluke, if you're getting it, it means that something is going wrong in your training. Here are some potential things you can look into:
Hope this helps, good luck with the debugging. |
@LovPe I was getting this error when the learning rate was too high. |
I was getting this error and found all the boxes are NaN, which is the problem of a full mask generated in the interpolation step (Backbone forward), mainly because of very large zero-padding in loading batch image and mask. If so, the variables below ( Lines 65 to 68 in 10a2c75
|
I think @raviv is correct. In my dataset, LR = 2e-4 works well but I came across such error when I set LR = 2e-3. |
Sorry for my wrong answer. Setting small LR actually only delays the error popping up. I've already set the LR to e-6 level, but still got the error.... |
Hi @zlyin I believe we went over most of the debugging tips to identify where the root cause of this issue might be in #28. In particular, I would look to see if there are other error messages that appear in your code before the assert from the beginning, such as
Which could be a different issue and caused by wrong number of classes. |
Hi @fmassa, thanks for your reply. I solved this issue by changing the bbox format into the normalized coco format. |
I am training DETR on COCO panoptic dataset, and I also meet the error |
@fmassa |
@LovPe @fmassa Hi, could you give us some advice? I have experimented with both COCO datasets and custom dataset, and I didn't change any hyperparameters except gpu number. I have this problem all the time. |
I also seem to be running into this error when training on custom data. It seems to happen with certain combinations of batch and image sizes according to no pattern that I can determine. I am feeding the bbox values into the network in xywh (normalised) format, and I don't seem to be running into any other errors. |
I met the same issue when training with the default setting on COCO detection. python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --coco_path /path/to/coco |
any solutions on this? |
I meet the same issue, any solutions on this? |
Hi, recently I met a similar problem, and when I set the key_padding_mask in multiheadattention as none, the bug is gone. So I wonder if it only needs a large zero-padding mask or must have a full zero-padding mask to cause the issue. Thank you. |
Hi, I found multiple uses of key_padding_mask by a multiheadattention in transformer.py, do you know which line numbers you changed? Edit for posterity: The lines I tried changing were transformer.py 227 and 251, each from "key_padding_mask=memory_padding_mask" to "key_padding_mask=None". It did not fix the box assertion error that all of us are having. Is this the same change that you made? |
I am sharing my experience may be it will be beneficial for you. |
I have changed the num_classes and the lr but the error is still on there. |
I'm wondering whether this is sufficient. For example, if |
I still get this error. All I did was set the batch size to 1. Did someone get a solution for this? |
set num_class = the nums of classes + 1 in config, then this error not pop up |
The Bounding Box Loss == From: Line 156 in 8a144f8
To: losses['loss_bbox'] = loss_bbox.sum() / num_boxes if num_boxes > 0 else loss_bbox.sum() From: Line 161 in 8a144f8
To: losses['loss_giou'] = loss_giou.sum() / num_boxes if num_boxes > 0 else loss_giou.sum() |
hi, i also have the same error. I tried all the suggestions above (change num_classes or only divided by num_boxes when it's >0). |
您好!您发给我的信件已经收到,我将尽快处理
|
i found where the problem was! It was because of the
before the line
add
before the line
and add
before the line
|
Thanks for amazing work
I have questions when training with your code,
assert (boxes1[:, 2:] >= boxes1[:, :2]).all()
happened in functiongeneralized_box_iou
After reading the code i find that boxes1 is the predictd bbox from a MLP layer, which i think the above assertion may happen during early training time, and then break the training.
I wonder if there are Mechanism that can make sure to avoid this happen
My Environment:
Provide your environment information using the following command:
Collecting environment information...
PyTorch version: 1.4.0
Is debug build: No
CUDA used to build PyTorch: 10.0
OS: Ubuntu 16.04.5 LTS
GCC version: (Ubuntu 4.9.3-13ubuntu2) 4.9.3
CMake version: version 3.16.2
Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 10.0.130
GPU models and configuration:
GPU 0: TITAN Xp
GPU 1: TITAN Xp
GPU 2: TITAN Xp
GPU 3: TITAN Xp
Nvidia driver version: 410.48
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5
Versions of relevant libraries:
[pip3] numpy==1.17.1
[pip3] torch==1.4.0
[pip3] torchfile==0.1.0
[pip3] torchvision==0.5.0
[conda] mkl 2019.4 243 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] pytorch 1.4.0 py3.6_cuda10.0.130_cudnn7.6.3_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch
[conda] torch 1.0.0
[conda] torchfile 0.1.0
[conda] torchvision 0.5.0 py36_cu100 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch
The text was updated successfully, but these errors were encountered: