Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

自定义数据集在计算loss时出错 #28

Closed
zhangvia opened this issue Jun 18, 2021 · 4 comments
Closed

自定义数据集在计算loss时出错 #28

zhangvia opened this issue Jun 18, 2021 · 4 comments

Comments

@zhangvia
Copy link

学长你好,我想自定义一个数据集来跑这个模型,我把数据集的标注转成了coco格式如下所示。然后修改了config文件的数据集路径,但在计算loss时还是报错了。错误如下:
"images":[略],
"categories":[略],
"annotations":[
{
"area":3366.6824649275077,
"category_id":3,
"segmentation":[
[
356.7924715069118,
54.362633790420304,
337.6467271719448,
29.497233060353793,
252.64484276405284,
94.9465843308558,
271.79058709901983,
119.81198506092231
]
],
"iscrowd":0,
"bbox":[
304.7186571354823,
74.65460906063805,
31.3822828934849,
107.27971818858487
],
"image_id":18,
"id":1
};

Traceback (most recent call last):
File "tools/train.py", line 95, in
main()
File "tools/train.py", line 91, in main
logger=logger)
File "/home/localchao/LHX/jyz/ReDet/mmdet/apis/train.py", line 61, in train_detector
_non_dist_train(model, dataset, cfg, validate=validate)
File "/home/localchao/LHX/jyz/ReDet/mmdet/apis/train.py", line 197, in _non_dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/home/localchao/anaconda3/envs/CG-Net/lib/python3.7/site-packages/mmcv-0.2.13-py3.7-linux-x86_64.egg/mmcv/runner/runner.py", line 358, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home/localchao/anaconda3/envs/CG-Net/lib/python3.7/site-packages/mmcv-0.2.13-py3.7-linux-x86_64.egg/mmcv/runner/runner.py", line 264, in train
self.model, data_batch, train_mode=True, **kwargs)
File "/home/localchao/LHX/jyz/ReDet/mmdet/apis/train.py", line 39, in batch_processor
losses = model(**data)
File "/home/localchao/anaconda3/envs/CG-Net/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in call
result = self.forward(*input, **kwargs)
File "/home/localchao/anaconda3/envs/CG-Net/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
return self.module(*inputs[0], **kwargs[0])
File "/home/localchao/anaconda3/envs/CG-Net/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in call
result = self.forward(*input, **kwargs)
File "/home/localchao/LHX/jyz/ReDet/mmdet/models/detectors/base_new.py", line 95, in forward
return self.forward_train(img, img_meta, **kwargs)
File "/home/localchao/LHX/jyz/ReDet/mmdet/models/detectors/ReDet.py", line 143, in forward_train
*rpn_loss_inputs, gt_bboxes_ignore=gt_bboxes_ignore)
File "/home/localchao/LHX/jyz/ReDet/mmdet/models/anchor_heads/rpn_head.py", line 51, in loss
gt_bboxes_ignore=gt_bboxes_ignore)
File "/home/localchao/LHX/jyz/ReDet/mmdet/models/anchor_heads/anchor_head.py", line 177, in loss
sampling=self.sampling)
File "/home/localchao/LHX/jyz/ReDet/mmdet/core/anchor/anchor_target.py", line 63, in anchor_target
unmap_outputs=unmap_outputs)
File "/home/localchao/LHX/jyz/ReDet/mmdet/core/utils/misc.py", line 24, in multi_apply
return tuple(map(list, zip(*map_results)))
File "/home/localchao/LHX/jyz/ReDet/mmdet/core/anchor/anchor_target.py", line 108, in anchor_target_single
cfg.allowed_border)
File "/home/localchao/LHX/jyz/ReDet/mmdet/core/anchor/anchor_target.py", line 173, in anchor_inside_flags
(flat_anchors[:, 2] < img_w + allowed_border) &
RuntimeError: Expected object of scalar type Byte but got scalar type Bool for argument #2 'other' in call to _th_and

@csuhan
Copy link
Owner

csuhan commented Jun 18, 2021

  1. 使用低版本pytorch
    或者
  2. RuntimeError in: mmdet/core/anchor/anchor_target.py #15

@zhangvia
Copy link
Author

  1. 使用低版本pytorch
    或者
  2. RuntimeError in: mmdet/core/anchor/anchor_target.py #15

我两个方法都用了,会报同样的错,segmentation fault

@zhangvia
Copy link
Author

zhangvia commented Jun 19, 2021

我更新了gcc解决了segmentation fault 之后重新编译了mmdet,我修改了configs文件中的数据路径,修改datatype为cocodataset,修改了imagescale,numclasses,以及lr(因为我只有单张1080ti)。现在模型跑起来后,会自动终止什么信息也没有是怎么回事。。。运行train.py后信息如下:
[localchao@localhost ReDet]$ python tools/train.py configs/ReDet/ReDet_re50_refpn_1x_dota1.py
ReResNet Orientation: 8 Fix Params: False
2021-06-20 00:37:38,683 - INFO - Distributed training: False
2021-06-20 00:38:11,890 - INFO - load model from: work_dirs/ReResNet_pretrain/re_resnet50_c8_batch256-25b16846.pth
2021-06-20 00:38:11,940 - WARNING - The model and loaded state dict do not match exactly

unexpected key in source state_dict: head.fc.weight, head.fc.bias

missing keys in source state_dict: layer1.2.conv3.filter, layer3.1.conv1.filter, layer4.2.conv2.filter, layer3.1.conv3.filter, layer4.2.conv3.filter, layer1.1.conv1.filter, layer3.4.conv1.filter, layer4.2.conv1.filter, layer2.0.conv2.filter, layer1.2.conv2.filter, layer3.4.conv2.filter, conv1.filter, layer2.1.conv2.filter, layer3.5.conv1.filter, layer1.1.conv2.filter, layer2.3.conv1.filter, layer1.0.conv2.filter, layer2.3.conv2.filter, layer2.2.conv1.filter, layer2.2.conv2.filter, layer2.1.conv1.filter, layer3.2.conv1.filter, layer3.3.conv3.filter, layer1.2.conv1.filter, layer3.5.conv2.filter, layer3.1.conv2.filter, layer4.1.conv1.filter, layer4.0.conv3.filter, layer2.2.conv3.filter, layer3.0.downsample.0.filter, layer1.0.conv1.filter, layer2.3.conv3.filter, layer4.1.conv3.filter, layer3.2.conv2.filter, layer4.1.conv2.filter, layer3.0.conv2.filter, layer1.1.conv3.filter, layer3.4.conv3.filter, layer4.0.downsample.0.filter, layer3.3.conv1.filter, layer3.0.conv1.filter, layer2.0.downsample.0.filter, layer2.0.conv3.filter, layer3.5.conv3.filter, layer1.0.conv3.filter, layer3.0.conv3.filter, layer3.3.conv2.filter, layer2.1.conv3.filter, layer3.2.conv3.filter, layer4.0.conv1.filter, layer4.0.conv2.filter, layer2.0.conv1.filter, layer1.0.downsample.0.filter

loading annotations into memory...
Done (t=0.01s)
creating index...
index created!
2021-06-20 00:38:13,770 - INFO - Start running, host: localchao@localhost.localdomain, work_dir: /home/localchao/LHX/jyz/ReDet/work_dirs/ReDet_re50_refpn_1x_dota1
2021-06-20 00:38:13,770 - INFO - workflow: [('train', 1)], max: 12 epochs
[localchao@localhost ReDet]$
@csuhan

@csuhan
Copy link
Owner

csuhan commented Jun 19, 2021

可以在模型内forward_train加断点,查看是否有运行

@csuhan csuhan closed this as completed Jun 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants