RuntimeError: [enforce fail at operator.cc:75] blob != nullptr. op Conv: Encountered a non-existing input blob: gpu_0/old_res3_7_sum #3

carryyu · 2019-09-17T12:54:06Z

I don't have 8 GPUS, so I chang3 Num_GPUS to 2 and it raise this error. How can I fix it?

I use e2e_cascade_rcnn_X-101-64x4d-FPN_1x.yaml. I change it like:
MODEL:
TYPE: generalized_rcnn
CONV_BODY: FPN.add_fpn_ResNet101_conv5_body
NUM_CLASSES: 21
FASTER_RCNN: True
CASCADE_ON: True
CLS_AGNOSTIC_BBOX_REG: True # default: False
NUM_GPUS: 2
SOLVER:
WEIGHT_DECAY: 0.0001
LR_POLICY: steps_with_decay
BASE_LR: 0.01
GAMMA: 0.1
MAX_ITER: 180000
STEPS: [0, 120000, 160000]
FPN:
FPN_ON: True
MULTILEVEL_ROIS: True
MULTILEVEL_RPN: True
RESNETS:
STRIDE_1X1: False # default True for MSRA; False for C2 or Torch models
TRANS_FUNC: bottleneck_transformation
NUM_GROUPS: 64
WIDTH_PER_GROUP: 4
FAST_RCNN:
ROI_BOX_HEAD: fast_rcnn_heads.add_roi_2mlp_head
ROI_XFORM_METHOD: RoIAlign
ROI_XFORM_RESOLUTION: 7
ROI_XFORM_SAMPLING_RATIO: 2
CASCADE_RCNN:
ROI_BOX_HEAD: cascade_rcnn_heads.add_roi_2mlp_head
NUM_STAGE: 3
TEST_STAGE: 3
TEST_ENSEMBLE: True
TRAIN:
WEIGHTS: https://dl.fbaipublicfiles.com/detectron/ImageNetPretrained/FBResNeXt/X-101-64x4d.pkl
DATASETS: ('coco_2014_train', 'coco_2014_valminusminival')
SCALES: (800,)
MAX_SIZE: 1333
IMS_PER_BATCH: 1
BATCH_SIZE_PER_IM: 512
RPN_PRE_NMS_TOP_N: 2000 # Per FPN level
TEST:
DATASETS: ('coco_2014_valminusminival',)
SCALE: 800
MAX_SIZE: 1333
NMS: 0.5
RPN_PRE_NMS_TOP_N: 1000 # Per FPN level
RPN_POST_NMS_TOP_N: 1000
OUTPUT_DIR: .

the error:

[W workspace.cc:170] Blob gpu_0/old_res3_7_sum not in the workspace.
WARNING workspace.py: 222: Original python traceback for operator 383 in network generalized_rcnn in exception above (most recent call last):
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/tools/train_net.py", line 133, in
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/tools/train_net.py", line 115, in main
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/utils/train.py", line 53, in train_model
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/utils/train.py", line 145, in create_model
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/model_builder.py", line 127, in create
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/model_builder.py", line 91, in generalized_rcnn
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/model_builder.py", line 259, in build_generic_detection_model
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/optimizer.py", line 40, in build_data_parallel_model
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/optimizer.py", line 63, in _build_forward_graph
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/model_builder.py", line 189, in _single_gpu_build_func
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/FPN.py", line 64, in add_fpn_ResNet101_conv5_body
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/FPN.py", line 112, in add_fpn_onto_conv_body
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/ResNet.py", line 48, in add_ResNet101_conv5_body
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/ResNet.py", line 145, in add_ResNet_convX_body
Traceback (most recent call last):
File "/home/lzy/diverse/CBNet/tools/train_net.py", line 133, in
main()
File "/home/lzy/diverse/CBNet/tools/train_net.py", line 115, in main
checkpoints = detectron.utils.train.train_model()
File "/home/lzy/diverse/CBNet/detectron/utils/train.py", line 58, in train_model
setup_model_for_training(model, weights_file, output_dir)
File "/home/lzy/diverse/CBNet/detectron/utils/train.py", line 179, in setup_model_for_training
workspace.CreateNet(model.net)
File "/home/lzy/pytorch/build/caffe2/python/workspace.py", line 181, in CreateNet
StringifyProto(net), overwrite,
File "/home/lzy/pytorch/build/caffe2/python/workspace.py", line 215, in CallWithExceptionIntercept
return func(args, kwargs)
RuntimeError: [enforce fail at operator.cc:75] blob != nullptr. op Conv: Encountered a non-existing input blob: gpu_0/old_res3_7_sum
frame #0: c10::ThrowEnforceNotMet(char const, int, char const, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, void const) + 0x76 (0x7f916475ed36 in /home/lzy/pytorch/build/lib/libc10.so)
frame #1: caffe2::OperatorBase::OperatorBase(caffe2::OperatorDef const&, caffe2::Workspace*) + 0x3ff (0x7f9144b7bd2f in /home/lzy/pytorch/build/lib/libtorch.so)
frame #2: + 0x3f68805 (0x7f914635b805 in /home/lzy/pytorch/build/lib/libtorch.so)
frame #3: + 0x3f868eb (0x7f91463798eb in /home/lzy/pytorch/build/lib/libtorch.so)
frame #4: + 0x3f8841e (0x7f914637b41e in /home/lzy/pytorch/build/lib/libtorch.so)
frame #5: std::_Function_handler<std::unique_ptr<caffe2::OperatorBase, std::default_deletecaffe2::OperatorBase > (caffe2::OperatorDef const&, caffe2::Workspace*), std::unique_ptr<caffe2::OperatorBase, std::default_deletecaffe2::OperatorBase > ()(caffe2::OperatorDef const&, caffe2::Workspace)>::_M_invoke(std::_Any_data const&, caffe2::OperatorDef const&, caffe2::Workspace*&&) + 0x23 (0x7f9164bf96a3 in /home/lzy/pytorch/build/caffe2/python/caffe2_pybind11_state_gpu.so)
frame #6: + 0x2786301 (0x7f9144b79301 in /home/lzy/pytorch/build/lib/libtorch.so)
frame #7: caffe2::CreateOperator(caffe2::OperatorDef const&, caffe2::Workspace*, int) + 0x32a (0x7f9144b7a60a in /home/lzy/pytorch/build/lib/libtorch.so)
frame #8: caffe2::dag_utils::prepareOperatorNodes(std::shared_ptr<caffe2::NetDef const> const&, caffe2::Workspace*) + 0x17f3 (0x7f9144b74b93 in /home/lzy/pytorch/build/lib/libtorch.so)
frame #9: caffe2::AsyncNetBase::AsyncNetBase(std::shared_ptr<caffe2::NetDef const> const&, caffe2::Workspace*) + 0x246 (0x7f9144b8c026 in /home/lzy/pytorch/build/lib/libtorch.so)
frame #10: caffe2::AsyncSchedulingNet::AsyncSchedulingNet(std::shared_ptr<caffe2::NetDef const> const&, caffe2::Workspace*) + 0x9 (0x7f9144bb6989 in /home/lzy/pytorch/build/lib/libtorch.so)
frame #11: + 0x27c5e2e (0x7f9144bb8e2e in /home/lzy/pytorch/build/lib/libtorch.so)
frame #12: std::_Function_handler<std::unique_ptr<caffe2::NetBase, std::default_deletecaffe2::NetBase > (std::shared_ptr<caffe2::NetDef const> const&, caffe2::Workspace*), std::unique_ptr<caffe2::NetBase, std::default_deletecaffe2::NetBase > ()(std::shared_ptr<caffe2::NetDef const> const&, caffe2::Workspace)>::_M_invoke(std::_Any_data const&, std::shared_ptr<caffe2::NetDef const> const&, caffe2::Workspace*&&) + 0x23 (0x7f9144bb8ce3 in /home/lzy/pytorch/build/lib/libtorch.so)
frame #13: caffe2::CreateNet(std::shared_ptr<caffe2::NetDef const> const&, caffe2::Workspace*) + 0x847 (0x7f9144bc3117 in /home/lzy/pytorch/build/lib/libtorch.so)
frame #14: caffe2::Workspace::CreateNet(std::shared_ptr<caffe2::NetDef const> const&, bool) + 0x13c (0x7f9144bdf24c in /home/lzy/pytorch/build/lib/libtorch.so)
frame #15: caffe2::Workspace::CreateNet(caffe2::NetDef const&, bool) + 0x9f (0x7f9144be094f in /home/lzy/pytorch/build/lib/libtorch.so)
frame #16: + 0x51f70 (0x7f9164beef70 in /home/lzy/pytorch/build/caffe2/python/caffe2_pybind11_state_gpu.so)
frame #17: + 0x521de (0x7f9164bef1de in /home/lzy/pytorch/build/caffe2/python/caffe2_pybind11_state_gpu.so)
frame #18: + 0x99160 (0x7f9164c36160 in /home/lzy/pytorch/build/caffe2/python/caffe2_pybind11_state_gpu.so)

frame #36: __libc_start_main + 0xf0 (0x7f9168059830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #37: + 0x107f (0x55e423b0507f in /home/lzy/anaconda2/envs/lzy/bin/python)

What's more, I can train model on the original detectron.

The text was updated successfully, but these errors were encountered:

carryyu · 2019-09-17T13:22:58Z

Your detectron version is a bit low。

PKUbahuangliuhe · 2019-09-17T15:41:27Z

I don't have 8 GPUS, so I chang3 Num_GPUS to 2 and it raise this error. How can I fix it?

I use e2e_cascade_rcnn_X-101-64x4d-FPN_1x.yaml. I change it like:
MODEL:
TYPE: generalized_rcnn
CONV_BODY: FPN.add_fpn_ResNet101_conv5_body
NUM_CLASSES: 21
FASTER_RCNN: True
CASCADE_ON: True
CLS_AGNOSTIC_BBOX_REG: True # default: False
NUM_GPUS: 2
SOLVER:
WEIGHT_DECAY: 0.0001
LR_POLICY: steps_with_decay
BASE_LR: 0.01
GAMMA: 0.1
MAX_ITER: 180000
STEPS: [0, 120000, 160000]
FPN:
FPN_ON: True
MULTILEVEL_ROIS: True
MULTILEVEL_RPN: True
RESNETS:
STRIDE_1X1: False # default True for MSRA; False for C2 or Torch models
TRANS_FUNC: bottleneck_transformation
NUM_GROUPS: 64
WIDTH_PER_GROUP: 4
FAST_RCNN:
ROI_BOX_HEAD: fast_rcnn_heads.add_roi_2mlp_head
ROI_XFORM_METHOD: RoIAlign
ROI_XFORM_RESOLUTION: 7
ROI_XFORM_SAMPLING_RATIO: 2
CASCADE_RCNN:
ROI_BOX_HEAD: cascade_rcnn_heads.add_roi_2mlp_head
NUM_STAGE: 3
TEST_STAGE: 3
TEST_ENSEMBLE: True
TRAIN:
WEIGHTS: https://dl.fbaipublicfiles.com/detectron/ImageNetPretrained/FBResNeXt/X-101-64x4d.pkl
DATASETS: ('coco_2014_train', 'coco_2014_valminusminival')
SCALES: (800,)
MAX_SIZE: 1333
IMS_PER_BATCH: 1
BATCH_SIZE_PER_IM: 512
RPN_PRE_NMS_TOP_N: 2000 # Per FPN level
TEST:
DATASETS: ('coco_2014_valminusminival',)
SCALE: 800
MAX_SIZE: 1333
NMS: 0.5
RPN_PRE_NMS_TOP_N: 1000 # Per FPN level
RPN_POST_NMS_TOP_N: 1000
OUTPUT_DIR: .

the error:

[W workspace.cc:170] Blob gpu_0/old_res3_7_sum not in the workspace.
WARNING workspace.py: 222: Original python traceback for operator 383 in network generalized_rcnn in exception above (most recent call last):
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/tools/train_net.py", line 133, in
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/tools/train_net.py", line 115, in main
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/utils/train.py", line 53, in train_model
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/utils/train.py", line 145, in create_model
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/model_builder.py", line 127, in create
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/model_builder.py", line 91, in generalized_rcnn
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/model_builder.py", line 259, in build_generic_detection_model
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/optimizer.py", line 40, in build_data_parallel_model
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/optimizer.py", line 63, in _build_forward_graph
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/model_builder.py", line 189, in _single_gpu_build_func
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/FPN.py", line 64, in add_fpn_ResNet101_conv5_body
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/FPN.py", line 112, in add_fpn_onto_conv_body
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/ResNet.py", line 48, in add_ResNet101_conv5_body
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/ResNet.py", line 145, in add_ResNet_convX_body
Traceback (most recent call last):
File "/home/lzy/diverse/CBNet/tools/train_net.py", line 133, in
main()
File "/home/lzy/diverse/CBNet/tools/train_net.py", line 115, in main
checkpoints = detectron.utils.train.train_model()
File "/home/lzy/diverse/CBNet/detectron/utils/train.py", line 58, in train_model
setup_model_for_training(model, weights_file, output_dir)
File "/home/lzy/diverse/CBNet/detectron/utils/train.py", line 179, in setup_model_for_training
workspace.CreateNet(model.net)
File "/home/lzy/pytorch/build/caffe2/python/workspace.py", line 181, in CreateNet
StringifyProto(net), overwrite,
File "/home/lzy/pytorch/build/caffe2/python/workspace.py", line 215, in CallWithExceptionIntercept
return func(_args, kwargs) RuntimeError: [enforce fail at operator.cc:75] blob != nullptr. op Conv: Encountered a non-existing input blob: gpu_0/old_res3_7_sum frame #0: c10::ThrowEnforceNotMet(char const, int, char const, std::cxx11::basic_string<char, std::char_traits, std::allocator > const&, void const) + 0x76 (0x7f916475ed36 in /home/lzy/pytorch/build/lib/libc10.so)
frame #1: caffe2::OperatorBase::OperatorBase(caffe2::OperatorDef const&, caffe2::Workspace*) + 0x3ff (0x7f9144b7bd2f in /home/lzy/pytorch/build/lib/libtorch.so)
frame #2: + 0x3f68805 (0x7f914635b805 in /home/lzy/pytorch/build/lib/libtorch.so)
frame #3: + 0x3f868eb (0x7f91463798eb in /home/lzy/pytorch/build/lib/libtorch.so)
frame #4: + 0x3f8841e (0x7f914637b41e in /home/lzy/pytorch/build/lib/libtorch.so)
frame #5: std::Function_handler<std::unique_ptr<caffe2::OperatorBase, std::default_deletecaffe2::OperatorBase > (caffe2::OperatorDef const&, caffe2::Workspace*), std::unique_ptr<caffe2::OperatorBase, std::default_deletecaffe2::OperatorBase > ()(caffe2::OperatorDef const&, caffe2::Workspace)>::_M_invoke(std::Any_data const&, caffe2::OperatorDef const&, caffe2::Workspace*&&) + 0x23 (0x7f9164bf96a3 in /home/lzy/pytorch/build/caffe2/python/caffe2_pybind11_state_gpu.so)
frame #6: + 0x2786301 (0x7f9144b79301 in /home/lzy/pytorch/build/lib/libtorch.so)
frame #7: caffe2::CreateOperator(caffe2::OperatorDef const&, caffe2::Workspace*, int) + 0x32a (0x7f9144b7a60a in /home/lzy/pytorch/build/lib/libtorch.so)
frame #8: caffe2::dag_utils::prepareOperatorNodes(std::shared_ptr<caffe2::NetDef const> const&, caffe2::Workspace*) + 0x17f3 (0x7f9144b74b93 in /home/lzy/pytorch/build/lib/libtorch.so)
frame #9: caffe2::AsyncNetBase::AsyncNetBase(std::shared_ptr<caffe2::NetDef const> const&, caffe2::Workspace*) + 0x246 (0x7f9144b8c026 in /home/lzy/pytorch/build/lib/libtorch.so)
frame #10: caffe2::AsyncSchedulingNet::AsyncSchedulingNet(std::shared_ptr<caffe2::NetDef const> const&, caffe2::Workspace*) + 0x9 (0x7f9144bb6989 in /home/lzy/pytorch/build/lib/libtorch.so)
frame #11: + 0x27c5e2e (0x7f9144bb8e2e in /home/lzy/pytorch/build/lib/libtorch.so)
frame #12: std::Function_handler<std::unique_ptr<caffe2::NetBase, std::default_deletecaffe2::NetBase > (std::shared_ptr<caffe2::NetDef const> const&, caffe2::Workspace*), std::unique_ptr<caffe2::NetBase, std::default_deletecaffe2::NetBase > ()(std::shared_ptr<caffe2::NetDef const> const&, caffe2::Workspace)>::_M_invoke(std::_Any_data const&, std::shared_ptr<caffe2::NetDef const> const&, caffe2::Workspace*&&) + 0x23 (0x7f9144bb8ce3 in /home/lzy/pytorch/build/lib/libtorch.so)
frame #13: caffe2::CreateNet(std::shared_ptr<caffe2::NetDef const> const&, caffe2::Workspace*) + 0x847 (0x7f9144bc3117 in /home/lzy/pytorch/build/lib/libtorch.so)
frame #14: caffe2::Workspace::CreateNet(std::shared_ptr<caffe2::NetDef const> const&, bool) + 0x13c (0x7f9144bdf24c in /home/lzy/pytorch/build/lib/libtorch.so)
frame #15: caffe2::Workspace::CreateNet(caffe2::NetDef const&, bool) + 0x9f (0x7f9144be094f in /home/lzy/pytorch/build/lib/libtorch.so)
frame #16: + 0x51f70 (0x7f9164beef70 in /home/lzy/pytorch/build/caffe2/python/caffe2_pybind11_state_gpu.so)
frame #17: + 0x521de (0x7f9164bef1de in /home/lzy/pytorch/build/caffe2/python/caffe2_pybind11_state_gpu.so)
frame #18: + 0x99160 (0x7f9164c36160 in /home/lzy/pytorch/build/caffe2/python/caffe2_pybind11_state_gpu.so)

frame #36: __libc_start_main + 0xf0 (0x7f9168059830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #37: + 0x107f (0x55e423b0507f in /home/lzy/anaconda2/envs/lzy/bin/python)

What's more, I can train model on the original detectron.

I note that you are using x101 instead of x152, the node name needs changed. res3_7 and res4_22 should be rewrited as res3_3 and res3_5

PKUbahuangliuhe · 2019-09-17T15:42:28Z

The lr should be changed linearly according to detectron if you reduce the number of gpu

carryyu · 2019-09-17T17:04:01Z

The lr should be changed linearly according to detectron if you reduce the number of gpu

Thank you very much, can u show me how to change the node name? extremely grateful！

PKUbahuangliuhe · 2019-09-18T01:54:21Z

The lr should be changed linearly according to detectron if you reduce the number of gpu

Thank you very much, can u show me how to change the node name? extremely grateful！

In detectron/modeling/ResNet.py, line 134, 'old_res3_7_sum'-->'old_res3_3_sum', line 158,'old_res4_35_sum'-->'old_res4_22_sum'. lr : 0.00125(since you use two gpus). train iter:180000*4

carryyu · 2019-09-18T02:30:35Z

The lr should be changed linearly according to detectron if you reduce the number of gpu

Thank you very much, can u show me how to change the node name? extremely grateful！

In detectron/modeling/ResNet.py, line 134, 'old_res3_7_sum'-->'old_res3_3_sum', line 158,'old_res4_35_sum'-->'old_res4_22_sum'. lr : 0.00125(since you use two gpus). train iter:180000*4

Thank u very much, it works!

David-19940718 · 2019-09-19T03:05:57Z

@PKUbahuangliuhe Thanks for your nice great work!
Hi, author, can you tell me how to set the lr?
if i have got one gpu, the lr should set how much?
if i have got two? I want to know why we set this value can better for trianing.
Looking forward to your replying. tks.

PKUbahuangliuhe · 2019-09-19T03:31:17Z

@PKUbahuangliuhe Thanks for your nice great work!
Hi, author, can you tell me how to set the lr?
if i have got one gpu, the lr should set how much?
if i have got two? I want to know why we set this value can better for trianing.
Looking forward to your replying. tks.

Firstly, we reduce the lr by half compared to the baseline. And you also need to reduce the lr linearly if you change the gpu number （according to the original detectron）. For example, the baseline in Cascade R-CNN-X152 utilizes 8 gpus and lr is 0.01. And if you train Dual-X152 with 2 gpus, lr should be set as 0.01/2/(8/2). Note that the train iter also needs changed when the number of gpus is reduced due to the reduction of batch size.

lironghua318 · 2019-11-27T05:38:26Z

I don't have 8 GPUS, so I chang3 Num_GPUS to 2 and it raise this error. How can I fix it?

I use e2e_cascade_rcnn_X-101-64x4d-FPN_1x.yaml. I change it like:
MODEL:
TYPE: generalized_rcnn
CONV_BODY: FPN.add_fpn_ResNet101_conv5_body
NUM_CLASSES: 21
FASTER_RCNN: True
CASCADE_ON: True
CLS_AGNOSTIC_BBOX_REG: True # default: False
NUM_GPUS: 2
SOLVER:
WEIGHT_DECAY: 0.0001
LR_POLICY: steps_with_decay
BASE_LR: 0.01
GAMMA: 0.1
MAX_ITER: 180000
STEPS: [0, 120000, 160000]
FPN:
FPN_ON: True
MULTILEVEL_ROIS: True
MULTILEVEL_RPN: True
RESNETS:
STRIDE_1X1: False # default True for MSRA; False for C2 or Torch models
TRANS_FUNC: bottleneck_transformation
NUM_GROUPS: 64
WIDTH_PER_GROUP: 4
FAST_RCNN:
ROI_BOX_HEAD: fast_rcnn_heads.add_roi_2mlp_head
ROI_XFORM_METHOD: RoIAlign
ROI_XFORM_RESOLUTION: 7
ROI_XFORM_SAMPLING_RATIO: 2
CASCADE_RCNN:
ROI_BOX_HEAD: cascade_rcnn_heads.add_roi_2mlp_head
NUM_STAGE: 3
TEST_STAGE: 3
TEST_ENSEMBLE: True
TRAIN:
WEIGHTS: https://dl.fbaipublicfiles.com/detectron/ImageNetPretrained/FBResNeXt/X-101-64x4d.pkl
DATASETS: ('coco_2014_train', 'coco_2014_valminusminival')
SCALES: (800,)
MAX_SIZE: 1333
IMS_PER_BATCH: 1
BATCH_SIZE_PER_IM: 512
RPN_PRE_NMS_TOP_N: 2000 # Per FPN level
TEST:
DATASETS: ('coco_2014_valminusminival',)
SCALE: 800
MAX_SIZE: 1333
NMS: 0.5
RPN_PRE_NMS_TOP_N: 1000 # Per FPN level
RPN_POST_NMS_TOP_N: 1000
OUTPUT_DIR: .

the error:

[W workspace.cc:170] Blob gpu_0/old_res3_7_sum not in the workspace.
WARNING workspace.py: 222: Original python traceback for operator 383 in network generalized_rcnn in exception above (most recent call last):
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/tools/train_net.py", line 133, in
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/tools/train_net.py", line 115, in main
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/utils/train.py", line 53, in train_model
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/utils/train.py", line 145, in create_model
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/model_builder.py", line 127, in create
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/model_builder.py", line 91, in generalized_rcnn
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/model_builder.py", line 259, in build_generic_detection_model
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/optimizer.py", line 40, in build_data_parallel_model
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/optimizer.py", line 63, in _build_forward_graph
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/model_builder.py", line 189, in _single_gpu_build_func
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/FPN.py", line 64, in add_fpn_ResNet101_conv5_body
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/FPN.py", line 112, in add_fpn_onto_conv_body
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/ResNet.py", line 48, in add_ResNet101_conv5_body
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/ResNet.py", line 145, in add_ResNet_convX_body
Traceback (most recent call last):
File "/home/lzy/diverse/CBNet/tools/train_net.py", line 133, in
main()
File "/home/lzy/diverse/CBNet/tools/train_net.py", line 115, in main
checkpoints = detectron.utils.train.train_model()
File "/home/lzy/diverse/CBNet/detectron/utils/train.py", line 58, in train_model
setup_model_for_training(model, weights_file, output_dir)
File "/home/lzy/diverse/CBNet/detectron/utils/train.py", line 179, in setup_model_for_training
workspace.CreateNet(model.net)
File "/home/lzy/pytorch/build/caffe2/python/workspace.py", line 181, in CreateNet
StringifyProto(net), overwrite,
File "/home/lzy/pytorch/build/caffe2/python/workspace.py", line 215, in CallWithExceptionIntercept
return func(_args, kwargs) RuntimeError: [enforce fail at operator.cc:75] blob != nullptr. op Conv: Encountered a non-existing input blob: gpu_0/old_res3_7_sum frame #0: c10::ThrowEnforceNotMet(char const, int, char const, std::cxx11::basic_string<char, std::char_traits, std::allocator > const&, void const) + 0x76 (0x7f916475ed36 in /home/lzy/pytorch/build/lib/libc10.so)
frame #1: caffe2::OperatorBase::OperatorBase(caffe2::OperatorDef const&, caffe2::Workspace*) + 0x3ff (0x7f9144b7bd2f in /home/lzy/pytorch/build/lib/libtorch.so)
frame #2: + 0x3f68805 (0x7f914635b805 in /home/lzy/pytorch/build/lib/libtorch.so)
frame #3: + 0x3f868eb (0x7f91463798eb in /home/lzy/pytorch/build/lib/libtorch.so)
frame #4: + 0x3f8841e (0x7f914637b41e in /home/lzy/pytorch/build/lib/libtorch.so)
frame #5: std::Function_handler<std::unique_ptr<caffe2::OperatorBase, std::default_deletecaffe2::OperatorBase > (caffe2::OperatorDef const&, caffe2::Workspace*), std::unique_ptr<caffe2::OperatorBase, std::default_deletecaffe2::OperatorBase > ()(caffe2::OperatorDef const&, caffe2::Workspace)>::_M_invoke(std::Any_data const&, caffe2::OperatorDef const&, caffe2::Workspace*&&) + 0x23 (0x7f9164bf96a3 in /home/lzy/pytorch/build/caffe2/python/caffe2_pybind11_state_gpu.so)
frame #6: + 0x2786301 (0x7f9144b79301 in /home/lzy/pytorch/build/lib/libtorch.so)
frame #7: caffe2::CreateOperator(caffe2::OperatorDef const&, caffe2::Workspace*, int) + 0x32a (0x7f9144b7a60a in /home/lzy/pytorch/build/lib/libtorch.so)
frame #8: caffe2::dag_utils::prepareOperatorNodes(std::shared_ptr<caffe2::NetDef const> const&, caffe2::Workspace*) + 0x17f3 (0x7f9144b74b93 in /home/lzy/pytorch/build/lib/libtorch.so)
frame #9: caffe2::AsyncNetBase::AsyncNetBase(std::shared_ptr<caffe2::NetDef const> const&, caffe2::Workspace*) + 0x246 (0x7f9144b8c026 in /home/lzy/pytorch/build/lib/libtorch.so)
frame #10: caffe2::AsyncSchedulingNet::AsyncSchedulingNet(std::shared_ptr<caffe2::NetDef const> const&, caffe2::Workspace*) + 0x9 (0x7f9144bb6989 in /home/lzy/pytorch/build/lib/libtorch.so)
frame #11: + 0x27c5e2e (0x7f9144bb8e2e in /home/lzy/pytorch/build/lib/libtorch.so)
frame #12: std::Function_handler<std::unique_ptr<caffe2::NetBase, std::default_deletecaffe2::NetBase > (std::shared_ptr<caffe2::NetDef const> const&, caffe2::Workspace*), std::unique_ptr<caffe2::NetBase, std::default_deletecaffe2::NetBase > ()(std::shared_ptr<caffe2::NetDef const> const&, caffe2::Workspace)>::_M_invoke(std::_Any_data const&, std::shared_ptr<caffe2::NetDef const> const&, caffe2::Workspace*&&) + 0x23 (0x7f9144bb8ce3 in /home/lzy/pytorch/build/lib/libtorch.so)
frame #13: caffe2::CreateNet(std::shared_ptr<caffe2::NetDef const> const&, caffe2::Workspace*) + 0x847 (0x7f9144bc3117 in /home/lzy/pytorch/build/lib/libtorch.so)
frame #14: caffe2::Workspace::CreateNet(std::shared_ptr<caffe2::NetDef const> const&, bool) + 0x13c (0x7f9144bdf24c in /home/lzy/pytorch/build/lib/libtorch.so)
frame #15: caffe2::Workspace::CreateNet(caffe2::NetDef const&, bool) + 0x9f (0x7f9144be094f in /home/lzy/pytorch/build/lib/libtorch.so)
frame #16: + 0x51f70 (0x7f9164beef70 in /home/lzy/pytorch/build/caffe2/python/caffe2_pybind11_state_gpu.so)
frame #17: + 0x521de (0x7f9164bef1de in /home/lzy/pytorch/build/caffe2/python/caffe2_pybind11_state_gpu.so)
frame #18: + 0x99160 (0x7f9164c36160 in /home/lzy/pytorch/build/caffe2/python/caffe2_pybind11_state_gpu.so)
frame #36: __libc_start_main + 0xf0 (0x7f9168059830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #37: + 0x107f (0x55e423b0507f in /home/lzy/anaconda2/envs/lzy/bin/python)

What's more, I can train model on the original detectron.

I note that you are using x101 instead of x152, the node name needs changed. res3_7 and res4_22 should be rewrited as res3_3 and res3_5

how about e2e_cascade_rcnn_R-50-FPN_1x.yaml? my gpu is 12G, can`t run 101

hw446 mentioned this issue Oct 26, 2019

RuntimeError: [enforce fail at operator.cc:75] blob != nullptr. op Conv: Encountered a non-existing input blob: gpu_0/old_res3_7_sum facebookresearch/Detectron#941

Open

This was referenced Nov 27, 2019

HTTP Error 301: Moved Permanently #13

Closed

RuntimeError: [enforce fail at pybind_state.h:425] . Exception encountered running PythonOp function: ValueError: min() arg is an empty sequence #14

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: [enforce fail at operator.cc:75] blob != nullptr. op Conv: Encountered a non-existing input blob: gpu_0/old_res3_7_sum #3

RuntimeError: [enforce fail at operator.cc:75] blob != nullptr. op Conv: Encountered a non-existing input blob: gpu_0/old_res3_7_sum #3

carryyu commented Sep 17, 2019

carryyu commented Sep 17, 2019

PKUbahuangliuhe commented Sep 17, 2019

I don't have 8 GPUS, so I chang3 Num_GPUS to 2 and it raise this error. How can I fix it?

the error:

What's more, I can train model on the original detectron.

PKUbahuangliuhe commented Sep 17, 2019

carryyu commented Sep 17, 2019

PKUbahuangliuhe commented Sep 18, 2019

carryyu commented Sep 18, 2019

David-19940718 commented Sep 19, 2019

PKUbahuangliuhe commented Sep 19, 2019

lironghua318 commented Nov 27, 2019

I don't have 8 GPUS, so I chang3 Num_GPUS to 2 and it raise this error. How can I fix it?

the error:

What's more, I can train model on the original detectron.

RuntimeError: [enforce fail at operator.cc:75] blob != nullptr. op Conv: Encountered a non-existing input blob: gpu_0/old_res3_7_sum #3

RuntimeError: [enforce fail at operator.cc:75] blob != nullptr. op Conv: Encountered a non-existing input blob: gpu_0/old_res3_7_sum #3

Comments

carryyu commented Sep 17, 2019

I don't have 8 GPUS, so I chang3 Num_GPUS to 2 and it raise this error. How can I fix it?

the error:

What's more, I can train model on the original detectron.

carryyu commented Sep 17, 2019

PKUbahuangliuhe commented Sep 17, 2019

I don't have 8 GPUS, so I chang3 Num_GPUS to 2 and it raise this error. How can I fix it?

the error:

What's more, I can train model on the original detectron.

PKUbahuangliuhe commented Sep 17, 2019

carryyu commented Sep 17, 2019

PKUbahuangliuhe commented Sep 18, 2019

carryyu commented Sep 18, 2019

David-19940718 commented Sep 19, 2019

PKUbahuangliuhe commented Sep 19, 2019

lironghua318 commented Nov 27, 2019

I don't have 8 GPUS, so I chang3 Num_GPUS to 2 and it raise this error. How can I fix it?

the error:

What's more, I can train model on the original detectron.