Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: [enforce fail at operator.cc:75] blob != nullptr. op Conv: Encountered a non-existing input blob: gpu_0/old_res3_7_sum #3

Open
carryyu opened this issue Sep 17, 2019 · 9 comments

Comments

@carryyu
Copy link

carryyu commented Sep 17, 2019

I don't have 8 GPUS, so I chang3 Num_GPUS to 2 and it raise this error. How can I fix it?

I use e2e_cascade_rcnn_X-101-64x4d-FPN_1x.yaml. I change it like:
MODEL:
TYPE: generalized_rcnn
CONV_BODY: FPN.add_fpn_ResNet101_conv5_body
NUM_CLASSES: 21
FASTER_RCNN: True
CASCADE_ON: True
CLS_AGNOSTIC_BBOX_REG: True # default: False
NUM_GPUS: 2
SOLVER:
WEIGHT_DECAY: 0.0001
LR_POLICY: steps_with_decay
BASE_LR: 0.01
GAMMA: 0.1
MAX_ITER: 180000
STEPS: [0, 120000, 160000]
FPN:
FPN_ON: True
MULTILEVEL_ROIS: True
MULTILEVEL_RPN: True
RESNETS:
STRIDE_1X1: False # default True for MSRA; False for C2 or Torch models
TRANS_FUNC: bottleneck_transformation
NUM_GROUPS: 64
WIDTH_PER_GROUP: 4
FAST_RCNN:
ROI_BOX_HEAD: fast_rcnn_heads.add_roi_2mlp_head
ROI_XFORM_METHOD: RoIAlign
ROI_XFORM_RESOLUTION: 7
ROI_XFORM_SAMPLING_RATIO: 2
CASCADE_RCNN:
ROI_BOX_HEAD: cascade_rcnn_heads.add_roi_2mlp_head
NUM_STAGE: 3
TEST_STAGE: 3
TEST_ENSEMBLE: True
TRAIN:
WEIGHTS: https://dl.fbaipublicfiles.com/detectron/ImageNetPretrained/FBResNeXt/X-101-64x4d.pkl
DATASETS: ('coco_2014_train', 'coco_2014_valminusminival')
SCALES: (800,)
MAX_SIZE: 1333
IMS_PER_BATCH: 1
BATCH_SIZE_PER_IM: 512
RPN_PRE_NMS_TOP_N: 2000 # Per FPN level
TEST:
DATASETS: ('coco_2014_valminusminival',)
SCALE: 800
MAX_SIZE: 1333
NMS: 0.5
RPN_PRE_NMS_TOP_N: 1000 # Per FPN level
RPN_POST_NMS_TOP_N: 1000
OUTPUT_DIR: .

the error:

[W workspace.cc:170] Blob gpu_0/old_res3_7_sum not in the workspace.
WARNING workspace.py: 222: Original python traceback for operator 383 in network generalized_rcnn in exception above (most recent call last):
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/tools/train_net.py", line 133, in
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/tools/train_net.py", line 115, in main
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/utils/train.py", line 53, in train_model
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/utils/train.py", line 145, in create_model
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/model_builder.py", line 127, in create
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/model_builder.py", line 91, in generalized_rcnn
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/model_builder.py", line 259, in build_generic_detection_model
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/optimizer.py", line 40, in build_data_parallel_model
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/optimizer.py", line 63, in _build_forward_graph
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/model_builder.py", line 189, in _single_gpu_build_func
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/FPN.py", line 64, in add_fpn_ResNet101_conv5_body
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/FPN.py", line 112, in add_fpn_onto_conv_body
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/ResNet.py", line 48, in add_ResNet101_conv5_body
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/ResNet.py", line 145, in add_ResNet_convX_body
Traceback (most recent call last):
File "/home/lzy/diverse/CBNet/tools/train_net.py", line 133, in
main()
File "/home/lzy/diverse/CBNet/tools/train_net.py", line 115, in main
checkpoints = detectron.utils.train.train_model()
File "/home/lzy/diverse/CBNet/detectron/utils/train.py", line 58, in train_model
setup_model_for_training(model, weights_file, output_dir)
File "/home/lzy/diverse/CBNet/detectron/utils/train.py", line 179, in setup_model_for_training
workspace.CreateNet(model.net)
File "/home/lzy/pytorch/build/caffe2/python/workspace.py", line 181, in CreateNet
StringifyProto(net), overwrite,
File "/home/lzy/pytorch/build/caffe2/python/workspace.py", line 215, in CallWithExceptionIntercept
return func(args, kwargs)
RuntimeError: [enforce fail at operator.cc:75] blob != nullptr. op Conv: Encountered a non-existing input blob: gpu_0/old_res3_7_sum
frame #0: c10::ThrowEnforceNotMet(char const
, int, char const
, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, void const
) + 0x76 (0x7f916475ed36 in /home/lzy/pytorch/build/lib/libc10.so)
frame #1: caffe2::OperatorBase::OperatorBase(caffe2::OperatorDef const&, caffe2::Workspace*) + 0x3ff (0x7f9144b7bd2f in /home/lzy/pytorch/build/lib/libtorch.so)
frame #2: + 0x3f68805 (0x7f914635b805 in /home/lzy/pytorch/build/lib/libtorch.so)
frame #3: + 0x3f868eb (0x7f91463798eb in /home/lzy/pytorch/build/lib/libtorch.so)
frame #4: + 0x3f8841e (0x7f914637b41e in /home/lzy/pytorch/build/lib/libtorch.so)
frame #5: std::_Function_handler<std::unique_ptr<caffe2::OperatorBase, std::default_deletecaffe2::OperatorBase > (caffe2::OperatorDef const&, caffe2::Workspace*), std::unique_ptr<caffe2::OperatorBase, std::default_deletecaffe2::OperatorBase > ()(caffe2::OperatorDef const&, caffe2::Workspace)>::_M_invoke(std::_Any_data const&, caffe2::OperatorDef const&, caffe2::Workspace*&&) + 0x23 (0x7f9164bf96a3 in /home/lzy/pytorch/build/caffe2/python/caffe2_pybind11_state_gpu.so)
frame #6: + 0x2786301 (0x7f9144b79301 in /home/lzy/pytorch/build/lib/libtorch.so)
frame #7: caffe2::CreateOperator(caffe2::OperatorDef const&, caffe2::Workspace*, int) + 0x32a (0x7f9144b7a60a in /home/lzy/pytorch/build/lib/libtorch.so)
frame #8: caffe2::dag_utils::prepareOperatorNodes(std::shared_ptr<caffe2::NetDef const> const&, caffe2::Workspace*) + 0x17f3 (0x7f9144b74b93 in /home/lzy/pytorch/build/lib/libtorch.so)
frame #9: caffe2::AsyncNetBase::AsyncNetBase(std::shared_ptr<caffe2::NetDef const> const&, caffe2::Workspace*) + 0x246 (0x7f9144b8c026 in /home/lzy/pytorch/build/lib/libtorch.so)
frame #10: caffe2::AsyncSchedulingNet::AsyncSchedulingNet(std::shared_ptr<caffe2::NetDef const> const&, caffe2::Workspace*) + 0x9 (0x7f9144bb6989 in /home/lzy/pytorch/build/lib/libtorch.so)
frame #11: + 0x27c5e2e (0x7f9144bb8e2e in /home/lzy/pytorch/build/lib/libtorch.so)
frame #12: std::_Function_handler<std::unique_ptr<caffe2::NetBase, std::default_deletecaffe2::NetBase > (std::shared_ptr<caffe2::NetDef const> const&, caffe2::Workspace*), std::unique_ptr<caffe2::NetBase, std::default_deletecaffe2::NetBase > ()(std::shared_ptr<caffe2::NetDef const> const&, caffe2::Workspace)>::_M_invoke(std::_Any_data const&, std::shared_ptr<caffe2::NetDef const> const&, caffe2::Workspace*&&) + 0x23 (0x7f9144bb8ce3 in /home/lzy/pytorch/build/lib/libtorch.so)
frame #13: caffe2::CreateNet(std::shared_ptr<caffe2::NetDef const> const&, caffe2::Workspace*) + 0x847 (0x7f9144bc3117 in /home/lzy/pytorch/build/lib/libtorch.so)
frame #14: caffe2::Workspace::CreateNet(std::shared_ptr<caffe2::NetDef const> const&, bool) + 0x13c (0x7f9144bdf24c in /home/lzy/pytorch/build/lib/libtorch.so)
frame #15: caffe2::Workspace::CreateNet(caffe2::NetDef const&, bool) + 0x9f (0x7f9144be094f in /home/lzy/pytorch/build/lib/libtorch.so)
frame #16: + 0x51f70 (0x7f9164beef70 in /home/lzy/pytorch/build/caffe2/python/caffe2_pybind11_state_gpu.so)
frame #17: + 0x521de (0x7f9164bef1de in /home/lzy/pytorch/build/caffe2/python/caffe2_pybind11_state_gpu.so)
frame #18: + 0x99160 (0x7f9164c36160 in /home/lzy/pytorch/build/caffe2/python/caffe2_pybind11_state_gpu.so)

frame #36: __libc_start_main + 0xf0 (0x7f9168059830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #37: + 0x107f (0x55e423b0507f in /home/lzy/anaconda2/envs/lzy/bin/python)

What's more, I can train model on the original detectron.

@carryyu
Copy link
Author

carryyu commented Sep 17, 2019

Your detectron version is a bit low。

@PKUbahuangliuhe
Copy link
Collaborator

I don't have 8 GPUS, so I chang3 Num_GPUS to 2 and it raise this error. How can I fix it?

I use e2e_cascade_rcnn_X-101-64x4d-FPN_1x.yaml. I change it like:
MODEL:
TYPE: generalized_rcnn
CONV_BODY: FPN.add_fpn_ResNet101_conv5_body
NUM_CLASSES: 21
FASTER_RCNN: True
CASCADE_ON: True
CLS_AGNOSTIC_BBOX_REG: True # default: False
NUM_GPUS: 2
SOLVER:
WEIGHT_DECAY: 0.0001
LR_POLICY: steps_with_decay
BASE_LR: 0.01
GAMMA: 0.1
MAX_ITER: 180000
STEPS: [0, 120000, 160000]
FPN:
FPN_ON: True
MULTILEVEL_ROIS: True
MULTILEVEL_RPN: True
RESNETS:
STRIDE_1X1: False # default True for MSRA; False for C2 or Torch models
TRANS_FUNC: bottleneck_transformation
NUM_GROUPS: 64
WIDTH_PER_GROUP: 4
FAST_RCNN:
ROI_BOX_HEAD: fast_rcnn_heads.add_roi_2mlp_head
ROI_XFORM_METHOD: RoIAlign
ROI_XFORM_RESOLUTION: 7
ROI_XFORM_SAMPLING_RATIO: 2
CASCADE_RCNN:
ROI_BOX_HEAD: cascade_rcnn_heads.add_roi_2mlp_head
NUM_STAGE: 3
TEST_STAGE: 3
TEST_ENSEMBLE: True
TRAIN:
WEIGHTS: https://dl.fbaipublicfiles.com/detectron/ImageNetPretrained/FBResNeXt/X-101-64x4d.pkl
DATASETS: ('coco_2014_train', 'coco_2014_valminusminival')
SCALES: (800,)
MAX_SIZE: 1333
IMS_PER_BATCH: 1
BATCH_SIZE_PER_IM: 512
RPN_PRE_NMS_TOP_N: 2000 # Per FPN level
TEST:
DATASETS: ('coco_2014_valminusminival',)
SCALE: 800
MAX_SIZE: 1333
NMS: 0.5
RPN_PRE_NMS_TOP_N: 1000 # Per FPN level
RPN_POST_NMS_TOP_N: 1000
OUTPUT_DIR: .

the error:

[W workspace.cc:170] Blob gpu_0/old_res3_7_sum not in the workspace.
WARNING workspace.py: 222: Original python traceback for operator 383 in network generalized_rcnn in exception above (most recent call last):
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/tools/train_net.py", line 133, in
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/tools/train_net.py", line 115, in main
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/utils/train.py", line 53, in train_model
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/utils/train.py", line 145, in create_model
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/model_builder.py", line 127, in create
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/model_builder.py", line 91, in generalized_rcnn
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/model_builder.py", line 259, in build_generic_detection_model
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/optimizer.py", line 40, in build_data_parallel_model
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/optimizer.py", line 63, in _build_forward_graph
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/model_builder.py", line 189, in _single_gpu_build_func
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/FPN.py", line 64, in add_fpn_ResNet101_conv5_body
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/FPN.py", line 112, in add_fpn_onto_conv_body
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/ResNet.py", line 48, in add_ResNet101_conv5_body
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/ResNet.py", line 145, in add_ResNet_convX_body
Traceback (most recent call last):
File "/home/lzy/diverse/CBNet/tools/train_net.py", line 133, in
main()
File "/home/lzy/diverse/CBNet/tools/train_net.py", line 115, in main
checkpoints = detectron.utils.train.train_model()
File "/home/lzy/diverse/CBNet/detectron/utils/train.py", line 58, in train_model
setup_model_for_training(model, weights_file, output_dir)
File "/home/lzy/diverse/CBNet/detectron/utils/train.py", line 179, in setup_model_for_training
workspace.CreateNet(model.net)
File "/home/lzy/pytorch/build/caffe2/python/workspace.py", line 181, in CreateNet
StringifyProto(net), overwrite,
File "/home/lzy/pytorch/build/caffe2/python/workspace.py", line 215, in CallWithExceptionIntercept
return func(_args, kwargs) RuntimeError: [enforce fail at operator.cc:75] blob != nullptr. op Conv: Encountered a non-existing input blob: gpu_0/old_res3_7_sum frame #0: c10::ThrowEnforceNotMet(char const, int, char const, std::cxx11::basic_string<char, std::char_traits, std::allocator > const&, void const) + 0x76 (0x7f916475ed36 in /home/lzy/pytorch/build/lib/libc10.so)
frame #1: caffe2::OperatorBase::OperatorBase(caffe2::OperatorDef const&, caffe2::Workspace*) + 0x3ff (0x7f9144b7bd2f in /home/lzy/pytorch/build/lib/libtorch.so)
frame #2: + 0x3f68805 (0x7f914635b805 in /home/lzy/pytorch/build/lib/libtorch.so)
frame #3: + 0x3f868eb (0x7f91463798eb in /home/lzy/pytorch/build/lib/libtorch.so)
frame #4: + 0x3f8841e (0x7f914637b41e in /home/lzy/pytorch/build/lib/libtorch.so)
frame #5: std::Function_handler<std::unique_ptr<caffe2::OperatorBase, std::default_deletecaffe2::OperatorBase > (caffe2::OperatorDef const&, caffe2::Workspace*), std::unique_ptr<caffe2::OperatorBase, std::default_deletecaffe2::OperatorBase > ()(caffe2::OperatorDef const&, caffe2::Workspace
)>::_M_invoke(std::Any_data const&, caffe2::OperatorDef const&, caffe2::Workspace*&&) + 0x23 (0x7f9164bf96a3 in /home/lzy/pytorch/build/caffe2/python/caffe2_pybind11_state_gpu.so)
frame #6: + 0x2786301 (0x7f9144b79301 in /home/lzy/pytorch/build/lib/libtorch.so)
frame #7: caffe2::CreateOperator(caffe2::OperatorDef const&, caffe2::Workspace*, int) + 0x32a (0x7f9144b7a60a in /home/lzy/pytorch/build/lib/libtorch.so)
frame #8: caffe2::dag_utils::prepareOperatorNodes(std::shared_ptr<caffe2::NetDef const> const&, caffe2::Workspace*) + 0x17f3 (0x7f9144b74b93 in /home/lzy/pytorch/build/lib/libtorch.so)
frame #9: caffe2::AsyncNetBase::AsyncNetBase(std::shared_ptr<caffe2::NetDef const> const&, caffe2::Workspace*) + 0x246 (0x7f9144b8c026 in /home/lzy/pytorch/build/lib/libtorch.so)
frame #10: caffe2::AsyncSchedulingNet::AsyncSchedulingNet(std::shared_ptr<caffe2::NetDef const> const&, caffe2::Workspace*) + 0x9 (0x7f9144bb6989 in /home/lzy/pytorch/build/lib/libtorch.so)
frame #11: + 0x27c5e2e (0x7f9144bb8e2e in /home/lzy/pytorch/build/lib/libtorch.so)
frame #12: std::Function_handler<std::unique_ptr<caffe2::NetBase, std::default_deletecaffe2::NetBase > (std::shared_ptr<caffe2::NetDef const> const&, caffe2::Workspace*), std::unique_ptr<caffe2::NetBase, std::default_deletecaffe2::NetBase > ()(std::shared_ptr<caffe2::NetDef const> const&, caffe2::Workspace
)>::_M_invoke(std::_Any_data const&, std::shared_ptr<caffe2::NetDef const> const&, caffe2::Workspace*&&) + 0x23 (0x7f9144bb8ce3 in /home/lzy/pytorch/build/lib/libtorch.so)
frame #13: caffe2::CreateNet(std::shared_ptr<caffe2::NetDef const> const&, caffe2::Workspace*) + 0x847 (0x7f9144bc3117 in /home/lzy/pytorch/build/lib/libtorch.so)
frame #14: caffe2::Workspace::CreateNet(std::shared_ptr<caffe2::NetDef const> const&, bool) + 0x13c (0x7f9144bdf24c in /home/lzy/pytorch/build/lib/libtorch.so)
frame #15: caffe2::Workspace::CreateNet(caffe2::NetDef const&, bool) + 0x9f (0x7f9144be094f in /home/lzy/pytorch/build/lib/libtorch.so)
frame #16: + 0x51f70 (0x7f9164beef70 in /home/lzy/pytorch/build/caffe2/python/caffe2_pybind11_state_gpu.so)
frame #17: + 0x521de (0x7f9164bef1de in /home/lzy/pytorch/build/caffe2/python/caffe2_pybind11_state_gpu.so)
frame #18: + 0x99160 (0x7f9164c36160 in /home/lzy/pytorch/build/caffe2/python/caffe2_pybind11_state_gpu.so)

frame #36: __libc_start_main + 0xf0 (0x7f9168059830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #37: + 0x107f (0x55e423b0507f in /home/lzy/anaconda2/envs/lzy/bin/python)

What's more, I can train model on the original detectron.

I note that you are using x101 instead of x152, the node name needs changed. res3_7 and res4_22 should be rewrited as res3_3 and res3_5

@PKUbahuangliuhe
Copy link
Collaborator

The lr should be changed linearly according to detectron if you reduce the number of gpu

@carryyu
Copy link
Author

carryyu commented Sep 17, 2019

The lr should be changed linearly according to detectron if you reduce the number of gpu

Thank you very much, can u show me how to change the node name? extremely grateful!

@PKUbahuangliuhe
Copy link
Collaborator

The lr should be changed linearly according to detectron if you reduce the number of gpu

Thank you very much, can u show me how to change the node name? extremely grateful!

In detectron/modeling/ResNet.py, line 134, 'old_res3_7_sum'-->'old_res3_3_sum', line 158,'old_res4_35_sum'-->'old_res4_22_sum'. lr : 0.00125(since you use two gpus). train iter:180000*4

@carryyu
Copy link
Author

carryyu commented Sep 18, 2019

The lr should be changed linearly according to detectron if you reduce the number of gpu

Thank you very much, can u show me how to change the node name? extremely grateful!

In detectron/modeling/ResNet.py, line 134, 'old_res3_7_sum'-->'old_res3_3_sum', line 158,'old_res4_35_sum'-->'old_res4_22_sum'. lr : 0.00125(since you use two gpus). train iter:180000*4

Thank u very much, it works!

@David-19940718
Copy link

@PKUbahuangliuhe Thanks for your nice great work!
Hi, author, can you tell me how to set the lr?
if i have got one gpu, the lr should set how much?
if i have got two? I want to know why we set this value can better for trianing.
Looking forward to your replying. tks.

@PKUbahuangliuhe
Copy link
Collaborator

@PKUbahuangliuhe Thanks for your nice great work!
Hi, author, can you tell me how to set the lr?
if i have got one gpu, the lr should set how much?
if i have got two? I want to know why we set this value can better for trianing.
Looking forward to your replying. tks.

Firstly, we reduce the lr by half compared to the baseline. And you also need to reduce the lr linearly if you change the gpu number (according to the original detectron). For example, the baseline in Cascade R-CNN-X152 utilizes 8 gpus and lr is 0.01. And if you train Dual-X152 with 2 gpus, lr should be set as 0.01/2/(8/2). Note that the train iter also needs changed when the number of gpus is reduced due to the reduction of batch size.

@lironghua318
Copy link

I don't have 8 GPUS, so I chang3 Num_GPUS to 2 and it raise this error. How can I fix it?

I use e2e_cascade_rcnn_X-101-64x4d-FPN_1x.yaml. I change it like:
MODEL:
TYPE: generalized_rcnn
CONV_BODY: FPN.add_fpn_ResNet101_conv5_body
NUM_CLASSES: 21
FASTER_RCNN: True
CASCADE_ON: True
CLS_AGNOSTIC_BBOX_REG: True # default: False
NUM_GPUS: 2
SOLVER:
WEIGHT_DECAY: 0.0001
LR_POLICY: steps_with_decay
BASE_LR: 0.01
GAMMA: 0.1
MAX_ITER: 180000
STEPS: [0, 120000, 160000]
FPN:
FPN_ON: True
MULTILEVEL_ROIS: True
MULTILEVEL_RPN: True
RESNETS:
STRIDE_1X1: False # default True for MSRA; False for C2 or Torch models
TRANS_FUNC: bottleneck_transformation
NUM_GROUPS: 64
WIDTH_PER_GROUP: 4
FAST_RCNN:
ROI_BOX_HEAD: fast_rcnn_heads.add_roi_2mlp_head
ROI_XFORM_METHOD: RoIAlign
ROI_XFORM_RESOLUTION: 7
ROI_XFORM_SAMPLING_RATIO: 2
CASCADE_RCNN:
ROI_BOX_HEAD: cascade_rcnn_heads.add_roi_2mlp_head
NUM_STAGE: 3
TEST_STAGE: 3
TEST_ENSEMBLE: True
TRAIN:
WEIGHTS: https://dl.fbaipublicfiles.com/detectron/ImageNetPretrained/FBResNeXt/X-101-64x4d.pkl
DATASETS: ('coco_2014_train', 'coco_2014_valminusminival')
SCALES: (800,)
MAX_SIZE: 1333
IMS_PER_BATCH: 1
BATCH_SIZE_PER_IM: 512
RPN_PRE_NMS_TOP_N: 2000 # Per FPN level
TEST:
DATASETS: ('coco_2014_valminusminival',)
SCALE: 800
MAX_SIZE: 1333
NMS: 0.5
RPN_PRE_NMS_TOP_N: 1000 # Per FPN level
RPN_POST_NMS_TOP_N: 1000
OUTPUT_DIR: .

the error:

[W workspace.cc:170] Blob gpu_0/old_res3_7_sum not in the workspace.
WARNING workspace.py: 222: Original python traceback for operator 383 in network generalized_rcnn in exception above (most recent call last):
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/tools/train_net.py", line 133, in
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/tools/train_net.py", line 115, in main
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/utils/train.py", line 53, in train_model
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/utils/train.py", line 145, in create_model
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/model_builder.py", line 127, in create
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/model_builder.py", line 91, in generalized_rcnn
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/model_builder.py", line 259, in build_generic_detection_model
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/optimizer.py", line 40, in build_data_parallel_model
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/optimizer.py", line 63, in _build_forward_graph
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/model_builder.py", line 189, in _single_gpu_build_func
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/FPN.py", line 64, in add_fpn_ResNet101_conv5_body
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/FPN.py", line 112, in add_fpn_onto_conv_body
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/ResNet.py", line 48, in add_ResNet101_conv5_body
WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/ResNet.py", line 145, in add_ResNet_convX_body
Traceback (most recent call last):
File "/home/lzy/diverse/CBNet/tools/train_net.py", line 133, in
main()
File "/home/lzy/diverse/CBNet/tools/train_net.py", line 115, in main
checkpoints = detectron.utils.train.train_model()
File "/home/lzy/diverse/CBNet/detectron/utils/train.py", line 58, in train_model
setup_model_for_training(model, weights_file, output_dir)
File "/home/lzy/diverse/CBNet/detectron/utils/train.py", line 179, in setup_model_for_training
workspace.CreateNet(model.net)
File "/home/lzy/pytorch/build/caffe2/python/workspace.py", line 181, in CreateNet
StringifyProto(net), overwrite,
File "/home/lzy/pytorch/build/caffe2/python/workspace.py", line 215, in CallWithExceptionIntercept
return func(_args, kwargs) RuntimeError: [enforce fail at operator.cc:75] blob != nullptr. op Conv: Encountered a non-existing input blob: gpu_0/old_res3_7_sum frame #0: c10::ThrowEnforceNotMet(char const, int, char const, std::cxx11::basic_string<char, std::char_traits, std::allocator > const&, void const) + 0x76 (0x7f916475ed36 in /home/lzy/pytorch/build/lib/libc10.so)
frame #1: caffe2::OperatorBase::OperatorBase(caffe2::OperatorDef const&, caffe2::Workspace*) + 0x3ff (0x7f9144b7bd2f in /home/lzy/pytorch/build/lib/libtorch.so)
frame #2: + 0x3f68805 (0x7f914635b805 in /home/lzy/pytorch/build/lib/libtorch.so)
frame #3: + 0x3f868eb (0x7f91463798eb in /home/lzy/pytorch/build/lib/libtorch.so)
frame #4: + 0x3f8841e (0x7f914637b41e in /home/lzy/pytorch/build/lib/libtorch.so)
frame #5: std::Function_handler<std::unique_ptr<caffe2::OperatorBase, std::default_deletecaffe2::OperatorBase > (caffe2::OperatorDef const&, caffe2::Workspace*), std::unique_ptr<caffe2::OperatorBase, std::default_deletecaffe2::OperatorBase > ()(caffe2::OperatorDef const&, caffe2::Workspace
)>::_M_invoke(std::Any_data const&, caffe2::OperatorDef const&, caffe2::Workspace*&&) + 0x23 (0x7f9164bf96a3 in /home/lzy/pytorch/build/caffe2/python/caffe2_pybind11_state_gpu.so)
frame #6: + 0x2786301 (0x7f9144b79301 in /home/lzy/pytorch/build/lib/libtorch.so)
frame #7: caffe2::CreateOperator(caffe2::OperatorDef const&, caffe2::Workspace*, int) + 0x32a (0x7f9144b7a60a in /home/lzy/pytorch/build/lib/libtorch.so)
frame #8: caffe2::dag_utils::prepareOperatorNodes(std::shared_ptr<caffe2::NetDef const> const&, caffe2::Workspace*) + 0x17f3 (0x7f9144b74b93 in /home/lzy/pytorch/build/lib/libtorch.so)
frame #9: caffe2::AsyncNetBase::AsyncNetBase(std::shared_ptr<caffe2::NetDef const> const&, caffe2::Workspace*) + 0x246 (0x7f9144b8c026 in /home/lzy/pytorch/build/lib/libtorch.so)
frame #10: caffe2::AsyncSchedulingNet::AsyncSchedulingNet(std::shared_ptr<caffe2::NetDef const> const&, caffe2::Workspace*) + 0x9 (0x7f9144bb6989 in /home/lzy/pytorch/build/lib/libtorch.so)
frame #11: + 0x27c5e2e (0x7f9144bb8e2e in /home/lzy/pytorch/build/lib/libtorch.so)
frame #12: std::Function_handler<std::unique_ptr<caffe2::NetBase, std::default_deletecaffe2::NetBase > (std::shared_ptr<caffe2::NetDef const> const&, caffe2::Workspace*), std::unique_ptr<caffe2::NetBase, std::default_deletecaffe2::NetBase > ()(std::shared_ptr<caffe2::NetDef const> const&, caffe2::Workspace
)>::_M_invoke(std::_Any_data const&, std::shared_ptr<caffe2::NetDef const> const&, caffe2::Workspace*&&) + 0x23 (0x7f9144bb8ce3 in /home/lzy/pytorch/build/lib/libtorch.so)
frame #13: caffe2::CreateNet(std::shared_ptr<caffe2::NetDef const> const&, caffe2::Workspace*) + 0x847 (0x7f9144bc3117 in /home/lzy/pytorch/build/lib/libtorch.so)
frame #14: caffe2::Workspace::CreateNet(std::shared_ptr<caffe2::NetDef const> const&, bool) + 0x13c (0x7f9144bdf24c in /home/lzy/pytorch/build/lib/libtorch.so)
frame #15: caffe2::Workspace::CreateNet(caffe2::NetDef const&, bool) + 0x9f (0x7f9144be094f in /home/lzy/pytorch/build/lib/libtorch.so)
frame #16: + 0x51f70 (0x7f9164beef70 in /home/lzy/pytorch/build/caffe2/python/caffe2_pybind11_state_gpu.so)
frame #17: + 0x521de (0x7f9164bef1de in /home/lzy/pytorch/build/caffe2/python/caffe2_pybind11_state_gpu.so)
frame #18: + 0x99160 (0x7f9164c36160 in /home/lzy/pytorch/build/caffe2/python/caffe2_pybind11_state_gpu.so)
frame #36: __libc_start_main + 0xf0 (0x7f9168059830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #37: + 0x107f (0x55e423b0507f in /home/lzy/anaconda2/envs/lzy/bin/python)

What's more, I can train model on the original detectron.

I note that you are using x101 instead of x152, the node name needs changed. res3_7 and res4_22 should be rewrited as res3_3 and res3_5

how about e2e_cascade_rcnn_R-50-FPN_1x.yaml? my gpu is 12G, can`t run 101

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants