Skip to content
This repository has been archived by the owner on Nov 21, 2023. It is now read-only.

Converted detection model for Caffe2 runs too slow on CPU #427

Open
drcege opened this issue May 11, 2018 · 10 comments
Open

Converted detection model for Caffe2 runs too slow on CPU #427

drcege opened this issue May 11, 2018 · 10 comments

Comments

@drcege
Copy link

drcege commented May 11, 2018

I trained a detector for electricity meter based on e2e_faster_rcnn_R-50-C4_1x.yaml. The trained model works very well with Detectron on GPU. we have to deploy it on CPU, so I converted it to Caffe2 format by convert_pkl_to_pb.py.

However, the workspace.RunNet executes for approximately 100 seconds per image. It is too slow.

The attachment is my test code. ammeter_det.pdf

System information

  • Operating system: Ubuntu 16.04
  • Compiler version:
  • CUDA version: 8
  • cuDNN version: 7
  • NVIDIA driver version: 384.111
  • python --version output: 2.7.12
@gadcam
Copy link
Contributor

gadcam commented May 28, 2018

Your code looks good.
I think we can not tell if the performance is good or no : we do not know what was the running time before and on which GPU, neither we know on many core you are running it currently.

It doesn't make sense.

However, this figure is not shocking to me if you ran it against 4 cores or less. I converted a few others models and the order of magnitude was the same. It could make sense as NN inference can be highly parallelized.
Could you provide us more data to guess if there is something worrying or no ?

@drcege
Copy link
Author

drcege commented May 29, 2018

@gadcam
Here is my cpuinfo:

$ nproc
56
$ tail -n 27 /proc/cpuinfo
processor       : 55
vendor_id       : GenuineIntel
cpu family      : 6
model           : 79
model name      : Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz
stepping        : 1
microcode       : 0xb00002c
cpu MHz         : 1202.578
cache size      : 35840 KB
physical id     : 1
siblings        : 28
core id         : 14
cpu cores       : 14
apicid          : 61
initial apicid  : 61
fpu             : yes
fpu_exception   : yes
cpuid level     : 20
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb invpcid_single intel_pt kaiser tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm arat pln pts
bugs            :
bogomips        : 4001.73
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

I don't konw how to specify the core number (or thread number?) for inference; is it automatic ?
I also tested the official converted model. Yes, the running time is of the same order.

But the problem is:
The similar implementation in tensorflow object detection repository ( faster_rcnn_resnet50_coco ) takes only 2 or 3 seconds on the same machine for CPU inference.
We'd like to use detectron because it provides higher precision especially with FPN. (It seems that the PFN could also be converted soon). I think the detection models need to be highly optimized for CPU inference.

@JaosonMa
Copy link

@drcege how did you convert pkl to caffe2 pd files? i failed when convert , my python cmd:
python tools/convert_pkl_to_pb.py
--cfg /e2e_mask_rcnn_X-101-64x4d-FPN_1x.yaml
--net_name detectron_detect_box
--out_dir /caffe2_model
--test_img /src_img/46391_46391_1531292865862_D54A2C49.jpg
--use_nnpack 0
--device 'gpu'

first time ,this error
Traceback (most recent call last):
File "tools/convert_pkl_to_pb.py", line 574, in
main()
File "tools/convert_pkl_to_pb.py", line 528, in main
assert not cfg.FPN.FPN_ON, "FPN not supported."
AssertionError: FPN not supported.
so i set
FPN:
FPN_ON: True --------->False
MULTILEVEL_ROIS: True
MULTILEVEL_RPN: True
then try again ,this error come out:

Traceback (most recent call last):
File "tools/convert_pkl_to_pb.py", line 574, in
main()
File "tools/convert_pkl_to_pb.py", line 532, in main
model, blobs = load_model(args)
File "tools/convert_pkl_to_pb.py", line 341, in load_model
model = test_engine.initialize_model_from_cfg(cfg.TEST.WEIGHTS)
File "/home/jansonm/detectron/detectron/core/test_engine.py", line 328, in initialize_model_from_cfg
model = model_builder.create(cfg.MODEL.TYPE, train=False, gpu_id=gpu_id)
File "/home/jansonm/detectron/detectron/modeling/model_builder.py", line 124, in create
return get_func(model_type_func)(model)
File "/home/jansonm/detectron/detectron/modeling/model_builder.py", line 89, in generalized_rcnn
freeze_conv_body=cfg.TRAIN.FREEZE_CONV_BODY
File "/home/jansonm/detectron/detectron/modeling/model_builder.py", line 229, in build_generic_detection_model
optim.build_data_parallel_model(model, _single_gpu_build_func)
File "/home/jansonm/detectron/detectron/modeling/optimizer.py", line 54, in build_data_parallel_model
single_gpu_build_func(model)
File "/home/jansonm/detectron/detectron/modeling/model_builder.py", line 189, in _single_gpu_build_func
model, blob_conv, dim_conv, spatial_scale_conv
File "/home/jansonm/detectron/detectron/modeling/rpn_heads.py", line 49, in add_generic_rpn_outputs
add_single_scale_rpn_outputs(model, blob_in, dim_in, spatial_scale_in)
File "/home/jansonm/detectron/detectron/modeling/rpn_heads.py", line 58, in add_single_scale_rpn_outputs
stride=1. / spatial_scale,
TypeError: unsupported operand type(s) for /: 'float' and 'list'


@gadcam @daquexian
please help me

@drcege
Copy link
Author

drcege commented Aug 23, 2018

@JaosonMa I converted R50-C4. I don't know whether FPN is supported now.

@gadcam
Copy link
Contributor

gadcam commented Aug 23, 2018

@drcege @JaosonMa #449 is needed to export any model.
But beware you need this commit 2a21855 as exporting in one model is currently a WIP : the code runs but I do not have the same keypoints in the end and I do not know why :/

@obendidi
Copy link

obendidi commented Sep 7, 2018

Hello, about the performance , I also exported an e2e_faster_rcnn_R-101-FPN_2x to .pb format to run on CPU (and specifically on aws lambda ),

I compiled caffe2 without NNPACK (not supported in lambda)

on my own CPU (4 cores) , Inference time for an image of size (800x800) is between 10~ 13 seconds,
on aws lambda it's a bit slower , around 17~20 seconds

another thing that my be slowing it down is that I tried to optimize the model to minimize memory consumption (lambda is limited to 3Gb ) using :

optimization = memonger.optimize_inference_fast(
        net,
        [b for b in net.external_input] +
        [b for b in net.external_output])

I'm also looking for an easy way to share_conv_buffers in caffe2 if you have any suggestions :)

When I tested locally with NNPACK enabled , inference time was around 5~7 seconds

@drcege
Copy link
Author

drcege commented Sep 7, 2018

@Bendidi So FPN export is now officially supported? Could you please also test the speed performance without FPN?

@gadcam
Copy link
Contributor

gadcam commented Sep 7, 2018

@Bendidi

another thing that my be slowing it down is that I tried to optimize the model to minimize memory consumption (lambda is limited to 3Gb ) using :
[...] memonger.optimize_inference_fast [...]

Did you measure any improvement ? If so could share your tests please ? :)

@obendidi
Copy link

obendidi commented Sep 9, 2018

improvement in term of memory ?
I didn't do any extensive testing, was just trying to make it work first ^^'

but it terms of memory, I had a loss of about 0.9 Gb (4.2 Gb -> 3.3 Gb) using memonger.optimize_inference_fast

@Walid-Ahmed
Copy link

@Bendidi

I am stuck at converting detection model for Caffe2, I am using the cpu flag for conversion as I knew no Caffe inference can be done on GPU
python convert_pkl_to_pb.py --cfg e2e_faster_rcnn_R-101-FPN_1x.yaml --out_dir . --device cpu

The conversion starts and is stuck at first layer after loading weights.
Can you please help ?

Thanks
Walid

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants