Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vgg16 model batch size 64 OOM on old docker image #74

Open
dzhwinter opened this issue Jan 24, 2018 · 3 comments
Open

vgg16 model batch size 64 OOM on old docker image #74

dzhwinter opened this issue Jan 24, 2018 · 3 comments

Comments

@dzhwinter
Copy link
Owner

dzhwinter commented Jan 24, 2018

export CUDA_VISIBLE_DEVICES=7; FLAGS_benchmark=true GLOG_vmodule=executor=2,memory=10 GLOG_v=10 GLOG_logtostderr=1 python vgg16.py --device=GPU --batch_size=64 --data_set=flowers
export CUDA_VISIBLE_DEVICES=7; FLAGS_benchmark=true GLOG_vmodule=executor=2,memory=10 GLOG_v=10 GLOG_logtostderr=1 python resnet50.py --device=GPU --batch_size=64 --data_set=flowers > log/resnet64.log 2>&1
@dzhwinter
Copy link
Owner Author

Here note some conclusions.

  1. paddle memory optimize module really reuse a lot of memory. Both in vgg16 and resnet50.
  2. batch_norm_grad seems a bottleneck which should be enhanced early, but this issue can not improve the framework with a big difference.
  3. conv2d workspace_cache cost a lot of memory, should be used carefully.

@dzhwinter
Copy link
Owner Author

dzhwinter commented Jan 24, 2018

A lesson learned from the vgg16 model. I vaguely remember previous V2 implementation version only can support 128 batch size with 4 12G cards. When it goes to Fluid, batch size 32 make the system reach the memory peak seems reasonable.

However, a crucial fact that the mxnet only cost 7G GPU memory even with 200 layers.
https://arxiv.org/pdf/1604.06174.pdf
He only uses the dependency engine, should we go that way too?

To be more concrete, let's do some math caculate.
say we have a vgg16, batch size = 128, this goal is not unreachable.
We can mark a image with N C H W.

128 * 3 * 224 * 224.

Before do the convolution, we will do im2col. It's feature map shape equals (assume Same shape, kernel=3)

224 * 224 * (3 * 3 * 3)

This im2col only used by one image, then
memory cost will be
128 * 3 * 224 * 224 *4(float) + 224 * 224 * (3 * 3 * 3) *4 + 64(Cout) * 3 * 3 * 3 * 4 + ...

We can raughly caculate the result, it can not reach a horrific 1.5G.

@dzhwinter
Copy link
Owner Author

bisect rollback to 1.11 image/ci build, but it makes nonsense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant