Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Is it normal that MXNet consumes much more system memory than Caffe during training in GPU mode? #2111

Closed
diPDew opened this issue May 11, 2016 · 11 comments

Comments

@diPDew
Copy link

diPDew commented May 11, 2016

I just switched from Caffe to MXNet. When training a GoogleNet model using the provided python scripts, I observe that MXNet always consumes much more system memory than Caffe in GPU mode. For instance, MXNet can easily eat 10GB RAM during training, while Caffe only takes less than 1GB.

I'm not sure if I didn't compile the MXNet code correctly. But before compilation, the only change I made in the config.mk is to enable CUDA.

Anyone could comment on that? Is there anything I need to set properly for MXNet in order to reduce the memory usage?

@antinucleon
Copy link
Contributor

Memory cost is related to Batch size. It is not normal, because in same batch size, MXNet will use much fewer memory.

@diPDew
Copy link
Author

diPDew commented May 11, 2016

Thanks for the reply @antinucleon. Forgot to mention that I set the batch_size as 20 for both MXNet and Caffe.

@tqchen
Copy link
Member

tqchen commented May 11, 2016

I think @EasonD3 means the CPU memory consumption. This could due to the memory needed for the recordIO pipeline due to current setting of caching queues.

We tuned the queue size for faster prefetching and decoding speed, maybe the setting was a bit large to eat up a bit more RAM

@antinucleon
Copy link
Contributor

antinucleon commented May 11, 2016

@tqchen Yes. I just checked GPU feature memory in Inception BN is 861 MB when batch size is 20.

@diPDew
Copy link
Author

diPDew commented May 11, 2016

@antinucleon Thanks for the number. GPU-wise, the memory consumption is roughly the same as my side. But my issue is with the system memory. I should've mentioned more clearly in my post.

@tqchen Thanks. If I'd like to tune the RAM usage, can you advise how to do that?

@diPDew
Copy link
Author

diPDew commented May 11, 2016

@tqchen Speaking of the queue size as you pointed out, I also observe that the latest MXNet code consumes about 50% more RAM compared to an older version of 2~3 months ago.

@antinucleon
Copy link
Contributor

Thanks for pointing out it. Recently there is a refactor of IO. I will check it after I finish my job today,

@tqchen
Copy link
Member

tqchen commented May 11, 2016

Some quick things to try

@diPDew
Copy link
Author

diPDew commented May 11, 2016

@tqchen Thanks for the hints. I just tested by training 15000 color images of size 224x224 with the Inception-BN model. I set preprocess_threads=1 and prefetch_threads=1 in ImageRecordIter. By modifying the line of code in iter_image_recordio.cc to i) iter_.set_max_capacity(1); and ii) iter_.set_max_capacity(2);, when comparing with the default which is iter_.set_max_capacity(4);, the RAM consumption decreased from >15GB down to i) ~7.5GB and ii) ~10.5GB for the above two cases, respectively. So still, the RAM consumption is quite a lot.

@diPDew diPDew changed the title Is it normal that MXNet consumes much more system memory than Caffe in GPU mode? Is it normal that MXNet consumes much more system memory than Caffe during training in GPU mode? May 11, 2016
@KeyKy
Copy link
Contributor

KeyKy commented Aug 12, 2017

I get the same problem when training ssd or imagenet using .rec file!

@szha
Copy link
Member

szha commented Nov 12, 2017

This issue is closed due to lack of activity in the last 90 days. Feel free to ping me to reopen if this is still an active issue. Thanks!
Also, do please check out our forum (and Chinese version) for general "how-to" questions.

@szha szha closed this as completed Nov 12, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants