Is it normal that MXNet consumes much more system memory than Caffe during training in GPU mode? #2111

Open
EasonD3 opened this Issue May 11, 2016 · 9 comments

Projects

None yet

3 participants

@EasonD3
EasonD3 commented May 11, 2016 edited

I just switched from Caffe to MXNet. When training a GoogleNet model using the provided python scripts, I observe that MXNet always consumes much more system memory than Caffe in GPU mode. For instance, MXNet can easily eat 10GB RAM during training, while Caffe only takes less than 1GB.

I'm not sure if I didn't compile the MXNet code correctly. But before compilation, the only change I made in the config.mk is to enable CUDA.

Anyone could comment on that? Is there anything I need to set properly for MXNet in order to reduce the memory usage?

@antinucleon
Member

Memory cost is related to Batch size. It is not normal, because in same batch size, MXNet will use much fewer memory.

@EasonD3
EasonD3 commented May 11, 2016 edited

Thanks for the reply @antinucleon. Forgot to mention that I set the batch_size as 20 for both MXNet and Caffe.

@tqchen
Member
tqchen commented May 11, 2016

I think @EasonD3 means the CPU memory consumption. This could due to the memory needed for the recordIO pipeline due to current setting of caching queues.

We tuned the queue size for faster prefetching and decoding speed, maybe the setting was a bit large to eat up a bit more RAM

@antinucleon
Member
antinucleon commented May 11, 2016 edited

@tqchen Yes. I just checked GPU feature memory in Inception BN is 861 MB when batch size is 20.

@EasonD3
EasonD3 commented May 11, 2016

@antinucleon Thanks for the number. GPU-wise, the memory consumption is roughly the same as my side. But my issue is with the system memory. I should've mentioned more clearly in my post.

@tqchen Thanks. If I'd like to tune the RAM usage, can you advise how to do that?

@EasonD3
EasonD3 commented May 11, 2016

@tqchen Speaking of the queue size as you pointed out, I also observe that the latest MXNet code consumes about 50% more RAM compared to an older version of 2~3 months ago.

@antinucleon
Member

Thanks for pointing out it. Recently there is a refactor of IO. I will check it after I finish my job today,

@tqchen
Member
tqchen commented May 11, 2016

Some quick things to try

@EasonD3
EasonD3 commented May 11, 2016

@tqchen Thanks for the hints. I just tested by training 15000 color images of size 224x224 with the Inception-BN model. I set preprocess_threads=1 and prefetch_threads=1 in ImageRecordIter. By modifying the line of code in iter_image_recordio.cc to i) iter_.set_max_capacity(1); and ii) iter_.set_max_capacity(2);, when comparing with the default which is iter_.set_max_capacity(4);, the RAM consumption decreased from >15GB down to i) ~7.5GB and ii) ~10.5GB for the above two cases, respectively. So still, the RAM consumption is quite a lot.

@EasonD3 EasonD3 changed the title from Is it normal that MXNet consumes much more system memory than Caffe in GPU mode? to Is it normal that MXNet consumes much more system memory than Caffe during training in GPU mode? May 11, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment