Any changes possible to reduce GPU memory usage? #10

sjscotti · 2022-01-26T21:20:41Z

Hi
I think that apollo is a great contribution and have used it with great success for "small" ( about a half billion) parameter models. However, in trying it with a 3 billion parameter model, I have hit GPU memory limits that crashes my training run. I've been looking at the code to see if there was something I could do - like deleting variables after they were used in case python is not doing it efficiently, However, I don't see a path to make major reductions to the memory usage. Do you have any suggestion on modification to be used? Maybe a version that stores optimizer information on CPU rather than GPU , brings it into GPU only when needed for calculations, and then releases the GPU memory when done with it?
Thanks in advance!

XuezheMax · 2022-01-26T22:57:12Z

Hi,

For a "large: model with around 3 billion parameters, I guess the optimizer is probably not the bottleneck of memory comparing with gradient calculation in back-propagation. Can I ask how large is your batch size and have you tried to use gradient accumulation?

sjscotti · 2022-01-26T23:42:45Z

Hi!
I am using batch size of 1 with gradient accumulation of 32. I am testing the idea I suggested above of defining optimizer states (in the case of apollo, these are state['exp_avg_grad'], state['approx_hessian'] , and state['update'] ) to be on the cpu instead of on the gpu and transferring to GPU only when needed. But for this test I am using an optimizer (adamp) that will not crash with this large model (Note: adamp only saves 2 states compared with 3 for apollo). The test is ongoing, but it appears to have saved almost 20GB of "shared GPU memory" (I am on a Windows machine that has drivers that allow for using CPU memory as "shared GPU memory" when you exceed the GPU hardware memory), and for some reason it is running much faster too (e.g., time for 16 steps was reduced to 40% of the case when GPU did all the memory management for shared GPU memory). There are only a few places where the code needed to be changed. If the test completes successfully, I will make similar mods to apollo and see how well it works. I'll keep you informed of how it works.

sjscotti · 2022-01-27T06:15:22Z

Hello again
I have good news. The code seems to be working well and I am running the large training problem with only about 8GB of "shared GPU memory" where before I couldn't run within the limit of my available 32GB of "shared GPU memory". I can't comment on whether explicitly managing the location of the optimizer states provides efficiency enhancements since I couldn't run at all previously. (By the way, I also found that some of the efficiency gains mentioned above were related to fewer checkpoint saves to disk - though there was still an efficiency improvement when accounting for this when doing my own memory management.) FYI, below are the code changes I made...

Replace lines 76 through 81 with ...

                    # Exponential moving average of gradient values
                    state['exp_avg_grad'] = torch.zeros_like(p.data, memory_format=torch.preserve_format, device = 'cpu') 
                    # Exponential moving average of squared gradient values
                    state['approx_hessian'] = torch.zeros_like(p.data, memory_format=torch.preserve_format, device = 'cpu') 
                    # Previous update direction
                    state['update'] = torch.zeros_like(p.data, memory_format=torch.preserve_format, device = 'cpu')

Replace lines 98 through 100 with ...

                exp_avg_grad = state['exp_avg_grad'].to('cuda') 
                B = state['approx_hessian'].to('cuda') 
                d_p = state['update'].to('cuda')

Add this after line 143...

                state['exp_avg_grad'] = exp_avg_grad.to('cpu') 
                state['approx_hessian'] = B.to('cpu') 
                state['update'] = d_p.to('cpu')
                del exp_avg_grad, B, d_p

I suspect that the last line (del exp_avg_grad, B, d_p) isn't really needed, but I included it for good measure.

XuezheMax · 2022-01-27T06:31:12Z

Thanks for the updates! If I understand correctly, storing the parameters together with the optimizer states is indeed the bottleneck of memory. Since apollo has one more state (3 vs. 2) than adam, you cannot train the large model with apollo. What you did is to transfer some optimizer states to cpu to save memory, and found it works even faster! I guess one possible reason is that the GPU may get slow when the memory is close to run out.

XuezheMax · 2022-01-27T06:32:18Z

Please let me know if you find apollo obtains better results on the large model. Thanks!

sjscotti mentioned this issue Jan 27, 2022

Create parlai options to 1) create optimizer states on CPU rather than GPU and 2) disable saving optimizer states at checkpoints facebookresearch/ParlAI#4328

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any changes possible to reduce GPU memory usage? #10

Any changes possible to reduce GPU memory usage? #10

sjscotti commented Jan 26, 2022

XuezheMax commented Jan 26, 2022

sjscotti commented Jan 26, 2022 •

edited

sjscotti commented Jan 27, 2022

XuezheMax commented Jan 27, 2022

XuezheMax commented Jan 27, 2022

Any changes possible to reduce GPU memory usage? #10

Any changes possible to reduce GPU memory usage? #10

Comments

sjscotti commented Jan 26, 2022

XuezheMax commented Jan 26, 2022

sjscotti commented Jan 26, 2022 • edited

sjscotti commented Jan 27, 2022

XuezheMax commented Jan 27, 2022

XuezheMax commented Jan 27, 2022

sjscotti commented Jan 26, 2022 •

edited