Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any changes possible to reduce GPU memory usage? #10

Open
sjscotti opened this issue Jan 26, 2022 · 5 comments
Open

Any changes possible to reduce GPU memory usage? #10

sjscotti opened this issue Jan 26, 2022 · 5 comments

Comments

@sjscotti
Copy link

Hi
I think that apollo is a great contribution and have used it with great success for "small" ( about a half billion) parameter models. However, in trying it with a 3 billion parameter model, I have hit GPU memory limits that crashes my training run. I've been looking at the code to see if there was something I could do - like deleting variables after they were used in case python is not doing it efficiently, However, I don't see a path to make major reductions to the memory usage. Do you have any suggestion on modification to be used? Maybe a version that stores optimizer information on CPU rather than GPU , brings it into GPU only when needed for calculations, and then releases the GPU memory when done with it?
Thanks in advance!

@XuezheMax
Copy link
Owner

Hi,

For a "large: model with around 3 billion parameters, I guess the optimizer is probably not the bottleneck of memory comparing with gradient calculation in back-propagation. Can I ask how large is your batch size and have you tried to use gradient accumulation?

@sjscotti
Copy link
Author

sjscotti commented Jan 26, 2022

Hi!
I am using batch size of 1 with gradient accumulation of 32. I am testing the idea I suggested above of defining optimizer states (in the case of apollo, these are state['exp_avg_grad'], state['approx_hessian'] , and state['update'] ) to be on the cpu instead of on the gpu and transferring to GPU only when needed. But for this test I am using an optimizer (adamp) that will not crash with this large model (Note: adamp only saves 2 states compared with 3 for apollo). The test is ongoing, but it appears to have saved almost 20GB of "shared GPU memory" (I am on a Windows machine that has drivers that allow for using CPU memory as "shared GPU memory" when you exceed the GPU hardware memory), and for some reason it is running much faster too (e.g., time for 16 steps was reduced to 40% of the case when GPU did all the memory management for shared GPU memory). There are only a few places where the code needed to be changed. If the test completes successfully, I will make similar mods to apollo and see how well it works. I'll keep you informed of how it works.

@sjscotti
Copy link
Author

Hello again
I have good news. The code seems to be working well and I am running the large training problem with only about 8GB of "shared GPU memory" where before I couldn't run within the limit of my available 32GB of "shared GPU memory". I can't comment on whether explicitly managing the location of the optimizer states provides efficiency enhancements since I couldn't run at all previously. (By the way, I also found that some of the efficiency gains mentioned above were related to fewer checkpoint saves to disk - though there was still an efficiency improvement when accounting for this when doing my own memory management.) FYI, below are the code changes I made...

  1. Replace lines 76 through 81 with ...
                    # Exponential moving average of gradient values
                    state['exp_avg_grad'] = torch.zeros_like(p.data, memory_format=torch.preserve_format, device = 'cpu') 
                    # Exponential moving average of squared gradient values
                    state['approx_hessian'] = torch.zeros_like(p.data, memory_format=torch.preserve_format, device = 'cpu') 
                    # Previous update direction
                    state['update'] = torch.zeros_like(p.data, memory_format=torch.preserve_format, device = 'cpu')
  1. Replace lines 98 through 100 with ...
                exp_avg_grad = state['exp_avg_grad'].to('cuda') 
                B = state['approx_hessian'].to('cuda') 
                d_p = state['update'].to('cuda') 
  1. Add this after line 143...
                state['exp_avg_grad'] = exp_avg_grad.to('cpu') 
                state['approx_hessian'] = B.to('cpu') 
                state['update'] = d_p.to('cpu')
                del exp_avg_grad, B, d_p 

I suspect that the last line (del exp_avg_grad, B, d_p) isn't really needed, but I included it for good measure.

@XuezheMax
Copy link
Owner

Thanks for the updates! If I understand correctly, storing the parameters together with the optimizer states is indeed the bottleneck of memory. Since apollo has one more state (3 vs. 2) than adam, you cannot train the large model with apollo. What you did is to transfer some optimizer states to cpu to save memory, and found it works even faster! I guess one possible reason is that the GPU may get slow when the memory is close to run out.

@XuezheMax
Copy link
Owner

Please let me know if you find apollo obtains better results on the large model. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants