-
Notifications
You must be signed in to change notification settings - Fork 640
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
paged optimizers doc? #962
Comments
Hi @stas00 ! Great to e-see you! |
That's exciting news!. Thank you for working on this, @younesbelkada and the team! |
That's exciting news!. Thank you for working on this, @younesbelkada and the team! But perhaps we could start discussing the nuances of the paged AdamW here first? e.g. we want to know the cost/overhead of using paged AdamW. |
I'm waiting for it too! @younesbelkada do you also have a plan to compare paged optimizer with 8bit optimizer and normal 32bit optimizer? |
Hi Stas! Thanks for your interest in this. Paged optimizers are build on top of the unified memory feature of CUDA. This feature is not supported by PyTorch and we added it to bitsandbytes. It works like regular CPU paging, which means that it only becomes active if you run out of GPU memory. Only then will the memory be transferred, page-by-page, from GPU to CPU. The memory is mapped, meaning that pages are preallocated on the CPU, but they are not updated automatically. They are only updated if the memory is accessed, or a swapping operation is launched. The unified memory feature is less efficient than regular asynchronous memory transfers. This means, you usually will not be able to get full PCIe memory bandwidth utilization. If you do a manual prefetch, transfer speeds can be high but still about half or worse than the full PCIe memory bandwidth (tested on 16x lanes PCIe 3.0). This all means performance depends highly on the particular use-case. If you evict, say, 1 GB of memory per forward-backward-optimizer loop. You can expect about 50% of the PCIe bandwidth as time in the best case. So 1 GB for PCIe 3.0 with 16x lanes, which runs at 16 GB/s, is 1/(16*0.5) = 1/8 = 125ms overhead per optimizer step. Other overhead can be estimated for the particular use-case given a PCIe interface, lanes, and the memory that is evicted in each iteration. Compared to CPU offloading, this has the advantage that you have zero overhead if all the memory fits into the device and only some overhead if some of memory needs to be evicted. For offloading, you usually offload fixed parts of the model and you need to off and onload all this memory with each iteration through the model (sometimes twice for both forward and backward pass). This is the basic functionality so far. Let me know if you have any other questions. |
Hi Tim, Your explanation is very helpful. I especially appreciate the numerical walkthroughs! I discovered this via @pacman100's https://huggingface.co/blog/ram-efficient-pytorch-fsdp Given the math that there is no way 70B-param model can fit onto 640GB of GPU memory for normal training/finetuning, if one has 4+2B bytes of weights, 2 (or 4) bytes for grads and 8 grads for optim states It looks like the eviction algo is automatic and in which case how does it know how much memory should be paged out? The missing part for me is how can it predict how much peak memory will be used by activations and thus to avoid OOM? And if I follow your math |
The eviction is done page-by-page which is usually a couple of kilobytes. With prefetching, the page size increases to 2MB which is much more efficient (but not as efficient as GB chunks). How it works under the hood, if memory is allocated, a caching algorithm will determine which pieces of memory to evict based on the cashing policy (this is configurable). When the memory is accessed by a kernel, and its detected that the memory is only located on the CPU, then GPU memory will be allocated (which may again evict some other memory) and the transfer of the page will be issued. In the meantime, the other thread blocks will take other instructions on data that is already on the GPU and continue computation in parallel. The slowness in that case makes sense. Page optimizers work without the need for programming any paging, but they are not very efficient for large amounts of data transferred. In this case offloading would be faster (but very difficult to program). Note that 8-bit optimizers only use 2 byte per param and 8-bit paged optimizers exist as well. There was a recent paper how to do 4-bit optimizers. Using those insights, it would be easy to design an 8-bit optimizer that performs just as well as 32-bit optimizers. This would be pretty useful in conjunction with paged optimizers. I do not have time to implement this, but if it is interesting to you I can give you pointers to papers and bitsandbytes code how to implement it. |
Oh, I see, so it can evict any GPU memory - not just optimizer pages. In other words this is not really a paged Optimizer, but a paged GPU-memory usage mode where any allocation of any memory (weights, grads, optim states, activations) may page out some of the previously allocated GPU memory to CPU and will page it back in when accessed? or did I get it wrong? CPU-offload is typically per component - i.e. once configures to offload either weights or optim states, but I have never seen a global any component CPU-offload before.
Sourab's setup was using |
what does "regular asynchronous memory transfers" constitutes here? are these within device (gpu) only? |
Would taking b&b function get_paged work out of box for using as model weights and activations (assuming one makes custom layer, which produces paged result)? Or would one also need to free the used memory? |
I guess CUDA paging somehow works out of the box in WSL2+CUDA. I was puzzled for a few hours about why I can keep increasing batch size with VRAM usage hovering around 100%, iterations becoming slower, and no CUDA OOM - until I noticed increasing CPU RAM usage. Running exactly the same trainer with the same "too large" batch-size in bare Linux produced CUDA OOM as expected. |
Closing due to Tim's comment and doc release. Feel free to keep discussing, I just don't want this issue to stay in open status, since it's been addressed. Thanks for the valuable input @stas00! |
Great, I found this https://github.com/TimDettmers/bitsandbytes/pull/1012/files#diff-d5e4dfc13d7418754c3cf9af9feeaba211f7d6eaf5b2fd8ac4d230e614c84877R109 but where is it rendered? I don't see it on https://bitsandbytes.readthedocs.io/ If possible please link to these new docs from the README.md of https://github.com/TimDettmers/bitsandbytes Thank you! |
Hi @stas00 ! |
oh, my apologies, for some reason I read it that was merged. Usually issues get closed when the corresponding PR is merged, hence the confusion. |
Yes, sorry, you're right, should have waited to resolve until the PR is merged. Sorry for the confusion. It's just that this is the last thing before my vacation and I wanted to already get this todo item (issue closure) checkmarked :D The docs were still pending a final polish + review. Sure, of course we'll put the link in the README. |
There is a handy trick to make this easy in the future: in the OP of the PR you add:
and once the PR is merged the linked Issue gets automatically closed by github with a comment of which PR closed it. |
Wow, yes, that's a good trick. Thanks a lot for the hint! Just merged the docs PR. Thanks for your valuable input! |
For posterity, for those who read this Issue - you will find the docs here: |
For posterity, the more detailed background info on the topic can now be found here, as we split the docs on the topic for gradual disclosure of complexity. |
Feature request
Hi Tim, I have just accidentally discovered that you added paged optimizers to this library - so very awesome! But there is absolutely zero documentation - would you consider adding at least a basic doc entry on these?
For example, I'd like to know of the cost of this tradeoff of CPU offloading of optim states.
Also did you cover those in some recent papers of yours? I can't find any documentation.
Thank you!
The text was updated successfully, but these errors were encountered: