Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

paged optimizers doc? #962

Closed
stas00 opened this issue Jan 11, 2024 · 20 comments
Closed

paged optimizers doc? #962

stas00 opened this issue Jan 11, 2024 · 20 comments

Comments

@stas00
Copy link
Contributor

stas00 commented Jan 11, 2024

Feature request

Hi Tim, I have just accidentally discovered that you added paged optimizers to this library - so very awesome! But there is absolutely zero documentation - would you consider adding at least a basic doc entry on these?

For example, I'd like to know of the cost of this tradeoff of CPU offloading of optim states.

Also did you cover those in some recent papers of yours? I can't find any documentation.

Thank you!

@younesbelkada
Copy link
Collaborator

Hi @stas00 ! Great to e-see you!
@Titus-von-Koeller is adding a v1 of the docs here: #965 meaning we're going to have extensive docs soon ! 👀

@stas00
Copy link
Contributor Author

stas00 commented Jan 12, 2024

That's exciting news!. Thank you for working on this, @younesbelkada and the team!

@stas00
Copy link
Contributor Author

stas00 commented Jan 12, 2024

That's exciting news!. Thank you for working on this, @younesbelkada and the team!

But perhaps we could start discussing the nuances of the paged AdamW here first? e.g. we want to know the cost/overhead of using paged AdamW.

@JH-lee95
Copy link

I'm waiting for it too!

@younesbelkada do you also have a plan to compare paged optimizer with 8bit optimizer and normal 32bit optimizer?

@TimDettmers
Copy link
Collaborator

Hi Stas! Thanks for your interest in this. Paged optimizers are build on top of the unified memory feature of CUDA. This feature is not supported by PyTorch and we added it to bitsandbytes.

It works like regular CPU paging, which means that it only becomes active if you run out of GPU memory. Only then will the memory be transferred, page-by-page, from GPU to CPU. The memory is mapped, meaning that pages are preallocated on the CPU, but they are not updated automatically. They are only updated if the memory is accessed, or a swapping operation is launched.

The unified memory feature is less efficient than regular asynchronous memory transfers. This means, you usually will not be able to get full PCIe memory bandwidth utilization. If you do a manual prefetch, transfer speeds can be high but still about half or worse than the full PCIe memory bandwidth (tested on 16x lanes PCIe 3.0).

This all means performance depends highly on the particular use-case. If you evict, say, 1 GB of memory per forward-backward-optimizer loop. You can expect about 50% of the PCIe bandwidth as time in the best case. So 1 GB for PCIe 3.0 with 16x lanes, which runs at 16 GB/s, is 1/(16*0.5) = 1/8 = 125ms overhead per optimizer step. Other overhead can be estimated for the particular use-case given a PCIe interface, lanes, and the memory that is evicted in each iteration.

Compared to CPU offloading, this has the advantage that you have zero overhead if all the memory fits into the device and only some overhead if some of memory needs to be evicted. For offloading, you usually offload fixed parts of the model and you need to off and onload all this memory with each iteration through the model (sometimes twice for both forward and backward pass).

This is the basic functionality so far.

Let me know if you have any other questions.

@stas00
Copy link
Contributor Author

stas00 commented Jan 16, 2024

Hi Tim,

Your explanation is very helpful. I especially appreciate the numerical walkthroughs!

I discovered this via @pacman100's https://huggingface.co/blog/ram-efficient-pytorch-fsdp
where he managed to squeeze a 70B Llama-2 into a 8x80GB node for normal finetuning (no lora, no quantization) - I was able to reproduce it - it was very slow, but it worked and the secret sauce was your paged AdamW.

Given the math that there is no way 70B-param model can fit onto 640GB of GPU memory for normal training/finetuning, if one has 4+2B bytes of weights, 2 (or 4) bytes for grads and 8 grads for optim states 70*16=1120. So if all of optim states are paged out, then it leaves just 8bytes per param, at 70*8=560 and then it can fit onto a single node. So in this case if you just look at the numbers all of optim states got paged out with a small portion remaining and a tiny bs=1 was able to fit it with the help of flash attn.

It looks like the eviction algo is automatic and in which case how does it know how much memory should be paged out? The missing part for me is how can it predict how much peak memory will be used by activations and thus to avoid OOM?

And if I follow your math 70B*8=560GB of optim states being fully evicted, and at 0.125s overhead per 1GB, this is 70secs for a single step overhead. (not complaining - this is not the intended use case, just calculating :)

@TimDettmers
Copy link
Collaborator

The eviction is done page-by-page which is usually a couple of kilobytes. With prefetching, the page size increases to 2MB which is much more efficient (but not as efficient as GB chunks).

How it works under the hood, if memory is allocated, a caching algorithm will determine which pieces of memory to evict based on the cashing policy (this is configurable). When the memory is accessed by a kernel, and its detected that the memory is only located on the CPU, then GPU memory will be allocated (which may again evict some other memory) and the transfer of the page will be issued. In the meantime, the other thread blocks will take other instructions on data that is already on the GPU and continue computation in parallel.

The slowness in that case makes sense. Page optimizers work without the need for programming any paging, but they are not very efficient for large amounts of data transferred. In this case offloading would be faster (but very difficult to program). Note that 8-bit optimizers only use 2 byte per param and 8-bit paged optimizers exist as well.

There was a recent paper how to do 4-bit optimizers. Using those insights, it would be easy to design an 8-bit optimizer that performs just as well as 32-bit optimizers. This would be pretty useful in conjunction with paged optimizers. I do not have time to implement this, but if it is interesting to you I can give you pointers to papers and bitsandbytes code how to implement it.

@stas00
Copy link
Contributor Author

stas00 commented Jan 16, 2024

Oh, I see, so it can evict any GPU memory - not just optimizer pages. In other words this is not really a paged Optimizer, but a paged GPU-memory usage mode where any allocation of any memory (weights, grads, optim states, activations) may page out some of the previously allocated GPU memory to CPU and will page it back in when accessed? or did I get it wrong?

CPU-offload is typically per component - i.e. once configures to offload either weights or optim states, but I have never seen a global any component CPU-offload before.

Note that 8-bit optimizers only use 2 byte per param and 8-bit paged optimizers exist as well.

Sourab's setup was using --optim paged_adamw_32bit - but I'm not sure what it means - Is it 32 bits per state (for a total of 64 bits/8 bytes) or is this 32 bits per both states (for a total of 4 bytes per param)?

@nullhook
Copy link

The unified memory feature is less efficient than regular asynchronous memory transfers

what does "regular asynchronous memory transfers" constitutes here? are these within device (gpu) only?

@usamec
Copy link

usamec commented Jan 24, 2024

Would taking b&b function get_paged work out of box for using as model weights and activations (assuming one makes custom layer, which produces paged result)? Or would one also need to free the used memory?
And also, using manual offloading is probably faster right?

@dimitry12
Copy link

I guess CUDA paging somehow works out of the box in WSL2+CUDA. I was puzzled for a few hours about why I can keep increasing batch size with VRAM usage hovering around 100%, iterations becoming slower, and no CUDA OOM - until I noticed increasing CPU RAM usage.

Running exactly the same trainer with the same "too large" batch-size in bare Linux produced CUDA OOM as expected.

@Titus-von-Koeller
Copy link
Collaborator

Closing due to Tim's comment and doc release.

Feel free to keep discussing, I just don't want this issue to stay in open status, since it's been addressed.

Thanks for the valuable input @stas00!

@stas00
Copy link
Contributor Author

stas00 commented Feb 4, 2024

Great, I found this https://github.com/TimDettmers/bitsandbytes/pull/1012/files#diff-d5e4dfc13d7418754c3cf9af9feeaba211f7d6eaf5b2fd8ac4d230e614c84877R109 but where is it rendered?

I don't see it on https://bitsandbytes.readthedocs.io/

If possible please link to these new docs from the README.md of https://github.com/TimDettmers/bitsandbytes

Thank you!

@younesbelkada
Copy link
Collaborator

younesbelkada commented Feb 4, 2024

Hi @stas00 !
For that PR, everything is rendered here right now: https://moon-ci-docs.huggingface.co/docs/bitsandbytes/pr_1012/en/index - after being merged it will be reflected here: https://huggingface.co/docs/bitsandbytes/main/en/

@stas00
Copy link
Contributor Author

stas00 commented Feb 4, 2024

oh, my apologies, for some reason I read it that was merged. Usually issues get closed when the corresponding PR is merged, hence the confusion.

@Titus-von-Koeller
Copy link
Collaborator

Yes, sorry, you're right, should have waited to resolve until the PR is merged. Sorry for the confusion. It's just that this is the last thing before my vacation and I wanted to already get this todo item (issue closure) checkmarked :D The docs were still pending a final polish + review.

Sure, of course we'll put the link in the README.

@stas00
Copy link
Contributor Author

stas00 commented Feb 4, 2024

There is a handy trick to make this easy in the future:

in the OP of the PR you add:

Fixes: https://github.com/TimDettmers/bitsandbytes/issues/962

and once the PR is merged the linked Issue gets automatically closed by github with a comment of which PR closed it.

@Titus-von-Koeller
Copy link
Collaborator

Wow, yes, that's a good trick. Thanks a lot for the hint! Just merged the docs PR.

Thanks for your valuable input!

@stas00
Copy link
Contributor Author

stas00 commented Feb 5, 2024

For posterity, for those who read this Issue - you will find the docs here:
https://huggingface.co/docs/bitsandbytes/main/en/optimizers#paged-optimizers

@Titus-von-Koeller
Copy link
Collaborator

For posterity, the more detailed background info on the topic can now be found here, as we split the docs on the topic for gradual disclosure of complexity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants