paged optimizers doc? #962

stas00 · 2024-01-11T02:13:55Z

Feature request

Hi Tim, I have just accidentally discovered that you added paged optimizers to this library - so very awesome! But there is absolutely zero documentation - would you consider adding at least a basic doc entry on these?

For example, I'd like to know of the cost of this tradeoff of CPU offloading of optim states.

Also did you cover those in some recent papers of yours? I can't find any documentation.

Thank you!

younesbelkada · 2024-01-12T10:57:08Z

Hi @stas00 ! Great to e-see you!
@Titus-von-Koeller is adding a v1 of the docs here: #965 meaning we're going to have extensive docs soon ! 👀

stas00 · 2024-01-12T19:14:31Z

That's exciting news!. Thank you for working on this, @younesbelkada and the team!

stas00 · 2024-01-12T19:25:43Z

That's exciting news!. Thank you for working on this, @younesbelkada and the team!

But perhaps we could start discussing the nuances of the paged AdamW here first? e.g. we want to know the cost/overhead of using paged AdamW.

JH-lee95 · 2024-01-16T04:18:09Z

I'm waiting for it too!

@younesbelkada do you also have a plan to compare paged optimizer with 8bit optimizer and normal 32bit optimizer?

TimDettmers · 2024-01-16T19:05:40Z

Hi Stas! Thanks for your interest in this. Paged optimizers are build on top of the unified memory feature of CUDA. This feature is not supported by PyTorch and we added it to bitsandbytes.

It works like regular CPU paging, which means that it only becomes active if you run out of GPU memory. Only then will the memory be transferred, page-by-page, from GPU to CPU. The memory is mapped, meaning that pages are preallocated on the CPU, but they are not updated automatically. They are only updated if the memory is accessed, or a swapping operation is launched.

The unified memory feature is less efficient than regular asynchronous memory transfers. This means, you usually will not be able to get full PCIe memory bandwidth utilization. If you do a manual prefetch, transfer speeds can be high but still about half or worse than the full PCIe memory bandwidth (tested on 16x lanes PCIe 3.0).

This all means performance depends highly on the particular use-case. If you evict, say, 1 GB of memory per forward-backward-optimizer loop. You can expect about 50% of the PCIe bandwidth as time in the best case. So 1 GB for PCIe 3.0 with 16x lanes, which runs at 16 GB/s, is 1/(16*0.5) = 1/8 = 125ms overhead per optimizer step. Other overhead can be estimated for the particular use-case given a PCIe interface, lanes, and the memory that is evicted in each iteration.

Compared to CPU offloading, this has the advantage that you have zero overhead if all the memory fits into the device and only some overhead if some of memory needs to be evicted. For offloading, you usually offload fixed parts of the model and you need to off and onload all this memory with each iteration through the model (sometimes twice for both forward and backward pass).

This is the basic functionality so far.

Let me know if you have any other questions.

stas00 · 2024-01-16T19:25:34Z

Hi Tim,

Your explanation is very helpful. I especially appreciate the numerical walkthroughs!

I discovered this via @pacman100's https://huggingface.co/blog/ram-efficient-pytorch-fsdp
where he managed to squeeze a 70B Llama-2 into a 8x80GB node for normal finetuning (no lora, no quantization) - I was able to reproduce it - it was very slow, but it worked and the secret sauce was your paged AdamW.

Given the math that there is no way 70B-param model can fit onto 640GB of GPU memory for normal training/finetuning, if one has 4+2B bytes of weights, 2 (or 4) bytes for grads and 8 grads for optim states 70*16=1120. So if all of optim states are paged out, then it leaves just 8bytes per param, at 70*8=560 and then it can fit onto a single node. So in this case if you just look at the numbers all of optim states got paged out with a small portion remaining and a tiny bs=1 was able to fit it with the help of flash attn.

It looks like the eviction algo is automatic and in which case how does it know how much memory should be paged out? The missing part for me is how can it predict how much peak memory will be used by activations and thus to avoid OOM?

And if I follow your math 70B*8=560GB of optim states being fully evicted, and at 0.125s overhead per 1GB, this is 70secs for a single step overhead. (not complaining - this is not the intended use case, just calculating :)

TimDettmers · 2024-01-16T20:05:40Z

The eviction is done page-by-page which is usually a couple of kilobytes. With prefetching, the page size increases to 2MB which is much more efficient (but not as efficient as GB chunks).

How it works under the hood, if memory is allocated, a caching algorithm will determine which pieces of memory to evict based on the cashing policy (this is configurable). When the memory is accessed by a kernel, and its detected that the memory is only located on the CPU, then GPU memory will be allocated (which may again evict some other memory) and the transfer of the page will be issued. In the meantime, the other thread blocks will take other instructions on data that is already on the GPU and continue computation in parallel.

The slowness in that case makes sense. Page optimizers work without the need for programming any paging, but they are not very efficient for large amounts of data transferred. In this case offloading would be faster (but very difficult to program). Note that 8-bit optimizers only use 2 byte per param and 8-bit paged optimizers exist as well.

There was a recent paper how to do 4-bit optimizers. Using those insights, it would be easy to design an 8-bit optimizer that performs just as well as 32-bit optimizers. This would be pretty useful in conjunction with paged optimizers. I do not have time to implement this, but if it is interesting to you I can give you pointers to papers and bitsandbytes code how to implement it.

stas00 · 2024-01-16T20:20:17Z

Oh, I see, so it can evict any GPU memory - not just optimizer pages. In other words this is not really a paged Optimizer, but a paged GPU-memory usage mode where any allocation of any memory (weights, grads, optim states, activations) may page out some of the previously allocated GPU memory to CPU and will page it back in when accessed? or did I get it wrong?

CPU-offload is typically per component - i.e. once configures to offload either weights or optim states, but I have never seen a global any component CPU-offload before.

Note that 8-bit optimizers only use 2 byte per param and 8-bit paged optimizers exist as well.

Sourab's setup was using --optim paged_adamw_32bit - but I'm not sure what it means - Is it 32 bits per state (for a total of 64 bits/8 bytes) or is this 32 bits per both states (for a total of 4 bytes per param)?

nullhook · 2024-01-24T08:58:39Z

The unified memory feature is less efficient than regular asynchronous memory transfers

what does "regular asynchronous memory transfers" constitutes here? are these within device (gpu) only?

usamec · 2024-01-24T11:55:56Z

Would taking b&b function get_paged work out of box for using as model weights and activations (assuming one makes custom layer, which produces paged result)? Or would one also need to free the used memory?
And also, using manual offloading is probably faster right?

dimitry12 · 2024-01-28T23:52:16Z

I guess CUDA paging somehow works out of the box in WSL2+CUDA. I was puzzled for a few hours about why I can keep increasing batch size with VRAM usage hovering around 100%, iterations becoming slower, and no CUDA OOM - until I noticed increasing CPU RAM usage.

Running exactly the same trainer with the same "too large" batch-size in bare Linux produced CUDA OOM as expected.

Titus-von-Koeller · 2024-02-03T23:57:09Z

Closing due to Tim's comment and doc release.

Feel free to keep discussing, I just don't want this issue to stay in open status, since it's been addressed.

Thanks for the valuable input @stas00!

stas00 · 2024-02-04T02:01:48Z

Great, I found this https://github.com/TimDettmers/bitsandbytes/pull/1012/files#diff-d5e4dfc13d7418754c3cf9af9feeaba211f7d6eaf5b2fd8ac4d230e614c84877R109 but where is it rendered?

I don't see it on https://bitsandbytes.readthedocs.io/

If possible please link to these new docs from the README.md of https://github.com/TimDettmers/bitsandbytes

Thank you!

younesbelkada · 2024-02-04T02:16:46Z

Hi @stas00 !
For that PR, everything is rendered here right now: https://moon-ci-docs.huggingface.co/docs/bitsandbytes/pr_1012/en/index - after being merged it will be reflected here: https://huggingface.co/docs/bitsandbytes/main/en/

stas00 · 2024-02-04T02:20:18Z

oh, my apologies, for some reason I read it that was merged. Usually issues get closed when the corresponding PR is merged, hence the confusion.

Titus-von-Koeller · 2024-02-04T13:59:53Z

Yes, sorry, you're right, should have waited to resolve until the PR is merged. Sorry for the confusion. It's just that this is the last thing before my vacation and I wanted to already get this todo item (issue closure) checkmarked :D The docs were still pending a final polish + review.

Sure, of course we'll put the link in the README.

stas00 · 2024-02-04T16:59:56Z

There is a handy trick to make this easy in the future:

in the OP of the PR you add:

Fixes: https://github.com/TimDettmers/bitsandbytes/issues/962

and once the PR is merged the linked Issue gets automatically closed by github with a comment of which PR closed it.

Titus-von-Koeller · 2024-02-04T19:15:26Z

Wow, yes, that's a good trick. Thanks a lot for the hint! Just merged the docs PR.

Thanks for your valuable input!

stas00 · 2024-02-05T06:14:04Z

For posterity, for those who read this Issue - you will find the docs here:
https://huggingface.co/docs/bitsandbytes/main/en/optimizers#paged-optimizers

Titus-von-Koeller · 2024-05-10T08:23:50Z

For posterity, the more detailed background info on the topic can now be found here, as we split the docs on the topic for gradual disclosure of complexity.

HarrySoteriou mentioned this issue Jan 24, 2024

ESP32 (Arduino) support ggerganov/whisper.cpp#717

Closed

Titus-von-Koeller closed this as completed Feb 3, 2024

ebsmothers mentioned this issue Oct 7, 2024

Question about torchtune's Low Memory Footprint pytorch/torchtune#1758

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

paged optimizers doc? #962

paged optimizers doc? #962

stas00 commented Jan 11, 2024 •

edited

Loading

younesbelkada commented Jan 12, 2024

stas00 commented Jan 12, 2024

stas00 commented Jan 12, 2024 •

edited

Loading

JH-lee95 commented Jan 16, 2024

TimDettmers commented Jan 16, 2024

stas00 commented Jan 16, 2024 •

edited

Loading

TimDettmers commented Jan 16, 2024

stas00 commented Jan 16, 2024 •

edited

Loading

nullhook commented Jan 24, 2024

usamec commented Jan 24, 2024

dimitry12 commented Jan 28, 2024

Titus-von-Koeller commented Feb 3, 2024

stas00 commented Feb 4, 2024 •

edited

Loading

younesbelkada commented Feb 4, 2024 •

edited

Loading

stas00 commented Feb 4, 2024

Titus-von-Koeller commented Feb 4, 2024

stas00 commented Feb 4, 2024 •

edited

Loading

Titus-von-Koeller commented Feb 4, 2024

stas00 commented Feb 5, 2024

Titus-von-Koeller commented May 10, 2024

paged optimizers doc? #962

paged optimizers doc? #962

Comments

stas00 commented Jan 11, 2024 • edited Loading

Feature request

younesbelkada commented Jan 12, 2024

stas00 commented Jan 12, 2024

stas00 commented Jan 12, 2024 • edited Loading

JH-lee95 commented Jan 16, 2024

TimDettmers commented Jan 16, 2024

stas00 commented Jan 16, 2024 • edited Loading

TimDettmers commented Jan 16, 2024

stas00 commented Jan 16, 2024 • edited Loading

nullhook commented Jan 24, 2024

usamec commented Jan 24, 2024

dimitry12 commented Jan 28, 2024

Titus-von-Koeller commented Feb 3, 2024

stas00 commented Feb 4, 2024 • edited Loading

younesbelkada commented Feb 4, 2024 • edited Loading

stas00 commented Feb 4, 2024

Titus-von-Koeller commented Feb 4, 2024

stas00 commented Feb 4, 2024 • edited Loading

Titus-von-Koeller commented Feb 4, 2024

stas00 commented Feb 5, 2024

Titus-von-Koeller commented May 10, 2024

stas00 commented Jan 11, 2024 •

edited

Loading

stas00 commented Jan 12, 2024 •

edited

Loading

stas00 commented Jan 16, 2024 •

edited

Loading

stas00 commented Jan 16, 2024 •

edited

Loading

stas00 commented Feb 4, 2024 •

edited

Loading

younesbelkada commented Feb 4, 2024 •

edited

Loading

stas00 commented Feb 4, 2024 •

edited

Loading