Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cpu fp32i8 for low RAM usage on CPU? #30

Closed
Crataco opened this issue Mar 10, 2023 · 10 comments
Closed

cpu fp32i8 for low RAM usage on CPU? #30

Crataco opened this issue Mar 10, 2023 · 10 comments

Comments

@Crataco
Copy link

Crataco commented Mar 10, 2023

Hi. This isn't an issue, but I didn't know where else to put this, haha.

I've been watching the progress of ChatRWKV (which is awesome; thank you so much for developing this), and I'm a user of oobabooga's Web UI (so I'm aware of the thread on RWKV support). I like to tinker with text generation models that can be used for chatbot tasks and on CPUs, with little system memory.

I heard about int8 quantization, and not having a good enough GPU (but plenty of RAM) on my main PC, I gave it a try via cpu fp32i8. To my surprise, it works! I still needed swap space to load the model, but after that I was able to run 7B under 8.8 GiB of RAM (with spikes to around 10.5 GiB while generating), and it loaded and generated faster than plain bf16, to the best of my recollection. It was a few days back, but these were the results I remember writing down:

# MODEL		MEMORY USAGE
169M		1.0 GiB (fp32) / 743.0 MiB (bf16) / 856.7 MiB (fp32i8) / 877.9 MiB (bf16i8)
430M		2.1 GiB (fp32) /   1.3 GiB (bf16) /   1.2 GiB (fp32i8) /   2.4 GiB (bf16i8)
1.5B		6.3 GiB (fp32) /   3.4 GiB (bf16) /   5.7 GiB (fp32i8) /   4.5 GiB (bf16i8)
3B		??.? GiB (fp32) /   6.1 GiB (bf16) /   4.3 GiB (fp32i8) /  ~9.1 GiB (bf16i8)
7B		??.? GiB (fp32) /  ??.? GiB (bf16) /   8.8 GiB (fp32i8) /  ??.? GiB (bf16i8)

I do notice that it fluctuates a bit (I tried 1.5B just now, and after it was done loading it ended up idling at around 2.5 GiB the first time, 5.8 GiB the second time, and back to 2.4 GiB the third time); I'm not sure why.

But yeah, I don't see any mentions of cpu fp32i8 anywhere, not even in any Discord servers, only mentions of cuda fp16i8, so I was wondering if this was intended or if it's just a nice side effect?

@BlinkDL
Copy link
Owner

BlinkDL commented Mar 10, 2023

Yes cpu fp32i8 is a hidden feature :)
It's because I am using a silly slow method to dynamically convert INT8 -> FPXX

@Crataco Crataco changed the title about cpu fp32i8 cpu fp32i8 for low RAM usage on CPU? Mar 11, 2023
@BlinkDL
Copy link
Owner

BlinkDL commented Mar 12, 2023

Please update ChatRWKV v2 & pip rwkv package (0.3.1) for 2x faster f16i8 (and less VRAM) and fast f16i8+ streaming.
I think cpu fp32i8 might be faster too. Please test :)

@Crataco
Copy link
Author

Crataco commented Mar 12, 2023

I've git pulled the repository and upgraded the rwkv package via pip.

First I used oobabooga's Text Generation Web UI. After 7B loaded and I disabled swap space, the memory usage is at 7.8 GiB and spiked at 9.0 GiB once (after that, it spiked highest at 8.4 GiB). I was able to generate at what felt like a token every ~15 seconds (which isn't a problem for me, but still). So far it looks better than my previous test with 7B on standalone ChatRWKV.

I've decided to give it a try with standalone ChatRWKV again, going back to my 1.5B testing.

  • First attempt: 1.5B takes 6.7 GiB to load, and then idles at 4.6 GiB (previously it idled at 5.8 GiB).
  • Second attempt: 1.5B takes 6.8 GiB to load, and idles at 4.6 GiB.
  • Third attempt: 1.5B takes ~4 GiB (I forgot the exact number?) to load, and it idles at 2.5 GiB (which spikes to 4.0 GiB during generation, and after generation idles at 3.3 GiB).

It seems the optimization for f16i8 might've benefit fp32i8 as well, as I notice the memory requirements might have been cut down slightly further than last time?

The memory fluctuation still seems to be there, though; aside from the 1.5B tests, quick tests with 169M gave me results ranging from 663.6 MiB to 976.3 MiB for fp32i8. I'm unsure if this is on RWKV's end or my operating system's end (I'm using Void Linux, if that helps).

I haven't kept an eye out on whether or not there was a difference in speed. I'm considering seeing if cpu fp32 and cpu fp32i8 are compatible with streaming as well (I need to look into how they work soon).

Either way, thank you again for considering lower-end and CPU users with this project! I feel like these optimization efforts helps make LLMs more accessible to those with weaker hardware.

@BlinkDL
Copy link
Owner

BlinkDL commented Mar 12, 2023

I've git pulled the repository and upgraded the rwkv package via pip.

Cool please update to latest code and 0.3.1 (better generation quality)

@Crataco
Copy link
Author

Crataco commented Mar 12, 2023

Mm, I think that's what I've done, actually! I've did it again just to make sure; both this code and the pip package are up-to-date.
image
image

@KerfuffleV2
Copy link
Contributor

KerfuffleV2 commented Mar 12, 2023

This is very unscientific, but I've been playing with using different strategies on my 6G GPU (GeForce GTX 1060) including with the latest release, 3.1. As far as I can tell, it just doesn't seem worth it to use i8 at all at least on this GPU, even if though it's possible to fit twice as many layers. (I know this isn't really relevant to pure CPU-based calculation, but there was already a discussion going on about the changes.)

The best performance I've been able to achieve is using cuda fp16 *8 -> cuda fp16 *0+ -> cpu fp32 *1. This generates about a token a second. (Other applications are already using about 1.2G VRAM, so on a server could probably go up to *10).

  1. cuda fp16i8 *16 -> cuda fp16 *0+ -> cpu fp32 *1 — Noticeably slower. (Seems about 2sec/token)
  2. cuda fp16 *9 -> cuda fp16i8 *0+ -> cpu fp32 *1 — Also slower.
  3. cuda fp16 *4 -> cuda fp16 *0+ -> cuda fp16 *4 -> cpu fp32 *1 — Seems about the same as the best strategy I already mentioned.

I didn't think the third one would be better, just tried it out of curiosity. I don't know if there's any advantage running specific layers on dedicated memory rather than streaming, aside from the 33rd.

@BlinkDL
Copy link
Owner

BlinkDL commented Mar 12, 2023

This is very unscientific, but I've been playing with using different strategies on my 6G GPU (GeForce GTX 1060) including with the latest release, 3.1. As far as I can tell, it just doesn't seem worth it to use i8 at all at least on this GPU, even if though it's possible to fit twice as many layers. (I know this isn't really relevant to pure CPU-based calculation, but there was already a discussion going on about the changes.)

you can try 'cuda fp16i8 *14+' (and increase 14) to stream fp16i8 layers with 0.3.1

@KerfuffleV2
Copy link
Contributor

you can try 'cuda fp16i8 *14+' (and increase 14) to stream fp16i8 layers with 0.3.1

This is much slower than cuda fp16 *8 -> cuda fp16 *0+ -> cpu fp32 *1

I tried:

  1. cuda fp16i8 *14+ -> cpu fp32 *1
  2. cuda fp16i8 *14+

I also tried cuda fp16 *7+ -> cpu fp32 *1 (not i8) and it seemed either about the same or maybe a little slower than cuda fp16 *8 -> cuda fp16 *0+ -> cpu fp32 *1. Definitely much faster than either of the two above though.

With my hardware, I haven't seen any case where using fp16i8 was faster than just using half the number of layers with fp16.

@BlinkDL
Copy link
Owner

BlinkDL commented Mar 13, 2023

Update ChatRWKV v2 & pip rwkv package (0.4.2) for 2x speed in all modes @KerfuffleV2

Please join RWKV Discord where you can share your thoughtful benchmark results

@BlinkDL
Copy link
Owner

BlinkDL commented Mar 15, 2023

Update ChatRWKV v2 & pip rwkv package (0.5.0) and set os.environ["RWKV_CUDA_ON"] = '1'
for 1.5x speed f16i8 (and 10% less VRAM, now 14686MB for 14B instead of 16462M - so you can put more layers on GPU)
PLEASE VERIFY THE GENERATION QUALITY IS UNCHANGED.

@BlinkDL BlinkDL closed this as completed Mar 17, 2023
@BlinkDL BlinkDL reopened this Mar 17, 2023
@BlinkDL BlinkDL closed this as completed Mar 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants