-
-
Notifications
You must be signed in to change notification settings - Fork 682
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cpu fp32i8 for low RAM usage on CPU? #30
Comments
Yes cpu fp32i8 is a hidden feature :) |
Please update ChatRWKV v2 & pip rwkv package (0.3.1) for 2x faster f16i8 (and less VRAM) and fast f16i8+ streaming. |
I've git pulled the repository and upgraded the First I used oobabooga's Text Generation Web UI. After 7B loaded and I disabled swap space, the memory usage is at 7.8 GiB and spiked at 9.0 GiB once (after that, it spiked highest at 8.4 GiB). I was able to generate at what felt like a token every ~15 seconds (which isn't a problem for me, but still). So far it looks better than my previous test with 7B on standalone ChatRWKV. I've decided to give it a try with standalone ChatRWKV again, going back to my 1.5B testing.
It seems the optimization for f16i8 might've benefit fp32i8 as well, as I notice the memory requirements might have been cut down slightly further than last time? The memory fluctuation still seems to be there, though; aside from the 1.5B tests, quick tests with 169M gave me results ranging from 663.6 MiB to 976.3 MiB for fp32i8. I'm unsure if this is on RWKV's end or my operating system's end (I'm using Void Linux, if that helps). I haven't kept an eye out on whether or not there was a difference in speed. I'm considering seeing if Either way, thank you again for considering lower-end and CPU users with this project! I feel like these optimization efforts helps make LLMs more accessible to those with weaker hardware. |
Cool please update to latest code and 0.3.1 (better generation quality) |
This is very unscientific, but I've been playing with using different strategies on my 6G GPU (GeForce GTX 1060) including with the latest release, 3.1. As far as I can tell, it just doesn't seem worth it to use The best performance I've been able to achieve is using
I didn't think the third one would be better, just tried it out of curiosity. I don't know if there's any advantage running specific layers on dedicated memory rather than streaming, aside from the 33rd. |
you can try 'cuda fp16i8 *14+' (and increase 14) to stream fp16i8 layers with 0.3.1 |
This is much slower than I tried:
I also tried With my hardware, I haven't seen any case where using |
Update ChatRWKV v2 & pip rwkv package (0.4.2) for 2x speed in all modes @KerfuffleV2 Please join RWKV Discord where you can share your thoughtful benchmark results |
Update ChatRWKV v2 & pip rwkv package (0.5.0) and set os.environ["RWKV_CUDA_ON"] = '1' |
Hi. This isn't an issue, but I didn't know where else to put this, haha.
I've been watching the progress of ChatRWKV (which is awesome; thank you so much for developing this), and I'm a user of oobabooga's Web UI (so I'm aware of the thread on RWKV support). I like to tinker with text generation models that can be used for chatbot tasks and on CPUs, with little system memory.
I heard about int8 quantization, and not having a good enough GPU (but plenty of RAM) on my main PC, I gave it a try via
cpu fp32i8
. To my surprise, it works! I still needed swap space to load the model, but after that I was able to run 7B under 8.8 GiB of RAM (with spikes to around 10.5 GiB while generating), and it loaded and generated faster than plain bf16, to the best of my recollection. It was a few days back, but these were the results I remember writing down:I do notice that it fluctuates a bit (I tried 1.5B just now, and after it was done loading it ended up idling at around 2.5 GiB the first time, 5.8 GiB the second time, and back to 2.4 GiB the third time); I'm not sure why.
But yeah, I don't see any mentions of
cpu fp32i8
anywhere, not even in any Discord servers, only mentions ofcuda fp16i8
, so I was wondering if this was intended or if it's just a nice side effect?The text was updated successfully, but these errors were encountered: