-
Notifications
You must be signed in to change notification settings - Fork 13.4k
ggml : Implementations for Q4_0_8_8 quantization based functions - RISC-V vector version #10029
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Model: https://huggingface.co/CobraMamba/mamba-gpt-3b-v4
|
Is there anything wrong with this PR? Should I provide more test results like llama-perplexity? @ggerganov |
* ggml : RISC-V vector gemv for q4_0_8x8 * ggml : Added WIP rvv q4_0_8x8 gemm * ggml : Added initial implementation of rvv gemm * ggml : optimize gemm to avoid register spillover * ggml : Fix GCC rvv load alignment issue * ggml : Format gemm rvv code * ggml : Fix a typo in RVV q4_0_8_8 GEMM
* ggml : RISC-V vector gemv for q4_0_8x8 * ggml : Added WIP rvv q4_0_8x8 gemm * ggml : Added initial implementation of rvv gemm * ggml : optimize gemm to avoid register spillover * ggml : Fix GCC rvv load alignment issue * ggml : Format gemm rvv code * ggml : Fix a typo in RVV q4_0_8_8 GEMM
Questions about Testing and Model ConfigurationHello! I have two questions regarding your project: 1. Functional Testing CoverageI notice that only performance tests are visible in the documentation/repository. How do you ensure functional correctness without explicit functional tests? Could you provide more details about your testing strategy for functional validation? 2. Model Variants ClarificationYou reference the model: https://huggingface.co/CobraMamba/mamba-gpt-3b-v4 However, in your test results, I see references to:
These variants don't appear to be directly available in the linked model repository. Could you clarify:
Thank you for your time and assistance! |
My apologies for the confusion, my test was a bit coarse. The functional correctness is validated by examining the output of model evaluations intuitively (whether the model is coherent) and statistically (by checking the perplexity metric before and after optimization). I remember that the perplexity results were very close, but the original test data was lost. p.s. The reference perplexity metric can be calculated on other performant devices, so you don't need to wait another painful 400 hours.
These quantized versions were obtained using the quantize tool in this llama.cpp project. Steps include converting the original format to GGUF format and quantizing the model to the desired formats. You might need to checkout an older version of llama.cpp because the manual Q4_0 8x8 repack format was deprecated afterward, and the repacking process was made automatic as a new backend. I haven't checked the Q4_0 format after the automatic repacking patch was merged, and have focused on K-quant types instead. |
I've tested Mistral 7B in qemu, and it just worked. I'm still choosing suitable 3B models for my dev board with only 4GB RAM (Banana Pi BPI-F3), so I can't give any performance evaluation as of now and any help is welcome! BTW Mistral 7B could run on BPI-F3 4GB with another 4GB swap memory enabled, but it was much slower than even qemu.