ggml : Implementations for Q4_0_8_8 quantization based functions - RISC-V vector version #10029

xctan · 2024-10-24T07:12:26Z

This PR supersedes Implementations for Q4_0_8_8 quantization based functions - RISC-V vector version #9953
The PR contains RISC-V Vector version of ggml_gemm_q4_0_8x8_q8_0 used for Q4_0_8_8 quantized models

I've tested Mistral 7B in qemu, and it just worked. I'm still choosing suitable 3B models for my dev board with only 4GB RAM (Banana Pi BPI-F3), so I can't give any performance evaluation as of now and any help is welcome! BTW Mistral 7B could run on BPI-F3 4GB with another 4GB swap memory enabled, but it was much slower than even qemu.

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

xctan · 2024-10-24T16:33:50Z

Model: https://huggingface.co/CobraMamba/mamba-gpt-3b-v4
Compiler: GCC 13.2.0

model	size	params	backend	threads	test	t/s	speedup	commit
llama 3B Q4_0_8_8	1.84 GiB	3.43 B	CPU	8	pp512	1.82 ± 0.00	271%	`78c78e2`
llama 3B Q4_0	1.84 GiB	3.43 B	CPU	8	pp512	1.04 ± 0.00	112%	`66c2c93`
llama 3B Q4_0_8_8	1.84 GiB	3.43 B	CPU	8	pp512	0.49 ± 0.00		`66c2c93`

llama 3B Q4_0_8_8	1.84 GiB	3.43 B	CPU	8	tg128	2.25 ± 0.10	350%	`78c78e2`
llama 3B Q4_0	1.84 GiB	3.43 B	CPU	8	tg128	1.27 ± 0.03	154%	`66c2c93`
llama 3B Q4_0_8_8	1.84 GiB	3.43 B	CPU	8	tg128	0.50 ± 0.01		`66c2c93`

xctan · 2024-10-30T04:56:11Z

Is there anything wrong with this PR? Should I provide more test results like llama-perplexity? @ggerganov
I'm trying ./llama-perplexity -m <model_name> -f wikitext-2-raw/wiki.test.raw for both Q4_0_8_8 GEMM implementations with and without this PR, but it takes ~400 hours in total to finish.

* ggml : RISC-V vector gemv for q4_0_8x8 * ggml : Added WIP rvv q4_0_8x8 gemm * ggml : Added initial implementation of rvv gemm * ggml : optimize gemm to avoid register spillover * ggml : Fix GCC rvv load alignment issue * ggml : Format gemm rvv code * ggml : Fix a typo in RVV q4_0_8_8 GEMM

ixgbe · 2025-08-14T06:47:53Z

Model: https://huggingface.co/CobraMamba/mamba-gpt-3b-v4 Compiler: GCC 13.2.0

model size params backend threads test t/s speedup commit
llama 3B Q4_0_8_8 1.84 GiB 3.43 B CPU 8 pp512 1.82 ± 0.00 271% 78c78e2
llama 3B Q4_0 1.84 GiB 3.43 B CPU 8 pp512 1.04 ± 0.00 112% 66c2c93
llama 3B Q4_0_8_8 1.84 GiB 3.43 B CPU 8 pp512 0.49 ± 0.00 66c2c93
llama 3B Q4_0_8_8 1.84 GiB 3.43 B CPU 8 tg128 2.25 ± 0.10 350% 78c78e2
llama 3B Q4_0 1.84 GiB 3.43 B CPU 8 tg128 1.27 ± 0.03 154% 66c2c93
llama 3B Q4_0_8_8 1.84 GiB 3.43 B CPU 8 tg128 0.50 ± 0.01 66c2c93

Questions about Testing and Model Configuration

Hello! I have two questions regarding your project:

1. Functional Testing Coverage

I notice that only performance tests are visible in the documentation/repository. How do you ensure functional correctness without explicit functional tests? Could you provide more details about your testing strategy for functional validation?

2. Model Variants Clarification

You reference the model: https://huggingface.co/CobraMamba/mamba-gpt-3b-v4

However, in your test results, I see references to:

llama 3B Q4_0_8_8
llama 3B Q4_0

These variants don't appear to be directly available in the linked model repository. Could you clarify:

How these specific quantized versions were obtained?
Are these custom quantizations you performed?
If so, could you share the quantization process or provide links to these model variants?

Thank you for your time and assistance!

xctan · 2025-08-14T08:14:57Z

Model: https://huggingface.co/CobraMamba/mamba-gpt-3b-v4 Compiler: GCC 13.2.0
model size params backend threads test t/s speedup commit
llama 3B Q4_0_8_8 1.84 GiB 3.43 B CPU 8 pp512 1.82 ± 0.00 271% 78c78e2
llama 3B Q4_0 1.84 GiB 3.43 B CPU 8 pp512 1.04 ± 0.00 112% 66c2c93
llama 3B Q4_0_8_8 1.84 GiB 3.43 B CPU 8 pp512 0.49 ± 0.00 66c2c93
llama 3B Q4_0_8_8 1.84 GiB 3.43 B CPU 8 tg128 2.25 ± 0.10 350% 78c78e2
llama 3B Q4_0 1.84 GiB 3.43 B CPU 8 tg128 1.27 ± 0.03 154% 66c2c93
llama 3B Q4_0_8_8 1.84 GiB 3.43 B CPU 8 tg128 0.50 ± 0.01 66c2c93

Questions about Testing and Model Configuration

Hello! I have two questions regarding your project:

1. Functional Testing Coverage

I notice that only performance tests are visible in the documentation/repository. How do you ensure functional correctness without explicit functional tests? Could you provide more details about your testing strategy for functional validation?

My apologies for the confusion, my test was a bit coarse. The functional correctness is validated by examining the output of model evaluations intuitively (whether the model is coherent) and statistically (by checking the perplexity metric before and after optimization). I remember that the perplexity results were very close, but the original test data was lost.

p.s. The reference perplexity metric can be calculated on other performant devices, so you don't need to wait another painful 400 hours.

2. Model Variants Clarification

You reference the model: https://huggingface.co/CobraMamba/mamba-gpt-3b-v4

However, in your test results, I see references to:

llama 3B Q4_0_8_8

llama 3B Q4_0

These variants don't appear to be directly available in the linked model repository. Could you clarify:

How these specific quantized versions were obtained?

Are these custom quantizations you performed?

If so, could you share the quantization process or provide links to these model variants?

Thank you for your time and assistance!

These quantized versions were obtained using the quantize tool in this llama.cpp project. Steps include converting the original format to GGUF format and quantizing the model to the desired formats. You might need to checkout an older version of llama.cpp because the manual Q4_0 8x8 repack format was deprecated afterward, and the repacking process was made automatic as a new backend. I haven't checked the Q4_0 format after the automatic repacking patch was merged, and have focused on K-quant types instead.

xctan added 5 commits October 20, 2024 01:15

ggml : RISC-V vector gemv for q4_0_8x8

9bfecf4

ggml : Added WIP rvv q4_0_8x8 gemm

3f7fdf2

ggml : Added initial implementation of rvv gemm

238cd66

ggml : optimize gemm to avoid register spillover

c039415

ggml : Fix GCC rvv load alignment issue

78c78e2

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Oct 24, 2024

ggerganov approved these changes Oct 25, 2024

View reviewed changes

ggml : Format gemm rvv code

37057a0

ggml : Fix a typo in RVV q4_0_8_8 GEMM

274a772

ggerganov merged commit fc83a9e into ggml-org:master Oct 30, 2024
51 of 52 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml : Implementations for Q4_0_8_8 quantization based functions - RISC-V vector version #10029

ggml : Implementations for Q4_0_8_8 quantization based functions - RISC-V vector version #10029

Uh oh!

xctan commented Oct 24, 2024

Uh oh!

xctan commented Oct 24, 2024 •

edited

Loading

Uh oh!

xctan commented Oct 30, 2024

Uh oh!

Uh oh!

ixgbe commented Aug 14, 2025

Uh oh!

xctan commented Aug 14, 2025

Questions about Testing and Model Configuration

1. Functional Testing Coverage

2. Model Variants Clarification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ggml : Implementations for Q4_0_8_8 quantization based functions - RISC-V vector version #10029

ggml : Implementations for Q4_0_8_8 quantization based functions - RISC-V vector version #10029

Uh oh!

Conversation

xctan commented Oct 24, 2024

Uh oh!

xctan commented Oct 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xctan commented Oct 30, 2024

Uh oh!

Uh oh!

ixgbe commented Aug 14, 2025

Questions about Testing and Model Configuration

1. Functional Testing Coverage

2. Model Variants Clarification

Uh oh!

xctan commented Aug 14, 2025

Questions about Testing and Model Configuration

1. Functional Testing Coverage

2. Model Variants Clarification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xctan commented Oct 24, 2024 •

edited

Loading