Skip to content

Conversation

xctan
Copy link
Collaborator

@xctan xctan commented Oct 24, 2024

I've tested Mistral 7B in qemu, and it just worked. I'm still choosing suitable 3B models for my dev board with only 4GB RAM (Banana Pi BPI-F3), so I can't give any performance evaluation as of now and any help is welcome! BTW Mistral 7B could run on BPI-F3 4GB with another 4GB swap memory enabled, but it was much slower than even qemu.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Oct 24, 2024
@xctan
Copy link
Collaborator Author

xctan commented Oct 24, 2024

Model: https://huggingface.co/CobraMamba/mamba-gpt-3b-v4
Compiler: GCC 13.2.0

model size params backend threads test t/s speedup commit
llama 3B Q4_0_8_8 1.84 GiB 3.43 B CPU 8 pp512 1.82 ± 0.00 271% 78c78e2
llama 3B Q4_0 1.84 GiB 3.43 B CPU 8 pp512 1.04 ± 0.00 112% 66c2c93
llama 3B Q4_0_8_8 1.84 GiB 3.43 B CPU 8 pp512 0.49 ± 0.00 66c2c93
llama 3B Q4_0_8_8 1.84 GiB 3.43 B CPU 8 tg128 2.25 ± 0.10 350% 78c78e2
llama 3B Q4_0 1.84 GiB 3.43 B CPU 8 tg128 1.27 ± 0.03 154% 66c2c93
llama 3B Q4_0_8_8 1.84 GiB 3.43 B CPU 8 tg128 0.50 ± 0.01 66c2c93

@xctan
Copy link
Collaborator Author

xctan commented Oct 30, 2024

Is there anything wrong with this PR? Should I provide more test results like llama-perplexity? @ggerganov
I'm trying ./llama-perplexity -m <model_name> -f wikitext-2-raw/wiki.test.raw for both Q4_0_8_8 GEMM implementations with and without this PR, but it takes ~400 hours in total to finish.

@ggerganov ggerganov merged commit fc83a9e into ggml-org:master Oct 30, 2024
51 of 52 checks passed
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
* ggml : RISC-V vector gemv for q4_0_8x8

* ggml : Added WIP rvv q4_0_8x8 gemm

* ggml : Added initial implementation of rvv gemm

* ggml : optimize gemm to avoid register spillover

* ggml : Fix GCC rvv load alignment issue

* ggml : Format gemm rvv code

* ggml : Fix a typo in RVV q4_0_8_8 GEMM
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
* ggml : RISC-V vector gemv for q4_0_8x8

* ggml : Added WIP rvv q4_0_8x8 gemm

* ggml : Added initial implementation of rvv gemm

* ggml : optimize gemm to avoid register spillover

* ggml : Fix GCC rvv load alignment issue

* ggml : Format gemm rvv code

* ggml : Fix a typo in RVV q4_0_8_8 GEMM
@ixgbe
Copy link

ixgbe commented Aug 14, 2025

Model: https://huggingface.co/CobraMamba/mamba-gpt-3b-v4 Compiler: GCC 13.2.0

model size params backend threads test t/s speedup commit
llama 3B Q4_0_8_8 1.84 GiB 3.43 B CPU 8 pp512 1.82 ± 0.00 271% 78c78e2
llama 3B Q4_0 1.84 GiB 3.43 B CPU 8 pp512 1.04 ± 0.00 112% 66c2c93
llama 3B Q4_0_8_8 1.84 GiB 3.43 B CPU 8 pp512 0.49 ± 0.00 66c2c93
llama 3B Q4_0_8_8 1.84 GiB 3.43 B CPU 8 tg128 2.25 ± 0.10 350% 78c78e2
llama 3B Q4_0 1.84 GiB 3.43 B CPU 8 tg128 1.27 ± 0.03 154% 66c2c93
llama 3B Q4_0_8_8 1.84 GiB 3.43 B CPU 8 tg128 0.50 ± 0.01 66c2c93

Questions about Testing and Model Configuration

Hello! I have two questions regarding your project:

1. Functional Testing Coverage

I notice that only performance tests are visible in the documentation/repository. How do you ensure functional correctness without explicit functional tests? Could you provide more details about your testing strategy for functional validation?

2. Model Variants Clarification

You reference the model: https://huggingface.co/CobraMamba/mamba-gpt-3b-v4

However, in your test results, I see references to:

  • llama 3B Q4_0_8_8
  • llama 3B Q4_0

These variants don't appear to be directly available in the linked model repository. Could you clarify:

  • How these specific quantized versions were obtained?
  • Are these custom quantizations you performed?
  • If so, could you share the quantization process or provide links to these model variants?

Thank you for your time and assistance!

@xctan
Copy link
Collaborator Author

xctan commented Aug 14, 2025

Model: https://huggingface.co/CobraMamba/mamba-gpt-3b-v4 Compiler: GCC 13.2.0
model size params backend threads test t/s speedup commit
llama 3B Q4_0_8_8 1.84 GiB 3.43 B CPU 8 pp512 1.82 ± 0.00 271% 78c78e2
llama 3B Q4_0 1.84 GiB 3.43 B CPU 8 pp512 1.04 ± 0.00 112% 66c2c93
llama 3B Q4_0_8_8 1.84 GiB 3.43 B CPU 8 pp512 0.49 ± 0.00 66c2c93
llama 3B Q4_0_8_8 1.84 GiB 3.43 B CPU 8 tg128 2.25 ± 0.10 350% 78c78e2
llama 3B Q4_0 1.84 GiB 3.43 B CPU 8 tg128 1.27 ± 0.03 154% 66c2c93
llama 3B Q4_0_8_8 1.84 GiB 3.43 B CPU 8 tg128 0.50 ± 0.01 66c2c93

Questions about Testing and Model Configuration

Hello! I have two questions regarding your project:

1. Functional Testing Coverage

I notice that only performance tests are visible in the documentation/repository. How do you ensure functional correctness without explicit functional tests? Could you provide more details about your testing strategy for functional validation?

My apologies for the confusion, my test was a bit coarse. The functional correctness is validated by examining the output of model evaluations intuitively (whether the model is coherent) and statistically (by checking the perplexity metric before and after optimization). I remember that the perplexity results were very close, but the original test data was lost.

p.s. The reference perplexity metric can be calculated on other performant devices, so you don't need to wait another painful 400 hours.

2. Model Variants Clarification

You reference the model: https://huggingface.co/CobraMamba/mamba-gpt-3b-v4

However, in your test results, I see references to:

  • llama 3B Q4_0_8_8
  • llama 3B Q4_0

These variants don't appear to be directly available in the linked model repository. Could you clarify:

  • How these specific quantized versions were obtained?
  • Are these custom quantizations you performed?
  • If so, could you share the quantization process or provide links to these model variants?

Thank you for your time and assistance!

These quantized versions were obtained using the quantize tool in this llama.cpp project. Steps include converting the original format to GGUF format and quantizing the model to the desired formats. You might need to checkout an older version of llama.cpp because the manual Q4_0 8x8 repack format was deprecated afterward, and the repacking process was made automatic as a new backend. I haven't checked the Q4_0 format after the automatic repacking patch was merged, and have focused on K-quant types instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants