Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kompute : llama-bench support and ggml_cpu_has_kompute() #5226

Merged
merged 2 commits into from
Jan 31, 2024

Conversation

cebtenzzre
Copy link
Collaborator

I didn't realize that the Kompute backend should have been added in these places.

Copy link
Collaborator

@slaren slaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't really need to be added to ggml.c, eventually all of the backend code will be removed from there, and the llama.cpp change does nothing since the CPU backend is no longer running at the same time as the GPU backends, it's just a leftover from the pre-ggml-backend implementation that I forget to remove. Anyway it doesn't really matter, the changes to llama-bench and common are good.

@cebtenzzre
Copy link
Collaborator Author

@slaren I removed that code from llama.cpp, does that seem right?

@cebtenzzre cebtenzzre removed the request for review from ggerganov January 31, 2024 00:04
@cebtenzzre cebtenzzre merged commit e8dc55d into master Jan 31, 2024
54 checks passed
@cebtenzzre cebtenzzre deleted the ceb/kompute-llama-bench branch January 31, 2024 00:04
@stduhpf
Copy link
Contributor

stduhpf commented Jan 31, 2024

Somehow this PR huts performance for the other Vulkan backend? This doesn't make sense when I look at the changes, but the difference with the previous commit on master is very significant. (It's not just with llama-bench.)

Before merge:

ggml_vulkan: Using AMD Radeon RX 5700 XT | uma: 0 | fp16: 1 | warp size: 64

model size params backend ngl test t/s
llama 7B Q4_K - Medium 4.07 GiB 7.24 B Vulkan 33 pp 512 225.55 ± 1.52
llama 7B Q4_K - Medium 4.07 GiB 7.24 B Vulkan 33 tg 128 43.33 ± 0.35

build: e0085fd (2026)

After merge:

ggml_vulkan: Using AMD Radeon RX 5700 XT | uma: 0 | fp16: 1 | warp size: 64

model size params backend ngl test t/s
llama 7B Q4_K - Medium 4.07 GiB 7.24 B Vulkan 33 pp 512 197.63 ± 16.47
llama 7B Q4_K - Medium 4.07 GiB 7.24 B Vulkan 33 tg 128 21.54 ± 2.20

build: e8dc55d (2027)

EDIT: reverting 3536cf6 fixes it.

@slaren
Copy link
Collaborator

slaren commented Jan 31, 2024

The only way I can imagine this could make a difference is there is a large overhead for launching the extra threads for the get_rows operation that still runs on the CPU. Are you on Windows?

@stduhpf
Copy link
Contributor

stduhpf commented Jan 31, 2024

I observe a simillar drop of performance with the Kompute backend

Lastest master

model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Kompute 33 pp 512 68.04 ± 1.70
llama 7B Q4_0 3.56 GiB 6.74 B Kompute 33 tg 128 15.42 ± 2.18

With 3536cf6 reverted

model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Kompute 33 pp 512 71.44 ± 0.25
llama 7B Q4_0 3.56 GiB 6.74 B Kompute 33 tg 128 20.79 ± 2.75

@stduhpf
Copy link
Contributor

stduhpf commented Jan 31, 2024

The only way I can imagine this could make a difference is there is a large overhead for launching the extra threads for the get_rows operation that still runs on the CPU. Are you on Windows?

Yes, on Windows 10

@stduhpf
Copy link
Contributor

stduhpf commented Jan 31, 2024

Setting -t 1 manually works too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants