server: run sampling in a threadpool by ngxson · Pull Request #24914 · ggml-org/llama.cpp

ngxson · 2026-06-22T17:42:31Z

Overview

Ref discussion: #24843

Add server_threadpool that spawn n-1 threads (plus one main thread)
Add --threads-sampling argument to select the number of threads

Benchmark: RTX 5060 Ti + 12th Gen Intel(R) Core(TM) i7-12700KF

10 concurrent requests --> around ~5% speed improvement

Baseline:

0.24.072.468 I slot print_timing: id 25 | task 4 | n_decoded =    481, tg =  25.27 t/s, tg_3s =  25.04 t/s
0.24.072.839 I slot print_timing: id 26 | task 5 | n_decoded =    481, tg =  25.27 t/s, tg_3s =  25.04 t/s
0.24.073.215 I slot print_timing: id 27 | task 3 | n_decoded =    481, tg =  25.28 t/s, tg_3s =  25.04 t/s
0.24.073.574 I slot print_timing: id 28 | task 2 | n_decoded =    481, tg =  25.28 t/s, tg_3s =  25.04 t/s

PR:

0.24.029.817 I slot print_timing: id 25 | task 5 | n_decoded =    502, tg =  26.75 t/s, tg_3s =  26.52 t/s
0.24.029.818 I slot print_timing: id 26 | task 4 | n_decoded =    502, tg =  26.75 t/s, tg_3s =  26.52 t/s
0.24.029.820 I slot print_timing: id 27 | task 3 | n_decoded =    502, tg =  26.75 t/s, tg_3s =  26.52 t/s
0.24.029.821 I slot print_timing: id 28 | task 1 | n_decoded =    502, tg =  26.75 t/s, tg_3s =  26.52 t/s

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: no

ngxson · 2026-06-22T18:07:31Z

                slot.task->params.sampling.preserved_tokens.find(token) != slot.task->params.sampling.preserved_tokens.end();
        };

+        std::vector<sampling_task> smpl_tasks;


point for discussion: should sampling_task be a member of server_slot ?

ggerganov · 2026-06-24T11:44:32Z

+    int sampling_n_threads = -1; // number of threads for sampling, used by server
+


Can put it in common_params_sampling

ggerganov · 2026-06-24T12:12:45Z

+            threadpool.run_all<sampling_task>(smpl_tasks, [](sampling_task & task) {
+                if (task.slot) {
+                    task.sampled_id = common_sampler_sample(task.slot->smpl.get(),
+                                            task.slot->ctx_tgt, task.tok_idx);
+                }
+            });


This is problematic because common_sampler_sample calls llama_context API that is not thread-safe.

Probably need a new overload for sampling from multiple samplers: common_sampler_sample(std::vector<common_sampler *> smpls, ...). And multi-thread just the llama_sampler_apply parts.

hmm ok I think that will requires having a common_threadpool that can be shared across multiple sampling calls, so that we can also reuse it for MTP sampling if needed

ngxson added 3 commits June 22, 2026 19:05

server: run sampling in a threadpool

fe03cce

wip

41ed530

working

c62fdd5

ngxson requested a review from a team as a code owner June 22, 2026 17:42

ngxson changed the title ~~Xsn/server multithread sampling~~ server: run sampling in a threadpool Jun 22, 2026

github-actions Bot added examples server labels Jun 22, 2026

add arg --threads-sampling

095058c

ngxson requested a review from a team as a code owner June 22, 2026 18:03

ngxson commented Jun 22, 2026

View reviewed changes

ggerganov reviewed Jun 24, 2026

View reviewed changes

ngxson marked this pull request as draft June 25, 2026 17:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

server: run sampling in a threadpool #24914

server: run sampling in a threadpool #24914
ngxson wants to merge 4 commits into
masterfrom
xsn/server_multithread_sampling

ngxson commented Jun 22, 2026 •

edited

Loading

Uh oh!

ngxson Jun 22, 2026 •

edited

Loading

Uh oh!

ggerganov Jun 24, 2026

Uh oh!

ggerganov Jun 24, 2026

Uh oh!

ngxson Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		int sampling_n_threads = -1; // number of threads for sampling, used by server

Uh oh!

Conversation

ngxson commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Requirements

Uh oh!

ngxson Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggerganov Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

ggerganov Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

ngxson Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ngxson commented Jun 22, 2026 •

edited

Loading

ngxson Jun 22, 2026 •

edited

Loading