Skip to content

server: run sampling in a threadpool #24914

Draft
ngxson wants to merge 4 commits into
masterfrom
xsn/server_multithread_sampling
Draft

server: run sampling in a threadpool #24914
ngxson wants to merge 4 commits into
masterfrom
xsn/server_multithread_sampling

Conversation

@ngxson

@ngxson ngxson commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Overview

Ref discussion: #24843

  • Add server_threadpool that spawn n-1 threads (plus one main thread)
  • Add --threads-sampling argument to select the number of threads

Benchmark: RTX 5060 Ti + 12th Gen Intel(R) Core(TM) i7-12700KF

10 concurrent requests --> around ~5% speed improvement

Baseline:

0.24.072.468 I slot print_timing: id 25 | task 4 | n_decoded =    481, tg =  25.27 t/s, tg_3s =  25.04 t/s
0.24.072.839 I slot print_timing: id 26 | task 5 | n_decoded =    481, tg =  25.27 t/s, tg_3s =  25.04 t/s
0.24.073.215 I slot print_timing: id 27 | task 3 | n_decoded =    481, tg =  25.28 t/s, tg_3s =  25.04 t/s
0.24.073.574 I slot print_timing: id 28 | task 2 | n_decoded =    481, tg =  25.28 t/s, tg_3s =  25.04 t/s

PR:

0.24.029.817 I slot print_timing: id 25 | task 5 | n_decoded =    502, tg =  26.75 t/s, tg_3s =  26.52 t/s
0.24.029.818 I slot print_timing: id 26 | task 4 | n_decoded =    502, tg =  26.75 t/s, tg_3s =  26.52 t/s
0.24.029.820 I slot print_timing: id 27 | task 3 | n_decoded =    502, tg =  26.75 t/s, tg_3s =  26.52 t/s
0.24.029.821 I slot print_timing: id 28 | task 1 | n_decoded =    502, tg =  26.75 t/s, tg_3s =  26.52 t/s

Requirements

@ngxson ngxson requested a review from a team as a code owner June 22, 2026 17:42
@ngxson ngxson changed the title Xsn/server multithread sampling server: run sampling in a threadpool Jun 22, 2026
@ngxson ngxson requested a review from a team as a code owner June 22, 2026 18:03
slot.task->params.sampling.preserved_tokens.find(token) != slot.task->params.sampling.preserved_tokens.end();
};

std::vector<sampling_task> smpl_tasks;

@ngxson ngxson Jun 22, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

point for discussion: should sampling_task be a member of server_slot ?

Comment thread common/common.h
Comment on lines +474 to +475
int sampling_n_threads = -1; // number of threads for sampling, used by server

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can put it in common_params_sampling

Comment on lines +3804 to +3809
threadpool.run_all<sampling_task>(smpl_tasks, [](sampling_task & task) {
if (task.slot) {
task.sampled_id = common_sampler_sample(task.slot->smpl.get(),
task.slot->ctx_tgt, task.tok_idx);
}
});

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is problematic because common_sampler_sample calls llama_context API that is not thread-safe.

Probably need a new overload for sampling from multiple samplers: common_sampler_sample(std::vector<common_sampler *> smpls, ...). And multi-thread just the llama_sampler_apply parts.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm ok I think that will requires having a common_threadpool that can be shared across multiple sampling calls, so that we can also reuse it for MTP sampling if needed

@ngxson ngxson marked this pull request as draft June 25, 2026 17:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants