sampling : add support for backend sampling #17004

danbev · 2025-11-04T17:34:17Z

This is a work in progress to add support for backend (like GPU) sampling.

The motivation for this feature is to enable sampling to be performed directly on the backend as part of the computation graph being executed, allowing for some or all of the sampling to be done on the backend.

For example, the backend sampler chain might select/sample a token directly in which case only the sampled token needs to be transferred from device memory to host memory.

It is also possible for the backend samplers to perform filtering of the logits, or compute and filter the probability distribution, in which case only the filtered logits or probabilites need to be transferred back to system memory for further processing by CPU samplers.

Currently the backend sampling works in a similar manner to how pooling works, it is a function that is called by build_graph and the sampler operations become part of the models computation graph.

Backend samplers can be configured by creating sampler chains, where each sampler chain is associated with a specific sequence id:

    struct llama_sampler_chain_params params = llama_sampler_chain_default_params();
    struct llama_sampler * chain = llama_sampler_chain_init(params);
    llama_sampler_chain_add(chain, llama_sampler_backend_init_greedy());
    std::vector<llama_sampler_seq_config> sampler_configs = {
        { 0, chain }
    };

The struct is defined as:

    struct llama_sampler_seq_config {
        llama_seq_id           seq_id;
        struct llama_sampler * sampler;
    };

These sampler configs are then passed as context params:

    llama_context_params cparams = llama_context_default_params();
    cparams.samplers = sampler_configs.data();
    cparams.n_samplers = sampler_configs.size();

When the model graph is built the GPU samplers will be called to enable them to add their operations to the graph:

ggml_cgraph * llama_model::build_graph(const llm_graph_params & params) const {
    std::unique_ptr<llm_graph_context> llm;
    ...

    // add backend sampling layers (if any)
    llm->build_sampling(*this, params);

The llama_sampler_i interface as been extended with 4 new methods in the API, and they are currently all named with a _ggml suffix to indicate that they are for backend sampling:

        void                   (*init_ggml)(struct llama_sampler      * smpl,
                                            ggml_backend_buffer_type_t  buft);

        void                   (*set_input_ggml)( struct llama_sampler * smpl,
                                                       ggml_context * ctx,
                                                        ggml_cgraph * gf);

        void                   (*apply_ggml)(  struct llama_sampler * smpl,
                                                       ggml_context * ctx,
                                                        ggml_cgraph * gf,
                                            llama_sampler_ggml_data * ggml_data);

        void                   (*accept_ggml)( struct llama_sampler * smpl,
                                                       ggml_context * ctx,
                                                        ggml_cgraph * gf,
                                               struct ggml_tensor * selected_token);

The init_ggml function allows backend samplers to create input tensors that they might need. The ggml_backend_buffer_type should be used so that the tensors are created using this backend buffer type, which is the same as the output logits backend. This avoids splits in the computation graph that would require data transfer between different backends.

The set_input_ggml function is called after the computation graph has been scheduled but before it is computed. This allows the backend sampler to set any input for the tensors it created in init_ggml.

The apply_ggml function is where the backend sampler adds its operations to the graphs. When the graph is built, the configured sampler's _apply function is called which allows them to add operations/nodes to the computation graph.

The accept_ggml functions allows backend samplers to update their tensor states if needed.

This enables the sampling to happen fully, or partially on the backend. The samplers could sample a single token in which case that is what will be transferred from the device memory to host memory after llama_decode has been called. The sampled token can then be retrieved using:

    llama_token id = llama_get_backend_sampled_token_ith(test_ctx.ctx, index);

Is it also possible to run a backend sampler that only filters the logits and then only the filtered logits are transferred back to the host and the sampling can proceed on the CPU with the normal (CPU) sampler chain. In this case the CPU samplers are configured as usual but they will now operate on already filtered logits.

Similar to the above handling of logits, it is possible for a GPU samplers to compute the full probability distribution and transfer that to the host. And the CPU samplers can then operate on the those probabilities.

Building and running the tests

Download a model for testing:

$ cd models && wget https://huggingface.co/ggml-org/models/resolve/main/tinyllamas/stories15M-q4_0.gguf

Building the test:

$ cmake --build build --target test-backend-sampler -j8

Runing all tests:

$ env LLAMACPP_TEST_MODELFILE=../models/stories15M-q4_0.gguf \
    ctest --test-dir build -R '^test-backen-sampler$' -V

The following individual tests are available:

$ ctest --test-dir build-gpu-sampler/ -N -R test-backend-sampler-
Internal ctest changing into directory: /home/danbev/work/ai/llama.cpp-debug/build-gpu-sampler
Test project /home/danbev/work/ai/llama.cpp-debug/build-gpu-sampler
  Test #36: test-backend-sampler-greedy
  Test #37: test-backend-sampler-temp
  Test #38: test-backend-sampler-top_k
  Test #39: test-backend-sampler-dist
  Test #40: test-backend-sampler-dist-and-cpu
  Test #41: test-backend-sampler-logit-bias
  Test #42: test-backend-sampler-mul_seq
  Test #43: test-backend-sampler-set-sampler

Total Tests: 8

These can be run individually, for example:

$ env LLAMACPP_TEST_MODELFILE=../models/stories15M-q4_0.gguf \
    ctest --test-dir build -R 'test-backend-sampler-temp' -V

llama-cli

Initial support for llama-cli has been added and can be used as follows:

$ export GGML_SCHED_DEBUG=2
$ ./build/bin/llama-cli -m models/Qwen2.5-VL-3B-Instruct-Q8_0.gguf \
    -p "What is the Capital of Sweden?" \
    --backend-sampling \
    --backend-dist \
    -ngl 99 \
    -no-cnv \
    -n 20 \
    --no-warmup

(To print the backend schedulers assignments add -v/--verbose to the above command in combination with GGML_SCHED_DEBUG)

llama-server

GPU sampling can be enabled using the following global configuration command line options:

$ ./build/bin/llama-server --help
...
----- sampling params -----
...
--backend-sampling                      enable backend sampling (default: disabled)
--backend-dist                          perform final (distribution) sampling on backend (default: disabled)

Usage:

$ export GGML_SCHED_DEBUG=2
$ ./build/bin/llama-server \
      -m models/Qwen2.5-VL-3B-Instruct-Q8_0.gguf \
      --backend-sampling \
      --temp 0.8 \
      --top-k 40 \
      -ngl 50

(To print the backend schedulers assignments add -v/--verbose to the above command in combination with GGML_SCHED_DEBUG)

It is then possible to specify send GPU request parameters as follows:

curl --request POST \
    --url http://localhost:8080/completion \
    --header "Content-Type: application/json" \
    --data '{"prompt": "What is the capital of Sweden?","n_predict": 20, "top_k": 40, "backend_dist": true}'

The backend_dist option will cause the dist backend sampler to sample a token. Without setting this the CPU samplers will be able to process the filtered tokens that backend sampler produced.

To enable testing with the webui, the following settings have been added:

TODO

Implemented GPU samplers

Remaining backend samplers

The list below are the current CPU sampler that exist. All of these might not be appropriate as GPU samplers. These will be implemented separate follow up PRs.

am17an · 2025-11-05T09:55:06Z

One place this would be useful immediately is the diffusion-cli. I'm happy to test this when it's ready

ggml/src/ggml.c

ORippler

Not sure if I have a strong opinion on this but removing hybrid sampling would reduce the complexity a bit I think (basically if we always set --gpu-dist we only have two states (either full gpu sampling or full cpu sampling, and no in-between).

common/arg.cpp

common/sampling.cpp

include/llama.h

src/llama-backend-sampler.cpp

tools/server/server.cpp

danbev · 2025-11-13T06:34:41Z

Not sure if I have a strong opinion on this but removing hybrid sampling would reduce the complexity a bit I think (basically if we always set --gpu-dist we only have two states (either full gpu sampling or full cpu sampling, and no in-between).

My thoughts are that I think we should keep the hybrid approach even though it does come with some additional complexity like you say. I think there could be use cases where one might want to perform some sampling like temp/logit_bias/top-k sampling on the device, and then only have a smaller set of logits copied to the host memory, and still enable other CPU samplers, including grammars, to be able to process the logits.

This might turn out to be an incorrect assumption and not something anyone wants to use, but it feels safer to have the ability do hybrid sampling to play it safe.

ggerganov · 2025-11-14T07:41:57Z

@danbev Let's rebase on latest master to pick up the recent changes.

This commit enables all exisiting backend sampler tests in the test-backend-sampler. Previously, some tests were disabled because there were missing ggml operation implementations.

ggerganov · 2025-11-18T12:05:39Z

common/sampling.cpp

+                }
+            } else {
+                for (llama_token token_id = 0; token_id < (int) sampled_probs_count; token_id++) {
+                    cur.emplace_back(llama_token_data{token_id, 0.0f, sampled_probs[token_id]});


Should we populate the logits here too?

I was not sure if we should to that or not. My reasoning for not populating them is that this could indicate to the CPU samplers that probabilities have already been generated, and that it would be possible for them to skip that step. But if I recall correctly the CPU sampler will actually need the logits as they will recompute the probabilities. This is something brought up in #16241.

But I'll update this so we populate the logits as well so this does not break the CPU samplers (at the moment we don't have any backend samplers that produce probabilities).

Added 51fee29 to address this.

ggerganov · 2025-11-18T12:07:19Z

common/sampling.cpp

+            if (sampled_ids != nullptr) {
+                for (uint32_t i = 0; i < sampled_logits_count; i++) {
+                    cur.emplace_back(llama_token_data{sampled_ids[i], sampled_logits[i], 0.0f});
+                }
+            } else {
+                for (llama_token token_id = 0; token_id < (int) sampled_logits_count; token_id++) {
+                    cur.emplace_back(llama_token_data{token_id, sampled_logits[token_id], 0.0f});
+                }
+            }


Can we simplify the logic to always have sampled_ids defined?

When the vocabulary is not filtered, we don't want to copy this buffer from the device to host, so it should be probably initialized by default to contain the full vocab.

I've added 82957a9 to address this.

ggerganov · 2025-11-18T12:11:42Z

src/llama-context.cpp

+        std::unordered_map<llama_seq_id, int32_t> seq_to_idx;
+        for (uint32_t i = 0; i < ubatch.n_tokens; i++) {
+            if (ubatch.output[i]) {
+                llama_seq_id seq_id = ubatch.seq_id[i][0];
+                seq_to_idx[seq_id] = i;
+            }


Here we assume that there is only one output token per sequence. We should assert this. Maybe the batch allocator has to throw an error if we try to run multi-output batches with backend sampling.

I've added 311c1a3 to address this.

ggerganov · 2025-11-18T12:12:34Z

src/llama-context.cpp

+        // extract sampled tokens
+        for (const auto & [seq_id, t_token] : res->t_sampled_tokens) {
+            auto idx_it = seq_to_idx.find(seq_id);
+            GGML_ASSERT(idx_it != seq_to_idx.end());
+            const int32_t idx = idx_it->second;
+            ggml_backend_t backend = ggml_backend_sched_get_tensor_backend(sched.get(), t_token);
+            ggml_backend_tensor_get_async(backend, t_token, &sampled_tokens_map[idx], 0, sizeof(llama_token));
+        }
+
+        for (const auto & [seq_id, t_ids] : res->t_sampled_token_ids) {
+            auto idx_it = seq_to_idx.find(seq_id);
+            GGML_ASSERT(idx_it != seq_to_idx.end());
+            const int32_t idx = idx_it->second;
+            ggml_backend_t backend = ggml_backend_sched_get_tensor_backend(sched.get(), t_ids);
+            sampled_token_ids_map[idx].resize(ggml_nelements(t_ids));
+            ggml_backend_tensor_get_async(backend, t_ids, sampled_token_ids_map[idx].data(), 0, ggml_nbytes(t_ids));
+        }
+
+        if (res->t_sampled_tokens.empty()) {
+            for (const auto & [seq_id, t_logits] : res->t_sampled_logits) {
+                auto idx_it = seq_to_idx.find(seq_id);
+                GGML_ASSERT(idx_it != seq_to_idx.end());
+                const int32_t idx = idx_it->second;
+                ggml_backend_t backend = ggml_backend_sched_get_tensor_backend(sched.get(), t_logits);
+                sampled_logits_map[idx].resize(ggml_nelements(t_logits));
+                ggml_backend_tensor_get_async(backend, t_logits, sampled_logits_map[idx].data(), 0, ggml_nbytes(t_logits));
+            }

-            if (n_outputs) {
-                GGML_ASSERT( n_outputs_prev + n_outputs <= n_outputs_all);
-                GGML_ASSERT((n_outputs_prev + n_outputs)*n_vocab <= (int64_t) logits_size);
-                ggml_backend_tensor_get_async(backend_res, t_logits, logits_out, 0, n_outputs*n_vocab*sizeof(float));
+            // extract sampled probabilities
+            for (const auto & [seq_id, t_probs] : res->t_sampled_probs) {
+                auto idx_it = seq_to_idx.find(seq_id);
+                GGML_ASSERT(idx_it != seq_to_idx.end());
+                const int32_t idx = idx_it->second;
+                ggml_backend_t backend = ggml_backend_sched_get_tensor_backend(sched.get(), t_probs);
+                sampled_probs_map[idx].resize(ggml_nelements(t_probs));
+                ggml_backend_tensor_get_async(backend, t_probs, sampled_probs_map[idx].data(), 0, ggml_nbytes(t_probs));


This logic is a bit cumbersome to read - need to express it in simpler way.

I'll take a look at simplifying this.

I've made an attempt in simplifying this in 7e98ebc.

ggerganov · 2025-11-18T12:15:35Z

src/llama-graph.cpp

+        }
+
+        if (ggml_data.filtered_ids != nullptr) {
+            res->t_sampled_token_ids[seq_id] = ggml_data.filtered_ids;


These names are inconsistent:

sampled_tokens <> sampled_token

sampled_token_ids <> filtered_ids

I'll do a rename pass of the llama_sampler_ggml_data and related.

src/llama-graph.h

This commit precomputes and caches the full-vocab token id list in llama_context's constructor, so llama_get_backend_sampled_token_ids_ith always returns a valid pointer. The motivation for this is that this enables both common/sampling.cpp and src/llama-sampling.cpp can simplify their logic. Not all backends samplers that process logits need to set the sampled_tokens_id as they may not change the order of the logits, for example the temperature sampler only scales the logits but does not change their order. Simliar the logit bias sampler only adds bias to specific token ids but does not change the order of the logits. In these cases there will not be a device to host copy of the sampled token ids, and this is the use case where having this precomputed list is useful.

This commit adds a check in the batch allocator to ensure that when backend sampling is enabled, at most one output token is specified per sequence.

Argsort is used for top-k currently. WE optimize argsort by 2 things: 1. Use `DeviceRadixSort` for single-row/sequence to parallelize it across our SMs 2. Use `DeviceSegmentedSort` for multi-row/sequence as this is the correct entrypoint (the function chooses different execution paths, it contains `DeviceSegmentedRadixSort` as one of the paths and will choose the best one according to heuristics. https://nvidia.github.io/cccl/cub/api/structcub_1_1DeviceSegmentedSort.html#overview Some perf numbers for a RTX PRO 6000: On the kernel level, tested with `GGML_CUDA_DISABLE_GRAPHS=1 ./test-backend-ops -o ARGSORT perf` Before: ``` ARGSORT(type=f32,ne=[65000,16,1,1],order=0): 4130 runs - 359.24 us/run ARGSORT(type=f32,ne=[200000,1,1,1],order=0): 8192 runs - 861.34 us/run ARGSORT(type=f32,ne=[200000,16,1,1],order=0): 1343 runs - 1020.01 us/run ``` After: ``` ARGSORT(type=f32,ne=[65000,16,1,1],order=0): 4130 runs - 312.41 us/run ARGSORT(type=f32,ne=[200000,1,1,1],order=0): 16384 runs - 63.48 us/run ARGSORT(type=f32,ne=[200000,16,1,1],order=0): 1343 runs - 874.36 us/run ``` --- On the model level, tested with `llama-cli -m gpt-oss-20b-mxfp4.gguf -n 200 -p "What is the Capital of Sweden?" -no-cnv -fa 1 --backend-sampling` Before: ``` llama_perf_sampler_print: sampling time = 0.25 ms / 207 runs ( 0.00 ms per token, 824701.20 tokens per second) llama_perf_context_print: load time = 18215.58 ms llama_perf_context_print: prompt eval time = 28.20 ms / 7 tokens ( 4.03 ms per token, 248.19 tokens per second) llama_perf_context_print: eval time = 714.79 ms / 199 runs ( 3.59 ms per token, 278.40 tokens per second) llama_perf_context_print: total time = 857.62 ms / 206 tokens ``` After ``` llama_perf_sampler_print: sampling time = 0.25 ms / 207 runs ( 0.00 ms per token, 828000.00 tokens per second) llama_perf_context_print: load time = 18366.92 ms llama_perf_context_print: prompt eval time = 35.92 ms / 7 tokens ( 5.13 ms per token, 194.87 tokens per second) llama_perf_context_print: eval time = 532.79 ms / 199 runs ( 2.68 ms per token, 373.50 tokens per second) llama_perf_context_print: total time = 683.65 ms / 206 tokens ```

ORippler · 2025-11-18T18:43:56Z

Based on some llama-cli-based benching I did in 26be108 I feel the timings reported by llama_perf_context_print may be off.

For optimized argsort, we get

llama_perf_sampler_print:    sampling time =       0.25 ms /   207 runs   (    0.00 ms per token, 828000.00 tokens per second)
llama_perf_context_print:        load time =   18366.92 ms
llama_perf_context_print: prompt eval time =      35.92 ms /     7 tokens (    5.13 ms per token,   194.87 tokens per second)
llama_perf_context_print:        eval time =     532.79 ms /   199 runs   (    2.68 ms per token,   373.50 tokens per second)
llama_perf_context_print:       total time =     683.65 ms /   206 tokens
llama_perf_context_print:    graphs reused =        198

For non-optimized argsort

llama_perf_sampler_print:    sampling time =       0.25 ms /   207 runs   (    0.00 ms per token, 824701.20 tokens per second)
llama_perf_context_print:        load time =   18215.58 ms
llama_perf_context_print: prompt eval time =      28.20 ms /     7 tokens (    4.03 ms per token,   248.19 tokens per second)
llama_perf_context_print:        eval time =     714.79 ms /   199 runs   (    3.59 ms per token,   278.40 tokens per second)
llama_perf_context_print:       total time =     857.62 ms /   206 tokens
llama_perf_context_print:    graphs reused =        198

and for CPU-sampling

llama_perf_sampler_print:    sampling time =      19.57 ms /   207 runs   (    0.09 ms per token, 10579.58 tokens per second)
llama_perf_context_print:        load time =   18254.54 ms
llama_perf_context_print: prompt eval time =      23.96 ms /     7 tokens (    3.42 ms per token,   292.10 tokens per second)
llama_perf_context_print:        eval time =     529.06 ms /   199 runs   (    2.66 ms per token,   376.14 tokens per second)
llama_perf_context_print:       total time =     914.23 ms /   206 tokens
llama_perf_context_print:    graphs reused =        198

Basically total time is behaving as expected, but I'd have thought sampling time + prompt eval time + eval time to come somewhat close to it. This gap is especially large for CPU-based sampling

This commit removes the version field from the sampler chain and instead used the sampler pointer itself for change detection.

This commit updates common/sampler.cpp set_logits and src/llama-sampling.cpp llama_sampler_sample to always populate the logits field when backend sampled probabilities are available. The motivation for this is that this ensure that CPU sampler always have access to the logits values even when probabilites have been produced by backend samplers.

This commit tries to simplify the backend sampling logic in llama_context::decode.

ggerganov · 2025-11-19T09:11:56Z

@danbev 7e98ebc might have introduced a bug - I'm getting gibberish with backend sampling disabled.

I'd have thought sampling time + prompt eval time + eval time to come somewhat close to it.

@ORippler They should. Is the CPU-sampling gap so large even on master?

danbev · 2025-11-19T09:18:33Z

@danbev 7e98ebc might have introduced a bug - I'm getting gibberish with backend sampling disabled.

Sorry about that, I'll look into it.

It should be producing normal output now, but I think I found another bug. Sometimes llama-cli will output [end of text] directly which out sampling anything, and this can happen with and without backend sampler enabled. I'm looking into this now. Update This also happens on master so it might not be directly related to this PR.

Fix condition to check if backend actually sampled tokens, not just that backend samplers are available.

ORippler · 2025-11-19T11:06:24Z

@ORippler They should. Is the CPU-sampling gap so large even on master?

Order in below is total, eval, prompt eval, sampling
p=7, n=200 on 26be108

>>> 914 - 529 - 24 - 19
342 (37%)

p=7,n=1000 on 26be108

>>> 3991 - 2631 - 23 - 92
1245 (31%)

p=7, n=200 on 6fd4f9536

>>> 713 - 527 - 24 - 18
144 (20%)

p=7, n=1000 on 6fd4f9536

>>> 3039 - 2640 - 23.6 - 94
281.4 (9%)

Timings are consistent across llama-cli invocations. Feels like we are missing something on both master and this PR ( though for this PR it scales linearly).

…pling

The commit fixes a variable shadowing issue in the `llama_context::decode` function which was introduced in a previous refactoring.

…ring sampling Apply the same changes to llama-sampling.cpp, llama_sampler_sample as were applied in commit 38f408c.

github-actions bot added the testing Everything test related label Nov 4, 2025

danbev force-pushed the gpu-sampling branch 2 times, most recently from 71b0e3d to c82b67b Compare November 6, 2025 06:14

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Nov 6, 2025

danbev force-pushed the gpu-sampling branch 2 times, most recently from 56bca5e to 5d18032 Compare November 6, 2025 06:27

DajanaV mentioned this pull request Nov 6, 2025

UPSTREAM PR #17004: sampling : add support for GPU sampling (wip) auroralabs-loci/llama.cpp#102

Open

5 tasks

danbev force-pushed the gpu-sampling branch from 2747aac to c0ac70c Compare November 7, 2025 09:52

github-actions bot added examples server labels Nov 7, 2025

danbev force-pushed the gpu-sampling branch 7 times, most recently from f49a857 to 7c6dc02 Compare November 11, 2025 12:05

slaren reviewed Nov 11, 2025

View reviewed changes

ggml/src/ggml.c Outdated Show resolved Hide resolved

danbev force-pushed the gpu-sampling branch 4 times, most recently from 1168c22 to 9609e7e Compare November 12, 2025 13:10

ORippler reviewed Nov 12, 2025

View reviewed changes

common/arg.cpp Show resolved Hide resolved

common/sampling.cpp Outdated Show resolved Hide resolved

include/llama.h Outdated Show resolved Hide resolved

src/llama-backend-sampler.cpp Outdated Show resolved Hide resolved

tools/server/server.cpp Show resolved Hide resolved

danbev force-pushed the gpu-sampling branch from cf139de to c7dbcfc Compare November 13, 2025 04:07

danbev force-pushed the gpu-sampling branch 2 times, most recently from 0730c19 to b2370c7 Compare November 16, 2025 07:16

danbev added 3 commits November 17, 2025 16:16

server : add backend sampling options/configuration

f1f3e68

webui : add backend sampling options

a3eb847

ggml : add initial cumsum implementation for CUDA

67d3b8e

danbev force-pushed the gpu-sampling branch from 88fa86e to 67d3b8e Compare November 17, 2025 15:18

danbev changed the title ~~sampling : add support for backend sampling (wip)~~ sampling : add support for backend sampling Nov 17, 2025

CISC mentioned this pull request Nov 17, 2025

cuda : support non-contiguous i32 to i32 copy #17326

Open

sampling : enable all backend sampler tests

71574f9

This commit enables all exisiting backend sampler tests in the test-backend-sampler. Previously, some tests were disabled because there were missing ggml operation implementations.

danbev marked this pull request as ready for review November 18, 2025 08:31

danbev requested review from CISC, allozaur and ngxson as code owners November 18, 2025 08:31

graph : do not include llama-model.h

4b52e59

ggerganov reviewed Nov 18, 2025

View reviewed changes

danbev and others added 3 commits November 18, 2025 15:11

sampling : ensure at most one output token per seq

311c1a3

This commit adds a check in the batch allocator to ensure that when backend sampling is enabled, at most one output token is specified per sequence.

danbev added 3 commits November 19, 2025 06:59

sampling : remove version from sampler chain

0da7e7d

This commit removes the version field from the sampler chain and instead used the sampler pointer itself for change detection.

sampling : simplify backend sampling logic decode

7e98ebc

This commit tries to simplify the backend sampling logic in llama_context::decode.

squash! sampling : simplify backend sampling logic decode

d74eb61

Fix condition to check if backend actually sampled tokens, not just that backend samplers are available.

common : fix regression caused by extra memory allocations during sam…

38f408c

…pling

ggerganov mentioned this pull request Nov 19, 2025

common : more accurate sampling timing #17382

Open

squash! sampling : simplify backend sampling logic decode

18ed4d8

The commit fixes a variable shadowing issue in the `llama_context::decode` function which was introduced in a previous refactoring.

ggerganov mentioned this pull request Nov 19, 2025

ggml : add ggml_top_k #17365

Draft

5 tasks

danbev added 2 commits November 20, 2025 06:57

Merge remote-tracking branch 'upstream/master' into backend-sampling

0c660e7

squash! common : fix regression caused by extra memory allocations du…

ed4345b

…ring sampling Apply the same changes to llama-sampling.cpp, llama_sampler_sample as were applied in commit 38f408c.

sampling : add support for backend sampling #17004

Are you sure you want to change the base?

sampling : add support for backend sampling #17004

Conversation

danbev commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Building and running the tests

llama-cli

llama-server

TODO

Implemented GPU samplers

Remaining backend samplers

Uh oh!

am17an commented Nov 5, 2025

Uh oh!

Uh oh!

ORippler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

danbev commented Nov 13, 2025

Uh oh!

ggerganov commented Nov 14, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ORippler commented Nov 18, 2025

Uh oh!

ggerganov commented Nov 19, 2025

Uh oh!

danbev commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ORippler commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

danbev commented Nov 4, 2025 •

edited

Loading

danbev commented Nov 19, 2025 •

edited

Loading

ORippler commented Nov 19, 2025 •

edited

Loading