llama : move the sampling API from common into llama lib #5214

ggerganov · 2024-01-30T12:44:03Z

There is functionality around llama_sampling_context currently part of common. We should move it into llama. Pretty much the entire API from common/sampling.h except llama_sampling_params and llama_sampling_sample can be integrated into the library.

This would probably require to also merge the grammar parser into the llama lib implementation.

The llama_sampling_params and llama_sampling_sample will stay in common since they are very example-specific and not general-purpose enough to be merged.

The text was updated successfully, but these errors were encountered:

slaren · 2024-01-30T12:52:45Z

Is this meant as a short-term stop gap measure? If we are going to add a new sampling API to llama.cpp, it would be good to do this from the ground up with the possibility of GPU sampling in mind. The implementation is sampling.h does not seem flexible enough to do this.

ggerganov · 2024-01-30T13:23:43Z

This change is more relevant for the CPU-based sampling. There are many use cases that require to manage a sampling state (e.g. previously sampled tokens, grammar state, etc.) so it makes sense to add support directly into the core library.

I haven't thought deeply about GPU sampling support. Wouldn't it make more sense to have a limited number of GPU sampling options (such as greedy and top-k) as part of llama_context since this would require changing the compute graph in the first place? I don't expect we can ever support GPU grammar sampling for example or even GPU repeat-penalty - is that a correct assumption?

slaren · 2024-01-30T13:30:14Z

It's clear that some samplers cannot have GPU implementations, however this doesn't mean that we need two different APIs for GPU and CPU sampling. We could define samplers as an abstract object, that may or may not contain a state, and that may contain ggml or CPU implementations. Then we would need to assemble a pipeline of sampler objects that can be run at the end of the model evaluation. If all the samplers contain ggml implementations, then it can run on the GPU, otherwise at least some parts would still run on the CPU. I think it is mostly a matter of designing a flexible enough interface.

ggerganov · 2024-01-30T13:53:01Z

Ok will give it further thought.

One way that comes to mind is something like this:

int32_t llama_decode_with_sampling(
            struct llama_context * ctx,
   struct llama_sampling_context * ctx_s,
              struct llama_batch   batch,
                     llama_token * result);

The llama_sampling_context can hold the information about the sampling pipeline together with the sampling state. So in that sense, merging llama_sampling_context in llama seems compatible with future GPU support - it just has to be extended with the samplers info.

slaren · 2024-01-30T13:58:25Z

It's also important to allow multiple evaluations to be queued together, that's one of the biggest advantages of GPU sampling. That can be done by making llama_decode_with_sampling asynchronous, however the token output result needs to be removed, since obtaining that requires flushing the queue.

ggerganov · 2024-01-30T14:09:09Z

Yes, this might get tricky when considering multiple sequences in the batch, but seems doable.

Let me know if you have other concerns about merging llama_sampling_context - seems like a step in the right direction even when considering GPU support.

If we do that, then the llama_sample_... API that currently exists in llama.h can be updated to take llama_sampling_context instead of candidates and last_tokens. This API will remain as a utility for the users to do manual sampling and can potentially be removed when the llama_decode_with_sampling gets fully implemented.

github-actions · 2024-03-18T01:32:36Z

This issue is stale because it has been open for 30 days with no activity.

cebtenzzre · 2024-03-18T16:43:35Z

Not stale.

ggerganov added the refactoring Refactoring label Jan 30, 2024

ggerganov mentioned this issue Jan 30, 2024

llama : create llamax library #5215

Open

netdur mentioned this issue Jan 30, 2024

Use gpt_params instead of model_params and context_params netdur/llama_cpp_dart#9

Closed

ggerganov mentioned this issue Mar 8, 2024

main : add Self-Extend support #4815

Merged

github-actions bot added the stale label Mar 18, 2024

slaren removed the stale label Mar 18, 2024

github-actions bot added the stale label Apr 18, 2024

ggerganov removed the stale label Apr 23, 2024

ggerganov mentioned this issue Apr 23, 2024

Server: fix seed for multiple slots #6835

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : move the sampling API from common into llama lib #5214

llama : move the sampling API from common into llama lib #5214

ggerganov commented Jan 30, 2024

slaren commented Jan 30, 2024

ggerganov commented Jan 30, 2024

slaren commented Jan 30, 2024 •

edited

ggerganov commented Jan 30, 2024

slaren commented Jan 30, 2024

ggerganov commented Jan 30, 2024 •

edited

github-actions bot commented Mar 18, 2024

cebtenzzre commented Mar 18, 2024

llama : move the sampling API from common into llama lib #5214

llama : move the sampling API from common into llama lib #5214

Comments

ggerganov commented Jan 30, 2024

slaren commented Jan 30, 2024

ggerganov commented Jan 30, 2024

slaren commented Jan 30, 2024 • edited

ggerganov commented Jan 30, 2024

slaren commented Jan 30, 2024

ggerganov commented Jan 30, 2024 • edited

github-actions bot commented Mar 18, 2024

cebtenzzre commented Mar 18, 2024

slaren commented Jan 30, 2024 •

edited

ggerganov commented Jan 30, 2024 •

edited