Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : move the sampling API from common into llama lib #5214

Open
ggerganov opened this issue Jan 30, 2024 · 8 comments
Open

llama : move the sampling API from common into llama lib #5214

ggerganov opened this issue Jan 30, 2024 · 8 comments
Labels
refactoring Refactoring

Comments

@ggerganov
Copy link
Owner

There is functionality around llama_sampling_context currently part of common. We should move it into llama. Pretty much the entire API from common/sampling.h except llama_sampling_params and llama_sampling_sample can be integrated into the library.

This would probably require to also merge the grammar parser into the llama lib implementation.

The llama_sampling_params and llama_sampling_sample will stay in common since they are very example-specific and not general-purpose enough to be merged.

@ggerganov ggerganov added the refactoring Refactoring label Jan 30, 2024
@slaren
Copy link
Collaborator

slaren commented Jan 30, 2024

Is this meant as a short-term stop gap measure? If we are going to add a new sampling API to llama.cpp, it would be good to do this from the ground up with the possibility of GPU sampling in mind. The implementation is sampling.h does not seem flexible enough to do this.

@ggerganov
Copy link
Owner Author

This change is more relevant for the CPU-based sampling. There are many use cases that require to manage a sampling state (e.g. previously sampled tokens, grammar state, etc.) so it makes sense to add support directly into the core library.

I haven't thought deeply about GPU sampling support. Wouldn't it make more sense to have a limited number of GPU sampling options (such as greedy and top-k) as part of llama_context since this would require changing the compute graph in the first place? I don't expect we can ever support GPU grammar sampling for example or even GPU repeat-penalty - is that a correct assumption?

@slaren
Copy link
Collaborator

slaren commented Jan 30, 2024

It's clear that some samplers cannot have GPU implementations, however this doesn't mean that we need two different APIs for GPU and CPU sampling. We could define samplers as an abstract object, that may or may not contain a state, and that may contain ggml or CPU implementations. Then we would need to assemble a pipeline of sampler objects that can be run at the end of the model evaluation. If all the samplers contain ggml implementations, then it can run on the GPU, otherwise at least some parts would still run on the CPU. I think it is mostly a matter of designing a flexible enough interface.

@ggerganov
Copy link
Owner Author

Ok will give it further thought.

One way that comes to mind is something like this:

int32_t llama_decode_with_sampling(
            struct llama_context * ctx,
   struct llama_sampling_context * ctx_s,
              struct llama_batch   batch,
                     llama_token * result);

The llama_sampling_context can hold the information about the sampling pipeline together with the sampling state. So in that sense, merging llama_sampling_context in llama seems compatible with future GPU support - it just has to be extended with the samplers info.

@slaren
Copy link
Collaborator

slaren commented Jan 30, 2024

It's also important to allow multiple evaluations to be queued together, that's one of the biggest advantages of GPU sampling. That can be done by making llama_decode_with_sampling asynchronous, however the token output result needs to be removed, since obtaining that requires flushing the queue.

@ggerganov
Copy link
Owner Author

ggerganov commented Jan 30, 2024

Yes, this might get tricky when considering multiple sequences in the batch, but seems doable.

Let me know if you have other concerns about merging llama_sampling_context - seems like a step in the right direction even when considering GPU support.

If we do that, then the llama_sample_... API that currently exists in llama.h can be updated to take llama_sampling_context instead of candidates and last_tokens. This API will remain as a utility for the users to do manual sampling and can potentially be removed when the llama_decode_with_sampling gets fully implemented.

Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale label Mar 18, 2024
@cebtenzzre
Copy link
Collaborator

Not stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refactoring Refactoring
Projects
Status: Todo
Development

No branches or pull requests

3 participants