Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server : fix crash when system prompt is bigger than batch size #5714

Merged
merged 1 commit into from
Feb 25, 2024

Conversation

compilade
Copy link
Collaborator

The system prompt is now decoded in batches.

Maybe llama_decode should eventually have some kind of built-in auto-batching since forgetting to split a batch seems common in the examples.

I also fixed a problem where n_past would skip a pos value when the prefix of a prompt is fully matching the tokens in the cache.

The system prompt is now decoded in batches.

* server : fix off-by-one n_past when start of prompt matches whole cache

The tokens right after the matching part would otherwise skip a pos value.
Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe llama_decode should eventually have some kind of built-in auto-batching since forgetting to split a batch seems common in the examples.

It should be added to common and eventually the new llamax library will abstract this (see below)

@slaren
Copy link
Collaborator

slaren commented Feb 25, 2024

I plan to extend llama_decode to automatically split the batches if they exceed n_batch for pipeline parallelism. It is already implemented in the demo PR.

@phymbert
Copy link
Collaborator

@slaren Hi, thanks for the fix, would it be possible to add a simple test ?

@slaren
Copy link
Collaborator

slaren commented Feb 25, 2024

After it is implemented sure, but it is not merged yet. It still needs more work, but it should be done soon.

@ggerganov ggerganov merged commit f762501 into ggerganov:master Feb 25, 2024
59 of 108 checks passed
jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Mar 13, 2024
…ganov#5714)

The system prompt is now decoded in batches.

* server : fix off-by-one n_past when start of prompt matches whole cache

The tokens right after the matching part would otherwise skip a pos value.
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
…ganov#5714)

The system prompt is now decoded in batches.

* server : fix off-by-one n_past when start of prompt matches whole cache

The tokens right after the matching part would otherwise skip a pos value.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants