Are deterministic tokens skipped when using grammar constraints? #5455

Azeirah · 2024-02-11T21:53:05Z

Azeirah
Feb 11, 2024

Let's say I have a simple JSON grammar where I want to generate a simple structure like this

interface Car {
  "brand": "Toyota" | "Mercedes" | "Mazda" | "Audi" | "Volkswagen"
  "build-year": number
}

Invoking the LLM to predict one token is expensive. Many tokens are deterministic given the current state of the output. For instance, at the beginning the grammar HAS to start with

{ "brand": "

(Note, depending on how restrictively you define your JSON grammar, there is more than one valid output at the beginning depending on if you allow infinite whitespace, newlines etc)

Next, depending on the first token it generates, it can easily end up in a situation where a lot of tokens are again deterministic

{ "brand": "M

Now it can either go into the "Mercedes" branch or the "Mazda" branch, this is the first point the LLM needs to actually be invoked. After that token we can go all the way to

{ "brand": "Mazda", "build-year":

Before it needs to be invoked again.

Does this kind of optimization exist in llama.cpp?

ggerganov · 2024-02-12T07:40:35Z

ggerganov
Feb 12, 2024
Maintainer

It's not implemented - the main problem is that a deterministic string can be represented in many different ways in terms of tokens (e.g. "hello" -> ["h", "ello"], ["he", "llo"], ["hel", "l", "o"], ...) so it is not clear beforehand which representation the LLM would generate.

The closest thing to this functionality that can be done even now is to use a fast draft model which you also constrain with the same grammar (see speculative example)

1 reply

HanClinto Jun 6, 2024
Collaborator

It's not implemented - the main problem is that a deterministic string can be represented in many different ways in terms of tokens (e.g. "hello" -> ["h", "ello"], ["he", "llo"], ["hel", "l", "o"], ...) so it is not clear beforehand which representation the LLM would generate.

I wonder -- how much would it matter? Can we just pick the longest token that matches the desired output and plow forward?

Azeirah · 2024-02-12T12:04:40Z

Azeirah
Feb 12, 2024
Author

I'm talking to someone who works on coalescence (I posted an issue about it), the same problem happens in prompt tokenization, right? I think it's worth trying at least, I'd expect an LLM to be reasonably robust given that this is also the case for regular prompt tokenization.

Especially for the application I'm working on this would save approximately 80% of tokens generated! I wrote a post about a specific use-case for grammar constraints where you can use an LLM to interpret spoken voice commands and turn them into JSON commands, it's actually possible to run this on a cheap azure CPU server because the entire few-shot prompt is cached, the additional input is quite short and the output is also just a small JSON object with only ever a dozen fields.

It would be amazing to have a valid use-case for LLMs that can use cheap cpu-only VMs in production, I'm using zephyr 3b at 5bit quants and it works absolutely wonderfully for this usecase.

Edit: Here's the link to the coalescence issue, the people from dottxt are working on this specifically: #5292

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Are deterministic tokens skipped when using grammar constraints? #5455

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Are deterministic tokens skipped when using grammar constraints? #5455

Azeirah Feb 11, 2024

Replies: 2 comments · 1 reply

ggerganov Feb 12, 2024 Maintainer

HanClinto Jun 6, 2024 Collaborator

Azeirah Feb 12, 2024 Author

Azeirah
Feb 11, 2024

Replies: 2 comments 1 reply

ggerganov
Feb 12, 2024
Maintainer

HanClinto Jun 6, 2024
Collaborator

Azeirah
Feb 12, 2024
Author