Replies: 2 comments 1 reply
-
It's not implemented - the main problem is that a deterministic string can be represented in many different ways in terms of tokens (e.g. "hello" -> ["h", "ello"], ["he", "llo"], ["hel", "l", "o"], ...) so it is not clear beforehand which representation the LLM would generate. The closest thing to this functionality that can be done even now is to use a fast draft model which you also constrain with the same grammar (see |
Beta Was this translation helpful? Give feedback.
-
I'm talking to someone who works on coalescence (I posted an issue about it), the same problem happens in prompt tokenization, right? I think it's worth trying at least, I'd expect an LLM to be reasonably robust given that this is also the case for regular prompt tokenization. Especially for the application I'm working on this would save approximately 80% of tokens generated! I wrote a post about a specific use-case for grammar constraints where you can use an LLM to interpret spoken voice commands and turn them into JSON commands, it's actually possible to run this on a cheap azure CPU server because the entire few-shot prompt is cached, the additional input is quite short and the output is also just a small JSON object with only ever a dozen fields. It would be amazing to have a valid use-case for LLMs that can use cheap cpu-only VMs in production, I'm using zephyr 3b at 5bit quants and it works absolutely wonderfully for this usecase. Edit: Here's the link to the coalescence issue, the people from dottxt are working on this specifically: #5292 |
Beta Was this translation helpful? Give feedback.
-
Let's say I have a simple JSON grammar where I want to generate a simple structure like this
Invoking the LLM to predict one token is expensive. Many tokens are deterministic given the current state of the output. For instance, at the beginning the grammar HAS to start with
(Note, depending on how restrictively you define your JSON grammar, there is more than one valid output at the beginning depending on if you allow infinite whitespace, newlines etc)
Next, depending on the first token it generates, it can easily end up in a situation where a lot of tokens are again deterministic
Now it can either go into the "Mercedes" branch or the "Mazda" branch, this is the first point the LLM needs to actually be invoked. After that token we can go all the way to
Before it needs to be invoked again.
Does this kind of optimization exist in llama.cpp?
Beta Was this translation helpful? Give feedback.
All reactions