Conversation
|
Thank you, this is so awesome! I have been meaning to look into llama support myself, but many other things are important right now, too. I will look into this more closely tomorrow, hopefully. I think we can find a good fix for the tokenizer, but the HF workaround is also good to get things running for now. Thank you so much. |
|
Just a heads up, I just bumped main to |
|
@lbeurerkellner I did bump it to For this PR though, does this matter? It looked like with 3736fcb you were moving towards a different design where this "proxy" tokenizer would not be needed anymore. |
|
@lbeurerkellner Since ggml is making great progress every week and is already providing stellar performance on Apple Silicon and GPUs: Is there any update on finalizing and merging this PR? |
|
Yes. We have recently simplified backend infrastructure greatly, making an integration of llama.cpp much easier. Before that however, we first need to also fix the integration of a Llama tokenizer. The HF one seems to behave in a non-standard way, which needs special handling, e.g. see #95. In general, though, I am actually quite excited about adding support and I will def try to push this more. |
|
I just merged llama.cpp support in #111. It leverages the new backend infrastructure, which makes it rather easy to support now. This gives us an initial implementation, however, I think work here definitely could continue to improve support wrt. caching and enable batch_size > 1 on the backend. Closing this now. Nevertheless, thanks so much for the proposal and initial implementation. It was of great help doing this port to the updated infrastructure. |
Description:
This is a rough attempt at adding support for llama.cpp. It adds a separate model serve while breaking out a couple shared interfaces between the llama serve and HF serve. Though far from perfect, perhaps with some suggestions I can get this to a proper state if desired.
What the changes amount to:
An example command to run with llama.cpp looks like the following
lmql serve-model AlekseyKorshuk/vicuna-7b --llama.cpp --model_path=../ggml-vicuna-7b-1.0-uncensored-q4_0.bin --n_threads=8The current problem I have not gotten around is that
lmql.runtime.tokenizerrequires some tokenizer to load tangentially to the serve and I'm not sure what the best way to make a tokenizer just from the ggml that is compatible with the LMQLTokenizer would be. So, for now, I have just bootstrapped themodelarg to require a HF tokenizer that can pair with the llama.cpp ggml.