Llama cpp #18

CircArgs · 2023-04-14T18:13:23Z

Description:

This is a rough attempt at adding support for llama.cpp. It adds a separate model serve while breaking out a couple shared interfaces between the llama serve and HF serve. Though far from perfect, perhaps with some suggestions I can get this to a proper state if desired.

What the changes amount to:

An example command to run with llama.cpp looks like the following
lmql serve-model AlekseyKorshuk/vicuna-7b --llama.cpp --model_path=../ggml-vicuna-7b-1.0-uncensored-q4_0.bin --n_threads=8

The current problem I have not gotten around is that lmql.runtime.tokenizer requires some tokenizer to load tangentially to the serve and I'm not sure what the best way to make a tokenizer just from the ggml that is compatible with the LMQLTokenizer would be. So, for now, I have just bootstrapped the model arg to require a HF tokenizer that can pair with the llama.cpp ggml.

…ma-cpp

lbeurerkellner · 2023-04-15T16:24:11Z

Thank you, this is so awesome! I have been meaning to look into llama support myself, but many other things are important right now, too. I will look into this more closely tomorrow, hopefully. I think we can find a good fix for the tokenizer, but the HF workaround is also good to get things running for now.

Thank you so much.

lbeurerkellner · 2023-04-18T08:14:23Z

Just a heads up, I just bumped main to transformers==4.28.1, so this should bring LLamaTokenizer, etc.. But given that you were already using it, I assume you are already running on a later version of transformers. Not providing clean_up_tokenization_spaces seems to cause a bug for now, at least with GPT models. Not sure if True/False is here the right choice, I will have to investigate more.

CircArgs · 2023-04-18T13:44:32Z

@lbeurerkellner I did bump it to transformers==4.28.1 and set clean_up_tokenization_spaces to True which seemed to fix it this is where the error originates. It's looking for the argument or from the config which are both False by default I think.

For this PR though, does this matter? It looked like with 3736fcb you were moving towards a different design where this "proxy" tokenizer would not be needed anymore.

gottlike · 2023-06-15T07:42:58Z

@lbeurerkellner Since ggml is making great progress every week and is already providing stellar performance on Apple Silicon and GPUs: Is there any update on finalizing and merging this PR?

lbeurerkellner · 2023-06-18T13:10:37Z

Yes. We have recently simplified backend infrastructure greatly, making an integration of llama.cpp much easier. Before that however, we first need to also fix the integration of a Llama tokenizer. The HF one seems to behave in a non-standard way, which needs special handling, e.g. see #95.

In general, though, I am actually quite excited about adding support and I will def try to push this more.

lbeurerkellner · 2023-07-13T12:53:57Z

I just merged llama.cpp support in #111. It leverages the new backend infrastructure, which makes it rather easy to support now. This gives us an initial implementation, however, I think work here definitely could continue to improve support wrt. caching and enable batch_size > 1 on the backend.

Closing this now. Nevertheless, thanks so much for the proposal and initial implementation. It was of great help doing this port to the updated infrastructure.

CircArgs and others added 8 commits April 13, 2023 21:18

llama cpp server

b0e44cd

hacking and whacking

1ce9602

reset model on each eval

aabd608

Merge branch 'eth-sri:main' into llama-cpp

3bfcd1e

separate servers for llamacpp and hf

33b3e05

Merge branch 'llama-cpp' of https://github.com/CircArgs/lmql into lla…

8b1790a

…ma-cpp

small serve fixes

93f3947

update requirements

21a806d

CircArgs added 2 commits April 17, 2023 19:35

clean_up_tokenization_spaces

9aa76dd

cleanup serve files

07190a3

hippalectryon-0 mentioned this pull request May 19, 2023

Use better prompt templating su77ungr/CASALIOY#99

Open

lbeurerkellner closed this Jul 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama cpp #18

Llama cpp #18

CircArgs commented Apr 14, 2023 •

edited

lbeurerkellner commented Apr 15, 2023

lbeurerkellner commented Apr 18, 2023 •

edited

CircArgs commented Apr 18, 2023

gottlike commented Jun 15, 2023

lbeurerkellner commented Jun 18, 2023 •

edited

lbeurerkellner commented Jul 13, 2023

Llama cpp #18

Llama cpp #18

Conversation

CircArgs commented Apr 14, 2023 • edited

Description:

What the changes amount to:

lbeurerkellner commented Apr 15, 2023

lbeurerkellner commented Apr 18, 2023 • edited

CircArgs commented Apr 18, 2023

gottlike commented Jun 15, 2023

lbeurerkellner commented Jun 18, 2023 • edited

lbeurerkellner commented Jul 13, 2023

CircArgs commented Apr 14, 2023 •

edited

lbeurerkellner commented Apr 18, 2023 •

edited

lbeurerkellner commented Jun 18, 2023 •

edited