Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : create llamax library #5215

Open
ggerganov opened this issue Jan 30, 2024 · 12 comments
Open

llama : create llamax library #5215

ggerganov opened this issue Jan 30, 2024 · 12 comments
Labels
refactoring Refactoring

Comments

@ggerganov
Copy link
Owner

ggerganov commented Jan 30, 2024

Depends on: #5214

The llamax library will wrap llama and expose common high-level functionality. The main goal is to ease the integration of llama.cpp into 3rd party projects. Ideally, most projects would interface through the llamax API for all common use cases, while still have the option to use the low-level llama API for more uncommon applications that require finer control of the state.

A simple way to think about llamax is that it will simplify all of the existing examples in llama.cpp by hiding the low-level stuff, such as managing the KV cache and batching requests.

Roughly, llamax will require it's own state object and a run-loop function.

The specifics of the API are yet to be determined - suggestions are welcome.

@ggerganov ggerganov added the refactoring Refactoring label Jan 30, 2024
@ngxson
Copy link
Collaborator

ngxson commented Jan 30, 2024

I recently did the same thing on my personal project, so I'd like to share with you the list of functions that I decided to keep on my part:

function purpose
load load the model, also take into account specified cparams, mparams and sparams
lookup_token For looking up special tokens, likes [INST] or <<SYS>>
tokenize convert a string into list of tokens
eval take a list of tokens, then create a batch and evaluate it via llama_decode
decode_logits it's quite bad naming here. The function returns the sampled token from the logits
session_save Save session to a file
session_load Load session
sampling_accept Accept one token
exit free anything then exit

More info on my project: My implementation is basically a web server that takes a JSON as input, for example: { "action": "load", "model_path": "...", ... }. That's why I never (and cannot) returns any pointer. It's more like a "low-level" API that is accessible via web.

@wsxiaoys
Copy link
Contributor

It would be great if we could generalize grammar-based sampling, into a callback function-based approach. This would open up downstream use cases to adjust logic in arbitrary ways. (In Tabby's case, we would really like to integrate a tree-sitter grammar for similar goal).

Something like void (*apply_logis_override)(logits, void* data) should do the work.

@ggerganov
Copy link
Owner Author

Ok, thanks for the suggestions - these are useful.

I'm thinking the API should also support the multi-sequence batched use case, where the user can dynamically insert new requests for processing (something like the current slots in the server example, but better). In that sense, calls such as eval, decode_logits won't be very suitable. Something more like:

llamax_context * ctx = llamax_context_init(...);

// thread 0 or 1
ctx->add_request(...);
ctx->add_request(...);
...

// main thread
while (true) {
	ctx->process(...);
}

llamax_context_free(ctx);

@ngxson
Copy link
Collaborator

ngxson commented Jan 31, 2024

Yeah I haven't yet consider about multi-sequence on my implementation.

As the first step, I designed my API to be readable from top to bottom, something like:

load(...)
eos_token = lookup_token("</s>")
input_tokens = tokenize("Hello, my name is")
eval(tokens)
while True:
  next_token, next_piece = decode_logits()
  if next_token == eos_token:
    break
  print(next_piece)
  sampling_accept(next_token)
  eval([next_token])
exit()

With multi-sequence, it may become:

...
input_tokens = tokenize("Hello, my name is")
seq_id = new_seq() # added
eval(tokens, seq_id) # add seq_id
while True:
  next_token, next_piece = decode_logits()
  if next_token == eos_token:
    break
  print(next_piece)
  sampling_accept(next_token, seq_id) # add seq_id
  eval([next_token], seq_id) # add seq_id
delete_seq(seq_id)
exit()

Would be nice if llamax can be thread-safe. For example, on the code above:

  • I can spawn 10 threads, each one have its own seq_id
  • Whenever I call eval, the call is blocking (maybe via a mutex), the request is sent to llamax thread
  • llamax thread processes the request (maybe in batch with other requests from other threads), then send back the result
  • The mutex is unlocked, eval returns the value to my thread

@wsxiaoys I'm not quite sure if modifying logits is suitable for high-level API or not, but maybe llamax can just expose the underlaying llama_context, then you can use low-level API to interact with the low-level context.

@AshD
Copy link

AshD commented Feb 6, 2024

I think this is a great idea.

Currently, I am using llama.cpp with LlamaSharp but it does not work with the latest version of llama.cpp because of llama.cpp changes. Ideally, I would like to drop the latest llama.dll directly into my .NET project.

It would be great to have a Super High level API that does NOT have breaking changes, something like a subset of the llama.cpp Python API https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#high-level-api

Super High level API Contract that does not change version to version - I think this should cover 90% of use cases

  • LoadModel
  • CreateCompletion
  • GetEmbedding
  • UnloadModel

There can be another second level of API like Tokenize, etc. and a low-level API.

@ngxson
Copy link
Collaborator

ngxson commented Feb 7, 2024

@AshD yeah right, seems like the create_chat_completion in llama-cpp-python is the sweet spot between using llama.cpp as "server" and as "library"

The chat_format param used by create_chat_completion can replace my lookup_token. The reason I have lookup_token in my implementation was because not all models use chatml format.

However, the create_chat_completion creates a tricky situation where I may want to cache the prompt. Prompt caching can be useful when you have a very long system prompt.

@AshD
Copy link

AshD commented Feb 8, 2024

@ngxson I was thinking of Prompt Caching. Our app Fusion Quill, calls to Llama.cpp are either Chat type calls or One off calls for things like Summarization.

For the Chat use case, the messages list is [system, usermsg1] for the 1st call, then [system,usermsg1, assistant1, usermsg2, ...]
Maybe llama.cpp can cache the tokens for the messages that it has seen.

For the other use case, caching the tokens for the system message will make sense.

This way the Super High Level api is kept simple.

@ngxson
Copy link
Collaborator

ngxson commented Feb 8, 2024

@AshD yeah I actually have a bonus idea, but haven't got time to implement it:

In the chat api, some systems may remove the oldest messages to be able to fit the history into context window. On the server side, we can detect this change then use llama_kv_cache_seq_rm and llama_kv_cache_seq_shift to shift the KV cache instead of re-calculating it.

This kind of behavior already exists in main, but is done on token level instead of "message" level.

My idea is to detect the change and calculate the number of KV to shift just by comparing list of messages from the last request vs the new request. This is just plain logic codes and nothing to do with inference though.

Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale label Mar 18, 2024
@cebtenzzre
Copy link
Collaborator

Not stale.

@slaren slaren removed the stale label Mar 18, 2024
@AshD
Copy link

AshD commented Mar 20, 2024

I hope someone picks this up soon. Our app, Fusion Quill uses llama.cpp via LlamaSharp.
There a big time lag from the time llama.cpp supports a new model to LlamaSharp supporting the new version of Llama.cpp

With a stable high level API, this problem should go away and it would simplify downstream llama.cpp libraries.

@amakropoulos
Copy link

Thanks a lot for this awesome project!
Just to +1 that this feature would be tremendously helpful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refactoring Refactoring
Projects
Status: Todo
Development

No branches or pull requests

7 participants