New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : create llamax library #5215
Comments
I recently did the same thing on my personal project, so I'd like to share with you the list of functions that I decided to keep on my part:
More info on my project: My implementation is basically a web server that takes a JSON as input, for example: |
It would be great if we could generalize grammar-based sampling, into a callback function-based approach. This would open up downstream use cases to adjust logic in arbitrary ways. (In Tabby's case, we would really like to integrate a tree-sitter grammar for similar goal). Something like |
Ok, thanks for the suggestions - these are useful. I'm thinking the API should also support the multi-sequence batched use case, where the user can dynamically insert new requests for processing (something like the current slots in the llamax_context * ctx = llamax_context_init(...);
// thread 0 or 1
ctx->add_request(...);
ctx->add_request(...);
...
// main thread
while (true) {
ctx->process(...);
}
llamax_context_free(ctx); |
Yeah I haven't yet consider about multi-sequence on my implementation. As the first step, I designed my API to be readable from top to bottom, something like: load(...)
eos_token = lookup_token("</s>")
input_tokens = tokenize("Hello, my name is")
eval(tokens)
while True:
next_token, next_piece = decode_logits()
if next_token == eos_token:
break
print(next_piece)
sampling_accept(next_token)
eval([next_token])
exit() With multi-sequence, it may become: ...
input_tokens = tokenize("Hello, my name is")
seq_id = new_seq() # added
eval(tokens, seq_id) # add seq_id
while True:
next_token, next_piece = decode_logits()
if next_token == eos_token:
break
print(next_piece)
sampling_accept(next_token, seq_id) # add seq_id
eval([next_token], seq_id) # add seq_id
delete_seq(seq_id)
exit() Would be nice if llamax can be thread-safe. For example, on the code above:
@wsxiaoys I'm not quite sure if modifying logits is suitable for high-level API or not, but maybe llamax can just expose the underlaying llama_context, then you can use low-level API to interact with the low-level context. |
I think this is a great idea. Currently, I am using llama.cpp with LlamaSharp but it does not work with the latest version of llama.cpp because of llama.cpp changes. Ideally, I would like to drop the latest llama.dll directly into my .NET project. It would be great to have a Super High level API that does NOT have breaking changes, something like a subset of the llama.cpp Python API https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#high-level-api Super High level API Contract that does not change version to version - I think this should cover 90% of use cases
There can be another second level of API like Tokenize, etc. and a low-level API. |
@AshD yeah right, seems like the The However, the |
@ngxson I was thinking of Prompt Caching. Our app Fusion Quill, calls to Llama.cpp are either Chat type calls or One off calls for things like Summarization. For the Chat use case, the messages list is [system, usermsg1] for the 1st call, then [system,usermsg1, assistant1, usermsg2, ...] For the other use case, caching the tokens for the system message will make sense. This way the Super High Level api is kept simple. |
@AshD yeah I actually have a bonus idea, but haven't got time to implement it: In the chat api, some systems may remove the oldest messages to be able to fit the history into context window. On the server side, we can detect this change then use This kind of behavior already exists in My idea is to detect the change and calculate the number of KV to shift just by comparing list of messages from the last request vs the new request. This is just plain logic codes and nothing to do with inference though. |
This issue is stale because it has been open for 30 days with no activity. |
Not stale. |
I hope someone picks this up soon. Our app, Fusion Quill uses llama.cpp via LlamaSharp. With a stable high level API, this problem should go away and it would simplify downstream llama.cpp libraries. |
Thanks a lot for this awesome project! |
Depends on: #5214
The
llamax
library will wrapllama
and expose common high-level functionality. The main goal is to ease the integration ofllama.cpp
into 3rd party projects. Ideally, most projects would interface through thellamax
API for all common use cases, while still have the option to use the low-levelllama
API for more uncommon applications that require finer control of the state.A simple way to think about
llamax
is that it will simplify all of the existing examples inllama.cpp
by hiding the low-level stuff, such as managing the KV cache and batching requests.Roughly,
llamax
will require it's own state object and a run-loop function.The specifics of the API are yet to be determined - suggestions are welcome.
The text was updated successfully, but these errors were encountered: