Roadmap May 2023 #1220
Replies: 3 comments 11 replies
-
If I'm understanding the idea of llama_state--that it will allow multiple "inference threads" from a single loaded model, then it definitely seems worth implementing, since it opens up a lot of possibilities. Is the idea that we can get a lot of the same gains by just quickly swapping out stored contexts? A lot of llm applications benefit from having multiple instances that can build on one another, or different instances that receive diverse queries. I haven't started using it yet, because I've been waiting for someone to post an example. It's a bit hard for me to parse the api. |
Beta Was this translation helpful? Give feedback.
-
do we have weigh model support for Encodec? If yes, could you please tell me how to build the sgml-model.bin? |
Beta Was this translation helpful? Give feedback.
-
High-prio
Refactoring pass
There is a lot of code duplication in
ggml.c
which probably can be simplified with a good set of macros. The goal is to keep the code size manageable, while we avoid reaching "macro hell"Optimize the AVX / AVX2 implementations of the quantization methods and add WASM SIMD
Make sure we have optimal implementation for these instruction sets
Apply the new integer quantization methods to whisper.cpp
Will backport the latest
ggml
version towhisper.cpp
and add support for quantized models. Will also update all WASM examples to be able to run with the quantized modelsUpdate: whisper.cpp v1.4.0 has been released. It includes integer quantization and GPU support via cuBLAS - all thanks to the great work done here
Add support for "batch inference"
Recently, the bert.cpp (by @skeskinen) project demonstrated BERT inference using
ggml
. This model gains a lot from batch inference, which is currently not supported byggml
. We will extend all operators to support it. Thebert.cpp
example will serve as a playground to achieve thisUpdate: batched forward passes have been demonstrated in the baby-llama example (thanks to @xaedes Implement backward passes for llama with small training llama from scratch example #1360). It will be great to apply the demonstrated approach to
bert.cpp
andwhisper.cpp
's beam-search decoding in order to gain extra speed-upImplement inference of new models
There are already some very interesting models that should be supported by
ggml
:There is a huge interest for adding
ggml
support for this model (see speeding up inference suno-ai/bark#30 (comment))The main blocker seems to be the dependency on Facebook's EnCodec codec. Still not sure how difficult it would be, but probably this codec is another model that we should try to support via
ggml
I'll use this section to add a note regarding new model implementations by contributors - I recommend to always try to add a very basic example implementation to the ggml repo. Having a basic example there would make long-term support much easier
Proof-of-concept for 100% inference on the GPU
The goal is to make a demonstration of the idea discussed in Add GPU support to ggml #914
Very preliminary work has been started in ggml : cgraph export/import/eval example + GPU support ggml#108
Will try to get a working example using the MNIST inference
Update: The MNIST inference on Apple Silicon GPU using Metal is now fully demonstrated: ggml : cgraph export/import/eval example + GPU support ggml#108 -- this is the way
Low-prio
Project ggml : improve threading implementation
Better utilization of the available CPU resources via improved thread management.
There have been a few efforts during April, but they remained in the "background" - need to put more focus this time
Having second thoughts about adding
llama_state
The experience is we added
whisper_state
inwhisper.cpp
with this PR: Added whisper state + default state on the whisper_context whisper.cpp#523However, I haven't see a lot of use of it. At the same time, it doubled the C API.
Let me know if you think this is worth implementing
ref: IMPORTANT: Introduce C-style API - Major Refactoring #370 (comment)
Add 3-bit integer quantization
It has been shown that 2-bit integer quantization does not look really useful for anything: Q2 and Q3 quantization #1004
In this case, we can probably add 3-bit integer quantization. Probably not "officially supported", but rather in a state where we can run experiments and see if we can find some application
There is an interesting ongoing effort to add "training" support to
ggml
: How to fine tune it? ggml#8 (comment)It would be really impressive if this actually works. Might be conflicts with the refactoring pass - need to coordinate with @xaedes
Update: this has been successfully completed and there is now a simple example demonstrating baby-LLaMA training: https://github.com/ggerganov/llama.cpp/blob/master/examples/baby-llama/baby-llama.cpp#L759-L770
Beta Was this translation helpful? Give feedback.
All reactions