server : improvements and maintenance #4216

ggerganov · 2023-11-25T09:57:53Z

The server example has been growing in functionality and unfortunately I feel it is not very stable at the moment and there are some important features that are still missing. Creating this issue to keep track on some of these points and try to draw more attention from the community. I guess, some of the tasks are relatively big and would require significant efforts to complete

This is likely not a complete list of things - if you think some feature is important to be improved or supported, drop a comment.

Have a look to issues labelled with server/webui.

The text was updated successfully, but these errors were encountered:

IridiumMaster · 2023-11-25T10:37:48Z

Would love if the server could get look ahead decoding and contrastive search. A collection of common presets would be very helpful for fast model evaluation. The ability to edit responses and replies in the UI would be very useful for rapidly testing prompt branches if combined with batching capabilities. Would also appreciate a simple implementation of request queuing and a server interface for the model training example. Edit: Discussion link for contrastive search : #3450 , other related topics / potential substitutes are mentioned in the thread.

ruped · 2023-11-25T16:33:23Z

Thanks for raising this issue and looking into the server example.

I think this #4201 could be relevant - although it sounds like the fix will be in the core code rather than in the server.

Since the addition of support for batching, llama.cpp could be come a viable competitor to vllm for large scale deployments. This is also helpful for individual hobbyists who are using/building AI agents (because these possibly make multiple requests in parallel to the LLMs to construct answers). So I think your suggestions around improving stability/refactor of the server example would be very valuable. Also focusing on the throughput speed particularly of batched requests (and benchmarking this against vllm).

mudler · 2023-11-25T17:16:52Z

What'd be lovely is to see also the speculative sampling added to it - would be really a great addition there

tobi · 2023-11-25T18:30:33Z

Very excited about this! I think that the server should increasingly be thought of as the main deliverable of this repo.

There are 100s of libraries and tools that integrate different subset of backends and inference libraries. Especially in the python world. This doesn't make sense. We need a simple convention by which everything can interopt. The solution is to use openai's API as a protocol on localhost. Could there be better standards? Maybe. But this is the one we have, and it works really well.

My suggestion is that clean we clean up the server and treat it and the /chat/completions endpoint as main deliverable of this repository. We can easily switch the web interface to use that as well. ./server -m ~/model should boot with the ideal default parameters read from the gguf like context size and (if we can pull it off) chat template style.

This means that existing code only needs the api_url override to be modified to work locally.


from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8080/v1")


completion = client.chat.completions.create(
  model="llama!",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
  ]
)

print(completion.choices[0].message.content)

This works already. At least as long as you are loading a model that conforms to chatml and are ok with the default context size. I find that a much better vision for how LLM interopt will work in the open source space. Different servers, different, backends, all on the same proto.

FSSRepo · 2023-11-26T02:31:42Z

@ggerganov

Batched decoding endpoint?

This option to generate multiple alternatives for the same prompt requires the ability to change the seed, and the truth is, I've been having a bit of a struggle with it when adding parallel decoding, as it raises questions about how the seed should be managed.

spirobel · 2023-11-26T10:57:21Z

@tobi

Very excited about this! I think that the server should increasingly be thought of as the main deliverable of this repo.

LocalAI serves this usecase quite well already and has lots of traction. It is better to not compete with your customers and delight them instead. An easy to link library with a C API should be the main deliverable of this project.

studiotatsu · 2023-11-26T17:51:11Z

The OAI API included with the server, is great I love it.
Please include llama_params "repeat_penalty" and "min_p" .

These params are much needed. Thanks.

antcodd · 2023-11-27T06:35:37Z

I think it would be good if the OAI endpoint supported the same set of parameters and defaults as the regular endpoint and sensible or argument driven defaults given many clients won't supply all parameters.

One issue is the seed is defaulting 0 instead of -1, so every regeneration is the same if the client doesn't specify a seed.

IridiumMaster · 2023-11-27T06:57:54Z

@tobi

Very excited about this! I think that the server should increasingly be thought of as the main deliverable of this repo.

LocalAI serves this usecase quite well already and has lots of traction. It is better to not compete with your customers and delight them instead. An easy to link library with a C API should be the main deliverable of this project.

With respect, I think the server endpoint is a different audience. LocalAI seems to be going for an "everything and the kitchen sink" approach. That's cool, and I respect the project, but what I would like from the server example is something different: Raw inference with the greatest number of capabilities at the fastest possible speed, along with tooling specifically designed to allow for large scale prompt testing of different model variants quickly and easily. This is what I would view as more of a "production" workflow as opposed to more of a hobbyist workflow. I agree with the upthread sentiment around making the server api a solid standard @tobi.

mudler · 2023-11-27T08:15:00Z

@tobi

Very excited about this! I think that the server should increasingly be thought of as the main deliverable of this repo.

LocalAI serves this usecase quite well already and has lots of traction. It is better to not compete with your customers and delight them instead. An easy to link library with a C API should be the main deliverable of this project.

With respect, I think the server endpoint is a different audience. LocalAI seems to be going for an "everything and the kitchen sink" approach. That's cool, and I respect the project, but what I would like from the server example is something different: Raw inference with the greatest number of capabilities at the fastest possible speed, along with tooling specifically designed to allow for large scale prompt testing of different model variants quickly and easily.

Sorry to jump-in in OT, but you are not sacrificing any speed nor capabilities with LocalAI - at the end the engine is always the same (llama.cpp, or vllm, or you name it) - however I see the value of having a server in llama.cpp. It's people's choice at the end of what suits better their needs. And also, the server LocalAI implementation is heavily based on that ;)

This is what I would view as more of a "production" workflow as opposed to more of a hobbyist workflow. I agree with the upthread sentiment around making the server api a solid standard @tobi.

For production there are quite some issues that are blockish-imho rather than this. Had several bugs in LocalAI w/ llama.cpp which makes it still difficult to navigate into that direction, which I hope gets addressed with this ticket. Things like #3969 are quite scary for prod-users.

ruped · 2023-11-27T11:31:28Z

Just a thought as a user of llama.cpp server: I imagine it's quite common for the llama.cpp Server to be used by developers who are able to add non core functionality in their own code. (e.g. Devs create their own application or library or REST server that wraps/orchestrates llama.cpp). Naturally the llama.cpp server is very convenient for this and works with any programming language. It also has a smaller/self contained API to learn.

I think some of the following can be done in dev's own code outside of llama.cpp:

basic templating
Additional interfaces (e.g. OpenAI compatibility) by setting up an intermediary server that calls llama.cpp server.
Making batch requests (by using multiple HTTP calls to llama.cpp server)

(Disclaimer: These are just examples, I haven't fully evaluated the pros/cons of implementing them outside of llama.cpp)

It's excellent if this project has the mission and bandwidth to provide functionalities like these. But if it sounds like its becoming too much work or feature creep then I imagine focusing on the bits that are impossible to do outside of llama.cpp is one of the ways to prioritise.

dongxiaolong · 2023-11-28T05:55:12Z

Hi, @ggerganov .The vllm project has a PR under construction for a chat template that can be used as a reference. vllm-project/vllm#1756

ggerganov · 2023-11-28T14:04:57Z

Regarding chat templates: I see they are using something called Jinja.

We are not going to implement Jinja in llama.cpp.

The best we can do are 2 things:

add an API endpoint for the clients to get the Jinja string and do whatever they want with it
add hardcoded implementations of common templates, where we string match the template and if it is something we know, we call a simple C++ function to tokenize as expected

Tostino · 2023-11-28T14:24:17Z

@ggerganov If you are going to hard code templates, this server will be totally unusable for a large number of users. I am experimenting with new templates, and would really rather the models trained with them be widely supported. Hell, there are so many variations of the chat-ml template floating around with no indication which is the correct version.

I mentioned on the other ticket that there is: https://github.com/jinja2cpp/Jinja2Cpp

Maybe that can be an optional component to add support for chat templates from the tokenizer, and hard coding can be the default code-path, I understand not wanting to add additional dependencies.

Getting the jinja string in the client is not helpful as an API endpoint, unless there is a client side compatibility layer between the chat/completions and completions endpoint.

I had opened a issue for chat template support a while ago, when I started working on it for vLLM: #3810

I implemented this for vLLM, and after going through a few rounds of testing, I had to rework things up and add additional parameters, and cli arguments to support the API properly.
We should very much stay on the same page for our implementations.

Here is the diff for my chat/completions endpoint changes: https://github.com/vllm-project/vllm/pull/1756/files#diff-38318677b76349044192bf70161371c88fb2818b85279d8fc7f2c041d83a9544

The important points from the vLLM pull request:

1. Addition of the `--chat-template` command-line argument to specify a chat template file or single-line template for the model.
2. Implementation of the `--response-role` command-line argument for defining the role name in chat responses when `add_generation_prompt` is set to true.
3. Update to the chat API request handling to support finishing a partial response correctly, and echoing input portions of messages (request.add_generation_prompt, and request.echo).

The request.echo is an extension of the api, due to the nature of Open Source LLMs being able to finish the last role:content pair in the messages list if request.add_generation_prompt=false (which is also an extension of the API due to the need to support this HF feature) and the template/model support that feature.

We should treat add_generation_prompt as default=true, because that is the behavior of the OpenAI API. This simply allows users to override that behavior if they need it, and gives them all the required knobs to use the feature properly.

mudler · 2023-11-28T14:31:03Z

Regarding chat templates: I see they are using something called Jinja.

We are not going to implement Jinja in llama.cpp.

The best we can do are 2 things:
* add an API endpoint for the clients to get the Jinja string and do whatever they want with it

* add hardcoded implementations of common templates, where we string match the template and if it is something we know, we call a simple C++ function to tokenize as expected

my personal thoughts here, but probably C++ ain't the best language for that - templating is quite easy to implement in scripted languages rather than C++, and in my opinion would undermine the maintenance and flexibility to have a lean server.cpp implementation.

Just my 2c, but maybe templating fits better on top of llama-cpp-python - which might be easier to go and to maintain (while keeping the core small and extensible)?

ggerganov · 2023-11-28T15:46:09Z

@Tostino

All templates that I've seen so far are so basic that I don't understand why we need an entire scripting language to express them. Is there a more advanced use case other than a basic for loop over the messages + add prefix/suffix?

How many templates do we expect to ever have? 10s, 100s? Even if it is 1000s, I prefer to have them hardcoded instead of building jinja2cpp (it takes 10 minutes !! to just run cmake config)

Here is sample ChatML template in a few lines of C++ that we currently use (and this is not even the best way to do it):

std::string format_chatml(std::vector<json> messages)
{
    std::ostringstream chatml_msgs;

    for (auto it = messages.begin(); it != messages.end(); ++it) {
        chatml_msgs << "<|im_start|>"
                    << json_value(*it, "role",    std::string("user")) << '\n';
        chatml_msgs << json_value(*it, "content", std::string(""))
                    << "<|im_end|>\n";
    }

    chatml_msgs << "<|im_start|>assistant" << '\n';

    return chatml_msgs.str();
}

I could be missing something, but for the moment I don't see a good reason to add Jinja support. Let's see how it goes - I'm open to reconsider, but need to see some reasonable examples and use cases that justify this dependency.

The request.echo is an extension of the api, due to the nature of Open Source LLMs being able to finish the last role:content pair in the messages list if request.add_generation_prompt=false (which is also an extension of the API due to the need to support this HF feature) and the template/model support that feature.

We should treat add_generation_prompt as default=true, because that is the behavior of the OpenAI API. This simply allows users to override that behavior if they need it, and gives them all the required knobs to use the feature properly.

I think I understand the request.add_generation_prompt parameter, but I don't understand request.echo - can you clarify / give an example?

@mudler

Yes, I agree.

Tostino · 2023-11-28T16:22:46Z

The fact is, if the rest of the ecosystem standardizes on these templates being "the way" to format messages, it will proliferate to new and unexpected use cases.

python3 -m vllm.entrypoints.openai.api_server --model teknium/OpenHermes-2.5-Mistral-7B --chat-template ./examples/template_inkbot.jinja

Here is an example call using my inkbot template which uses echo:

curl http://0.0.0.0:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer " \
  -d '{
    "model": "teknium/OpenHermes-2.5-Mistral-7B",
    "stream": false,
    "stop": ["\n<#bot#>","\n<#user#>"],
    "add_generation_prompt": false,
    "echo": true,
    "temperature": 0.0,
    "n": 1,
    "messages": [
	{"role": "meta-current_date", "content": "2023-10-20"},
	{"role": "meta-task_name", "content": "general"},
	{"role": "system", "content": "You are a helpful assistant."},
	{"role": "user", "content": "Hello!"},
	{"role": "assistant", "content": "Hello, how are you?"},
	{"role": "user", "content": "Great, thank you! Now i would like some help planning my weekend in Asheville, I would like some hike suggestions. Please list out a few moderate"}	
	]
  }'

Which returns:

{"id":"cmpl-bb73e8eefb164c3194bb2b450369e1c6","object":"chat.completion","created":195778,"model":"teknium/OpenHermes-2.5-Mistral-7B","choices":[{"index":0,"message":{"role":"user","content":"Great, thank you! Now i would like some help planning my weekend in Asheville, I would like some hike suggestions. Please list out a few moderate to difficult hikes in the area."},"finish_reason":"stop"}],"usage":{"prompt_tokens":107,"total_tokens":121,"completion_tokens":14}}

vs with "echo": false:

{"id":"cmpl-86ba4dd235a84b8e9a7361b46b04ac79","object":"chat.completion","created":195723,"model":"teknium/OpenHermes-2.5-Mistral-7B","choices":[{"index":0,"message":{"role":"user","content":" to difficult hikes in the area."},"finish_reason":"stop"}],"usage":{"prompt_tokens":107,"total_tokens":121,"completion_tokens":14}}

Since the official OpenAI API for chat/completions doesn't allow you to complete an incomplete message, there was no point for them to implement echo in the chat/completions endpoint. The HF chat_template spec explicitly supports that feature with the add_generation_prompt parameter, so it made sense to implement echo for ease of use. It is an extension of the API, which is why I was calling it out though. I tried to choose the most likely behavior / keywords if OpenAI ever did expand their API to add echo.

Edit:
Yeah, 10 min for a cmake is painful... Unsure what the best way forward is to be honest.
But without actual support for the chat template that the model creator defined, this isn't usable for me (and many others).

FSSRepo · 2023-11-28T17:53:00Z

In my opinion, most of these projects based on ggml have the characteristic of being very lightweight with few dependencies (headers library: httplib.h json.hpp stb_image.h and others), making them portable compared to having to download a 2 GB library like PyTorch and the entire Python environment that downloads packages that will never be used.

Adding overly heavy dependencies, especially those dependent on an external language like Python, seems to go against the idea of this type of project.

Tostino · 2023-11-28T19:34:48Z

In my opinion, most of these projects based on ggml have the characteristic of being very lightweight with few dependencies (headers library: httplib.h json.hpp stb_image.h and others), making them portable compared to having to download a 2 GB library like PyTorch and the entire Python environment that downloads packages that will never be used.

Adding overly heavy dependencies, especially those dependent on an external language like Python, seems to go against the idea of this type of project.

Absolutely no one is advocating for a whole pytorch dependency chain. There just may be other options for running the jinja that don't bloat the dependency chain too badly, and I very much think it's worth discussing further to see if there is an acceptable solution that can be found.

Even if it's something like transpiling jinja to another language that we can directly run, or providing hooks for users to run a python interpreter and the jinja dependency to give the results back to the main cpp program. That way it can be optional, and fall back to hard coded options if unavailable.

Just some thoughts, take them for what you will, I am not a cpp dev.

FSSRepo · 2023-11-28T19:42:49Z

I would suggest something like creating a small utility that performs the functionality we are interested in using C++ (porting it).

Analyzing the Jinja2cpp library quickly, it has Boost as a dependency, which explains the long CMake configuration time. It could be beneficial to decouple that library and include only the necessary functions for Jinja2cpp to work, making it more lightweight.

psugihara · 2023-11-28T20:59:49Z

@tobi completely agree that server.cpp should be a first-class focus of this repo. My macOS app uses exactly the architecture you describe, hitting server on localhost. I would note however that iOS apps cannot include executables so server.cpp won't work in at least that case. Tangentially, it might make sense to pull some of the common completion/tokenizing/batching/parallelization functionality being added to server.cpp into the llama.cpp core so that each platform doesn't have to rewrite completion_loop, etc.

I also wanted to throw in an example of some ugly code I'd love to kill with built-in server.cpp templating. I'm guessing every server.cpp client has some version of this and I'm sure they all have slightly different bugs: https://github.com/psugihara/FreeChat/blob/main/mac/FreeChat/Models/NPC/PromptTemplates/Templates.swift

@Tostino After understanding more of the background here, I agree that ideally we'd want to support the jinja templates included in GGUFs. I didn't even know these were added to GGUF, that's so cool! Unfortunately I'm not seeing a ton of existing work in cpp besides the relatively heavyweight jinja2cpp you found as well. Implementing a minimal jinja2 parser seems out of scope for v1 of template support but perhaps a more incremental compromise could work...

add an endpoint for retrieving the jinja template, allowing clients to skip parsing the gguf themselves if they want to run the template directly
this endpoint could indicate whether the template is supported by server.cpp itself (server.cpp could hardcode a cpp template implementation + hash of a corresponding jinja template for example).
when requesting a chat completion, the client could indicate whether they've already templated their input

I agree with @ggerganov that the templates are pretty trivial to implement in c++ or whatever and I'd first and foremost just like to have them all in one place (ideally llama.cpp) rather than bespoke implementations in each client. A mapping from jinja template hashes to c++ functions would be the most performant setup too, even if it's a bit ugly conceptually.

If templates are added here, I can delete my implementation in FreeChat so we'll have net 0 fragmentation :)

Tostino · 2023-11-28T21:29:49Z

when requesting a chat completion, the client could indicate whether they've already templated their input

That isn't possible. You can template your response on the client side, but then you need to hit the legacy completion endpoint, because the payload for chat/completion doesn't support a formatted string, just a list of messages with role/content.

psugihara · 2023-11-28T21:44:54Z

then you need to hit the legacy completion endpoint

For my use-case that would be fine. Though it does look like there are some non-standard args supported by server's chat/completion already (e.g. mirostat).

wizzard0 · 2023-11-29T09:08:44Z

May I add my 2c?

I'd very much prefer to keep OAI -> llama and tokens -> tokens parts separate and convert on the "proxy".
IMHO Jinja is a terrible middle ground which is both complicated and not flexible enough. See below for examples.

What would be useful on the server.cpp side is more APIs useful for the "OAI -> llama" converter service:

cache usage; current request details;
aborting requests-in-progress
model metadata as parsed by the server (even if "converter" starts the server binary itself it has no way to read it apart from parsing the stderr)
command line args/build args (is batching enabled? LORAs? RoPE params? context? slots? etc);

I can't imagine trying to cram all the hacks I've found useful on the "OAI -> llama" side into the C++ binary without it devolving into the unmaintanable mess.

Some examples:

Models are very finicky re chat templates. Often adding/removing whitespace, BOS/EOS etc, using a different template, repeating the system prompt etc improves the output drastically (tested with temperature 0 and fixed seed ofc)
so I often find myself generating multiple outputs with different templates/samplers (!) and returning a single result later
or re-prompting to return a single long reply on small-context models
server-side templating is a black box, and watching the raw formatted prompt is very very useful. Eg if you're trying to debug why the grammar doesnt match anything
Another hack I've found useful is to combine chat templating with a custom prefix for the last response if the reply without that prefix is bad, and converting the results with a custom code
...in particular, the grammar + pre/post-conversion enables quite OK function calling. But it's prompt-dependent as well, so even the full Jinja wont help, not to mention debugging.

cztomsik · 2023-11-29T09:18:12Z

you need to hit the legacy completion endpoint

I am doing exactly that in my project, everything is "client-side" because I can then easily "complete" messages (the model refuses to answer, so I edit the message to start with "Sure, here's how to" and let the model fill-in the rest)

And I know nobody asked, but adding Jinja to cpp project is a terrible idea

ggerganov · 2024-03-12T14:04:24Z

Yes, let's add it

What is the plan to add it back please ?

I would start with cleaning up the clip/llava libs and improving the API. I felt that in the old implementation, there were many internal object exposed to the server and the memory management was dubious. Also, there was no obvious path for supporting parallel multimodal slots

Before working in server, we should improve llava-cli and the API for using LLaVA

teleprint-me · 2024-03-12T15:29:33Z

Would it be alright to decouple the client from the build process?

Building everything I need to test a new update to the UI/UX is painful because I have to go through extra steps and the CORS implementation blocks me from locally developing since it's designed to prevent exactly that kind of thing.

I really don't see a valid rationale for downloading the preact scripts and deps, then converting it to a c/c++ source file, and then building it when you could just open the file, read in the text, and then pass it to the lambda function handling it (honestly, this entire process is incredibly convoluted).

I was playing around with the source code that's used to concatenate the scripts together and it seems to be using unpkg in the background. esm.sh seems like it might be a viable alternative as I've been using it to prototype a new UI which allows me to skip building at all. The only caveat to esm.sh I see is that it's a remote dependency.

Regardless, using tools like preact, marked, etc rely on npm and create a build dependency as a result. It's usually why I stick with vanilla html, css, and js unless it would be extremely cumbersome to do otherwise (e.g. syntax highlighting is really complicated).

Any feedback is appreciated.

ngxson · 2024-03-12T16:44:22Z

Building everything I need to test a new update to the UI/UX is painful because I have to go through extra steps and the CORS implementation blocks me from locally developing since it's designed to prevent exactly that kind of thing.

@teleprint-me As I remember, there's --path option in server to specify the path to static assets (I didn't test that though). For example you have a file at ./test/myfile.html then you run server with --path ./test, the file can be accessed from http://localhost:8080/myfile.html

Of course it would be ideal to simply use everything from a specified --path. I tried to do just that in #5939 . My idea was that: if --path is specified, we read the file on disk and don't use embedded html/js. However, it's blocked by the fact that there's guide to extend the embedded competion.js in README. So in the end, I couldn't change this behavior.

Also in #5762 , I noted that the hpp generated from deps.sh is quite redundant. Instead, it can be generated in build time, like how we generate build-info.cpp file. In fact, I was half-way finish that idea, then I realize that it won't work if we build on Windows 😢 basically I'll need to write another set of script just for Windows, and I'm very bad at using Windows...

vanilla html, css, and js

Exactly that's what I initially thought. I don't think the current frontend implementation is that complicated, just vanilla is enough. We don't even need to break into multiple files, just one giant index.html still works fine.

Personally for anything more complicated than that, I prefer using vitejs since it can be built into one single html file, which can then be embedded into other project easily. A while ago I even did a visual demo for my company (to be used in showroom). The idea of having single html turns out to be so handy - easy to send via email, works offline,...

Azeirah · 2024-03-12T17:02:05Z

I'm not entirely certain what kind of role the frontend should take on. If it's really really basic, then it's not desirable to use for anything other than just quickly testing the model. If it's complex then people will be asking for all kinds of features, like multiple chats, templates, characters or what have you.

There are already many clients, even ones specifically written for llama.cpp. I'm afraid that whatever ww make, it will never be good enough, unless we offer something that's clearly the absolute minimum so that people don't get big expectations from a built-in client

What do we want to achieve with an html frontend?

psugihara · 2024-03-12T17:28:02Z

What do we want to achieve with an html frontend?

Personally I like what the current frontend achieves... an extremely minimal example of interacting with the server and getting (impressively fast) streaming responses. Seeing the speed of responses in the server example is what convinced me to use it in my project. In my mind it's not really intended to be used as a daily driver, more as a test harness.

I 100% agree that this is not the right place to make yet another front-end and feature creep in the UI/UX layer of the chat front-end is a distraction to building the fastest, most bulletproof llama.cpp server. I maintain a SwiftUI front-end, so have some first-hand experience with the amount of work it entails (much!).

teleprint-me · 2024-03-12T19:32:53Z

I think it's important to focus on the fact that the /public path should be completely decoupled from the build process.

A user should be able to substitute any source files within this path. This is how a server operates in practice; I don't need to rebuild the server/container every time I update the index.html.

EDIT: I was finally able to test the CLI option recommended by @ngxson and it does work as expected.

server -m /mnt/valerie/models/mistralai/Mistral-7B-Instruct-v0.2/ggml-model-q4_0.gguf --ctx-size 8192 --n-gpu-layers 32 --path ~/Local/llama/llama_cpp_client/app/

Me offering to to flesh out the UI/UX is just a bonus. I can fix it, improve it, and keep it lightweight, with a less complex pipeline, that uses less code, and is still just as fast.

The proposed interface would use vanilla HTML, CSS, and JavaScript, with the following libraries as dependencies: Marked, Highlight, and MathJax. These libraries would be hot-linked via a CDN to keep the codebase lightweight and bypass the need for a build process. The interface's design would be minimalistic, focusing on usability, functionality, and aesthetics.

I have a couple of UI's I prototyped and they're so much faster than anything else I've used so far. The UX would be more intuitive as well. I can do it in the about the same amount of SLoC that preact used and reduce the pipeline dependencies and processes as a result, e.g. the server build and client are agnostic and decoupled from one another.

This allows users to do their own thing and gives llama.cpp an enhanced UI/UX out of the box giving users a minimalistic alternative. This reduces cognitive load, developer overhead, feature creep, and more.

The only caveat is I won't be able to focus on it completely until after mid April. This should be enough time to discover some form of consensus.

teleprint-me · 2024-03-12T20:33:48Z

As an aside, unless someone beats me to it (they will because it's low hanging fruit), enabling the completions API to receive either a list of completions or a single completion should be fairly trivial. I think it's mostly due to the hype and excitement around chat models.

Completions are incredibly powerful and useful and I use them as much as I use chat, if not more than chat.

Implementing FIM has proven to be more difficult and less intuitive.

For example, I can use Llama, Mistral, or any other well trained/fine-tuned model to perform a code completion and none of these models utilize FIM. Where FIM proves its usefulness is including the "surrounding" context and directing the model towards the text you're interested in modifying, e.g. the Refact model. Utilizing these capabilities requires full control over the models prompt, though. I found that using completions aides in this endeavor.

FSSRepo · 2024-04-08T19:43:54Z

@ggerganov Is there any way to do the following with llamacpp?

Sorry for my poor use of Paint, but I wanted to convey my idea for improving the way we handle requests from different clients more efficiently and conveniently, at least for applications like chatbots. When a client sends a PDF document for processing, other clients shouldn't get stuck, and should continue to receive tokens continuously.

slaren · 2024-04-08T20:10:49Z

@FSSRepo to elaborate on what I mentioned before. The way you can do this is by splitting all the work into small batches. If the server receives a 2048 token prompt from one client, you don't process all of it in a single evaluation. For example evaluate the first 512 tokens only. If it receives another request from another client for 256 tokens in the meanwhile, then in the next batch evaluate 256 tokens from the first client and 256 from the new client. And so on until all the requests have been processed, with every new batch keep distributing fairly all the work queued from the different clients.

ggerganov · 2024-04-09T05:49:15Z

Yes, what @slaren said. The API is flexible and server implements just one possible way of batching the inputs

phymbert · 2024-04-09T08:57:58Z

If the server receives a 2048 token prompt from one client, you don't process all of it in a single evaluation. For example evaluate the first 512 tokens only. If it receives another request from another client for 256 tokens in the meanwhile, then in the next batch evaluate 256 tokens from the first client and 256 from the new client. And so on until all the requests have been processed

Maybe we can divide the batch-size by the number of slots at command LOAD_PROMPT to determine how many max prompt tokens a slot can include in the current batch. I will have a look

c608345 · 2024-04-13T07:18:06Z

Server HTTP API CORS is broken #6544 .

jboero · 2024-04-23T15:19:45Z

I put together a PR with some themes and an example of how people can skin/re-theme it. Anyone interested in approving? Note that graphic design is NOT my specialty.

#6848

jboero · 2024-04-23T15:20:44Z

Kartoffelsaft · 2024-05-11T19:20:52Z

I noticed that the OpenAI semi-compatible API defaults the temperature to 0, but this is different than OpenAI's actual API which defaults to 1. This can make applications that assume a non-0 value default (such as Oatmeal) much more frustrating to use. Curious if there's a reason the default differs? If it's worth changing this, I've already set up a PR: #7226

jukofyork · 2024-05-22T18:31:04Z

I've thought of something that I can't find if it has been suggested already as not sure what words best describe it:

--override-kv KEY=TYPE:VALUE
                            advanced option to override model metadata by key. may be specified multiple times.
                            types: int, float, bool, str. example: --override-kv tokenizer.ggml.add_bos_token=bool:false

It would be very helpful if we could somehow send a set of overrides for the parameters to be sent via the API (possibly using some JSON file as input). I wanted originally to use the logit-bias option via the shell script I use to run the server but realized unless the front-ends know about all the options, then it's not possible to use them.

This would also solve the person above's problem of the temperature being default of zero too and let them override it.

I think it would be pretty straightforward to implement as you could just have a JSON file as input that follows the exact same format as the API uses, and then use it to set new default values.

It's also possible this CLI option could be adapted:

  -spf FNAME, --system-prompt-file FNAME
                            set a file to load a system prompt (initial prompt of all slots), this is useful for chat applications.

as it already loads and parses a JSON file for a few parameters....

If anybody is interested then I can probably take a look over the weekend at adding this?

It could work in 2 ways too:

As an override to the default parameters, which can then be subsequently changed via setting them in the API calls as normal. This fixes the problem me and the person above have.
As a frozen parameter to stop say family members doing daft stuff like settings the -ngl too high and crashing the server with OOM errors, etc. In this case the parameter sent from the API would be ignored.

bionicles · 2024-05-28T22:08:02Z

well, i suppose it's a breaking change and not really germane to the server per se, but i propose the leadership consider the project could be renamed something other than "llama" because that's one model from one company and this framework can surely be useful for other models besides llama from meta right?

i also think it's distateful but impossible to avoid the implications of the way the following rules and prohibitions on learning exist 100% to stifle innovation in the AI industry:

https://ai.meta.com/llama/license/

"""
v. You will not use the Llama Materials or any output or results of the Llama Materials to improve any other large language model (excluding Llama 2 or derivative works thereof).

Additional Commercial Terms. If, on the Llama 2 version release date, the monthly active users of the products or services made available by or for Licensee, or Licensee’s affiliates, is greater than 700 million monthly active users in the preceding calendar month, you must request a license from Meta, which Meta may grant to you in its sole discretion, and you are not authorized to exercise any of the rights under this Agreement unless or until Meta otherwise expressly grants you such rights.

... non-offensive parts omitted for brevity ...

c. If you institute litigation or other proceedings against Meta or any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Llama Materials or Llama 2 outputs or results, or any portion of any of the foregoing, constitutes infringement of intellectual property or other rights owned or licensable by you, then any licenses granted to you under this Agreement shall terminate as of the date such litigation or claim is filed or instituted. You will indemnify and hold harmless Meta from and against any claim by any third party arising out of or related to your use or distribution of the Llama Materials.
"""

What makes us confident it's a good idea to tie the name of this project to the name of meta's open weights but not OSI approved license llm? e.g. for future options to use other AI models, are we sure we ought to sink hundreds of engineer years into a project named after a corpo ai model with explicitly monopolistic business terms?

i don't mean to mount a crusade here, just seems reasonable to rename llama.cpp to something else which sounds useful for more than just talking to llama (which produces outputs which we can't even use for other future AI !) and post this here because this is where the feedback link leads; bottom line for me personally is, I wouldn't ever use this for Llama but i'd consider it for Mistral

ggerganov added help wanted Extra attention is needed refactoring Refactoring server/webui labels Nov 25, 2023

ggerganov pinned this issue Nov 25, 2023

mudler mentioned this issue Nov 25, 2023

llama.cpp: infinite loop of context switch mudler/LocalAI#1333

Closed

phymbert mentioned this issue Mar 12, 2024

llava-cli: improve llava-cli and the API for using LLaVA #6027

Open

phymbert mentioned this issue Mar 26, 2024

Add multimodal example #6313

Closed

slaren mentioned this issue Mar 31, 2024

main: port basic LLaVA (multimodal) support from llava-cli #5730

Open

ngxson mentioned this issue Apr 1, 2024

Add OpenChat, Alpaca, Vicuna chat templates #6397

Merged

This was referenced Apr 10, 2024

Server: Add prompt processing progress endpoint? #6586

Open

server: process prompt fairly accross slots #6607

Open

SignalRT mentioned this issue Apr 15, 2024

Llava Initial approach to clear images SciSharp/LLamaSharp#664

Merged

kaizau mentioned this issue Apr 17, 2024

Proposal: An alternative to chat templates #6726

Closed

4 tasks

arthw unpinned this issue Apr 19, 2024

ngxson mentioned this issue Apr 28, 2024

Generic Chat templating code with text/json file based config; main chat updated to drive its in-prefix, in-suffix and reverse-prompt from same; chat-apply-template equivalent c-api to allow use by other codes also #6834

Draft

khimaros mentioned this issue Apr 29, 2024

Add the Command R chat format abetlen/llama-cpp-python#1382

Open

HanClinto mentioned this issue Jun 5, 2024

Feature Request: Multi session chat support #7758

Open

4 tasks

mgroeber9110 mentioned this issue Jun 30, 2024

server : fix templates for llama2, llama3 and zephyr in new UI #8196

Open

4 tasks

server : improvements and maintenance #4216

server : improvements and maintenance #4216

Comments

ggerganov commented Nov 25, 2023 • edited by ngxson Loading

IridiumMaster commented Nov 25, 2023 • edited Loading

ruped commented Nov 25, 2023 • edited Loading

mudler commented Nov 25, 2023

tobi commented Nov 25, 2023 • edited Loading

FSSRepo commented Nov 26, 2023 • edited Loading

spirobel commented Nov 26, 2023 • edited Loading

studiotatsu commented Nov 26, 2023

antcodd commented Nov 27, 2023

IridiumMaster commented Nov 27, 2023

mudler commented Nov 27, 2023 • edited Loading

ruped commented Nov 27, 2023 • edited Loading

dongxiaolong commented Nov 28, 2023

ggerganov commented Nov 28, 2023

Tostino commented Nov 28, 2023 • edited Loading

mudler commented Nov 28, 2023

ggerganov commented Nov 28, 2023

Tostino commented Nov 28, 2023 • edited Loading

FSSRepo commented Nov 28, 2023 • edited Loading

Tostino commented Nov 28, 2023 • edited Loading

FSSRepo commented Nov 28, 2023 • edited Loading

psugihara commented Nov 28, 2023 • edited Loading

Tostino commented Nov 28, 2023

psugihara commented Nov 28, 2023

wizzard0 commented Nov 29, 2023 • edited Loading

cztomsik commented Nov 29, 2023 • edited Loading

ggerganov commented Mar 12, 2024

teleprint-me commented Mar 12, 2024 • edited Loading

ngxson commented Mar 12, 2024

Azeirah commented Mar 12, 2024 • edited Loading

psugihara commented Mar 12, 2024

teleprint-me commented Mar 12, 2024 • edited Loading

teleprint-me commented Mar 12, 2024

FSSRepo commented Apr 8, 2024 • edited Loading

slaren commented Apr 8, 2024 • edited Loading

ggerganov commented Apr 9, 2024

phymbert commented Apr 9, 2024

c608345 commented Apr 13, 2024 • edited Loading

jboero commented Apr 23, 2024 • edited Loading

jboero commented Apr 23, 2024

Kartoffelsaft commented May 11, 2024

jukofyork commented May 22, 2024 • edited Loading

bionicles commented May 28, 2024 • edited Loading

ggerganov commented Nov 25, 2023 •

edited by ngxson

Loading

IridiumMaster commented Nov 25, 2023 •

edited

Loading

ruped commented Nov 25, 2023 •

edited

Loading

tobi commented Nov 25, 2023 •

edited

Loading

FSSRepo commented Nov 26, 2023 •

edited

Loading

spirobel commented Nov 26, 2023 •

edited

Loading

mudler commented Nov 27, 2023 •

edited

Loading

ruped commented Nov 27, 2023 •

edited

Loading

Tostino commented Nov 28, 2023 •

edited

Loading

Tostino commented Nov 28, 2023 •

edited

Loading

FSSRepo commented Nov 28, 2023 •

edited

Loading

Tostino commented Nov 28, 2023 •

edited

Loading

FSSRepo commented Nov 28, 2023 •

edited

Loading

psugihara commented Nov 28, 2023 •

edited

Loading

wizzard0 commented Nov 29, 2023 •

edited

Loading

cztomsik commented Nov 29, 2023 •

edited

Loading

teleprint-me commented Mar 12, 2024 •

edited

Loading

Azeirah commented Mar 12, 2024 •

edited

Loading

teleprint-me commented Mar 12, 2024 •

edited

Loading

FSSRepo commented Apr 8, 2024 •

edited

Loading

slaren commented Apr 8, 2024 •

edited

Loading

c608345 commented Apr 13, 2024 •

edited

Loading

jboero commented Apr 23, 2024 •

edited

Loading

jukofyork commented May 22, 2024 •

edited

Loading

bionicles commented May 28, 2024 •

edited

Loading