Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server : improvements and maintenance #4216

Open
6 of 10 tasks
ggerganov opened this issue Nov 25, 2023 · 106 comments
Open
6 of 10 tasks

server : improvements and maintenance #4216

ggerganov opened this issue Nov 25, 2023 · 106 comments
Labels
help wanted Extra attention is needed refactoring Refactoring server/webui

Comments

@ggerganov
Copy link
Owner

ggerganov commented Nov 25, 2023

The server example has been growing in functionality and unfortunately I feel it is not very stable at the moment and there are some important features that are still missing. Creating this issue to keep track on some of these points and try to draw more attention from the community. I guess, some of the tasks are relatively big and would require significant efforts to complete

This is likely not a complete list of things - if you think some feature is important to be improved or supported, drop a comment.

Have a look to issues labelled with server/webui.

@ggerganov ggerganov added help wanted Extra attention is needed refactoring Refactoring server/webui labels Nov 25, 2023
@ggerganov ggerganov pinned this issue Nov 25, 2023
@IridiumMaster
Copy link

IridiumMaster commented Nov 25, 2023

Would love if the server could get look ahead decoding and contrastive search. A collection of common presets would be very helpful for fast model evaluation. The ability to edit responses and replies in the UI would be very useful for rapidly testing prompt branches if combined with batching capabilities. Would also appreciate a simple implementation of request queuing and a server interface for the model training example. Edit: Discussion link for contrastive search : #3450 , other related topics / potential substitutes are mentioned in the thread.

@ruped
Copy link

ruped commented Nov 25, 2023

Thanks for raising this issue and looking into the server example.

I think this #4201 could be relevant - although it sounds like the fix will be in the core code rather than in the server.

Since the addition of support for batching, llama.cpp could be come a viable competitor to vllm for large scale deployments. This is also helpful for individual hobbyists who are using/building AI agents (because these possibly make multiple requests in parallel to the LLMs to construct answers). So I think your suggestions around improving stability/refactor of the server example would be very valuable. Also focusing on the throughput speed particularly of batched requests (and benchmarking this against vllm).

@mudler
Copy link
Contributor

mudler commented Nov 25, 2023

What'd be lovely is to see also the speculative sampling added to it - would be really a great addition there

@tobi
Copy link
Sponsor Collaborator

tobi commented Nov 25, 2023

Very excited about this! I think that the server should increasingly be thought of as the main deliverable of this repo.

There are 100s of libraries and tools that integrate different subset of backends and inference libraries. Especially in the python world. This doesn't make sense. We need a simple convention by which everything can interopt. The solution is to use openai's API as a protocol on localhost. Could there be better standards? Maybe. But this is the one we have, and it works really well.

My suggestion is that clean we clean up the server and treat it and the /chat/completions endpoint as main deliverable of this repository. We can easily switch the web interface to use that as well. ./server -m ~/model should boot with the ideal default parameters read from the gguf like context size and (if we can pull it off) chat template style.

This means that existing code only needs the api_url override to be modified to work locally.


from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8080/v1")


completion = client.chat.completions.create(
  model="llama!",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
  ]
)

print(completion.choices[0].message.content)

This works already. At least as long as you are loading a model that conforms to chatml and are ok with the default context size. I find that a much better vision for how LLM interopt will work in the open source space. Different servers, different, backends, all on the same proto.

@FSSRepo
Copy link
Collaborator

FSSRepo commented Nov 26, 2023

@ggerganov

Batched decoding endpoint?

This option to generate multiple alternatives for the same prompt requires the ability to change the seed, and the truth is, I've been having a bit of a struggle with it when adding parallel decoding, as it raises questions about how the seed should be managed.

@spirobel
Copy link

spirobel commented Nov 26, 2023

@tobi

Very excited about this! I think that the server should increasingly be thought of as the main deliverable of this repo.

LocalAI serves this usecase quite well already and has lots of traction. It is better to not compete with your customers and delight them instead. An easy to link library with a C API should be the main deliverable of this project.

@studiotatsu
Copy link

The OAI API included with the server, is great I love it.
Please include llama_params "repeat_penalty" and "min_p" .

These params are much needed. Thanks.

@antcodd
Copy link

antcodd commented Nov 27, 2023

I think it would be good if the OAI endpoint supported the same set of parameters and defaults as the regular endpoint and sensible or argument driven defaults given many clients won't supply all parameters.

One issue is the seed is defaulting 0 instead of -1, so every regeneration is the same if the client doesn't specify a seed.

@IridiumMaster
Copy link

@tobi

Very excited about this! I think that the server should increasingly be thought of as the main deliverable of this repo.

LocalAI serves this usecase quite well already and has lots of traction. It is better to not compete with your customers and delight them instead. An easy to link library with a C API should be the main deliverable of this project.

With respect, I think the server endpoint is a different audience. LocalAI seems to be going for an "everything and the kitchen sink" approach. That's cool, and I respect the project, but what I would like from the server example is something different: Raw inference with the greatest number of capabilities at the fastest possible speed, along with tooling specifically designed to allow for large scale prompt testing of different model variants quickly and easily. This is what I would view as more of a "production" workflow as opposed to more of a hobbyist workflow. I agree with the upthread sentiment around making the server api a solid standard @tobi.

@mudler
Copy link
Contributor

mudler commented Nov 27, 2023

@tobi

Very excited about this! I think that the server should increasingly be thought of as the main deliverable of this repo.

LocalAI serves this usecase quite well already and has lots of traction. It is better to not compete with your customers and delight them instead. An easy to link library with a C API should be the main deliverable of this project.

With respect, I think the server endpoint is a different audience. LocalAI seems to be going for an "everything and the kitchen sink" approach. That's cool, and I respect the project, but what I would like from the server example is something different: Raw inference with the greatest number of capabilities at the fastest possible speed, along with tooling specifically designed to allow for large scale prompt testing of different model variants quickly and easily.

Sorry to jump-in in OT, but you are not sacrificing any speed nor capabilities with LocalAI - at the end the engine is always the same (llama.cpp, or vllm, or you name it) - however I see the value of having a server in llama.cpp. It's people's choice at the end of what suits better their needs. And also, the server LocalAI implementation is heavily based on that ;)

This is what I would view as more of a "production" workflow as opposed to more of a hobbyist workflow. I agree with the upthread sentiment around making the server api a solid standard @tobi.

For production there are quite some issues that are blockish-imho rather than this. Had several bugs in LocalAI w/ llama.cpp which makes it still difficult to navigate into that direction, which I hope gets addressed with this ticket. Things like #3969 are quite scary for prod-users.

@ruped
Copy link

ruped commented Nov 27, 2023

Just a thought as a user of llama.cpp server: I imagine it's quite common for the llama.cpp Server to be used by developers who are able to add non core functionality in their own code. (e.g. Devs create their own application or library or REST server that wraps/orchestrates llama.cpp). Naturally the llama.cpp server is very convenient for this and works with any programming language. It also has a smaller/self contained API to learn.

I think some of the following can be done in dev's own code outside of llama.cpp:

  • basic templating
  • Additional interfaces (e.g. OpenAI compatibility) by setting up an intermediary server that calls llama.cpp server.
  • Making batch requests (by using multiple HTTP calls to llama.cpp server)

(Disclaimer: These are just examples, I haven't fully evaluated the pros/cons of implementing them outside of llama.cpp)

It's excellent if this project has the mission and bandwidth to provide functionalities like these. But if it sounds like its becoming too much work or feature creep then I imagine focusing on the bits that are impossible to do outside of llama.cpp is one of the ways to prioritise.

@dongxiaolong
Copy link

Hi, @ggerganov .The vllm project has a PR under construction for a chat template that can be used as a reference. vllm-project/vllm#1756

@ggerganov
Copy link
Owner Author

Regarding chat templates: I see they are using something called Jinja.

We are not going to implement Jinja in llama.cpp.

The best we can do are 2 things:

  • add an API endpoint for the clients to get the Jinja string and do whatever they want with it
  • add hardcoded implementations of common templates, where we string match the template and if it is something we know, we call a simple C++ function to tokenize as expected

@Tostino
Copy link

Tostino commented Nov 28, 2023

@ggerganov If you are going to hard code templates, this server will be totally unusable for a large number of users. I am experimenting with new templates, and would really rather the models trained with them be widely supported. Hell, there are so many variations of the chat-ml template floating around with no indication which is the correct version.

I mentioned on the other ticket that there is: https://github.com/jinja2cpp/Jinja2Cpp

Maybe that can be an optional component to add support for chat templates from the tokenizer, and hard coding can be the default code-path, I understand not wanting to add additional dependencies.

Getting the jinja string in the client is not helpful as an API endpoint, unless there is a client side compatibility layer between the chat/completions and completions endpoint.

I had opened a issue for chat template support a while ago, when I started working on it for vLLM: #3810

I implemented this for vLLM, and after going through a few rounds of testing, I had to rework things up and add additional parameters, and cli arguments to support the API properly.
We should very much stay on the same page for our implementations.

Here is the diff for my chat/completions endpoint changes: https://github.com/vllm-project/vllm/pull/1756/files#diff-38318677b76349044192bf70161371c88fb2818b85279d8fc7f2c041d83a9544

The important points from the vLLM pull request:

1. Addition of the `--chat-template` command-line argument to specify a chat template file or single-line template for the model.
2. Implementation of the `--response-role` command-line argument for defining the role name in chat responses when `add_generation_prompt` is set to true.
3. Update to the chat API request handling to support finishing a partial response correctly, and echoing input portions of messages (request.add_generation_prompt, and request.echo).

The request.echo is an extension of the api, due to the nature of Open Source LLMs being able to finish the last role:content pair in the messages list if request.add_generation_prompt=false (which is also an extension of the API due to the need to support this HF feature) and the template/model support that feature.

We should treat add_generation_prompt as default=true, because that is the behavior of the OpenAI API. This simply allows users to override that behavior if they need it, and gives them all the required knobs to use the feature properly.

@mudler
Copy link
Contributor

mudler commented Nov 28, 2023

Regarding chat templates: I see they are using something called Jinja.

We are not going to implement Jinja in llama.cpp.

The best we can do are 2 things:

* add an API endpoint for the clients to get the Jinja string and do whatever they want with it

* add hardcoded implementations of common templates, where we string match the template and if it is something we know, we call a simple C++ function to tokenize as expected

my personal thoughts here, but probably C++ ain't the best language for that - templating is quite easy to implement in scripted languages rather than C++, and in my opinion would undermine the maintenance and flexibility to have a lean server.cpp implementation.

Just my 2c, but maybe templating fits better on top of llama-cpp-python - which might be easier to go and to maintain (while keeping the core small and extensible)?

@ggerganov
Copy link
Owner Author

@Tostino

All templates that I've seen so far are so basic that I don't understand why we need an entire scripting language to express them. Is there a more advanced use case other than a basic for loop over the messages + add prefix/suffix?

How many templates do we expect to ever have? 10s, 100s? Even if it is 1000s, I prefer to have them hardcoded instead of building jinja2cpp (it takes 10 minutes !! to just run cmake config)

Here is sample ChatML template in a few lines of C++ that we currently use (and this is not even the best way to do it):

std::string format_chatml(std::vector<json> messages)
{
    std::ostringstream chatml_msgs;

    for (auto it = messages.begin(); it != messages.end(); ++it) {
        chatml_msgs << "<|im_start|>"
                    << json_value(*it, "role",    std::string("user")) << '\n';
        chatml_msgs << json_value(*it, "content", std::string(""))
                    << "<|im_end|>\n";
    }

    chatml_msgs << "<|im_start|>assistant" << '\n';

    return chatml_msgs.str();
}

I could be missing something, but for the moment I don't see a good reason to add Jinja support. Let's see how it goes - I'm open to reconsider, but need to see some reasonable examples and use cases that justify this dependency.


The request.echo is an extension of the api, due to the nature of Open Source LLMs being able to finish the last role:content pair in the messages list if request.add_generation_prompt=false (which is also an extension of the API due to the need to support this HF feature) and the template/model support that feature.

We should treat add_generation_prompt as default=true, because that is the behavior of the OpenAI API. This simply allows users to override that behavior if they need it, and gives them all the required knobs to use the feature properly.

I think I understand the request.add_generation_prompt parameter, but I don't understand request.echo - can you clarify / give an example?

@mudler

Yes, I agree.

@Tostino
Copy link

Tostino commented Nov 28, 2023

The fact is, if the rest of the ecosystem standardizes on these templates being "the way" to format messages, it will proliferate to new and unexpected use cases.

python3 -m vllm.entrypoints.openai.api_server --model teknium/OpenHermes-2.5-Mistral-7B --chat-template ./examples/template_inkbot.jinja

Here is an example call using my inkbot template which uses echo:

curl http://0.0.0.0:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer " \
  -d '{
    "model": "teknium/OpenHermes-2.5-Mistral-7B",
    "stream": false,
    "stop": ["\n<#bot#>","\n<#user#>"],
    "add_generation_prompt": false,
    "echo": true,
    "temperature": 0.0,
    "n": 1,
    "messages": [
	{"role": "meta-current_date", "content": "2023-10-20"},
	{"role": "meta-task_name", "content": "general"},
	{"role": "system", "content": "You are a helpful assistant."},
	{"role": "user", "content": "Hello!"},
	{"role": "assistant", "content": "Hello, how are you?"},
	{"role": "user", "content": "Great, thank you! Now i would like some help planning my weekend in Asheville, I would like some hike suggestions. Please list out a few moderate"}	
	]
  }'

Which returns:

{"id":"cmpl-bb73e8eefb164c3194bb2b450369e1c6","object":"chat.completion","created":195778,"model":"teknium/OpenHermes-2.5-Mistral-7B","choices":[{"index":0,"message":{"role":"user","content":"Great, thank you! Now i would like some help planning my weekend in Asheville, I would like some hike suggestions. Please list out a few moderate to difficult hikes in the area."},"finish_reason":"stop"}],"usage":{"prompt_tokens":107,"total_tokens":121,"completion_tokens":14}}

vs with "echo": false:

{"id":"cmpl-86ba4dd235a84b8e9a7361b46b04ac79","object":"chat.completion","created":195723,"model":"teknium/OpenHermes-2.5-Mistral-7B","choices":[{"index":0,"message":{"role":"user","content":" to difficult hikes in the area."},"finish_reason":"stop"}],"usage":{"prompt_tokens":107,"total_tokens":121,"completion_tokens":14}}

Since the official OpenAI API for chat/completions doesn't allow you to complete an incomplete message, there was no point for them to implement echo in the chat/completions endpoint. The HF chat_template spec explicitly supports that feature with the add_generation_prompt parameter, so it made sense to implement echo for ease of use. It is an extension of the API, which is why I was calling it out though. I tried to choose the most likely behavior / keywords if OpenAI ever did expand their API to add echo.

Edit:
Yeah, 10 min for a cmake is painful... Unsure what the best way forward is to be honest.
But without actual support for the chat template that the model creator defined, this isn't usable for me (and many others).

@FSSRepo
Copy link
Collaborator

FSSRepo commented Nov 28, 2023

In my opinion, most of these projects based on ggml have the characteristic of being very lightweight with few dependencies (headers library: httplib.h json.hpp stb_image.h and others), making them portable compared to having to download a 2 GB library like PyTorch and the entire Python environment that downloads packages that will never be used.

Adding overly heavy dependencies, especially those dependent on an external language like Python, seems to go against the idea of this type of project.

@Tostino
Copy link

Tostino commented Nov 28, 2023

In my opinion, most of these projects based on ggml have the characteristic of being very lightweight with few dependencies (headers library: httplib.h json.hpp stb_image.h and others), making them portable compared to having to download a 2 GB library like PyTorch and the entire Python environment that downloads packages that will never be used.

Adding overly heavy dependencies, especially those dependent on an external language like Python, seems to go against the idea of this type of project.

Absolutely no one is advocating for a whole pytorch dependency chain. There just may be other options for running the jinja that don't bloat the dependency chain too badly, and I very much think it's worth discussing further to see if there is an acceptable solution that can be found.

Even if it's something like transpiling jinja to another language that we can directly run, or providing hooks for users to run a python interpreter and the jinja dependency to give the results back to the main cpp program. That way it can be optional, and fall back to hard coded options if unavailable.

Just some thoughts, take them for what you will, I am not a cpp dev.

@FSSRepo
Copy link
Collaborator

FSSRepo commented Nov 28, 2023

I would suggest something like creating a small utility that performs the functionality we are interested in using C++ (porting it).

Analyzing the Jinja2cpp library quickly, it has Boost as a dependency, which explains the long CMake configuration time. It could be beneficial to decouple that library and include only the necessary functions for Jinja2cpp to work, making it more lightweight.

@psugihara
Copy link
Contributor

psugihara commented Nov 28, 2023

@tobi completely agree that server.cpp should be a first-class focus of this repo. My macOS app uses exactly the architecture you describe, hitting server on localhost. I would note however that iOS apps cannot include executables so server.cpp won't work in at least that case. Tangentially, it might make sense to pull some of the common completion/tokenizing/batching/parallelization functionality being added to server.cpp into the llama.cpp core so that each platform doesn't have to rewrite completion_loop, etc.


I also wanted to throw in an example of some ugly code I'd love to kill with built-in server.cpp templating. I'm guessing every server.cpp client has some version of this and I'm sure they all have slightly different bugs: https://github.com/psugihara/FreeChat/blob/main/mac/FreeChat/Models/NPC/PromptTemplates/Templates.swift

@Tostino After understanding more of the background here, I agree that ideally we'd want to support the jinja templates included in GGUFs. I didn't even know these were added to GGUF, that's so cool! Unfortunately I'm not seeing a ton of existing work in cpp besides the relatively heavyweight jinja2cpp you found as well. Implementing a minimal jinja2 parser seems out of scope for v1 of template support but perhaps a more incremental compromise could work...

  1. add an endpoint for retrieving the jinja template, allowing clients to skip parsing the gguf themselves if they want to run the template directly
  2. this endpoint could indicate whether the template is supported by server.cpp itself (server.cpp could hardcode a cpp template implementation + hash of a corresponding jinja template for example).
  3. when requesting a chat completion, the client could indicate whether they've already templated their input

I agree with @ggerganov that the templates are pretty trivial to implement in c++ or whatever and I'd first and foremost just like to have them all in one place (ideally llama.cpp) rather than bespoke implementations in each client. A mapping from jinja template hashes to c++ functions would be the most performant setup too, even if it's a bit ugly conceptually.

If templates are added here, I can delete my implementation in FreeChat so we'll have net 0 fragmentation :)

@Tostino
Copy link

Tostino commented Nov 28, 2023

when requesting a chat completion, the client could indicate whether they've already templated their input

That isn't possible. You can template your response on the client side, but then you need to hit the legacy completion endpoint, because the payload for chat/completion doesn't support a formatted string, just a list of messages with role/content.

@psugihara
Copy link
Contributor

then you need to hit the legacy completion endpoint

For my use-case that would be fine. Though it does look like there are some non-standard args supported by server's chat/completion already (e.g. mirostat).

@wizzard0
Copy link
Contributor

wizzard0 commented Nov 29, 2023

May I add my 2c?

I'd very much prefer to keep OAI -> llama and tokens -> tokens parts separate and convert on the "proxy".
IMHO Jinja is a terrible middle ground which is both complicated and not flexible enough. See below for examples.

What would be useful on the server.cpp side is more APIs useful for the "OAI -> llama" converter service:

  • cache usage; current request details;
  • aborting requests-in-progress
  • model metadata as parsed by the server (even if "converter" starts the server binary itself it has no way to read it apart from parsing the stderr)
  • command line args/build args (is batching enabled? LORAs? RoPE params? context? slots? etc);

I can't imagine trying to cram all the hacks I've found useful on the "OAI -> llama" side into the C++ binary without it devolving into the unmaintanable mess.

Some examples:

  1. Models are very finicky re chat templates. Often adding/removing whitespace, BOS/EOS etc, using a different template, repeating the system prompt etc improves the output drastically (tested with temperature 0 and fixed seed ofc)
  2. so I often find myself generating multiple outputs with different templates/samplers (!) and returning a single result later
  3. or re-prompting to return a single long reply on small-context models
  4. server-side templating is a black box, and watching the raw formatted prompt is very very useful. Eg if you're trying to debug why the grammar doesnt match anything
  5. Another hack I've found useful is to combine chat templating with a custom prefix for the last response if the reply without that prefix is bad, and converting the results with a custom code
  6. ...in particular, the grammar + pre/post-conversion enables quite OK function calling. But it's prompt-dependent as well, so even the full Jinja wont help, not to mention debugging.

@cztomsik
Copy link
Contributor

cztomsik commented Nov 29, 2023

you need to hit the legacy completion endpoint

I am doing exactly that in my project, everything is "client-side" because I can then easily "complete" messages (the model refuses to answer, so I edit the message to start with "Sure, here's how to" and let the model fill-in the rest)

And I know nobody asked, but adding Jinja to cpp project is a terrible idea

@ggerganov
Copy link
Owner Author

Yes, let's add it

What is the plan to add it back please ?

I would start with cleaning up the clip/llava libs and improving the API. I felt that in the old implementation, there were many internal object exposed to the server and the memory management was dubious. Also, there was no obvious path for supporting parallel multimodal slots

Before working in server, we should improve llava-cli and the API for using LLaVA

@teleprint-me
Copy link
Contributor

teleprint-me commented Mar 12, 2024

Would it be alright to decouple the client from the build process?

Building everything I need to test a new update to the UI/UX is painful because I have to go through extra steps and the CORS implementation blocks me from locally developing since it's designed to prevent exactly that kind of thing.

I really don't see a valid rationale for downloading the preact scripts and deps, then converting it to a c/c++ source file, and then building it when you could just open the file, read in the text, and then pass it to the lambda function handling it (honestly, this entire process is incredibly convoluted).

I was playing around with the source code that's used to concatenate the scripts together and it seems to be using unpkg in the background. esm.sh seems like it might be a viable alternative as I've been using it to prototype a new UI which allows me to skip building at all. The only caveat to esm.sh I see is that it's a remote dependency.

Regardless, using tools like preact, marked, etc rely on npm and create a build dependency as a result. It's usually why I stick with vanilla html, css, and js unless it would be extremely cumbersome to do otherwise (e.g. syntax highlighting is really complicated).

Any feedback is appreciated.

@ngxson
Copy link
Collaborator

ngxson commented Mar 12, 2024

Building everything I need to test a new update to the UI/UX is painful because I have to go through extra steps and the CORS implementation blocks me from locally developing since it's designed to prevent exactly that kind of thing.

@teleprint-me As I remember, there's --path option in server to specify the path to static assets (I didn't test that though). For example you have a file at ./test/myfile.html then you run server with --path ./test, the file can be accessed from http://localhost:8080/myfile.html

Of course it would be ideal to simply use everything from a specified --path. I tried to do just that in #5939 . My idea was that: if --path is specified, we read the file on disk and don't use embedded html/js. However, it's blocked by the fact that there's guide to extend the embedded competion.js in README. So in the end, I couldn't change this behavior.

Also in #5762 , I noted that the hpp generated from deps.sh is quite redundant. Instead, it can be generated in build time, like how we generate build-info.cpp file. In fact, I was half-way finish that idea, then I realize that it won't work if we build on Windows 😢 basically I'll need to write another set of script just for Windows, and I'm very bad at using Windows...

vanilla html, css, and js

Exactly that's what I initially thought. I don't think the current frontend implementation is that complicated, just vanilla is enough. We don't even need to break into multiple files, just one giant index.html still works fine.

Personally for anything more complicated than that, I prefer using vitejs since it can be built into one single html file, which can then be embedded into other project easily. A while ago I even did a visual demo for my company (to be used in showroom). The idea of having single html turns out to be so handy - easy to send via email, works offline,...

@Azeirah
Copy link
Contributor

Azeirah commented Mar 12, 2024

I'm not entirely certain what kind of role the frontend should take on. If it's really really basic, then it's not desirable to use for anything other than just quickly testing the model. If it's complex then people will be asking for all kinds of features, like multiple chats, templates, characters or what have you.

There are already many clients, even ones specifically written for llama.cpp. I'm afraid that whatever ww make, it will never be good enough, unless we offer something that's clearly the absolute minimum so that people don't get big expectations from a built-in client

What do we want to achieve with an html frontend?

@psugihara
Copy link
Contributor

What do we want to achieve with an html frontend?

Personally I like what the current frontend achieves... an extremely minimal example of interacting with the server and getting (impressively fast) streaming responses. Seeing the speed of responses in the server example is what convinced me to use it in my project. In my mind it's not really intended to be used as a daily driver, more as a test harness.

I 100% agree that this is not the right place to make yet another front-end and feature creep in the UI/UX layer of the chat front-end is a distraction to building the fastest, most bulletproof llama.cpp server. I maintain a SwiftUI front-end, so have some first-hand experience with the amount of work it entails (much!).

@teleprint-me
Copy link
Contributor

teleprint-me commented Mar 12, 2024

I think it's important to focus on the fact that the /public path should be completely decoupled from the build process.

A user should be able to substitute any source files within this path. This is how a server operates in practice; I don't need to rebuild the server/container every time I update the index.html.

EDIT: I was finally able to test the CLI option recommended by @ngxson and it does work as expected.

server -m /mnt/valerie/models/mistralai/Mistral-7B-Instruct-v0.2/ggml-model-q4_0.gguf --ctx-size 8192 --n-gpu-layers 32 --path ~/Local/llama/llama_cpp_client/app/

Me offering to to flesh out the UI/UX is just a bonus. I can fix it, improve it, and keep it lightweight, with a less complex pipeline, that uses less code, and is still just as fast.

The proposed interface would use vanilla HTML, CSS, and JavaScript, with the following libraries as dependencies: Marked, Highlight, and MathJax. These libraries would be hot-linked via a CDN to keep the codebase lightweight and bypass the need for a build process. The interface's design would be minimalistic, focusing on usability, functionality, and aesthetics.

I have a couple of UI's I prototyped and they're so much faster than anything else I've used so far. The UX would be more intuitive as well. I can do it in the about the same amount of SLoC that preact used and reduce the pipeline dependencies and processes as a result, e.g. the server build and client are agnostic and decoupled from one another.

This allows users to do their own thing and gives llama.cpp an enhanced UI/UX out of the box giving users a minimalistic alternative. This reduces cognitive load, developer overhead, feature creep, and more.

The only caveat is I won't be able to focus on it completely until after mid April. This should be enough time to discover some form of consensus.

@teleprint-me
Copy link
Contributor

As an aside, unless someone beats me to it (they will because it's low hanging fruit), enabling the completions API to receive either a list of completions or a single completion should be fairly trivial. I think it's mostly due to the hype and excitement around chat models.

Completions are incredibly powerful and useful and I use them as much as I use chat, if not more than chat.

Implementing FIM has proven to be more difficult and less intuitive.

For example, I can use Llama, Mistral, or any other well trained/fine-tuned model to perform a code completion and none of these models utilize FIM. Where FIM proves its usefulness is including the "surrounding" context and directing the model towards the text you're interested in modifying, e.g. the Refact model. Utilizing these capabilities requires full control over the models prompt, though. I found that using completions aides in this endeavor.

@FSSRepo
Copy link
Collaborator

FSSRepo commented Apr 8, 2024

@ggerganov Is there any way to do the following with llamacpp?

explain

Sorry for my poor use of Paint, but I wanted to convey my idea for improving the way we handle requests from different clients more efficiently and conveniently, at least for applications like chatbots. When a client sends a PDF document for processing, other clients shouldn't get stuck, and should continue to receive tokens continuously.

@slaren
Copy link
Collaborator

slaren commented Apr 8, 2024

@FSSRepo to elaborate on what I mentioned before. The way you can do this is by splitting all the work into small batches. If the server receives a 2048 token prompt from one client, you don't process all of it in a single evaluation. For example evaluate the first 512 tokens only. If it receives another request from another client for 256 tokens in the meanwhile, then in the next batch evaluate 256 tokens from the first client and 256 from the new client. And so on until all the requests have been processed, with every new batch keep distributing fairly all the work queued from the different clients.

@ggerganov
Copy link
Owner Author

Yes, what @slaren said. The API is flexible and server implements just one possible way of batching the inputs

@phymbert
Copy link
Collaborator

phymbert commented Apr 9, 2024

If the server receives a 2048 token prompt from one client, you don't process all of it in a single evaluation. For example evaluate the first 512 tokens only. If it receives another request from another client for 256 tokens in the meanwhile, then in the next batch evaluate 256 tokens from the first client and 256 from the new client. And so on until all the requests have been processed

Maybe we can divide the batch-size by the number of slots at command LOAD_PROMPT to determine how many max prompt tokens a slot can include in the current batch. I will have a look

@c608345
Copy link

c608345 commented Apr 13, 2024

Server HTTP API CORS is broken #6544 .

@jboero
Copy link
Contributor

jboero commented Apr 23, 2024

I put together a PR with some themes and an example of how people can skin/re-theme it. Anyone interested in approving? Note that graphic design is NOT my specialty.

#6848

@jboero
Copy link
Contributor

jboero commented Apr 23, 2024

image

@Kartoffelsaft
Copy link
Contributor

I noticed that the OpenAI semi-compatible API defaults the temperature to 0, but this is different than OpenAI's actual API which defaults to 1. This can make applications that assume a non-0 value default (such as Oatmeal) much more frustrating to use. Curious if there's a reason the default differs? If it's worth changing this, I've already set up a PR: #7226

@jukofyork
Copy link
Contributor

jukofyork commented May 22, 2024

I've thought of something that I can't find if it has been suggested already as not sure what words best describe it:

--override-kv KEY=TYPE:VALUE
                            advanced option to override model metadata by key. may be specified multiple times.
                            types: int, float, bool, str. example: --override-kv tokenizer.ggml.add_bos_token=bool:false

It would be very helpful if we could somehow send a set of overrides for the parameters to be sent via the API (possibly using some JSON file as input). I wanted originally to use the logit-bias option via the shell script I use to run the server but realized unless the front-ends know about all the options, then it's not possible to use them.

This would also solve the person above's problem of the temperature being default of zero too and let them override it.

I think it would be pretty straightforward to implement as you could just have a JSON file as input that follows the exact same format as the API uses, and then use it to set new default values.

It's also possible this CLI option could be adapted:

  -spf FNAME, --system-prompt-file FNAME
                            set a file to load a system prompt (initial prompt of all slots), this is useful for chat applications. 

as it already loads and parses a JSON file for a few parameters....

If anybody is interested then I can probably take a look over the weekend at adding this?

It could work in 2 ways too:

  1. As an override to the default parameters, which can then be subsequently changed via setting them in the API calls as normal. This fixes the problem me and the person above have.
  2. As a frozen parameter to stop say family members doing daft stuff like settings the -ngl too high and crashing the server with OOM errors, etc. In this case the parameter sent from the API would be ignored.

@bionicles
Copy link

bionicles commented May 28, 2024

well, i suppose it's a breaking change and not really germane to the server per se, but i propose the leadership consider the project could be renamed something other than "llama" because that's one model from one company and this framework can surely be useful for other models besides llama from meta right?

i also think it's distateful but impossible to avoid the implications of the way the following rules and prohibitions on learning exist 100% to stifle innovation in the AI industry:

https://ai.meta.com/llama/license/

"""
v. You will not use the Llama Materials or any output or results of the Llama Materials to improve any other large language model (excluding Llama 2 or derivative works thereof).

  1. Additional Commercial Terms. If, on the Llama 2 version release date, the monthly active users of the products or services made available by or for Licensee, or Licensee’s affiliates, is greater than 700 million monthly active users in the preceding calendar month, you must request a license from Meta, which Meta may grant to you in its sole discretion, and you are not authorized to exercise any of the rights under this Agreement unless or until Meta otherwise expressly grants you such rights.

... non-offensive parts omitted for brevity ...

c. If you institute litigation or other proceedings against Meta or any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Llama Materials or Llama 2 outputs or results, or any portion of any of the foregoing, constitutes infringement of intellectual property or other rights owned or licensable by you, then any licenses granted to you under this Agreement shall terminate as of the date such litigation or claim is filed or instituted. You will indemnify and hold harmless Meta from and against any claim by any third party arising out of or related to your use or distribution of the Llama Materials.
"""

What makes us confident it's a good idea to tie the name of this project to the name of meta's open weights but not OSI approved license llm? e.g. for future options to use other AI models, are we sure we ought to sink hundreds of engineer years into a project named after a corpo ai model with explicitly monopolistic business terms?

i don't mean to mount a crusade here, just seems reasonable to rename llama.cpp to something else which sounds useful for more than just talking to llama (which produces outputs which we can't even use for other future AI !) and post this here because this is where the feedback link leads; bottom line for me personally is, I wouldn't ever use this for Llama but i'd consider it for Mistral

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed refactoring Refactoring server/webui
Projects
Status: In Progress
Development

No branches or pull requests