server : refactor #5882

ggerganov · 2024-03-05T09:25:07Z

Moved the code around to logically similar things closer together and did some renaming. The cache_tokens management should be improved - it's now updated only when params.cache_tokens == true. It was also misused to count the number of tokens that have been processed - use n_past instead.

The context shift and self-extend logic is still very ugly - need to simplify this, but will be done in another PR because changes are getting too much in this one. Thinking about merging it within llama_sampling_context, but the system_prompt is making things difficult. Probably a first step would be to simplify the system_prompt management and then try to refactor the rest

Remove multimodal capabilities - I don't like the existing implementation. Better to completely remove it and implement it properly in the future
(future PR) Refactor context shift + self-extend into common and reuse across examples
Improve main loop batching - do not process more than n_batch tokens per iter. Need better slot managing logic. This will fix batched embeddings and long-blocking requests
Fix the code style
Disable prompt caching when self-extend is enabled

phymbert · 2024-03-05T14:23:53Z

nice @ggerganov, it is worth the effort. Please tell me which additional tests might help you to secure it.

ggerganov · 2024-03-05T16:31:01Z

Here is a test case that fails on master and I am looking to fix in this PR:

./server -m models/bert-bge-small/ggml-model-f16.gguf --embedding --port 6900 -c 4096 -b 512 -np 8

diff --git a/examples/server-embd.py b/examples/server-embd.py
index c5c4ea87..118e0427 100644
--- a/examples/server-embd.py
+++ b/examples/server-embd.py
@@ -13,7 +13,7 @@ async def main():
     model_url = "http://127.0.0.1:6900"
     responses: list[requests.Response] = await asyncio.gather(*[requests_post_async(
         url= f"{model_url}/embedding",
-        json= {"content": str(i)*1024}
+        json= {"content": str(0)*1024}
     ) for i in range(n)])
 
     for response in responses:

python3 ./examples/server-embd.py
[-0.691986083984375, -0.6994509100914001, -0.34556347131729126, -0.16304560005664825, -0.6425938606262207, 0.6009216904640198, 0.8178988099098206, -0.2026521861553192]
[-0.033909425139427185, -0.11941321194171906, -0.7864583730697632, -0.24582073092460632, -0.17017485201358795, 0.0760011374950409, 0.3244125545024872, -0.38041409850120544]
[-0.0339287631213665, -0.11942094564437866, -0.7864440083503723, -0.24585771560668945, -0.17014935612678528, 0.07596983015537262, 0.32442623376846313, -0.38041090965270996]
[-0.03392859548330307, -0.11942102015018463, -0.7864437699317932, -0.24585801362991333, -0.17014899849891663, 0.07596953213214874, 0.324426531791687, -0.380410760641098]
[-0.46300897002220154, -0.14666011929512024, -0.646159291267395, 0.10496828705072403, -0.528754711151123, 0.09720712900161743, 0.7127938866615295, 0.019480813294649124]
[-0.033925510942935944, -0.1194162666797638, -0.7864490151405334, -0.24582067131996155, -0.17019721865653992, 0.07600662112236023, 0.32440948486328125, -0.38042333722114563]
[-0.03392315283417702, -0.1194247156381607, -0.786466121673584, -0.2458271086215973, -0.1701907217502594, 0.07600513100624084, 0.3244108259677887, -0.38043472170829773]
[-0.35006847977638245, -0.09708067774772644, -0.7097396850585938, 0.1826569139957428, -0.36089810729026794, 0.19932880997657776, 0.8620818853378296, 0.2980068325996399]
Similarity between 0 and 1: 0.41
Similarity between 0 and 2: 0.41
Similarity between 0 and 3: 0.41
Similarity between 0 and 4: 0.76
Similarity between 0 and 5: 0.41
Similarity between 0 and 6: 0.41
Similarity between 0 and 7: 0.69
Similarity between 1 and 2: 1.00
Similarity between 1 and 3: 1.00
Similarity between 1 and 4: 0.47
Similarity between 1 and 5: 1.00
Similarity between 1 and 6: 1.00
Similarity between 1 and 7: 0.45
Similarity between 2 and 3: 1.00
Similarity between 2 and 4: 0.47
Similarity between 2 and 5: 1.00
Similarity between 2 and 6: 1.00
Similarity between 2 and 7: 0.45
Similarity between 3 and 4: 0.47
Similarity between 3 and 5: 1.00
Similarity between 3 and 6: 1.00
Similarity between 3 and 7: 0.45
Similarity between 4 and 5: 0.47
Similarity between 4 and 6: 0.47
Similarity between 4 and 7: 0.95
Similarity between 5 and 6: 1.00
Similarity between 5 and 7: 0.45
Similarity between 6 and 7: 0.45

It should return similarity of 1.00 for all pairs. But currently it fails because the logic in update_slots() truncates and splits the input prompts. This is incorrect - they have to be processed in whole when using embedding models such as BERT. If the prompt does not fit in the batch size (n_batch) we should return some error

chigkim · 2024-03-06T11:59:17Z

@ggerganov Could you please don't "remove multimodal capabilities" from server? Many blind users are relying on it for image description.
Thanks so much!!!

phymbert · 2024-03-06T16:36:10Z

./server -m models/bert-bge-small/ggml-model-f16.gguf --embedding --port 6900 -c 4096 -b 512 -np 8

@ggerganov Could you please upload on ggml-org/models a bert-bge-small ?

The one I found cannot tokenize anymore CompendiumLabs/embedding-models-english

tokenize error to_lower(104)

__pthread_kill_implementation 0x00007fffeee99a1b
__pthread_kill_internal 0x00007fffeee99a1b
__GI___pthread_kill 0x00007fffeee99a1b
__GI_raise 0x00007fffeee428e6
__GI_abort 0x00007fffeee268b7
<unknown> 0x00007fffef2a4f06
<unknown> 0x00007fffef2b6e6c
std::terminate() 0x00007fffef2b6ed7
__cxa_throw 0x00007fffef2b7138
std::__throw_runtime_error(char const*) 0x00007fffef2a8338
std::locale::facet::_S_create_c_locale(__locale_struct*&, char const*, __locale_struct*) 0x00007fffef2db108
std::locale::_Impl::_Impl(char const*, unsigned long) 0x00007fffef2cc9a5
std::locale::locale(char const*) 0x00007fffef2cd585
llm_tokenizer_wpm::to_lower llama.cpp:9355
llm_tokenizer_wpm::preprocess llama.cpp:9319
llm_tokenizer_wpm::tokenize llama.cpp:9246
llama_tokenize_internal llama.cpp:9593
llama_tokenize llama.cpp:13309
llama_tokenize common.cpp:1407
llama_tokenize common.cpp:1396
llama_server_context::tokenize server.cpp:515
llama_server_context::update_slots server.cpp:1719
std::__invoke_impl<…> invoke.h:74
std::__invoke<…> invoke.h:96
std::_Bind<bool (llama_server_context::*(llama_server_context*))()>::__call<bool, , 0ul>(std::tuple<>&&, std::_Index_tuple<0ul>) functional:506
std::_Bind<bool (llama_server_context::*(llama_server_context*))()>::operator()<, bool>() functional:591
std::__invoke_impl<…>(std::__invoke_other, std::_Bind<…> &) invoke.h:61
std::__invoke_r<…>(std::_Bind<…> &) invoke.h:150
std::_Function_handler::_M_invoke(const std::_Any_data &) std_function.h:290
std::function::operator()() const std_function.h:591
llama_server_queue::start_loop utils.hpp:315
main server.cpp:3552
__libc_start_call_main 0x00007fffeee28150
__libc_start_main_impl 0x00007fffeee28209
_start 0x00005555555678a5

ggerganov · 2024-03-06T16:55:11Z

Here is one: https://huggingface.co/ggml-org/models/tree/main/bert-bge-small

ggerganov · 2024-03-06T19:41:33Z

@phymbert The prompt str(0)*1024 tokenizes to 513 tokens so I bumped the batch size in the embeddings tests to 1024, otherwise the prompts are not processed because they don't fit entirely in the batch. The tests pass on my local machine - let's see if it works in the CI

phymbert · 2024-03-06T19:43:13Z

@phymbert The prompt str(0)*1024 tokenizes to 513 tokens so I bumped the batch size in the embeddings tests to 1024, otherwise the prompts are not processed because they don't fit entirely in the batch. The tests pass on my local machine - let's see if it works in the CI

@ggerganov Thanks, yes but if you increase the KV Cache Size to 2048, it does not pass anymore.

ggerganov · 2024-03-06T19:52:13Z

Can you find out why? It works for me

examples/server/server.cpp

ngxson · 2024-03-06T22:34:52Z

examples/server/server.cpp

+            snprintf(buf, sizeof(buf), fmt, "Unknown Exception");
+        }
+
+        res.set_content(buf, "text/plain; charset=utf-8");


Maybe off topic, but errors should be wrapped inside an object, to be exactly the format of OpenAI:

{ "error": { "message": "You didn't provide an API key. .......", "type": "invalid_request_error", "param": null, "code": null } }

I can do that in another PR though

examples/server/server.cpp

ggml-ci

phymbert · 2024-03-07T09:41:58Z

examples/server/server.cpp

+        slot.sparams.mirostat_tau      = json_value(data, "mirostat_tau",      default_sparams.mirostat_tau);
+        slot.sparams.mirostat_eta      = json_value(data, "mirostat_eta",      default_sparams.mirostat_eta);
+        slot.sparams.penalize_nl       = json_value(data, "penalize_nl",       default_sparams.penalize_nl);
+        slot.params.n_keep             = json_value(data, "n_keep",            slot.params.n_keep);


@ggerganov For OAI Completions, "n_keep" is not set in json data, so it's always 0 and it triggers context shifthing with n_keep=1:

llama.cpp/examples/server/server.cpp

Line 1609 in 87a4a10

const int n_keep = slot.params.n_keep + add_bos_token;

llama.cpp/examples/server/server.cpp

Line 1749 in 87a4a10

if (slot.params.n_keep < 0) {

Or something I dont understand ?

Hm, yes - likely this has to default to default_params.n_keep

ggerganov · 2024-03-07T09:43:44Z

It's likely that I broke something, but merging this to unblock other PRs. We should focus on fixing embeddings tokenization and normalization next (#5801 (comment))

sorasoras · 2024-03-08T16:06:38Z

It's likely that I broke something, but merging this to unblock other PRs. We should focus on fixing embeddings tokenization and normalization next (#5801 (comment))

It looks like server.exe wouldn't open after this pr with rocm on windows
There is no response

cmake .. -G "Ninja" -DCMAKE_BUILD_TYPE=Release -DLLAMA_HIPBLAS=ON  -DCMAKE_C_COMPILER="C:/Program Files/AMD/ROCm/5.7/bin/clang.exe" -DCMAKE_CXX_COMPILER="C:/Program Files/AMD/ROCm/5.7/bin/clang++.exe" -DAMDGPU_TARGETS="gfx1100;gfx1030;gfx1031"

 cmake --build . --config Release

@ggerganov

* server : refactoring (wip) * server : remove llava/clip objects from build * server : fix empty prompt handling + all slots idle logic * server : normalize id vars * server : code style * server : simplify model chat template validation * server : code style * server : minor * llama : llama_chat_apply_template support null buf * server : do not process embedding requests when disabled * server : reorganize structs and enums + naming fixes * server : merge oai.hpp in utils.hpp * server : refactor system prompt update at start * server : disable cached prompts with self-extend * server : do not process more than n_batch tokens per iter * server: tests: embeddings use a real embeddings model (ggerganov#5908) * server, tests : bump batch to fit 1 embedding prompt * server: tests: embeddings fix build type Debug is randomly failing (ggerganov#5911) * server: tests: embeddings, use different KV Cache size * server: tests: embeddings, fixed prompt do not exceed n_batch, increase embedding timeout, reduce number of concurrent embeddings * server: tests: embeddings, no need to wait for server idle as it can timout * server: refactor: clean up http code (ggerganov#5912) * server : avoid n_available var ggml-ci * server: refactor: better http codes * server : simplify json parsing + add comment about t_last * server : rename server structs * server : allow to override FQDN in tests ggml-ci * server : add comments --------- Co-authored-by: Pierrick Hymbert <pierrick.hymbert@gmail.com>

whoreson · 2024-03-12T12:51:58Z

Remove multimodal capabilities - I don't like the existing implementation. Better to completely remove it and implement it properly in the future

...aaand my library of archived llama.cpp versions just gained another entry

* server : refactoring (wip) * server : remove llava/clip objects from build * server : fix empty prompt handling + all slots idle logic * server : normalize id vars * server : code style * server : simplify model chat template validation * server : code style * server : minor * llama : llama_chat_apply_template support null buf * server : do not process embedding requests when disabled * server : reorganize structs and enums + naming fixes * server : merge oai.hpp in utils.hpp * server : refactor system prompt update at start * server : disable cached prompts with self-extend * server : do not process more than n_batch tokens per iter * server: tests: embeddings use a real embeddings model (ggerganov#5908) * server, tests : bump batch to fit 1 embedding prompt * server: tests: embeddings fix build type Debug is randomly failing (ggerganov#5911) * server: tests: embeddings, use different KV Cache size * server: tests: embeddings, fixed prompt do not exceed n_batch, increase embedding timeout, reduce number of concurrent embeddings * server: tests: embeddings, no need to wait for server idle as it can timout * server: refactor: clean up http code (ggerganov#5912) * server : avoid n_available var ggml-ci * server: refactor: better http codes * server : simplify json parsing + add comment about t_last * server : rename server structs * server : allow to override FQDN in tests ggml-ci * server : add comments --------- Co-authored-by: Pierrick Hymbert <pierrick.hymbert@gmail.com>

Kreijstal · 2024-03-18T10:24:18Z

Remove multimodal capabilities - I don't like the existing implementation. Better to completely remove it and implement it properly in the future

Hmm wouldn't it be wiser to implement it properly before removing it.

Dampfinchen · 2024-03-22T08:07:00Z

Remove multimodal capabilities - I don't like the existing implementation. Better to completely remove it and implement it properly in the future

...aaand my library of archived llama.cpp versions just gained another entry

I'm also not getting the thought process behind this. It's better to remove something working in place of... nothing at all, because...?

phymbert · 2024-03-22T08:41:43Z

I'm also not getting the thought process behind this. It's better to remove something working in place of... nothing at all, because...?

Sorry for that, but the previous implementation was making impossible to refactor the code properly and it was causing performance issue, so this temporary removal is for the good of the server adoption. The removal has been tracked in:

Feel free to contribute.

cartertemm · 2024-03-25T00:15:27Z

For @chigkim, and anyone else with a use case requiring multimodal capabilities e.g. image descriptions for blind users: not exactly a workaround, but the last release supporting this seems to be b2356 so you can link to that in your app's documentation which is what I'm having to do in the meantime.

* server : refactoring (wip) * server : remove llava/clip objects from build * server : fix empty prompt handling + all slots idle logic * server : normalize id vars * server : code style * server : simplify model chat template validation * server : code style * server : minor * llama : llama_chat_apply_template support null buf * server : do not process embedding requests when disabled * server : reorganize structs and enums + naming fixes * server : merge oai.hpp in utils.hpp * server : refactor system prompt update at start * server : disable cached prompts with self-extend * server : do not process more than n_batch tokens per iter * server: tests: embeddings use a real embeddings model (ggerganov#5908) * server, tests : bump batch to fit 1 embedding prompt * server: tests: embeddings fix build type Debug is randomly failing (ggerganov#5911) * server: tests: embeddings, use different KV Cache size * server: tests: embeddings, fixed prompt do not exceed n_batch, increase embedding timeout, reduce number of concurrent embeddings * server: tests: embeddings, no need to wait for server idle as it can timout * server: refactor: clean up http code (ggerganov#5912) * server : avoid n_available var ggml-ci * server: refactor: better http codes * server : simplify json parsing + add comment about t_last * server : rename server structs * server : allow to override FQDN in tests ggml-ci * server : add comments --------- Co-authored-by: Pierrick Hymbert <pierrick.hymbert@gmail.com>

GrigoryEvko · 2024-04-10T23:38:23Z

Really waiting for multimodal capabilities back!

And update example docs please, it still mentions --mmproj

ggerganov added 6 commits March 5, 2024 11:16

server : refactoring (wip)

f4e6e7e

server : remove llava/clip objects from build

ef7eb33

server : fix empty prompt handling + all slots idle logic

134f5fe

server : normalize id vars

ad1d746

server : code style

fef64c5

server : simplify model chat template validation

b1b3ba8

ggerganov added 3 commits March 5, 2024 16:42

server : code style

f4800d5

server : minor

7635b13

llama : llama_chat_apply_template support null buf

f84809b

ngxson mentioned this pull request Mar 5, 2024

server: maintain chat completion id for streaming responses #5880

Closed

ggerganov added 4 commits March 5, 2024 18:58

server : do not process embedding requests when disabled

22ae1a6

server : reorganize structs and enums + naming fixes

cb3ce0b

server : merge oai.hpp in utils.hpp

4a2d5f6

server : refactor system prompt update at start

61b6370

This was referenced Mar 6, 2024

server: multimodal - fix misreported prompt and num prompt tokens #5896

Open

llama cpp server not doing parallel inference for llava when using flags -np and -cb #5592

Open

ngxson mentioned this pull request Mar 6, 2024

Inconsistent Bert Embedding output from embedding.cpp vs llama.cpp server #5801

Open

server : disable cached prompts with self-extend

aef02b1

cjpais mentioned this pull request Mar 6, 2024

Question about llama.cpp and llava-cli when used with llava 1.6 for vision: #5852

Closed

ggerganov and others added 3 commits March 6, 2024 21:17

server : do not process more than n_batch tokens per iter

bfb121f

server: tests: embeddings use a real embeddings model (#5908)

79ef3c0

server, tests : bump batch to fit 1 embedding prompt

36e12f8

ngxson reviewed Mar 6, 2024

View reviewed changes

examples/server/server.cpp Outdated Show resolved Hide resolved

ngxson reviewed Mar 6, 2024

View reviewed changes

examples/server/server.cpp Show resolved Hide resolved

ggerganov added 4 commits March 7, 2024 10:19

server : simplify json parsing + add comment about t_last

fd74b5e

server : rename server structs

234ab58

server : allow to override FQDN in tests

818d898

ggml-ci

server : add comments

87a4a10

ggerganov merged commit 2002bc9 into master Mar 7, 2024
56 of 61 checks passed

ggerganov deleted the gg/refactor-server branch March 7, 2024 09:41

phymbert reviewed Mar 7, 2024

View reviewed changes

ngxson mentioned this pull request Mar 8, 2024

server: error handling #5776

Closed

ngxson mentioned this pull request Mar 9, 2024

Server: format error to json #5961

Merged

sorasoras mentioned this pull request Mar 9, 2024

Server stop working after recent refactor on Windows #5963

Closed

phymbert mentioned this pull request Mar 12, 2024

server : improvements and maintenance #4216

Open

10 tasks

chigkim mentioned this pull request Mar 12, 2024

LLaVA 1.6 Models Unable to Process Specific Image Size and Resolution Locally ollama/ollama#2429

Open

rezacopol mentioned this pull request Mar 19, 2024

Bring back multimodal support for server #6168

Open

Dampfinchen mentioned this pull request Mar 22, 2024

Unable to assign mmproj value when running docker #6226

Open

mscheong01 mentioned this pull request Mar 26, 2024

Add multimodal example #6313

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server : refactor #5882

server : refactor #5882

ggerganov commented Mar 5, 2024 •

edited

phymbert commented Mar 5, 2024 •

edited

ggerganov commented Mar 5, 2024

chigkim commented Mar 6, 2024

phymbert commented Mar 6, 2024

ggerganov commented Mar 6, 2024

ggerganov commented Mar 6, 2024

phymbert commented Mar 6, 2024 •

edited

ggerganov commented Mar 6, 2024

ngxson Mar 6, 2024

phymbert Mar 7, 2024 •

edited

ggerganov Mar 7, 2024

ggerganov commented Mar 7, 2024

sorasoras commented Mar 8, 2024 •

edited

whoreson commented Mar 12, 2024

Kreijstal commented Mar 18, 2024

Dampfinchen commented Mar 22, 2024

phymbert commented Mar 22, 2024

cartertemm commented Mar 25, 2024 •

edited

GrigoryEvko commented Apr 10, 2024

server : refactor #5882

server : refactor #5882

Conversation

ggerganov commented Mar 5, 2024 • edited

phymbert commented Mar 5, 2024 • edited

ggerganov commented Mar 5, 2024

chigkim commented Mar 6, 2024

phymbert commented Mar 6, 2024

ggerganov commented Mar 6, 2024

ggerganov commented Mar 6, 2024

phymbert commented Mar 6, 2024 • edited

ggerganov commented Mar 6, 2024

ngxson Mar 6, 2024

Choose a reason for hiding this comment

phymbert Mar 7, 2024 • edited

Choose a reason for hiding this comment

ggerganov Mar 7, 2024

Choose a reason for hiding this comment

ggerganov commented Mar 7, 2024

sorasoras commented Mar 8, 2024 • edited

whoreson commented Mar 12, 2024

Kreijstal commented Mar 18, 2024

Dampfinchen commented Mar 22, 2024

phymbert commented Mar 22, 2024

cartertemm commented Mar 25, 2024 • edited

GrigoryEvko commented Apr 10, 2024

ggerganov commented Mar 5, 2024 •

edited

phymbert commented Mar 5, 2024 •

edited

phymbert commented Mar 6, 2024 •

edited

phymbert Mar 7, 2024 •

edited

sorasoras commented Mar 8, 2024 •

edited

cartertemm commented Mar 25, 2024 •

edited