Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server : refactor #5882

Merged
merged 26 commits into from Mar 7, 2024
Merged

server : refactor #5882

merged 26 commits into from Mar 7, 2024

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Mar 5, 2024

ref #4216

Moved the code around to logically similar things closer together and did some renaming. The cache_tokens management should be improved - it's now updated only when params.cache_tokens == true. It was also misused to count the number of tokens that have been processed - use n_past instead.

The context shift and self-extend logic is still very ugly - need to simplify this, but will be done in another PR because changes are getting too much in this one. Thinking about merging it within llama_sampling_context, but the system_prompt is making things difficult. Probably a first step would be to simplify the system_prompt management and then try to refactor the rest

  • Remove multimodal capabilities - I don't like the existing implementation. Better to completely remove it and implement it properly in the future
  • (future PR) Refactor context shift + self-extend into common and reuse across examples
  • Improve main loop batching - do not process more than n_batch tokens per iter. Need better slot managing logic. This will fix batched embeddings and long-blocking requests
  • Fix the code style
  • Disable prompt caching when self-extend is enabled

@phymbert
Copy link
Collaborator

phymbert commented Mar 5, 2024

nice @ggerganov, it is worth the effort. Please tell me which additional tests might help you to secure it.

@ggerganov
Copy link
Owner Author

Here is a test case that fails on master and I am looking to fix in this PR:

./server -m models/bert-bge-small/ggml-model-f16.gguf --embedding --port 6900 -c 4096 -b 512 -np 8
diff --git a/examples/server-embd.py b/examples/server-embd.py
index c5c4ea87..118e0427 100644
--- a/examples/server-embd.py
+++ b/examples/server-embd.py
@@ -13,7 +13,7 @@ async def main():
     model_url = "http://127.0.0.1:6900"
     responses: list[requests.Response] = await asyncio.gather(*[requests_post_async(
         url= f"{model_url}/embedding",
-        json= {"content": str(i)*1024}
+        json= {"content": str(0)*1024}
     ) for i in range(n)])
 
     for response in responses:
python3 ./examples/server-embd.py
[-0.691986083984375, -0.6994509100914001, -0.34556347131729126, -0.16304560005664825, -0.6425938606262207, 0.6009216904640198, 0.8178988099098206, -0.2026521861553192]
[-0.033909425139427185, -0.11941321194171906, -0.7864583730697632, -0.24582073092460632, -0.17017485201358795, 0.0760011374950409, 0.3244125545024872, -0.38041409850120544]
[-0.0339287631213665, -0.11942094564437866, -0.7864440083503723, -0.24585771560668945, -0.17014935612678528, 0.07596983015537262, 0.32442623376846313, -0.38041090965270996]
[-0.03392859548330307, -0.11942102015018463, -0.7864437699317932, -0.24585801362991333, -0.17014899849891663, 0.07596953213214874, 0.324426531791687, -0.380410760641098]
[-0.46300897002220154, -0.14666011929512024, -0.646159291267395, 0.10496828705072403, -0.528754711151123, 0.09720712900161743, 0.7127938866615295, 0.019480813294649124]
[-0.033925510942935944, -0.1194162666797638, -0.7864490151405334, -0.24582067131996155, -0.17019721865653992, 0.07600662112236023, 0.32440948486328125, -0.38042333722114563]
[-0.03392315283417702, -0.1194247156381607, -0.786466121673584, -0.2458271086215973, -0.1701907217502594, 0.07600513100624084, 0.3244108259677887, -0.38043472170829773]
[-0.35006847977638245, -0.09708067774772644, -0.7097396850585938, 0.1826569139957428, -0.36089810729026794, 0.19932880997657776, 0.8620818853378296, 0.2980068325996399]
Similarity between 0 and 1: 0.41
Similarity between 0 and 2: 0.41
Similarity between 0 and 3: 0.41
Similarity between 0 and 4: 0.76
Similarity between 0 and 5: 0.41
Similarity between 0 and 6: 0.41
Similarity between 0 and 7: 0.69
Similarity between 1 and 2: 1.00
Similarity between 1 and 3: 1.00
Similarity between 1 and 4: 0.47
Similarity between 1 and 5: 1.00
Similarity between 1 and 6: 1.00
Similarity between 1 and 7: 0.45
Similarity between 2 and 3: 1.00
Similarity between 2 and 4: 0.47
Similarity between 2 and 5: 1.00
Similarity between 2 and 6: 1.00
Similarity between 2 and 7: 0.45
Similarity between 3 and 4: 0.47
Similarity between 3 and 5: 1.00
Similarity between 3 and 6: 1.00
Similarity between 3 and 7: 0.45
Similarity between 4 and 5: 0.47
Similarity between 4 and 6: 0.47
Similarity between 4 and 7: 0.95
Similarity between 5 and 6: 1.00
Similarity between 5 and 7: 0.45
Similarity between 6 and 7: 0.45

It should return similarity of 1.00 for all pairs. But currently it fails because the logic in update_slots() truncates and splits the input prompts. This is incorrect - they have to be processed in whole when using embedding models such as BERT. If the prompt does not fit in the batch size (n_batch) we should return some error

@chigkim
Copy link

chigkim commented Mar 6, 2024

@ggerganov Could you please don't "remove multimodal capabilities" from server? Many blind users are relying on it for image description.
Thanks so much!!!

@phymbert
Copy link
Collaborator

phymbert commented Mar 6, 2024

./server -m models/bert-bge-small/ggml-model-f16.gguf --embedding --port 6900 -c 4096 -b 512 -np 8

@ggerganov Could you please upload on ggml-org/models a bert-bge-small ?

The one I found cannot tokenize anymore CompendiumLabs/embedding-models-english

tokenize error to_lower(104)
__pthread_kill_implementation 0x00007fffeee99a1b
__pthread_kill_internal 0x00007fffeee99a1b
__GI___pthread_kill 0x00007fffeee99a1b
__GI_raise 0x00007fffeee428e6
__GI_abort 0x00007fffeee268b7
<unknown> 0x00007fffef2a4f06
<unknown> 0x00007fffef2b6e6c
std::terminate() 0x00007fffef2b6ed7
__cxa_throw 0x00007fffef2b7138
std::__throw_runtime_error(char const*) 0x00007fffef2a8338
std::locale::facet::_S_create_c_locale(__locale_struct*&, char const*, __locale_struct*) 0x00007fffef2db108
std::locale::_Impl::_Impl(char const*, unsigned long) 0x00007fffef2cc9a5
std::locale::locale(char const*) 0x00007fffef2cd585
llm_tokenizer_wpm::to_lower llama.cpp:9355
llm_tokenizer_wpm::preprocess llama.cpp:9319
llm_tokenizer_wpm::tokenize llama.cpp:9246
llama_tokenize_internal llama.cpp:9593
llama_tokenize llama.cpp:13309
llama_tokenize common.cpp:1407
llama_tokenize common.cpp:1396
llama_server_context::tokenize server.cpp:515
llama_server_context::update_slots server.cpp:1719
std::__invoke_impl<…> invoke.h:74
std::__invoke<…> invoke.h:96
std::_Bind<bool (llama_server_context::*(llama_server_context*))()>::__call<bool, , 0ul>(std::tuple<>&&, std::_Index_tuple<0ul>) functional:506
std::_Bind<bool (llama_server_context::*(llama_server_context*))()>::operator()<, bool>() functional:591
std::__invoke_impl<…>(std::__invoke_other, std::_Bind<…> &) invoke.h:61
std::__invoke_r<…>(std::_Bind<…> &) invoke.h:150
std::_Function_handler::_M_invoke(const std::_Any_data &) std_function.h:290
std::function::operator()() const std_function.h:591
llama_server_queue::start_loop utils.hpp:315
main server.cpp:3552
__libc_start_call_main 0x00007fffeee28150
__libc_start_main_impl 0x00007fffeee28209
_start 0x00005555555678a5

@ggerganov
Copy link
Owner Author

Here is one: https://huggingface.co/ggml-org/models/tree/main/bert-bge-small

@ggerganov
Copy link
Owner Author

@phymbert The prompt str(0)*1024 tokenizes to 513 tokens so I bumped the batch size in the embeddings tests to 1024, otherwise the prompts are not processed because they don't fit entirely in the batch. The tests pass on my local machine - let's see if it works in the CI

@phymbert
Copy link
Collaborator

phymbert commented Mar 6, 2024

@phymbert The prompt str(0)*1024 tokenizes to 513 tokens so I bumped the batch size in the embeddings tests to 1024, otherwise the prompts are not processed because they don't fit entirely in the batch. The tests pass on my local machine - let's see if it works in the CI

@ggerganov Thanks, yes but if you increase the KV Cache Size to 2048, it does not pass anymore.

@ggerganov
Copy link
Owner Author

Can you find out why? It works for me

examples/server/server.cpp Outdated Show resolved Hide resolved
snprintf(buf, sizeof(buf), fmt, "Unknown Exception");
}

res.set_content(buf, "text/plain; charset=utf-8");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe off topic, but errors should be wrapped inside an object, to be exactly the format of OpenAI:

{
    "error": {
        "message": "You didn't provide an API key. .......",
        "type": "invalid_request_error",
        "param": null,
        "code": null
    }
}

I can do that in another PR though

@ggerganov ggerganov merged commit 2002bc9 into master Mar 7, 2024
56 of 61 checks passed
@ggerganov ggerganov deleted the gg/refactor-server branch March 7, 2024 09:41
slot.sparams.mirostat_tau = json_value(data, "mirostat_tau", default_sparams.mirostat_tau);
slot.sparams.mirostat_eta = json_value(data, "mirostat_eta", default_sparams.mirostat_eta);
slot.sparams.penalize_nl = json_value(data, "penalize_nl", default_sparams.penalize_nl);
slot.params.n_keep = json_value(data, "n_keep", slot.params.n_keep);
Copy link
Collaborator

@phymbert phymbert Mar 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggerganov For OAI Completions, "n_keep" is not set in json data, so it's always 0 and it triggers context shifthing with n_keep=1:

const int n_keep = slot.params.n_keep + add_bos_token;

if (slot.params.n_keep < 0) {

Or something I dont understand ?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, yes - likely this has to default to default_params.n_keep

@ggerganov
Copy link
Owner Author

It's likely that I broke something, but merging this to unblock other PRs. We should focus on fixing embeddings tokenization and normalization next (#5801 (comment))

@ngxson ngxson mentioned this pull request Mar 8, 2024
@sorasoras
Copy link

sorasoras commented Mar 8, 2024

It's likely that I broke something, but merging this to unblock other PRs. We should focus on fixing embeddings tokenization and normalization next (#5801 (comment))

It looks like server.exe wouldn't open after this pr with rocm on windows
There is no response

cmake .. -G "Ninja" -DCMAKE_BUILD_TYPE=Release -DLLAMA_HIPBLAS=ON  -DCMAKE_C_COMPILER="C:/Program Files/AMD/ROCm/5.7/bin/clang.exe" -DCMAKE_CXX_COMPILER="C:/Program Files/AMD/ROCm/5.7/bin/clang++.exe" -DAMDGPU_TARGETS="gfx1100;gfx1030;gfx1031"

 cmake --build . --config Release

@ggerganov

hazelnutcloud pushed a commit to hazelnutcloud/llama.cpp that referenced this pull request Mar 10, 2024
* server : refactoring (wip)

* server : remove llava/clip objects from build

* server : fix empty prompt handling + all slots idle logic

* server : normalize id vars

* server : code style

* server : simplify model chat template validation

* server : code style

* server : minor

* llama : llama_chat_apply_template support null buf

* server : do not process embedding requests when disabled

* server : reorganize structs and enums + naming fixes

* server : merge oai.hpp in utils.hpp

* server : refactor system prompt update at start

* server : disable cached prompts with self-extend

* server : do not process more than n_batch tokens per iter

* server: tests: embeddings use a real embeddings model (ggerganov#5908)

* server, tests : bump batch to fit 1 embedding prompt

* server: tests: embeddings fix build type Debug is randomly failing (ggerganov#5911)

* server: tests: embeddings, use different KV Cache size

* server: tests: embeddings, fixed prompt do not exceed n_batch, increase embedding timeout, reduce number of concurrent embeddings

* server: tests: embeddings, no need to wait for server idle as it can timout

* server: refactor: clean up http code (ggerganov#5912)

* server : avoid n_available var

ggml-ci

* server: refactor: better http codes

* server : simplify json parsing + add comment about t_last

* server : rename server structs

* server : allow to override FQDN in tests

ggml-ci

* server : add comments

---------

Co-authored-by: Pierrick Hymbert <pierrick.hymbert@gmail.com>
@whoreson
Copy link
Contributor

Remove multimodal capabilities - I don't like the existing implementation. Better to completely remove it and implement it properly in the future

...aaand my library of archived llama.cpp versions just gained another entry

NeoZhangJianyu pushed a commit to NeoZhangJianyu/llama.cpp that referenced this pull request Mar 12, 2024
* server : refactoring (wip)

* server : remove llava/clip objects from build

* server : fix empty prompt handling + all slots idle logic

* server : normalize id vars

* server : code style

* server : simplify model chat template validation

* server : code style

* server : minor

* llama : llama_chat_apply_template support null buf

* server : do not process embedding requests when disabled

* server : reorganize structs and enums + naming fixes

* server : merge oai.hpp in utils.hpp

* server : refactor system prompt update at start

* server : disable cached prompts with self-extend

* server : do not process more than n_batch tokens per iter

* server: tests: embeddings use a real embeddings model (ggerganov#5908)

* server, tests : bump batch to fit 1 embedding prompt

* server: tests: embeddings fix build type Debug is randomly failing (ggerganov#5911)

* server: tests: embeddings, use different KV Cache size

* server: tests: embeddings, fixed prompt do not exceed n_batch, increase embedding timeout, reduce number of concurrent embeddings

* server: tests: embeddings, no need to wait for server idle as it can timout

* server: refactor: clean up http code (ggerganov#5912)

* server : avoid n_available var

ggml-ci

* server: refactor: better http codes

* server : simplify json parsing + add comment about t_last

* server : rename server structs

* server : allow to override FQDN in tests

ggml-ci

* server : add comments

---------

Co-authored-by: Pierrick Hymbert <pierrick.hymbert@gmail.com>
jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Mar 13, 2024
* server : refactoring (wip)

* server : remove llava/clip objects from build

* server : fix empty prompt handling + all slots idle logic

* server : normalize id vars

* server : code style

* server : simplify model chat template validation

* server : code style

* server : minor

* llama : llama_chat_apply_template support null buf

* server : do not process embedding requests when disabled

* server : reorganize structs and enums + naming fixes

* server : merge oai.hpp in utils.hpp

* server : refactor system prompt update at start

* server : disable cached prompts with self-extend

* server : do not process more than n_batch tokens per iter

* server: tests: embeddings use a real embeddings model (ggerganov#5908)

* server, tests : bump batch to fit 1 embedding prompt

* server: tests: embeddings fix build type Debug is randomly failing (ggerganov#5911)

* server: tests: embeddings, use different KV Cache size

* server: tests: embeddings, fixed prompt do not exceed n_batch, increase embedding timeout, reduce number of concurrent embeddings

* server: tests: embeddings, no need to wait for server idle as it can timout

* server: refactor: clean up http code (ggerganov#5912)

* server : avoid n_available var

ggml-ci

* server: refactor: better http codes

* server : simplify json parsing + add comment about t_last

* server : rename server structs

* server : allow to override FQDN in tests

ggml-ci

* server : add comments

---------

Co-authored-by: Pierrick Hymbert <pierrick.hymbert@gmail.com>
@Kreijstal
Copy link

  • Remove multimodal capabilities - I don't like the existing implementation. Better to completely remove it and implement it properly in the future

Hmm wouldn't it be wiser to implement it properly before removing it.

@Dampfinchen
Copy link

Remove multimodal capabilities - I don't like the existing implementation. Better to completely remove it and implement it properly in the future

...aaand my library of archived llama.cpp versions just gained another entry

I'm also not getting the thought process behind this. It's better to remove something working in place of... nothing at all, because...?

@phymbert
Copy link
Collaborator

I'm also not getting the thought process behind this. It's better to remove something working in place of... nothing at all, because...?

Sorry for that, but the previous implementation was making impossible to refactor the code properly and it was causing performance issue, so this temporary removal is for the good of the server adoption. The removal has been tracked in:

Feel free to contribute.

@cartertemm
Copy link

cartertemm commented Mar 25, 2024

For @chigkim, and anyone else with a use case requiring multimodal capabilities e.g. image descriptions for blind users: not exactly a workaround, but the last release supporting this seems to be b2356 so you can link to that in your app's documentation which is what I'm having to do in the meantime.

hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
* server : refactoring (wip)

* server : remove llava/clip objects from build

* server : fix empty prompt handling + all slots idle logic

* server : normalize id vars

* server : code style

* server : simplify model chat template validation

* server : code style

* server : minor

* llama : llama_chat_apply_template support null buf

* server : do not process embedding requests when disabled

* server : reorganize structs and enums + naming fixes

* server : merge oai.hpp in utils.hpp

* server : refactor system prompt update at start

* server : disable cached prompts with self-extend

* server : do not process more than n_batch tokens per iter

* server: tests: embeddings use a real embeddings model (ggerganov#5908)

* server, tests : bump batch to fit 1 embedding prompt

* server: tests: embeddings fix build type Debug is randomly failing (ggerganov#5911)

* server: tests: embeddings, use different KV Cache size

* server: tests: embeddings, fixed prompt do not exceed n_batch, increase embedding timeout, reduce number of concurrent embeddings

* server: tests: embeddings, no need to wait for server idle as it can timout

* server: refactor: clean up http code (ggerganov#5912)

* server : avoid n_available var

ggml-ci

* server: refactor: better http codes

* server : simplify json parsing + add comment about t_last

* server : rename server structs

* server : allow to override FQDN in tests

ggml-ci

* server : add comments

---------

Co-authored-by: Pierrick Hymbert <pierrick.hymbert@gmail.com>
@GrigoryEvko
Copy link

Really waiting for multimodal capabilities back!

And update example docs please, it still mentions --mmproj

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet