Bug: crash in tokenizer while using embedding endpoint #7589

skoulik · 2024-05-28T13:10:05Z

What happened?

server --n-gpu-layers 13 --model D:\code\test_llm\models\embedding\nomic-embed-text-v1.5.f16.gguf --ctx-size 8192 --batch-size 8192 --rope-scaling yarn --rope-freq-scale 0.75 --port 8081 --embeddings --verbose

Exception thrown at 0x00007FFA7F76BA99 in server.exe: Microsoft C++ exception: nlohmann::json_abi_v3_11_3::detail::type_error at memory location 0x0000001B981275F0.
Unhandled exception at 0x00007FFA7F76BA99 in server.exe: Microsoft C++ exception: nlohmann::json_abi_v3_11_3::detail::type_error at memory location 0x0000001B981275F0.

 	KernelBase.dll!00007ffa7f76ba99()	Unknown
 	vcruntime140d.dll!00007ffa3a28b460()	Unknown
	server.exe!nlohmann::json_abi_v3_11_3::detail::from_json<nlohmann::basic_json<nlohmann::ordered_map>,int,0>(const nlohmann::json_abi_v3_11_3::basic_json<nlohmann::ordered_map,std::vector,std::string,bool,long long,unsigned long long,double,std::allocator,adl_serializer,std::vector<unsigned char,std::allocator<unsigned char>>,void> & j={...}, int & val=0) Line 4999	C++
 	server.exe!nlohmann::json_abi_v3_11_3::detail::from_json_fn::operator()<nlohmann::basic_json<nlohmann::ordered_map>,int &>(const nlohmann::json_abi_v3_11_3::basic_json<nlohmann::ordered_map,std::vector,std::string,bool,long long,unsigned long long,double,std::allocator,adl_serializer,std::vector<unsigned char,std::allocator<unsigned char>>,void> & j={...}, int & val=0) Line 5105	C++
 	server.exe!nlohmann::json_abi_v3_11_3::adl_serializer<int,void>::from_json<const nlohmann::basic_json<nlohmann::ordered_map> &,int>(const nlohmann::json_abi_v3_11_3::basic_json<nlohmann::ordered_map,std::vector,std::string,bool,long long,unsigned long long,double,std::allocator,adl_serializer,std::vector<unsigned char,std::allocator<unsigned char>>,void> & j={...}, int & val=0) Line 5843	C++
 	server.exe!nlohmann::json_abi_v3_11_3::basic_json<nlohmann::ordered_map,std::vector,std::string,bool,long long,unsigned long long,double,std::allocator,adl_serializer,std::vector<unsigned char,std::allocator<unsigned char>>,void>::get_impl<int,0>(nlohmann::json_abi_v3_11_3::detail::priority_tag<0>={...}) Line 20914	C++
 	server.exe!nlohmann::json_abi_v3_11_3::basic_json<nlohmann::ordered_map,std::vector,std::string,bool,long long,unsigned long long,double,std::allocator,adl_serializer,std::vector<unsigned char,std::allocator<unsigned char>>,void>::get<int,int>() Line 21056	C++
 	server.exe!server_context::tokenize(const nlohmann::json_abi_v3_11_3::basic_json<nlohmann::ordered_map,std::vector,std::string,bool,long long,unsigned long long,double,std::allocator,adl_serializer,std::vector<unsigned char,std::allocator<unsigned char>>,void> & json_prompt={...}, bool add_special=true) Line 799	C++
 	server.exe!server_context::update_slots() Line 1948	C++
 	server.exe!std::invoke<void (server_context::*&)(),server_context *&>(void(server_context::*)() & _Obj=0x00007ff773ea4350, server_context * & _Arg1=0x0000001b9812e178) Line 1540	C++
 	server.exe!std::_Invoker_ret<std::_Unforced,0>::_Call<void (server_context::*&)(),server_context *&>(void(server_context::*)() & _Func=0x00007ff773ea4350, server_context * & _Vals=0x0000001b9812e178) Line 670	C++
 	server.exe!std::_Call_binder<std::_Unforced,0,void (server_context::*)(),std::tuple<server_context *>,std::tuple<>>(std::_Invoker_ret<std::_Unforced,0>={...}, std::integer_sequence<unsigned long long,0>={...}, void(server_context::*)() & _Obj=0x00007ff773ea4350, std::tuple<server_context *> & _Tpl={...}, std::tuple<> && _Ut={...}) Line 1307	C++
 	server.exe!std::_Binder<std::_Unforced,void (server_context::*)(),server_context *>::operator()<>() Line 1344	C++
 	server.exe!std::invoke<std::_Binder<std::_Unforced,void (server_context::*)(),server_context *> &>(std::_Binder<std::_Unforced,void (server_context::*)(),server_context *> & _Obj={...}) Line 1524	C++
 	server.exe!std::_Invoker_ret<void,1>::_Call<std::_Binder<std::_Unforced,void (server_context::*)(),server_context *> &>(std::_Binder<std::_Unforced,void (server_context::*)(),server_context *> & _Func={...}) Line 652	C++
 	server.exe!std::_Func_impl_no_alloc<std::_Binder<std::_Unforced,void (server_context::*)(),server_context *>,void>::_Do_call() Line 822	C++
 	server.exe!std::_Func_class<void>::operator()() Line 869	C++
 	server.exe!server_queue::start_loop() Line 512	C++
 	server.exe!main(int argc=17, char * * argv=0x00000162afa057e0) Line 3837	C++
 	server.exe!invoke_main() Line 79	C++
 	server.exe!__scrt_common_main_seh() Line 288	C++
 	server.exe!__scrt_common_main() Line 331	C++
 	server.exe!mainCRTStartup(void * __formal=0x0000001b9824e000) Line 17	C++
 	kernel32.dll!BaseThreadInitThunk�()	Unknown
 	ntdll.dll!RtlUserThreadStart�()	Unknown

Client code:

from langchain_openai.embeddings import OpenAIEmbeddings

embeddings_model = OpenAIEmbeddings(
    model="",
    deployment="",
    openai_api_key="0",
    openai_api_base="http://localhost:8081",
    embedding_ctx_length=8192
)

embeddings_model.embed_query("Test")

Name and Version

edc2943

What operating system are you seeing the problem on?

Windows

Relevant log output

.......................................................
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 8192
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000.0
llama_new_context_with_model: freq_scale = 0.75
llama_kv_cache_init:  CUDA_Host KV buffer size =   288.00 MiB
llama_new_context_with_model: KV self size  =  288.00 MiB, K (f16):  144.00 MiB, V (f16):  144.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.00 MiB
ggml_gallocr_reserve_n: reallocating CUDA0 buffer from size 0.00 MiB to 23.00 MiB
ggml_gallocr_reserve_n: reallocating CUDA_Host buffer from size 0.00 MiB to 3.50 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =    23.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =     3.50 MiB
llama_new_context_with_model: graph nodes  = 453
llama_new_context_with_model: graph splits = 2
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
{"tid":"8792","timestamp":1716901289,"level":"INFO","function":"init","line":715,"msg":"initializing slots","n_slots":1}{"tid":"8792","timestamp":1716901289,"level":"INFO","function":"init","line":727,"msg":"new slot","id_slot":0,"n_ctx_slot":8192}
{"tid":"8792","timestamp":1716901289,"level":"INFO","function":"main","line":3040,"msg":"model loaded"}
{"tid":"8792","timestamp":1716901289,"level":"VERB","function":"format_chat","line":156,"msg":"formatted_chat","text":"<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\nHi there<|im_end|>\n<|im_start|>user\nHow are you?<|im_end|>\n<|im_start|>assistant\n"}
{"tid":"8792","timestamp":1716901289,"level":"INFO","function":"main","line":3065,"msg":"chat template","chat_example":"<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\nHi there<|im_end|>\n<|im_start|>user\nHow are you?<|im_end|>\n<|im_start|>assistant\n","built_in":true}
{"tid":"8792","timestamp":1716901289,"level":"INFO","function":"main","line":3793,"msg":"HTTP server listening","hostname":"127.0.0.1","port":"8081","n_threads_http":"31"}
{"tid":"8792","timestamp":1716901289,"level":"VERB","function":"start_loop","line":478,"msg":"new task may arrive"}
{"tid":"8792","timestamp":1716901289,"level":"VERB","function":"start_loop","line":493,"msg":"update_multitasks"}
{"tid":"8792","timestamp":1716901289,"level":"VERB","function":"start_loop","line":510,"msg":"callback_update_slots"}
{"tid":"8792","timestamp":1716901289,"level":"INFO","function":"update_slots","line":1812,"msg":"all slots are idle"}
{"tid":"8792","timestamp":1716901289,"level":"VERB","function":"kv_cache_clear","line":1052,"msg":"clearing KV cache"}
{"tid":"8792","timestamp":1716901289,"level":"VERB","function":"start_loop","line":514,"msg":"wait for new task"}
{"tid":"18200","timestamp":1716901316,"level":"VERB","function":"get_new_id","line":431,"msg":"new task id","new_id":0}
{"tid":"18200","timestamp":1716901316,"level":"VERB","function":"add_waiting_task_id","line":570,"msg":"waiting for task id","id_task":0}
{"tid":"8792","timestamp":1716901316,"level":"VERB","function":"start_loop","line":478,"msg":"new task may arrive"}
{"tid":"8792","timestamp":1716901316,"level":"VERB","function":"start_loop","line":489,"msg":"callback_new_task","id_task":0}
{"tid":"8792","timestamp":1716901316,"level":"INFO","function":"launch_slot_with_task","line":1046,"msg":"slot is processing task","id_slot":0,"id_task":0}
{"tid":"8792","timestamp":1716901316,"level":"VERB","function":"start_loop","line":493,"msg":"update_multitasks"}
{"tid":"8792","timestamp":1716901316,"level":"VERB","function":"start_loop","line":510,"msg":"callback_update_slots"}
{"tid":"8792","timestamp":1716901316,"level":"VERB","function":"update_slots","line":1822,"msg":"posting NEXT_RESPONSE"}{"tid":"8792","timestamp":1716901316,"level":"VERB","function":"post","line":414,"msg":"new task id","new_id":1}
{"tid":"8792","timestamp":1716901316,"level":"VERB","function":"update_slots","line":1921,"msg":"tokenizing prompt","id_slot":0,"id_task":0}

The text was updated successfully, but these errors were encountered:

skoulik · 2024-05-28T13:55:22Z

I've been experimenting further and found out that if you request embeddings for multiple queries in single request the server works fine, but if you request just one - it crashes. See attached http captures.
bad.txt
good.txt

skoulik · 2024-05-28T14:04:10Z

BTW, not sure why the input looks pre-tokenized. Must be a langchain thing, investigating.

stygmate · 2024-06-08T08:05:34Z

this issue maybe related: #7221 (to check)

skoulik added bug-unconfirmed high severity Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow) labels May 28, 2024

skoulik mentioned this issue Jun 6, 2024

Refactor: investigate cleaner exception handling for server/server.cpp #7787

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: crash in tokenizer while using embedding endpoint #7589

Bug: crash in tokenizer while using embedding endpoint #7589

skoulik commented May 28, 2024

skoulik commented May 28, 2024

skoulik commented May 28, 2024

stygmate commented Jun 8, 2024 •

edited

Loading

Bug: crash in tokenizer while using embedding endpoint #7589

Bug: crash in tokenizer while using embedding endpoint #7589

Comments

skoulik commented May 28, 2024

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

skoulik commented May 28, 2024

skoulik commented May 28, 2024

stygmate commented Jun 8, 2024 • edited Loading

stygmate commented Jun 8, 2024 •

edited

Loading