Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: crash in tokenizer while using embedding endpoint #7589

Open
skoulik opened this issue May 28, 2024 · 3 comments
Open

Bug: crash in tokenizer while using embedding endpoint #7589

skoulik opened this issue May 28, 2024 · 3 comments
Labels
bug-unconfirmed high severity Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow)

Comments

@skoulik
Copy link

skoulik commented May 28, 2024

What happened?

server --n-gpu-layers 13 --model D:\code\test_llm\models\embedding\nomic-embed-text-v1.5.f16.gguf --ctx-size 8192 --batch-size 8192 --rope-scaling yarn --rope-freq-scale 0.75 --port 8081 --embeddings --verbose

Exception thrown at 0x00007FFA7F76BA99 in server.exe: Microsoft C++ exception: nlohmann::json_abi_v3_11_3::detail::type_error at memory location 0x0000001B981275F0.
Unhandled exception at 0x00007FFA7F76BA99 in server.exe: Microsoft C++ exception: nlohmann::json_abi_v3_11_3::detail::type_error at memory location 0x0000001B981275F0.

 	KernelBase.dll!00007ffa7f76ba99()	Unknown
 	vcruntime140d.dll!00007ffa3a28b460()	Unknown
	server.exe!nlohmann::json_abi_v3_11_3::detail::from_json<nlohmann::basic_json<nlohmann::ordered_map>,int,0>(const nlohmann::json_abi_v3_11_3::basic_json<nlohmann::ordered_map,std::vector,std::string,bool,long long,unsigned long long,double,std::allocator,adl_serializer,std::vector<unsigned char,std::allocator<unsigned char>>,void> & j={...}, int & val=0) Line 4999	C++
 	server.exe!nlohmann::json_abi_v3_11_3::detail::from_json_fn::operator()<nlohmann::basic_json<nlohmann::ordered_map>,int &>(const nlohmann::json_abi_v3_11_3::basic_json<nlohmann::ordered_map,std::vector,std::string,bool,long long,unsigned long long,double,std::allocator,adl_serializer,std::vector<unsigned char,std::allocator<unsigned char>>,void> & j={...}, int & val=0) Line 5105	C++
 	server.exe!nlohmann::json_abi_v3_11_3::adl_serializer<int,void>::from_json<const nlohmann::basic_json<nlohmann::ordered_map> &,int>(const nlohmann::json_abi_v3_11_3::basic_json<nlohmann::ordered_map,std::vector,std::string,bool,long long,unsigned long long,double,std::allocator,adl_serializer,std::vector<unsigned char,std::allocator<unsigned char>>,void> & j={...}, int & val=0) Line 5843	C++
 	server.exe!nlohmann::json_abi_v3_11_3::basic_json<nlohmann::ordered_map,std::vector,std::string,bool,long long,unsigned long long,double,std::allocator,adl_serializer,std::vector<unsigned char,std::allocator<unsigned char>>,void>::get_impl<int,0>(nlohmann::json_abi_v3_11_3::detail::priority_tag<0>={...}) Line 20914	C++
 	server.exe!nlohmann::json_abi_v3_11_3::basic_json<nlohmann::ordered_map,std::vector,std::string,bool,long long,unsigned long long,double,std::allocator,adl_serializer,std::vector<unsigned char,std::allocator<unsigned char>>,void>::get<int,int>() Line 21056	C++
 	server.exe!server_context::tokenize(const nlohmann::json_abi_v3_11_3::basic_json<nlohmann::ordered_map,std::vector,std::string,bool,long long,unsigned long long,double,std::allocator,adl_serializer,std::vector<unsigned char,std::allocator<unsigned char>>,void> & json_prompt={...}, bool add_special=true) Line 799	C++
 	server.exe!server_context::update_slots() Line 1948	C++
 	server.exe!std::invoke<void (server_context::*&)(),server_context *&>(void(server_context::*)() & _Obj=0x00007ff773ea4350, server_context * & _Arg1=0x0000001b9812e178) Line 1540	C++
 	server.exe!std::_Invoker_ret<std::_Unforced,0>::_Call<void (server_context::*&)(),server_context *&>(void(server_context::*)() & _Func=0x00007ff773ea4350, server_context * & _Vals=0x0000001b9812e178) Line 670	C++
 	server.exe!std::_Call_binder<std::_Unforced,0,void (server_context::*)(),std::tuple<server_context *>,std::tuple<>>(std::_Invoker_ret<std::_Unforced,0>={...}, std::integer_sequence<unsigned long long,0>={...}, void(server_context::*)() & _Obj=0x00007ff773ea4350, std::tuple<server_context *> & _Tpl={...}, std::tuple<> && _Ut={...}) Line 1307	C++
 	server.exe!std::_Binder<std::_Unforced,void (server_context::*)(),server_context *>::operator()<>() Line 1344	C++
 	server.exe!std::invoke<std::_Binder<std::_Unforced,void (server_context::*)(),server_context *> &>(std::_Binder<std::_Unforced,void (server_context::*)(),server_context *> & _Obj={...}) Line 1524	C++
 	server.exe!std::_Invoker_ret<void,1>::_Call<std::_Binder<std::_Unforced,void (server_context::*)(),server_context *> &>(std::_Binder<std::_Unforced,void (server_context::*)(),server_context *> & _Func={...}) Line 652	C++
 	server.exe!std::_Func_impl_no_alloc<std::_Binder<std::_Unforced,void (server_context::*)(),server_context *>,void>::_Do_call() Line 822	C++
 	server.exe!std::_Func_class<void>::operator()() Line 869	C++
 	server.exe!server_queue::start_loop() Line 512	C++
 	server.exe!main(int argc=17, char * * argv=0x00000162afa057e0) Line 3837	C++
 	server.exe!invoke_main() Line 79	C++
 	server.exe!__scrt_common_main_seh() Line 288	C++
 	server.exe!__scrt_common_main() Line 331	C++
 	server.exe!mainCRTStartup(void * __formal=0x0000001b9824e000) Line 17	C++
 	kernel32.dll!BaseThreadInitThunk�()	Unknown
 	ntdll.dll!RtlUserThreadStart�()	Unknown

Client code:

from langchain_openai.embeddings import OpenAIEmbeddings

embeddings_model = OpenAIEmbeddings(
    model="",
    deployment="",
    openai_api_key="0",
    openai_api_base="http://localhost:8081",
    embedding_ctx_length=8192
)

embeddings_model.embed_query("Test")

Name and Version

edc2943

What operating system are you seeing the problem on?

Windows

Relevant log output

.......................................................
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 8192
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000.0
llama_new_context_with_model: freq_scale = 0.75
llama_kv_cache_init:  CUDA_Host KV buffer size =   288.00 MiB
llama_new_context_with_model: KV self size  =  288.00 MiB, K (f16):  144.00 MiB, V (f16):  144.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.00 MiB
ggml_gallocr_reserve_n: reallocating CUDA0 buffer from size 0.00 MiB to 23.00 MiB
ggml_gallocr_reserve_n: reallocating CUDA_Host buffer from size 0.00 MiB to 3.50 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =    23.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =     3.50 MiB
llama_new_context_with_model: graph nodes  = 453
llama_new_context_with_model: graph splits = 2
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture
{"tid":"8792","timestamp":1716901289,"level":"INFO","function":"init","line":715,"msg":"initializing slots","n_slots":1}{"tid":"8792","timestamp":1716901289,"level":"INFO","function":"init","line":727,"msg":"new slot","id_slot":0,"n_ctx_slot":8192}
{"tid":"8792","timestamp":1716901289,"level":"INFO","function":"main","line":3040,"msg":"model loaded"}
{"tid":"8792","timestamp":1716901289,"level":"VERB","function":"format_chat","line":156,"msg":"formatted_chat","text":"<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\nHi there<|im_end|>\n<|im_start|>user\nHow are you?<|im_end|>\n<|im_start|>assistant\n"}
{"tid":"8792","timestamp":1716901289,"level":"INFO","function":"main","line":3065,"msg":"chat template","chat_example":"<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\nHi there<|im_end|>\n<|im_start|>user\nHow are you?<|im_end|>\n<|im_start|>assistant\n","built_in":true}
{"tid":"8792","timestamp":1716901289,"level":"INFO","function":"main","line":3793,"msg":"HTTP server listening","hostname":"127.0.0.1","port":"8081","n_threads_http":"31"}
{"tid":"8792","timestamp":1716901289,"level":"VERB","function":"start_loop","line":478,"msg":"new task may arrive"}
{"tid":"8792","timestamp":1716901289,"level":"VERB","function":"start_loop","line":493,"msg":"update_multitasks"}
{"tid":"8792","timestamp":1716901289,"level":"VERB","function":"start_loop","line":510,"msg":"callback_update_slots"}
{"tid":"8792","timestamp":1716901289,"level":"INFO","function":"update_slots","line":1812,"msg":"all slots are idle"}
{"tid":"8792","timestamp":1716901289,"level":"VERB","function":"kv_cache_clear","line":1052,"msg":"clearing KV cache"}
{"tid":"8792","timestamp":1716901289,"level":"VERB","function":"start_loop","line":514,"msg":"wait for new task"}
{"tid":"18200","timestamp":1716901316,"level":"VERB","function":"get_new_id","line":431,"msg":"new task id","new_id":0}
{"tid":"18200","timestamp":1716901316,"level":"VERB","function":"add_waiting_task_id","line":570,"msg":"waiting for task id","id_task":0}
{"tid":"8792","timestamp":1716901316,"level":"VERB","function":"start_loop","line":478,"msg":"new task may arrive"}
{"tid":"8792","timestamp":1716901316,"level":"VERB","function":"start_loop","line":489,"msg":"callback_new_task","id_task":0}
{"tid":"8792","timestamp":1716901316,"level":"INFO","function":"launch_slot_with_task","line":1046,"msg":"slot is processing task","id_slot":0,"id_task":0}
{"tid":"8792","timestamp":1716901316,"level":"VERB","function":"start_loop","line":493,"msg":"update_multitasks"}
{"tid":"8792","timestamp":1716901316,"level":"VERB","function":"start_loop","line":510,"msg":"callback_update_slots"}
{"tid":"8792","timestamp":1716901316,"level":"VERB","function":"update_slots","line":1822,"msg":"posting NEXT_RESPONSE"}{"tid":"8792","timestamp":1716901316,"level":"VERB","function":"post","line":414,"msg":"new task id","new_id":1}
{"tid":"8792","timestamp":1716901316,"level":"VERB","function":"update_slots","line":1921,"msg":"tokenizing prompt","id_slot":0,"id_task":0}
@skoulik skoulik added bug-unconfirmed high severity Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow) labels May 28, 2024
@skoulik
Copy link
Author

skoulik commented May 28, 2024

I've been experimenting further and found out that if you request embeddings for multiple queries in single request the server works fine, but if you request just one - it crashes. See attached http captures.
bad.txt
good.txt

@skoulik
Copy link
Author

skoulik commented May 28, 2024

BTW, not sure why the input looks pre-tokenized. Must be a langchain thing, investigating.

@stygmate
Copy link

stygmate commented Jun 8, 2024

this issue maybe related: #7221 (to check)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed high severity Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow)
Projects
None yet
Development

No branches or pull requests

2 participants