-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cpp client core when meta closes connection of query config if connection threshold is hit #286
Comments
change client log level to LOG_LEVEL_INFORMATION, to see more log info |
Two different core stacks, one of them is as below: Core was generated by `./pegasus_cpp_sample onebox temp'. client log: The other one is as below: Core was generated by `./pegasus_cpp_sample onebox temp'. client log: |
The problem happened probably because we setup a timer for query_config in partition_resolver_simple::call according to the client request timeout, however the internal rpc request timeout is not set in partition_resolver_simple::query_config, so it will be 5 sec as default. In some of the cases, the timer will be triggered before the rpc timeout and back. In this case the client will core, I didn't find the exact root cause yet.
|
Found the root cause of the 2nd core: bad casedata/log/log.3.txt:E2019-02-22 19:55:48.331 (1550836548331735179 738f) mimic.io-thrd.29583: wss: i am here in partition_resolver_simple::call good casedata/log/log.5.txt:E2019-02-22 19:58:09.528 (1550836689528468694 74f4) mimic.io-thrd.29940: wss: i am here in partition_resolver_simple::call |
Temporary fix the problem by passing the client request timeout into partition_resolver_simple::query_config to set the coresponding timeout for the actual rpc call.
E2019-02-26 18:59:30.886 (1551178770886220913 4ca2) mimic.io-thrd.19618: wss: i am here in partition_resolver_simple::call |
Found the root cause of core1, it's a different problem, need to reconsider the solution to fix the two problems. |
The second problem is because end_request is called as timeout will happen within 100 us, and then the timer task is trying to enqueue shorlty. Considering to not end up the request in this case, but wait for the timer to clean up instead, but still need to make sure rpc in query_config is back before the timer.
E2019-02-28 12:07:52.541 (1551326872541616898 455c) mimic.default1.0101000000000001: temp.client: query config reply, gpid = -1.-1, err = ERR_TIMEOUT |
The problem can be reproduced on the temp branch https://github.com/XiaoMi/pegasus/tree/timeout_core
Steps:
[network]
; how many network threads for network library(used by asio)
io_service_worker_count = 4
connection_threshold_endpoint = 7
The text was updated successfully, but these errors were encountered: