New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bitcoin-cli: print useful error if bitcoind rpc work queue exceeded #18335
bitcoin-cli: print useful error if bitcoind rpc work queue exceeded #18335
Conversation
To reproduce the problem, start the client:
This submits this trivial RPC 10 time in the background (the
But most users wouldn't think to look there, or wouldn't have access. With this PR, the error is more understandable. The failed request in this case is retryable, so maybe that should be indicated in the text, but this text ( |
src/bitcoin-cli.cpp
Outdated
@@ -404,7 +404,7 @@ static UniValue CallRPC(BaseRequestHandler *rh, const std::string& strMethod, co | |||
// Parse reply | |||
UniValue valReply(UniValue::VSTR); | |||
if (!valReply.read(response.body)) | |||
throw std::runtime_error("couldn't parse reply from server"); | |||
throw std::runtime_error(strprintf("server response: %s", response.body)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's strange that it ends up here. It's trying to parse an error message as JSON! Isn't there some error code (either HTTP or JSON-RPC) that could be checked for instead? If not, that seems like a bug in the server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does seem wrong that my proposed fix retains the practice of trying to decode a non-JSON string as JSON, and depending on that to fail. Yes, perhaps this is a bug in the server -- should it be returning JSON in body
here rather than a plain string? But as I said at the end of the comment above, I was hesitant to change the server because I'm worried that there may be other RPC clients besides bitcoin-cli
that depend on the current (plain string) response. What do you think? If the fix is better in the server, I can do that.
Just for further information, the response returned in this work queue depth exceeded is (status 500 is HTTP_INTERNAL_SERVER_ERROR
):
(gdb) f 0
Breakpoint 1, CallRPC (rh=0x555555845e90, strMethod="getblockhash", args=std::vector of length 1, capacity 2 = {...})
at bitcoin-cli.cpp:406
406 if (!valReply.read(response.body))
(gdb) p response
$2 = {
status = 500,
error = -1,
body = "Work queue depth exceeded"
}
(gdb)
Here's a successful reply:
(gdb) p response
$6 = {
status = 200,
error = -1,
body = "{\"result\":\"000000000019d6689c085ae165831e934ff763ae46a2a6c172b3f1b60a8ce26f\",\"error\":null,\"id\":1}\n"
}
Here's the response when requesting a nonsense method:
(gdb) p response
$2 = {
status = 404,
error = -1,
body = "{\"result\":null,\"error\":{\"code\":-32601,\"message\":\"Method not found\"},\"id\":1}\n"
}
As part of this PR, I could also systematically search for and fix any other places where body
is returned as plain string (not JSON); so far I see only one:
req->WriteReply(HTTP_BAD_METHOD, "JSONRPC server handles only POST requests")
The second commit undoes the first, and fixes the problem in the server by having it return JSON in the body of the response:
(Here's where that error code,
|
Ok, looks like his is actually non-trivial.
So in this case, possibly you could detect this error code in bitcoin-cli, and print an error message instead of trying to parse as JSON. Or are there cases in which it returns 500 but the response needs to be parsed as JSON? I must admit we're sometimes inconsistent about whether to report errors at the HTTP layer or the JSON-RPC, so it's very possible that your solution of changing the server to return a JSON error is the most consistent one! (I'm not personally very worried about clients depending on a specific exact error message text, though we might want to mention it briefly in the release notes) |
I checked this and yes, we do return JSON errors with static void JSONErrorReply(HTTPRequest* req, const UniValue& objError, const UniValue& id)
{
// Send error reply from json-rpc error object
int nStatus = HTTP_INTERNAL_SERVER_ERROR;
int code = find_value(objError, "code").get_int();
if (code == RPC_INVALID_REQUEST)
nStatus = HTTP_BAD_REQUEST;
else if (code == RPC_METHOD_NOT_FOUND)
nStatus = HTTP_NOT_FOUND;
std::string strReply = JSONRPCReply(NullUniValue, objError, id);
req->WriteHeader("Content-Type", "application/json");
req->WriteReply(nStatus, strReply);
} So we have no way to distinguish "parse this as JSON?" based on just the status code. I think the current change of changing the server to always return JSON errors is OK. I think it's slightly more elegant than doing "try to parse as JSON and print the raw text otherwise" in the client. But I don't have a strong opinion. |
@LarryRuane If you find this approach simpler you may grab it :) |
912e4a5
to
4bb892c
Compare
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. ConflictsNo conflicts as of last run. |
4bb892c
to
b5a80fa
Compare
Rebase (refresh) |
Concept ACK. Could you please add functional tests for this in one of the |
test/functional/interface_rpc.py
Outdated
for _ in range(400): | ||
if got_exceeded_error: | ||
break | ||
try: | ||
node.cli('getrpcinfo').send_cli() | ||
except subprocess.CalledProcessError as e: | ||
if e.output != 'error: Server response: Work queue depth exceeded\n': | ||
raise AssertionError("Unexpected cli error: :{}:".format(e.output)) | ||
got_exceeded_error = True | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This 400 is, of course, arbitrary, as is the number of threads this test starts (5, see below). These numbers are the result of simple trial-and-error. We want each thread to do enough work that it's still running by the time all 5 threads are started. Still, the test only takes about 1.3 seconds to run on my system. Increasing the number of threads allows the iteration count here to be reduced, but I don't want to start too many threads, because each one probably uses a nontrivial amount of resources and I don't want to overburden the CI workers.
Tested almost-ACK. Thanks for adding the test! Runs quickly, but one of the Travis jobs failed at assert got_exceeded_error
so it may need to be more reliable. Some suggested changes below: add a line of logging, attempt to remove perhaps-unneeded breaks, and PEP8 suggestions from pycodestyle
and black
-- feel free to ignore the style nits. Tested that the test fails on master and passes on this branch.
some suggested edits to the test
index e0e6fc57aa..cdbed53dd6 100755
--- a/test/functional/interface_rpc.py
+++ b/test/functional/interface_rpc.py
@@ -11,6 +11,7 @@ from test_framework.util import assert_equal, assert_greater_than_or_equal
from threading import Thread
import subprocess
+
def expect_http_status(expected_http_status, expected_rpc_code,
fcn, *args):
try:
@@ -20,7 +21,10 @@ def expect_http_status(expected_http_status, expected_rpc_code,
assert_equal(exc.error["code"], expected_rpc_code)
assert_equal(exc.http_status, expected_http_status)
+
got_exceeded_error = False
+
+
def test_work_queue_getblock(node):
global got_exceeded_error
for _ in range(400):
@@ -29,10 +33,9 @@ def test_work_queue_getblock(node):
try:
node.cli('getrpcinfo').send_cli()
except subprocess.CalledProcessError as e:
- if e.output != 'error: Server response: Work queue depth exceeded\n':
- raise AssertionError("Unexpected cli error: :{}:".format(e.output))
+ assert_equal(e.output, 'error: Server response: Work queue depth exceeded\n')
got_exceeded_error = True
- break
+
class RPCInterfaceTest(BitcoinTestFramework):
def set_test_params(self):
@@ -84,19 +87,18 @@ class RPCInterfaceTest(BitcoinTestFramework):
expect_http_status(500, -8, self.nodes[0].getblockhash, 42)
def test_work_queue_exceeded(self):
+ self.log.info("Test work queue exceeded")
global got_exceeded_error
- self.restart_node(0,['-rpcworkqueue=1'])
+ self.restart_node(0, ['-rpcworkqueue=1'])
threads = []
for _ in range(5):
- t = Thread(target=test_work_queue_getblock, args=(self.nodes[0], ))
+ t = Thread(target=test_work_queue_getblock, args=(self.nodes[0],))
t.start()
threads.append(t)
- if got_exceeded_error:
- break
for t in threads:
t.join()
# This likely just means more threads are needed.
- assert(got_exceeded_error)
+ assert got_exceeded_error
def run_test(self):
self.test_getrpcinfo()
Concept ACK Code looks good at a glance but the test fails on Travis, so I will come back for a more in-depth review when that is fixed :) |
c067589
to
db0d460
Compare
Addressed review suggestions, force-pushed (diff), thank you. The test will run until it succeeds (there's no longer an arbitrary loop iteration count), or CI will timeout the test (which is extremely unlikely). Starting three threads should be enough, even two works, but the test runs faster with three (greater chance of exceeded the work queue depth limit).
def test_work_queue_exceeded(self): | ||
self.log.info("Test work queue exceeded") | ||
self.restart_node(0, ['-rpcworkqueue=1']) | ||
got_exceeded_error = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a list, rather than just a boolean, because a list is a mutable object, and this eliminates the need for the global variable, which seems cleaner. Is it worth a comment explaining that? I tend to think not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
Is it worth a comment explaining that? I tend to think not.
I don't think so, but it's a bit awkward for the list to be named like a boolean, maybe exceeded_errors
if you have to retouch ?
Thanks @LarryRuane for removing the global and the update. The tests appear to be timing out, though it's unclear. |
ACK db0d460, tested on Linux Mint 20 (x86_64).
https://cirrus-ci.com/task/5913926858702848 timeout seems irrelevant (could be restarted though).
test/functional/interface_rpc.py
Outdated
@@ -67,10 +80,23 @@ def test_http_status_codes(self): | |||
expect_http_status(404, -32601, self.nodes[0].invalidmethod) | |||
expect_http_status(500, -8, self.nodes[0].getblockhash, 42) | |||
|
|||
def test_work_queue_exceeded(self): | |||
self.log.info("Test work queue exceeded") | |||
self.restart_node(0, ['-rpcworkqueue=1']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would make CI testing more reliable:
self.restart_node(0, ['-rpcworkqueue=1']) | |
self.restart_node(0, ["-rpcworkqueue=1", "-rpcthreads=1"]) |
test/functional/interface_rpc.py
Outdated
@@ -67,10 +80,23 @@ def test_http_status_codes(self): | |||
expect_http_status(404, -32601, self.nodes[0].invalidmethod) | |||
expect_http_status(500, -8, self.nodes[0].getblockhash, 42) | |||
|
|||
def test_work_queue_exceeded(self): | |||
self.log.info("Test work queue exceeded") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Mind rewording to make it consistent with other log messages
self.log.info("Test work queue exceeded") | |
self.log.info("Testing work queue exceeded...") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Force-pushed to implement review suggestions.
db0d460
to
8dd5946
Compare
Travis failure seems related to #19654. |
tACK 8dd5946 Confirmed that the test fails without the fix. Ran test multiple times and seemed stable. mini-nit if you have to retouch: The commit message of the test commit is too general for our usual conventions. |
Github-Pull: bitcoin#18335 Rebased-From: 8dd5946
@luke-jr as you've incorporated this PR into Bitcoin Knots, mind refreshing your previous ACK to get things done? |
utACK 8dd5946 (no changes since previous utACK) |
@@ -18,6 +21,16 @@ def expect_http_status(expected_http_status, expected_rpc_code, | |||
assert_equal(exc.error["code"], expected_rpc_code) | |||
assert_equal(exc.http_status, expected_http_status) | |||
|
|||
|
|||
def test_work_queue_getblock(node, got_exceeded_error): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this is called queue_getblock
but queues a getrpcinfo
? Maybe queue_rpc_call
def test_work_queue_exceeded(self): | ||
self.log.info("Test work queue exceeded") | ||
self.restart_node(0, ['-rpcworkqueue=1']) | ||
got_exceeded_error = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
Is it worth a comment explaining that? I tend to think not.
I don't think so, but it's a bit awkward for the list to be named like a boolean, maybe exceeded_errors
if you have to retouch ?
@@ -269,7 +269,7 @@ static void http_request_cb(struct evhttp_request* req, void* arg) | |||
item.release(); /* if true, queue took ownership */ | |||
else { | |||
LogPrintf("WARNING: request rejected because http work queue depth exceeded, it can be increased with the -rpcworkqueue= setting\n"); | |||
item->req->WriteReply(HTTP_INTERNAL_SERVER_ERROR, "Work queue depth exceeded"); | |||
item->req->WriteReply(HTTP_SERVICE_UNAVAILABLE, "Work queue depth exceeded"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need release notes to explain the changed http status code? Just in case a user relies on this?
Just asking, you don't have to invalidate review. Release note can be added after merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I think you're right, I'll do it separately, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add them directly to doc/release-notes.md
on the master
branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a "needs release note" label.
51eef4a doc: Add release notes for #18335 (rpc work queue exceeded error) (Larry Ruane) Pull request description: Follow-on to #18335 ACKs for top commit: darosior: ACK 51eef4a Tree-SHA512: 863d92cb1c23493d9c8c42ed89b30ebd59092e44f159de4cefbabfe4101e7d7d40f24776ff3fcf39dedf90b45fc25845cf7a2177af38729ce2118d75c3cd779b
51eef4a doc: Add release notes for bitcoin#18335 (rpc work queue exceeded error) (Larry Ruane) Pull request description: Follow-on to bitcoin#18335 ACKs for top commit: darosior: ACK 51eef4a Tree-SHA512: 863d92cb1c23493d9c8c42ed89b30ebd59092e44f159de4cefbabfe4101e7d7d40f24776ff3fcf39dedf90b45fc25845cf7a2177af38729ce2118d75c3cd779b
If
bitcoind
is processing 16 RPC requests, attempting to submit another request usingbitcoin-cli
produces this less-than-helpful error message:error: couldn't parse reply from server
. This PR changes the error to:error: server response: Work queue depth exceeded
.