Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bitcoin-cli: print useful error if bitcoind rpc work queue exceeded #18335

Merged
merged 2 commits into from Mar 19, 2021

Conversation

LarryRuane
Copy link
Contributor

@LarryRuane LarryRuane commented Mar 12, 2020

If bitcoind is processing 16 RPC requests, attempting to submit another request using bitcoin-cli produces this less-than-helpful error message: error: couldn't parse reply from server. This PR changes the error to: error: server response: Work queue depth exceeded.

@LarryRuane
Copy link
Contributor Author

LarryRuane commented Mar 12, 2020

To reproduce the problem, start the client: bitcoind -rpcworkqueue=1. This reduces the maximum number of in-progress RPCs from 16 (DEFAULT_HTTP_WORKQUEUE) to 1. Then run:

seq 10 | while read i;do src/bitcoin-cli getblockhash 0 >/dev/null&done

This submits this trivial RPC 10 time in the background (the & character) and usually produces one or more of the error: couldn't parse reply from server errors. If it doesn't, try again, or increase the load to more than 10. You will see a reasonable error message in debug.log:

2020-03-12T17:32:28Z WARNING: request rejected because http work queue depth exceeded, it can be increased with the -rpcworkqueue= setting

But most users wouldn't think to look there, or wouldn't have access.

With this PR, the error is more understandable. The failed request in this case is retryable, so maybe that should be indicated in the text, but this text (Work queue depth exceeded) comes from bitcoind, which probably shouldn't be changed, especially since there could be other RPC clients that depend on that exact text. We could have bitcoin-cli include additional text like please try again but it doesn't really know for certain in every case that the request is retryable.

@@ -404,7 +404,7 @@ static UniValue CallRPC(BaseRequestHandler *rh, const std::string& strMethod, co
// Parse reply
UniValue valReply(UniValue::VSTR);
if (!valReply.read(response.body))
throw std::runtime_error("couldn't parse reply from server");
throw std::runtime_error(strprintf("server response: %s", response.body));
Copy link
Member

@laanwj laanwj Mar 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's strange that it ends up here. It's trying to parse an error message as JSON! Isn't there some error code (either HTTP or JSON-RPC) that could be checked for instead? If not, that seems like a bug in the server.

Copy link
Contributor Author

@LarryRuane LarryRuane Mar 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does seem wrong that my proposed fix retains the practice of trying to decode a non-JSON string as JSON, and depending on that to fail. Yes, perhaps this is a bug in the server -- should it be returning JSON in body here rather than a plain string? But as I said at the end of the comment above, I was hesitant to change the server because I'm worried that there may be other RPC clients besides bitcoin-cli that depend on the current (plain string) response. What do you think? If the fix is better in the server, I can do that.

Just for further information, the response returned in this work queue depth exceeded is (status 500 is HTTP_INTERNAL_SERVER_ERROR):

(gdb) f 0
Breakpoint 1, CallRPC (rh=0x555555845e90, strMethod="getblockhash", args=std::vector of length 1, capacity 2 = {...})
    at bitcoin-cli.cpp:406
406	    if (!valReply.read(response.body))
(gdb) p response
$2 = {
  status = 500, 
  error = -1, 
  body = "Work queue depth exceeded"
}
(gdb) 

Here's a successful reply:

(gdb) p response
$6 = {
  status = 200, 
  error = -1, 
  body = "{\"result\":\"000000000019d6689c085ae165831e934ff763ae46a2a6c172b3f1b60a8ce26f\",\"error\":null,\"id\":1}\n"
}

Here's the response when requesting a nonsense method:

(gdb) p response
$2 = {
  status = 404, 
  error = -1, 
  body = "{\"result\":null,\"error\":{\"code\":-32601,\"message\":\"Method not found\"},\"id\":1}\n"
}

As part of this PR, I could also systematically search for and fix any other places where body is returned as plain string (not JSON); so far I see only one:

req->WriteReply(HTTP_BAD_METHOD, "JSONRPC server handles only POST requests")

@LarryRuane
Copy link
Contributor Author

LarryRuane commented Mar 12, 2020

The second commit undoes the first, and fixes the problem in the server by having it return JSON in the body of the response:

error code: -32603
error message:
Work queue depth exceeded

(Here's where that error code, RPC_INTERNAL_ERROR is defined.) It makes sense that this is similar to:

$ src/bitcoin-cli not-a-valid-rpc
error code: -32601
error message:
Method not found
$ 

@laanwj
Copy link
Member

laanwj commented Mar 17, 2020

Ok, looks like his is actually non-trivial.

HTTP_INTERNAL_SERVER_ERROR

$2 = {
  status = 500, 
  error = -1, 
  body = "Work queue depth exceeded"
}

So in this case, possibly you could detect this error code in bitcoin-cli, and print an error message instead of trying to parse as JSON. Or are there cases in which it returns 500 but the response needs to be parsed as JSON?

I must admit we're sometimes inconsistent about whether to report errors at the HTTP layer or the JSON-RPC, so it's very possible that your solution of changing the server to return a JSON error is the most consistent one!

(I'm not personally very worried about clients depending on a specific exact error message text, though we might want to mention it briefly in the release notes)

@laanwj
Copy link
Member

laanwj commented Mar 27, 2020

I must admit we're sometimes inconsistent about whether to report errors at the HTTP layer or the JSON-RPC

I checked this and yes, we do return JSON errors with HTTP_INTERNAL_SERVER_ERROR:

static void JSONErrorReply(HTTPRequest* req, const UniValue& objError, const UniValue& id)
{
    // Send error reply from json-rpc error object
    int nStatus = HTTP_INTERNAL_SERVER_ERROR;
    int code = find_value(objError, "code").get_int();

    if (code == RPC_INVALID_REQUEST)
        nStatus = HTTP_BAD_REQUEST;
    else if (code == RPC_METHOD_NOT_FOUND)
        nStatus = HTTP_NOT_FOUND;

    std::string strReply = JSONRPCReply(NullUniValue, objError, id);

    req->WriteHeader("Content-Type", "application/json");
    req->WriteReply(nStatus, strReply);
}

So we have no way to distinguish "parse this as JSON?" based on just the status code.

I think the current change of changing the server to always return JSON errors is OK. I think it's slightly more elegant than doing "try to parse as JSON and print the raw text otherwise" in the client. But I don't have a strong opinion.

@hebasto
Copy link
Member

hebasto commented Apr 4, 2020

@LarryRuane
Just was noted about this PR by GH :)

If you find this approach simpler you may grab it :)

@LarryRuane LarryRuane force-pushed the cli-work-queue-error-msg branch from 912e4a5 to 4bb892c Compare Apr 4, 2020
@DrahtBot
Copy link
Contributor

DrahtBot commented Apr 4, 2020

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

No conflicts as of last run.

@LarryRuane
Copy link
Contributor Author

LarryRuane commented Apr 4, 2020

@hebasto

If you find this approach simpler you may grab it :)

Yes, your solution is much better, thank you! I tested your fix, and cherry-picked your commit to this PR (and force-push-removed my 3 commits).

luke-jr
luke-jr approved these changes May 5, 2020
Copy link
Member

@luke-jr luke-jr left a comment

utACK 4bb892c

@LarryRuane LarryRuane force-pushed the cli-work-queue-error-msg branch from 4bb892c to b5a80fa Compare Jul 12, 2020
@LarryRuane
Copy link
Contributor Author

LarryRuane commented Jul 12, 2020

Rebase (refresh)

Copy link
Member

@hebasto hebasto left a comment

self-ACK b5a80fa

@jonatack
Copy link
Member

jonatack commented Jul 17, 2020

Concept ACK. Could you please add functional tests for this in one of the test/functional/interface* files? Say, interface_bitcoin_cli.py (which has a lot of error response testing) or interface_rpc.py (which has a batch request test).

for _ in range(400):
if got_exceeded_error:
break
try:
node.cli('getrpcinfo').send_cli()
except subprocess.CalledProcessError as e:
if e.output != 'error: Server response: Work queue depth exceeded\n':
raise AssertionError("Unexpected cli error: :{}:".format(e.output))
got_exceeded_error = True
break
Copy link
Contributor Author

@LarryRuane LarryRuane Jul 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This 400 is, of course, arbitrary, as is the number of threads this test starts (5, see below). These numbers are the result of simple trial-and-error. We want each thread to do enough work that it's still running by the time all 5 threads are started. Still, the test only takes about 1.3 seconds to run on my system. Increasing the number of threads allows the iteration count here to be reduced, but I don't want to start too many threads, because each one probably uses a nontrivial amount of resources and I don't want to overburden the CI workers.

Copy link
Member

@jonatack jonatack left a comment

Tested almost-ACK. Thanks for adding the test! Runs quickly, but one of the Travis jobs failed at assert got_exceeded_error so it may need to be more reliable. Some suggested changes below: add a line of logging, attempt to remove perhaps-unneeded breaks, and PEP8 suggestions from pycodestyle and black -- feel free to ignore the style nits. Tested that the test fails on master and passes on this branch.

some suggested edits to the test

index e0e6fc57aa..cdbed53dd6 100755
--- a/test/functional/interface_rpc.py
+++ b/test/functional/interface_rpc.py
@@ -11,6 +11,7 @@ from test_framework.util import assert_equal, assert_greater_than_or_equal
 from threading import Thread
 import subprocess
 
+
 def expect_http_status(expected_http_status, expected_rpc_code,
                        fcn, *args):
     try:
@@ -20,7 +21,10 @@ def expect_http_status(expected_http_status, expected_rpc_code,
         assert_equal(exc.error["code"], expected_rpc_code)
         assert_equal(exc.http_status, expected_http_status)
 
+
 got_exceeded_error = False
+
+
 def test_work_queue_getblock(node):
     global got_exceeded_error
     for _ in range(400):
@@ -29,10 +33,9 @@ def test_work_queue_getblock(node):
         try:
             node.cli('getrpcinfo').send_cli()
         except subprocess.CalledProcessError as e:
-            if e.output != 'error: Server response: Work queue depth exceeded\n':
-                raise AssertionError("Unexpected cli error: :{}:".format(e.output))
+            assert_equal(e.output, 'error: Server response: Work queue depth exceeded\n')
             got_exceeded_error = True
-            break
+
 
 class RPCInterfaceTest(BitcoinTestFramework):
     def set_test_params(self):
@@ -84,19 +87,18 @@ class RPCInterfaceTest(BitcoinTestFramework):
         expect_http_status(500, -8, self.nodes[0].getblockhash, 42)
 
     def test_work_queue_exceeded(self):
+        self.log.info("Test work queue exceeded")
         global got_exceeded_error
-        self.restart_node(0,['-rpcworkqueue=1'])
+        self.restart_node(0, ['-rpcworkqueue=1'])
         threads = []
         for _ in range(5):
-            t = Thread(target=test_work_queue_getblock, args=(self.nodes[0], ))
+            t = Thread(target=test_work_queue_getblock, args=(self.nodes[0],))
             t.start()
             threads.append(t)
-            if got_exceeded_error:
-                break
         for t in threads:
             t.join()
         # This likely just means more threads are needed.
-        assert(got_exceeded_error)
+        assert got_exceeded_error
 
     def run_test(self):
         self.test_getrpcinfo()

@fjahr
Copy link
Contributor

fjahr commented Jul 22, 2020

Concept ACK

Code looks good at a glance but the test fails on Travis, so I will come back for a more in-depth review when that is fixed :)

@LarryRuane LarryRuane force-pushed the cli-work-queue-error-msg branch from c067589 to db0d460 Compare Jul 22, 2020
Copy link
Contributor Author

@LarryRuane LarryRuane left a comment

Addressed review suggestions, force-pushed (diff), thank you. The test will run until it succeeds (there's no longer an arbitrary loop iteration count), or CI will timeout the test (which is extremely unlikely). Starting three threads should be enough, even two works, but the test runs faster with three (greater chance of exceeded the work queue depth limit).

def test_work_queue_exceeded(self):
self.log.info("Test work queue exceeded")
self.restart_node(0, ['-rpcworkqueue=1'])
got_exceeded_error = []
Copy link
Contributor Author

@LarryRuane LarryRuane Jul 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a list, rather than just a boolean, because a list is a mutable object, and this eliminates the need for the global variable, which seems cleaner. Is it worth a comment explaining that? I tend to think not.

Copy link
Member

@darosior darosior Oct 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Is it worth a comment explaining that? I tend to think not.

I don't think so, but it's a bit awkward for the list to be named like a boolean, maybe exceeded_errors if you have to retouch ?

@jonatack
Copy link
Member

jonatack commented Jul 24, 2020

Thanks @LarryRuane for removing the global and the update. The tests appear to be timing out, though it's unclear.

hebasto
hebasto approved these changes Aug 4, 2020
Copy link
Member

@hebasto hebasto left a comment

ACK db0d460, tested on Linux Mint 20 (x86_64).

https://cirrus-ci.com/task/5913926858702848 timeout seems irrelevant (could be restarted though).

@@ -67,10 +80,23 @@ def test_http_status_codes(self):
expect_http_status(404, -32601, self.nodes[0].invalidmethod)
expect_http_status(500, -8, self.nodes[0].getblockhash, 42)

def test_work_queue_exceeded(self):
self.log.info("Test work queue exceeded")
self.restart_node(0, ['-rpcworkqueue=1'])
Copy link
Member

@hebasto hebasto Aug 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would make CI testing more reliable:

Suggested change
self.restart_node(0, ['-rpcworkqueue=1'])
self.restart_node(0, ["-rpcworkqueue=1", "-rpcthreads=1"])

@@ -67,10 +80,23 @@ def test_http_status_codes(self):
expect_http_status(404, -32601, self.nodes[0].invalidmethod)
expect_http_status(500, -8, self.nodes[0].getblockhash, 42)

def test_work_queue_exceeded(self):
self.log.info("Test work queue exceeded")
Copy link
Member

@hebasto hebasto Aug 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Mind rewording to make it consistent with other log messages

Suggested change
self.log.info("Test work queue exceeded")
self.log.info("Testing work queue exceeded...")

Copy link
Contributor Author

@LarryRuane LarryRuane Aug 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Force-pushed to implement review suggestions.

@LarryRuane LarryRuane force-pushed the cli-work-queue-error-msg branch from db0d460 to 8dd5946 Compare Aug 4, 2020
@hebasto
Copy link
Member

hebasto commented Aug 4, 2020

Travis failure seems related to #19654.

hebasto
hebasto approved these changes Aug 4, 2020
Copy link
Member

@hebasto hebasto left a comment

re-ACK 8dd5946, only suggested changes since my previous review.

@fjahr
Copy link
Contributor

fjahr commented Aug 5, 2020

tACK 8dd5946

Confirmed that the test fails without the fix. Ran test multiple times and seemed stable.

mini-nit if you have to retouch: The commit message of the test commit is too general for our usual conventions.

luke-jr pushed a commit to bitcoinknots/bitcoin that referenced this issue Aug 15, 2020
Github-Pull: bitcoin#18335
Rebased-From: 8dd5946
@hebasto
Copy link
Member

hebasto commented Aug 22, 2020

@luke-jr as you've incorporated this PR into Bitcoin Knots, mind refreshing your previous ACK to get things done?

@luke-jr
Copy link
Member

luke-jr commented Oct 13, 2020

utACK 8dd5946 (no changes since previous utACK)

Copy link
Member

@darosior darosior left a comment

ACK 8dd5946

Thanks for doing this, going to be helpful downstream :-)
Left a few nits only if you have to retouch

@@ -18,6 +21,16 @@ def expect_http_status(expected_http_status, expected_rpc_code,
assert_equal(exc.error["code"], expected_rpc_code)
assert_equal(exc.http_status, expected_http_status)


def test_work_queue_getblock(node, got_exceeded_error):
Copy link
Member

@darosior darosior Oct 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is called queue_getblock but queues a getrpcinfo ? Maybe queue_rpc_call

def test_work_queue_exceeded(self):
self.log.info("Test work queue exceeded")
self.restart_node(0, ['-rpcworkqueue=1'])
got_exceeded_error = []
Copy link
Member

@darosior darosior Oct 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Is it worth a comment explaining that? I tend to think not.

I don't think so, but it's a bit awkward for the list to be named like a boolean, maybe exceeded_errors if you have to retouch ?

@@ -269,7 +269,7 @@ static void http_request_cb(struct evhttp_request* req, void* arg)
item.release(); /* if true, queue took ownership */
else {
LogPrintf("WARNING: request rejected because http work queue depth exceeded, it can be increased with the -rpcworkqueue= setting\n");
item->req->WriteReply(HTTP_INTERNAL_SERVER_ERROR, "Work queue depth exceeded");
item->req->WriteReply(HTTP_SERVICE_UNAVAILABLE, "Work queue depth exceeded");
Copy link
Member

@MarcoFalke MarcoFalke Mar 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need release notes to explain the changed http status code? Just in case a user relies on this?

Just asking, you don't have to invalidate review. Release note can be added after merge.

Copy link
Contributor Author

@LarryRuane LarryRuane Mar 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I think you're right, I'll do it separately, thanks.

Copy link
Member

@MarcoFalke MarcoFalke Mar 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add them directly to doc/release-notes.md on the master branch

Copy link
Member

@jonatack jonatack Mar 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a "needs release note" label.

@MarcoFalke MarcoFalke merged commit 3530d5d into bitcoin:master Mar 19, 2021
4 checks passed
LarryRuane added a commit to LarryRuane/bitcoin that referenced this issue Mar 20, 2021
sidhujag pushed a commit to syscoin/syscoin that referenced this issue Mar 20, 2021
LarryRuane added a commit to LarryRuane/bitcoin that referenced this issue Mar 22, 2021
LarryRuane added a commit to LarryRuane/bitcoin that referenced this issue Mar 22, 2021
LarryRuane added a commit to LarryRuane/bitcoin that referenced this issue Mar 23, 2021
MarcoFalke added a commit that referenced this issue Mar 29, 2021
51eef4a doc: Add release notes for #18335 (rpc work queue exceeded error) (Larry Ruane)

Pull request description:

  Follow-on to #18335

ACKs for top commit:
  darosior:
    ACK 51eef4a

Tree-SHA512: 863d92cb1c23493d9c8c42ed89b30ebd59092e44f159de4cefbabfe4101e7d7d40f24776ff3fcf39dedf90b45fc25845cf7a2177af38729ce2118d75c3cd779b
sidhujag pushed a commit to syscoin/syscoin that referenced this issue Mar 29, 2021
51eef4a doc: Add release notes for bitcoin#18335 (rpc work queue exceeded error) (Larry Ruane)

Pull request description:

  Follow-on to bitcoin#18335

ACKs for top commit:
  darosior:
    ACK 51eef4a

Tree-SHA512: 863d92cb1c23493d9c8c42ed89b30ebd59092e44f159de4cefbabfe4101e7d7d40f24776ff3fcf39dedf90b45fc25845cf7a2177af38729ce2118d75c3cd779b
@bitcoin bitcoin locked as resolved and limited conversation to collaborators Aug 16, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants