Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

http: Fix HTTP server shutdown #14670

Merged
merged 6 commits into from Dec 6, 2018

Conversation

Projects
None yet
8 participants
@promag
Copy link
Member

commented Nov 6, 2018

Fixes #11777. Reverts #11006. Replaces #13501.

With this change the HTTP server will exit gracefully, meaning that all requests will finish processing and sending the response, even if this means to wait more than 2 seconds (current time allowed to exit the event loop).

Another small change is that connections are accepted even when the server is stopping, but HTTP requests are rejected. This can be improved later, especially if chunked replies are implemented.

Briefly, before this PR, this is the order or events when a request arrives (RPC stop):

  1. bufferevent_disable(..., EV_READ)
  2. StartShutdown()
  3. evhttp_del_accept_socket(...)
  4. ThreadHTTP terminates (event loop exits) because there are no active or pending events thanks to 1. and 3.
  5. client doesn't get the response thanks to 4.

This can be verified by applying

     // Event loop will exit after current HTTP requests have been handled, so
     // this reply will get back to the client.
     StartShutdown();
+    MilliSleep(2000);
     return "Bitcoin server stopping";
 }

and checking the log output:

    Received a POST request for / from 127.0.0.1:62443
    ThreadRPCServer method=stop user=__cookie__
    Interrupting HTTP server
**  Exited http event loop
    Interrupting HTTP RPC server
    Interrupting RPC
    tor: Thread interrupt
    Shutdown: In progress...
    torcontrol thread exit
    Stopping HTTP RPC server
    addcon thread exit
    opencon thread exit
    Unregistering HTTP handler for / (exactmatch 1)
    Unregistering HTTP handler for /wallet/ (exactmatch 0)
    Stopping RPC
    RPC stopped.
    Stopping HTTP server
    Waiting for HTTP worker threads to exit
    msghand thread exit
    net thread exit

    ... sleep 2 seconds ...

    Waiting for HTTP event thread to exit
    Stopped HTTP server

For this reason point 3. is moved right after all HTTP workers quit. In that moment HTTP replies are queued in the event loop which keeps spinning util all connections are closed. In order to trigger the server side close with keep alive connections (implicit in HTTP/1.1) the header Connection: close is sent if shutdown was requested. This can be tested by

bitcoind -regtest
nc localhost 18443
POST / HTTP/1.1
Authorization: Basic ...
Content-Type: application/json
Content-Length: 44

{"jsonrpc": "2.0","method":"stop","id":123}

Summing up, this PR:

  • removes explicit event loop exit — event loop exits once there are no active or pending events
  • changes the moment the listening sockets are removed — explained above
  • sends header Connection: close on active requests when shutdown was requested which is relevant when it's a persistent connection (default in HTTP 1.1) — libevent is aware of this header and closes the connection gracefully
  • removes event loop explicit break after 2 seconds timeout
@jnewbery

This comment has been minimized.

Copy link
Member

commented Nov 6, 2018

@promag can you explain how this is different from #13501 and what the reasoning was for closing that PR?

@MarcoFalke

This comment has been minimized.

Copy link
Member

commented Nov 6, 2018

This is now the trillionth attempt in fixing the issue. Could you provide some rationale how this is different from the other approaches and why it should be preferred over the other ones?

@ryanofsky

This comment has been minimized.

Copy link
Contributor

commented Nov 6, 2018

This is simpler than #13501, and if it works I think it is preferable. Instead of trying to wait for all reply data to be sent before stopping the event loop and calling evhttp_free, it lets evhttp_free worry about this and stop the event loop itself. Assuming http_free does the right thing, this is a cleaner and less invasive.

I am curious what exactly the behavior of evhttp_free is. For example if we are in the middle of sending a huge reply to a slow client, and the server is shut down, how long does this wait (if it waits at all) to send all the data before shutting down? IIUC #13501 would wait as long as it took, and currently we wait for 2 seconds.

@ryanofsky

This comment has been minimized.

Copy link
Contributor

commented Nov 6, 2018

It seems like evhttp_free just immediately calls shutdown and close on all open connections, so I would expect this change not to always fix #11777.

https://github.com/libevent/libevent/blob/9afe7a6c12c112a6cbc49cf1bd684781275b2579/http.c#L1243
https://github.com/libevent/libevent/blob/9afe7a6c12c112a6cbc49cf1bd684781275b2579/http.c#L3721

But maybe the WriteHeader("Connection", "close") call added to HTTPRequest::WriteReply improves things in some way, so even if there is a race condition, it won't trigger in practice?

@DrahtBot

This comment has been minimized.

Copy link
Contributor

commented Nov 6, 2018

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #14726 (Use RPCHelpMan for all RPCs by MarcoFalke)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@promag promag force-pushed the promag:2018-11-httpserver branch 3 times, most recently Nov 6, 2018

@promag

This comment has been minimized.

Copy link
Member Author

commented Nov 7, 2018

@ryanofsky right, can't call evhttp_free with ongoing requests.

Updated code and PR description, should be more clear of the current problem and the approach used.

@promag promag force-pushed the promag:2018-11-httpserver branch 2 times, most recently Nov 7, 2018

@promag

This comment has been minimized.

Copy link
Member Author

commented Nov 7, 2018

Split in multiple commits for easy review.

It can be arguable to remove the timeout but IMO it's safe because:

  • if an HTTP worker is with a long task then calling event_base_loopbreak is no good
  • active connections will timeout and close if no request/response happens (30 seconds by default).

This can be observed by:

bitcoind -regtest
time nc localhost 18443
bitcoin-cli -regtest stop
nc localhost 18443  0.00s user 0.01s system 0% cpu 30.025 total
                                                   ^^^^^^

The only way to improve this in a correct way is to lower the connection timeout (that I know of)

@ken2812221
Copy link
Member

left a comment

Concept ACK.

Show resolved Hide resolved src/httpserver.cpp
@ryanofsky

This comment has been minimized.

Copy link
Contributor

commented Nov 7, 2018

Can you update the PR description to describe the change in external behavior instead of diving immediately into details of the code?

It would also help if each individual commit stated how it changes behavior, and why the change makes sense by itself or in conjunction with other changes.

@promag

This comment has been minimized.

Copy link
Member Author

commented Nov 7, 2018

describe the change in external behavior

@ryanofsky sorry I don't understand what you mean with external behavior.

and why the change makes sense by itself or in conjunction with other changes

I think I did?

It would also help if each individual commit stated how it changes behavior

Will do.

@MarcoFalke

This comment has been minimized.

Copy link
Member

commented Nov 7, 2018

I think external behaviour means the behaviour users or json-rpc-wrapper libraries see.

@ryanofsky

This comment has been minimized.

Copy link
Contributor

commented Nov 7, 2018

What will happen if bitcoin is in the middle of sending a large response to a slow client and it gets shutdown? Will it try to send as much as of the reply as possible but give up after a timeout? Will it stop sending immediately? Will it wait forever? Previous versions of this PR have done a bunch of different things, so what externally observable changes is this PR is intending to make? I think this would be the most useful information to know and include in the description.

@promag promag force-pushed the promag:2018-11-httpserver branch 2 times, most recently Nov 7, 2018

@promag

This comment has been minimized.

Copy link
Member Author

commented Nov 7, 2018

@ryanofsky still testing but I believe it waits until HTTP request finishes or timeouts due to inactivity on the connection. In detail this means:

  • completes reading the request headers and body
  • worker computes full response and adds event to send the response
  • the response is sent and then the connection is finally closed
  • at this point, if it's the last connection, the event loop has no registered events and exists.

Updated commits and PR description. Added @ken2812221 suggestion.

@promag promag force-pushed the promag:2018-11-httpserver branch Nov 8, 2018

@laanwj laanwj added this to Blockers in High-priority for review Nov 8, 2018

@laanwj

This comment has been minimized.

Copy link
Member

commented Nov 20, 2018

Can you update the PR description to describe the change in external behavior instead of diving immediately into details of the code?

Also, can we please add a test that fails without this, but passes with it? That's the best way to "prove" that this fixes something.

@promag promag force-pushed the promag:2018-11-httpserver branch 3 times, most recently Nov 20, 2018

@promag

This comment has been minimized.

Copy link
Member Author

commented Nov 20, 2018

  • ef7c54c1e - rpc: Add wait argument to stop
  • 8589891ea - http: Send "Connection: close" header if shutdown is requested
  • e10898267 - http: Unlisten sockets after all workers quit
  • b1f241a0f - http: Remove unnecessary event_base_loopexit call
  • a72fbc8aa - http: Remove timeout to exit event loop
  • c76cde3c9 - qa: Test shutdown

@laanwj if you build only the marked commits then test test/functional/feature_shutdown.py fails.

@promag promag force-pushed the promag:2018-11-httpserver branch Nov 20, 2018

Show resolved Hide resolved test/functional/feature_shutdown.py Outdated
@promag

This comment has been minimized.

Copy link
Member Author

commented Nov 22, 2018

Sorry @ken2812221, just pushed what @laanwj suggested. I'll squash later.

@promag promag force-pushed the promag:2018-11-httpserver branch Nov 23, 2018

@promag

This comment has been minimized.

Copy link
Member Author

commented Nov 23, 2018

Rebased.

@DrahtBot DrahtBot removed the Needs rebase label Nov 23, 2018

Show resolved Hide resolved src/rpc/server.cpp Outdated

promag added some commits Nov 7, 2018

http: Send "Connection: close" header if shutdown is requested
Sending the header "Connection: close" makes libevent close persistent
connections (implicit with HTTP 1.1) which cleans the event base when
shutdown is requested.
http: Unlisten sockets after all workers quit
This (almost) move only ensures the event base loop doesn't exit before
HTTP worker threads exit. This way events registered by HTTP workers are
processed and not discarded.
http: Remove unnecessary event_base_loopexit call
Let event base loop exit cleanly by processing all active and pending
events. The call is no longer necessary because closing persistent
connections is now properly handled.
http: Remove timeout to exit event loop
Let HTTP connections to timeout due to inactivity.
Let all remaning connections finish sending the response and close.

@promag promag force-pushed the promag:2018-11-httpserver branch to 28479f9 Nov 23, 2018

@ken2812221
Copy link
Member

left a comment

tACK 28479f9

@laanwj

This comment has been minimized.

Copy link
Member

commented Dec 6, 2018

utACK 28479f9

1 similar comment
@laanwj

This comment has been minimized.

Copy link
Member

commented Dec 6, 2018

utACK 28479f9

@laanwj laanwj merged commit 28479f9 into bitcoin:master Dec 6, 2018

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

laanwj added a commit that referenced this pull request Dec 6, 2018

Merge #14670: http: Fix HTTP server shutdown
28479f9 qa: Test bitcond shutdown (João Barbosa)
8d3f46e http: Remove timeout to exit event loop (João Barbosa)
e98a9ee http: Remove unnecessary event_base_loopexit call (João Barbosa)
6b13580 http: Unlisten sockets after all workers quit (João Barbosa)
18e9685 http: Send "Connection: close" header if shutdown is requested (João Barbosa)
02e1e4e rpc: Add wait argument to stop (João Barbosa)

Pull request description:

  Fixes #11777. Reverts #11006. Replaces #13501.

  With this change the HTTP server will exit gracefully, meaning that all requests will finish processing and sending the response, even if this means to wait more than 2 seconds (current time allowed to exit the event loop).

  Another small change is that connections are accepted even when the server is stopping, but HTTP requests are rejected. This can be improved later, especially if chunked replies are implemented.

  Briefly, before this PR, this is the order or events when a request arrives (RPC `stop`):
   1. `bufferevent_disable(..., EV_READ)`
   2. `StartShutdown()`
   3. `evhttp_del_accept_socket(...)`
   4. `ThreadHTTP` terminates (event loop exits) because there are no active or pending events thanks to 1. and 3.
   5. client doesn't get the response thanks to 4.

  This can be verified by applying
  ```diff
       // Event loop will exit after current HTTP requests have been handled, so
       // this reply will get back to the client.
       StartShutdown();
  +    MilliSleep(2000);
       return "Bitcoin server stopping";
   }
  ```
  and checking the log output:
  ```
      Received a POST request for / from 127.0.0.1:62443
      ThreadRPCServer method=stop user=__cookie__
      Interrupting HTTP server
  **  Exited http event loop
      Interrupting HTTP RPC server
      Interrupting RPC
      tor: Thread interrupt
      Shutdown: In progress...
      torcontrol thread exit
      Stopping HTTP RPC server
      addcon thread exit
      opencon thread exit
      Unregistering HTTP handler for / (exactmatch 1)
      Unregistering HTTP handler for /wallet/ (exactmatch 0)
      Stopping RPC
      RPC stopped.
      Stopping HTTP server
      Waiting for HTTP worker threads to exit
      msghand thread exit
      net thread exit

      ... sleep 2 seconds ...

      Waiting for HTTP event thread to exit
      Stopped HTTP server
  ```

  For this reason point 3. is moved right after all HTTP workers quit. In that moment HTTP replies are queued in the event loop which keeps spinning util all connections are closed. In order to trigger the server side close with keep alive connections (implicit in HTTP/1.1) the header `Connection: close` is sent if shutdown was requested. This can be tested by
  ```
  bitcoind -regtest
  nc localhost 18443
  POST / HTTP/1.1
  Authorization: Basic ...
  Content-Type: application/json
  Content-Length: 44

  {"jsonrpc": "2.0","method":"stop","id":123}
  ```

  Summing up, this PR:
   - removes explicit event loop exit — event loop exits once there are no active or pending events
   - changes the moment the listening sockets are removed — explained above
   - sends header `Connection: close` on active requests when shutdown was requested which is relevant when it's a persistent connection (default in HTTP 1.1) — libevent is aware of this header and closes the connection gracefully
   - removes event loop explicit break after 2 seconds timeout

Tree-SHA512: 4dac1e86abe388697c1e2dedbf31fb36a394cfafe5e64eadbf6ed01d829542785a8c3b91d1ab680d3f03f912d14fc87176428041141441d25dcb6c98a1e069d8

@jnewbery jnewbery removed this from Blockers in High-priority for review Dec 6, 2018

@ken2812221

This comment has been minimized.

Copy link
Member

commented Dec 14, 2018

Is this another HTTP server bug?
https://travis-ci.org/bitcoin/bitcoin/jobs/467883032

stdout:
2018-12-14T08:34:22.452000Z TestFramework (INFO): Initializing test directory /tmp/test_runner_₿_🏃_20181214_082355/feature_shutdown_0
2018-12-14T08:34:24.197000Z TestFramework (INFO): Stopping nodes
2018-12-14T08:34:24.197000Z TestFramework (INFO): Cleaning up /tmp/test_runner_₿_🏃_20181214_082355/feature_shutdown_0 on exit
2018-12-14T08:34:24.197000Z TestFramework (INFO): Tests successful
stderr:
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/travis/build/bitcoin/bitcoin/build/bitcoin-x86_64-unknown-linux-gnu/test/functional/feature_shutdown.py", line 12, in test_long_call
    block = node.waitfornewblock()
  File "/home/travis/build/bitcoin/bitcoin/build/bitcoin-x86_64-unknown-linux-gnu/test/functional/test_framework/coverage.py", line 47, in __call__
    return_val = self.auth_service_proxy_instance.__call__(*args, **kwargs)
  File "/home/travis/build/bitcoin/bitcoin/build/bitcoin-x86_64-unknown-linux-gnu/test/functional/test_framework/authproxy.py", line 134, in __call__
    response = self._request('POST', self.__url.path, postdata.encode('utf-8'))
  File "/home/travis/build/bitcoin/bitcoin/build/bitcoin-x86_64-unknown-linux-gnu/test/functional/test_framework/authproxy.py", line 105, in _request
    return self._get_response()
  File "/home/travis/build/bitcoin/bitcoin/build/bitcoin-x86_64-unknown-linux-gnu/test/functional/test_framework/authproxy.py", line 166, in _get_response
    'code': -342, 'message': 'non-JSON HTTP response with \'%i %s\' from server' % (http_response.status, http_response.reason)})
test_framework.authproxy.JSONRPCException: non-JSON HTTP response with '503 Service Unavailable' from server (-342)
@promag

This comment has been minimized.

Copy link
Member Author

commented Dec 14, 2018

At least it doesn't look. Looks more a race between the thread that runs test_long_call and stop_node — if stop_node runs before then RPC would be unavailable. It seems unlikely because of the 1 second wait but it should be possible considering the probable load.

@promag

This comment has been minimized.

Copy link
Member Author

commented Dec 14, 2018

Tested the change:

diff --git a/test/functional/feature_shutdown.py b/test/functional/feature_shutdown.py
index 5a0c88546..b0df81736 100755
--- a/test/functional/feature_shutdown.py
+++ b/test/functional/feature_shutdown.py
@@ -7,8 +7,11 @@
 from test_framework.test_framework import BitcoinTestFramework
 from test_framework.util import assert_equal, get_rpc_proxy
 from threading import Thread
+from time import sleep
+

 def test_long_call(node):
+    sleep(1)
     block = node.waitfornewblock()
     assert_equal(block['height'], 0)

which gives the following error:

ConnectionRefusedError: [Errno 61] Connection refused

Then on top of that:

@@ -20,6 +23,8 @@ class ShutdownTest(BitcoinTestFramework):

     def run_test(self):
         node = get_rpc_proxy(self.nodes[0].url, 1, timeout=600, coveragedir=self.nodes[0].coverage_dir)
+        # force connection establishment
+        node.getblockcount()

the error is now

test_framework.authproxy.JSONRPCException: non-JSON HTTP response with '503 Service Unavailable' from server (-342)

This could be easily fixed if we had a way to inspect how many RPC workers are running, for instance, by having a getrpcinfo.

Meanwhile this can be "prevented" by adding some relaxation:

@@ -26,6 +26,7 @@ class ShutdownTest(BitcoinTestFramework):
         # force connection establishment
         node.getblockcount()
         Thread(target=test_long_call, args=(node,)).start()
+        sleep(1)

@promag promag deleted the promag:2018-11-httpserver branch Dec 15, 2018

laanwj added a commit that referenced this pull request Jan 16, 2019

Merge #14958: qa: Remove race between connecting and shutdown on sepa…
…rate connections

4412a59 qa: Remove race between connecting and shutdown on separate connections (João Barbosa)

Pull request description:

  Fixes the error #14670 (comment) reported by @ken2812221.

  There is a race between RPC stop and another concurrent call in the test framework. The connection must be established and the command `waitfornewblock` running before calling `stop`.

  See also #14670 (comment).

Tree-SHA512: 77feb8628d3b9c025ec0cf83565d4d6680cad4fb182fc93a65df8b573f3e799ba4c44e06d9001dd8a375ca0b1ee17f10e66c3902b6256d0ae2acbc64539185d7

scravy added a commit to dtr-org/unit-e that referenced this pull request Feb 28, 2019

Merge pull request #690 from dsaveliev/fix-node-stop-rpc-call
Ignore disconnects when stopping node - fixes #367
This is a port of bitcoin/bitcoin#14670

This problem is related to the "stop" RPC call.
It turns out that sometimes the node stops before it sends the response.
That causes the client to retry the call and lead to "ConnectionRefusedError".

I found out that there is a fix in bitcoin repo (in master branch, actually), related to this bug.
In the nutshell, this changes removes forced exit of libevent loop, which allows
to process all the requests gracefully.

Test feature_shutdown.py wasn't ported because it introduces a bug bitcoin/bitcoin#14670 (comment)
The "improvement" bitcoin/bitcoin#14958 requires RPC method "getrpcinfo" which isn't ported from the bitcoin codebase yet.

There are other floating tests with a similar error (ConnectionRefusedError):
fixes #373
fixes #484
fixes #544

Signed-off-by: Dmitry Saveliev <dima@thirdhash.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.