http: speed up shutdown #6990

Merged
merged 1 commit into from Nov 13, 2015

Conversation

Projects
None yet
6 participants
@laanwj
Member

laanwj commented Nov 11, 2015

Pardon a bit of iteration. This continues/fixes #6719.

event_base_loopexit was not doing what I expected it to. What I expected was that it sets a timeout, given that no other pending events it would exit in N seconds. However, what it does was delay the event loop exit with N seconds, even if nothing is pending.

Solve it in a different way: give the event loop thread time to exit out of itself, and if it doesn't, send loopbreak (and then join it).

This speeds up the RPC tests a lot, each exit incurred a 10 second overhead, with this change there should be no shutdown overhead in the common case and up to two seconds if the event loop is blocking.

As a bonus this breaks dependency on the global boost::thread_group, as the HTTP server minds its own offspring.

Time for rpctests with this patch:

real    7m50.215s
user    0m30.152s
sys     0m11.752s

Without:

real    13m12.069s
user    0m30.412s
sys     0m11.168s

@laanwj laanwj added the RPC/REST/ZMQ label Nov 11, 2015

@dcousens

This comment has been minimized.

Show comment
Hide comment
@dcousens

dcousens Nov 11, 2015

Contributor

event_base_loopbreak was not doing what I expected it to. What I expected was that it sets a timeout, given that no other pending events it would exit in N seconds. However, what it does was delay the event loop exit with N seconds, even if nothing is pending.

Sounds like you wanted event_base_loopexit? Ref:

The next event_base_loop() iteration after the given timer expires will complete normally (handling all queued events) then exit without blocking for events again.

In any case, event_base_loopbreak should still exit immediately, stopping any loops after the next event:

event_base_loop() will abort the loop after the next event is completed; event_base_loopbreak() is typically invoked from this event's callback. This behavior is analogous to the "break;" statement.

Are you sure this isn't symptomatic of something else?

Reference: http://www.wangafu.net/~nickm/libevent-2.0/doxygen/html/event_8h.html#a07a7599e478e4031fa8cf52e26d8aa1e

edit: Right, never mind, you were using event_base_loopexit, hence the confusion. You're OP (before it was edited) makes it sound like you switched from event_base_loopbreak.

Contributor

dcousens commented Nov 11, 2015

event_base_loopbreak was not doing what I expected it to. What I expected was that it sets a timeout, given that no other pending events it would exit in N seconds. However, what it does was delay the event loop exit with N seconds, even if nothing is pending.

Sounds like you wanted event_base_loopexit? Ref:

The next event_base_loop() iteration after the given timer expires will complete normally (handling all queued events) then exit without blocking for events again.

In any case, event_base_loopbreak should still exit immediately, stopping any loops after the next event:

event_base_loop() will abort the loop after the next event is completed; event_base_loopbreak() is typically invoked from this event's callback. This behavior is analogous to the "break;" statement.

Are you sure this isn't symptomatic of something else?

Reference: http://www.wangafu.net/~nickm/libevent-2.0/doxygen/html/event_8h.html#a07a7599e478e4031fa8cf52e26d8aa1e

edit: Right, never mind, you were using event_base_loopexit, hence the confusion. You're OP (before it was edited) makes it sound like you switched from event_base_loopbreak.

src/httpserver.cpp
+ if (eventBase) {
+ LogPrint("http", "Waiting for HTTP event thread to exit\n");
+ // Give event loop a few seconds to exit (to send back last RPC responses), then break it
+ if (!threadHTTP.try_join_for(boost::chrono::milliseconds(2000))) {

This comment has been minimized.

@dcousens

dcousens Nov 11, 2015

Contributor

Out of interest, why do we have to wait [a hard coded timeout]? Why can't we wait for all events to finish (and maybe just ignore all new incoming connections?)

@dcousens

dcousens Nov 11, 2015

Contributor

Out of interest, why do we have to wait [a hard coded timeout]? Why can't we wait for all events to finish (and maybe just ignore all new incoming connections?)

This comment has been minimized.

@MarcoFalke

MarcoFalke Nov 11, 2015

Member

all events to finish

may take a long time (minutes) depending what was going in in the past?

@MarcoFalke

MarcoFalke Nov 11, 2015

Member

all events to finish

may take a long time (minutes) depending what was going in in the past?

This comment has been minimized.

@dcousens

dcousens Nov 11, 2015

Contributor

@MarcoFalke but this doesn't stop new connections (or is that done elsewhere?) in that waiting time. So conceptually, the status quo could be the same?
That is, new connections coming in, old ones still finishing.

@dcousens

dcousens Nov 11, 2015

Contributor

@MarcoFalke but this doesn't stop new connections (or is that done elsewhere?) in that waiting time. So conceptually, the status quo could be the same?
That is, new connections coming in, old ones still finishing.

This comment has been minimized.

@theuni

theuni Nov 12, 2015

Member

New connections are stopped just above, see where the accept sockets are torn down.

Though it does look like we should close connected sockets in http_reject_request_cb as well for good measure. @laanwj Is there a reason to keep them open once they've had a req rejected?

@theuni

theuni Nov 12, 2015

Member

New connections are stopped just above, see where the accept sockets are torn down.

Though it does look like we should close connected sockets in http_reject_request_cb as well for good measure. @laanwj Is there a reason to keep them open once they've had a req rejected?

This comment has been minimized.

@laanwj

laanwj Nov 12, 2015

Member
  • Waiting for all events to finish could take forever. There's tons of ways that someone could keep a connection option by slowly sending data. Also, timing events (such as the walletpassphrase timeout) may hold up the event queue.
  • Just breaking off everything in progress immediately prevents the answer to stop from getting back. We could just declare this a feature of course and be done with it... But it was a race condition, sometimes you would get a reply, sometimes not. We could define stop to cease all RPC activity immediately. But I think I like this better.
  • There AFAIK is no way to ask libevent, nor libevent_http if the events in the queue are important enough to wait for or not. And specially tagging the connection that sends 'stop' is quite ugly, not sure if even possible without diving into internal data structures.

For context read #6719.

I'm not 100% happy with this solution either, but I wasn't able to find another way to do it, and it's better than the previous one.

@cfields sure, you can close the connection afterwards there, but that doesn't solve this issue.

@laanwj

laanwj Nov 12, 2015

Member
  • Waiting for all events to finish could take forever. There's tons of ways that someone could keep a connection option by slowly sending data. Also, timing events (such as the walletpassphrase timeout) may hold up the event queue.
  • Just breaking off everything in progress immediately prevents the answer to stop from getting back. We could just declare this a feature of course and be done with it... But it was a race condition, sometimes you would get a reply, sometimes not. We could define stop to cease all RPC activity immediately. But I think I like this better.
  • There AFAIK is no way to ask libevent, nor libevent_http if the events in the queue are important enough to wait for or not. And specially tagging the connection that sends 'stop' is quite ugly, not sure if even possible without diving into internal data structures.

For context read #6719.

I'm not 100% happy with this solution either, but I wasn't able to find another way to do it, and it's better than the previous one.

@cfields sure, you can close the connection afterwards there, but that doesn't solve this issue.

This comment has been minimized.

@laanwj

laanwj Nov 12, 2015

Member

@MarcoFalke but this doesn't stop new connections (or is that done elsewhere?) in that waiting time. So conceptually, the status quo could be the same?

It does:

        // Unlisten sockets
        BOOST_FOREACH (evhttp_bound_socket *socket, boundSockets) {
            evhttp_del_accept_socket(eventHTTP, socket);
        }
        // Reject requests on current connections
        evhttp_set_gencb(eventHTTP, http_reject_request_cb, NULL);

This closes the port and makes new requests error (so that no new work items are created). But that's best-effort. Current connections could still do e.g. a slowloris attack. The two seconds are a belt and suspenders, nothing more.

@laanwj

laanwj Nov 12, 2015

Member

@MarcoFalke but this doesn't stop new connections (or is that done elsewhere?) in that waiting time. So conceptually, the status quo could be the same?

It does:

        // Unlisten sockets
        BOOST_FOREACH (evhttp_bound_socket *socket, boundSockets) {
            evhttp_del_accept_socket(eventHTTP, socket);
        }
        // Reject requests on current connections
        evhttp_set_gencb(eventHTTP, http_reject_request_cb, NULL);

This closes the port and makes new requests error (so that no new work items are created). But that's best-effort. Current connections could still do e.g. a slowloris attack. The two seconds are a belt and suspenders, nothing more.

@dcousens

This comment has been minimized.

Show comment
Hide comment
@dcousens

dcousens Nov 11, 2015

Contributor

utACK

Contributor

dcousens commented Nov 11, 2015

utACK

@jonasschnelli

This comment has been minimized.

Show comment
Hide comment
@jonasschnelli

jonasschnelli Nov 12, 2015

Member

utACK.

Member

jonasschnelli commented Nov 12, 2015

utACK.

@laanwj

This comment has been minimized.

Show comment
Hide comment
@laanwj

laanwj Nov 12, 2015

Member

Some more information: it appears that with libevent master, this issue doesn't exist.
event_base_loopexit does actually do what I expect it to in the OP. As that is the version I'm normally using, that is why I didn't notice before.

Edit: so I think we should still do this for backward compat, but at least I'm not crazy :)

Member

laanwj commented Nov 12, 2015

Some more information: it appears that with libevent master, this issue doesn't exist.
event_base_loopexit does actually do what I expect it to in the OP. As that is the version I'm normally using, that is why I didn't notice before.

Edit: so I think we should still do this for backward compat, but at least I'm not crazy :)

@dcousens

This comment has been minimized.

Show comment
Hide comment
@dcousens

dcousens Nov 13, 2015

Contributor

@laanwj maybe just inline a comment to mention this might be fixed in latest version of libevent? Could be useful for some future refactoring, save someone trying to establish whether there was a reason we moved away from it that they aren't aware of.

Contributor

dcousens commented Nov 13, 2015

@laanwj maybe just inline a comment to mention this might be fixed in latest version of libevent? Could be useful for some future refactoring, save someone trying to establish whether there was a reason we moved away from it that they aren't aware of.

@laanwj

This comment has been minimized.

Show comment
Hide comment
@laanwj

laanwj Nov 13, 2015

Member

Will do

Edit: OK, added comment and updated commit message.

Member

laanwj commented Nov 13, 2015

Will do

Edit: OK, added comment and updated commit message.

@laanwj

This comment has been minimized.

Show comment
Hide comment
@laanwj

laanwj Nov 13, 2015

Member

Huh! I pushed this branch to bitcoin/bitcoin instead of my own repository. Oh well, no harm done, I'll just remove it again when it's merged.

Member

laanwj commented Nov 13, 2015

Huh! I pushed this branch to bitcoin/bitcoin instead of my own repository. Oh well, no harm done, I'll just remove it again when it's merged.

http: speed up shutdown
This continues/fixes #6719.

`event_base_loopbreak` was not doing what I expected it to, at least in
libevent 2.0.21.
What I expected was that it sets a timeout, given that no other pending
events it would exit in N seconds. However, what it does was delay the
event loop exit with 10 seconds, even if nothing is pending.

Solve it in a different way: give the event loop thread time to exit
out of itself, and if it doesn't, send loopbreak.

This speeds up the RPC tests a lot, each exit incurred a 10 second
overhead, with this change there should be no shutdown overhead in the
common case and up to two seconds if the event loop is blocking.

As a bonus this breaks dependency on boost::thread_group, as the HTTP
server minds its own offspring.
@laanwj

This comment has been minimized.

Show comment
Hide comment
@laanwj

laanwj Nov 13, 2015

Member

@theuni can you take a look here as well?
Edit: nm, you already did, but didn't ack

Member

laanwj commented Nov 13, 2015

@theuni can you take a look here as well?
Edit: nm, you already did, but didn't ack

@gmaxwell

This comment has been minimized.

Show comment
Hide comment
@gmaxwell

gmaxwell Nov 13, 2015

Member

ACK.

Member

gmaxwell commented Nov 13, 2015

ACK.

@gmaxwell gmaxwell merged commit a264c32 into master Nov 13, 2015

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details

gmaxwell added a commit that referenced this pull request Nov 13, 2015

Merge pull request #6990
a264c32 http: speed up shutdown (Wladimir J. van der Laan)
@dcousens

This comment has been minimized.

Show comment
Hide comment
@dcousens

dcousens Nov 14, 2015

Contributor

Awesome, thanks @laanwj :)

Contributor

dcousens commented Nov 14, 2015

Awesome, thanks @laanwj :)

@MarcoFalke

This comment has been minimized.

Show comment
Hide comment
@MarcoFalke

MarcoFalke Nov 15, 2015

Member

Post merge tested ACK!

@laanwj this makes travis a lot faster. (runtime reduced for some rpc tests by 90%)

Before:

$ time for i in {1..10};do qa/pull-tester/rpc-tests.py disablewallet;done
real    1m53.003s
user    0m1.135s
sys 0m0.437s

$ time for i in {1..10};do qa/pull-tester/rpc-tests.py wallet;done
real    10m4.888s
user    0m21.829s
sys 0m8.239s

After:

disablewallet:
real    0m13.809s
user    0m1.166s
sys 0m0.510s

wallet:
real    5m7.703s
user    0m22.128s
sys 0m8.222s
Member

MarcoFalke commented Nov 15, 2015

Post merge tested ACK!

@laanwj this makes travis a lot faster. (runtime reduced for some rpc tests by 90%)

Before:

$ time for i in {1..10};do qa/pull-tester/rpc-tests.py disablewallet;done
real    1m53.003s
user    0m1.135s
sys 0m0.437s

$ time for i in {1..10};do qa/pull-tester/rpc-tests.py wallet;done
real    10m4.888s
user    0m21.829s
sys 0m8.239s

After:

disablewallet:
real    0m13.809s
user    0m1.166s
sys 0m0.510s

wallet:
real    5m7.703s
user    0m22.128s
sys 0m8.222s

@laanwj laanwj deleted the 2015_11_threadexit branch Nov 30, 2015

@jnewbery jnewbery referenced this pull request Aug 8, 2017

Merged

Improve shutdown process #11006

zkbot added a commit to zcash/zcash that referenced this pull request Oct 23, 2017

Auto merge of #2555 - jasondavies:fix-2554, r=str4d
Fix various thread assertion errors caused during shutdown.

Cherry-picked from the following upstream PRs:

- bitcoin/bitcoin#6719
- bitcoin/bitcoin#6990
- bitcoin/bitcoin#8421
  - Second commit only in this PR
- bitcoin/bitcoin#11006

I've cherry-picked the relevant commits, along with a note in each commit referring to the original Bitcoin commit ID (and the Zcash issue numbers where applicable).  I've tested each issue with/without these patches applied.

Closes #2214, #2334, and #2554.

@dagurval dagurval referenced this pull request in bitcoinxt/bitcoinxt Jan 27, 2018

Merged

HTTP Server cherries #315

sickpig added a commit to sickpig/BitcoinUnlimited that referenced this pull request Mar 12, 2018

Port XT #315 from dagurval/28-01-httpserver
HTTP Server cherries from Core:

bitcoin/bitcoin#6719 - Make HTTP server shutdown more graceful
bitcoin/bitcoin#6859 - http: Restrict maximum size of http + headers
bitcoin/bitcoin#6990 - http: speed up shutdown
bitcoin/bitcoin#7966 - http: Do a pending c++11 simplification handling work items
bitcoin/bitcoin#8421 - httpserver: drop boost (#8023 dependency)
bitcoin/bitcoin#11006 - Improve shutdown process

@sickpig sickpig referenced this pull request in BitcoinUnlimited/BitcoinUnlimited Mar 12, 2018

Merged

HTTP servers Core ports #1005

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment