Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Android] Sporadic failures to shutdown the server (gracefully) #106

Closed
birdofpreyru opened this issue Apr 30, 2024 · 17 comments
Closed

[Android] Sporadic failures to shutdown the server (gracefully) #106

birdofpreyru opened this issue Apr 30, 2024 · 17 comments
Labels
P1 High priority issue. Ready Ready for release.

Comments

@birdofpreyru
Copy link
Owner

No description provided.

@birdofpreyru birdofpreyru added the P1 High priority issue. label Apr 30, 2024
@birdofpreyru
Copy link
Owner Author

birdofpreyru commented Apr 30, 2024

Does not seem to be a problem in the example app when I move it between background and foreground; but it does happen in my app when Google.Play purchase flow is started (the RN app itself moves to background state then, and sometimes the server fails to stop, never exiting the main loop on Lighttpd site). I wonder whether it is related to a bug fixed in the latest Lighttpd v1.4.76.

@birdofpreyru birdofpreyru changed the title Double-check the auto-stop of the server works on Android when started/stopped on the same port [Android] Sporadic failures to shutdown the server May 1, 2024
@birdofpreyru birdofpreyru reopened this May 1, 2024
@birdofpreyru
Copy link
Owner Author

No, v1.4.76 does not help.

A note for myself, on what I figured out so far happens during a failed shutdown attempt, in Lighttpd core:

  • To shut it down I set graceful_shutdown to 1.
  • The main loop executes server_graceful_state() till the end, including the last branch of its last if which does server_sockets_close() and remove_pid_file().
  • Back in the main loop it does not go into if (NULL == srv->conns && graceful_shutdown) { condition, so the main loop continues.
  • Then my understanding is it does not shutdown gracefully because some connections stay alive, for some reason; and my expectation is that after server.graceful-shutdown-timeout (the default 8 seconds) it is supposed to force-terminate any connection and shutdown anyway — does not happen to me.
  • I see connection_graceful_shutdown_maint() is repeatedly called by the main loop, and I see that after 8 seconds it does hit connection_set_state_error() because of the graceful shutdown timeout... though the main loop still keeps on running.
  • Haven't looked further yet, but I guess I it is supposed to collect connections that are marked as error-ed-out, thus making srv->conns null, and that way terminate the main loop?

@birdofpreyru
Copy link
Owner Author

Hi @gstrauss , if you have a free moment, will you help me out here, please?

After tracing stuff a bit further I figured out this:

  • Once a connection is marked as errored out because of the graceful shutdown timeout, the connection stat machine calls connection_handle_response_end_state(), which in turns calls connection_handle_shutdown(), and there it repeatedly goes into the first branch of if-statement for me (in case of shutdown failure) — it sets close_timeout_ts to the current time and marks connection state as close.
  • Then I see connection_handle_close_state() is called by the state machine, but it never arrives to call connection_close() because log_monotonic_secs - con->close_timeout_ts is always 0, because close_timeout_ts is repeatedly reset to current time in the previous point (and HTTP_LINGER_TIMEOUT is 5).

At this point, it feels like a bug in Lighttpd to me, I guess here once connection is marked as errored because of graceful_expire timeout, it should not be repeated for the same connection in the next main loop iteration? Or, perhaps, it should not be marked as errored if it is already in CON_STATE_CLOSE state? Then it will allow the connection to reach HTTP_LINGER_TIMEOUT and, presumably to be closed.

Or... do I misunderstand something here?

@birdofpreyru
Copy link
Owner Author

Yeah, I have verified that changing the if-block from

        if (graceful_expire) {
            connection_set_state_error(r, CON_STATE_ERROR);
            changed = 1;
        }

into

        if (graceful_expire && r->state != CON_STATE_CLOSE) {
            connection_set_state_error(r, CON_STATE_ERROR);
            changed = 1;
        }

solves the problem for me — in case when the server does not gracefully shutdown rapidly, it successfully shuts down after the expected 8 + 5 seconds timeout.

@gstrauss
Copy link

gstrauss commented May 1, 2024

Thank you for the detailed troubleshooting. I have to ruminate on this a bit.

@gstrauss
Copy link

gstrauss commented May 1, 2024

Instead of your suggested patch, my initial take (untested) is that connection_handle_shutdown() should not assign con->close_timeout_ts = log_monotonic_secs; if the state is already r->state == CON_STATE_CLOSE, but I have to read through the code to also assure myself that the state won't bounce between CON_STATE_CLOSE and CON_STATE_ERROR.

@gstrauss
Copy link

gstrauss commented May 1, 2024

connection_handle_response_end_state() is called for CON_STATE_RESPONSE_END and for CON_STATE_ERROR. Calling connection_handle_shutdown() changes the state to CON_STATE_CLOSE if the connection is not yet closed, or cleans up the connection and goes to the reset state of CON_STATE_CONNECT if all finished.

Since the intent is that connection_handle_shutdown() should not be called more than once, and it is clearly being called more than once, that appears to be a bug. My hunch is that I need to look carefully to find where CON_STATE_CLOSE might transition to CON_STATE_ERROR.

@gstrauss
Copy link

gstrauss commented May 1, 2024

h1_check_timeout(), h2_check_timeout() and connection_revents_err() might play roles, but connection_graceful_shutdown_maint() definitely transitions CON_STATE_CLOSE to CON_STATE_ERROR, as you pointed out. However, we still want changed = 1, so please try this:

--- a/src/connections.c
+++ b/src/connections.c
@@ -908,7 +908,8 @@ connection_graceful_shutdown_maint (server * const srv)
         }
 
         if (graceful_expire) {
-            connection_set_state_error(r, CON_STATE_ERROR);
+            if (r->state != CON_STATE_CLOSE)
+                connection_set_state_error(r, CON_STATE_ERROR);
             changed = 1;
         }
 

@gstrauss
Copy link

gstrauss commented May 2, 2024

h1_check_timeout(), h2_check_timeout() and connection_revents_err() do not change the state if r->state == CON_STATE_CLOSE, so it looks like you correctly identified the problem in connection_graceful_shutdown_maint(). Does my modified patch above work for you?

@birdofpreyru
Copy link
Owner Author

@gstrauss your variant sure works as well, though I am not sure why to set changed if we haven't modified the state? I saw it works without it, thus the connection state machine still is called by the main loop regardless?

Another thought I got — in this scenario, when we arrive to connection_handle_close_state() after expiring the graceful shutdown timeout, perhaps there is no need to additionally wait for HTTP_LINGER_TIMEOUT? I mean, once we already expired the graceful shutdown one, perhaps it makes sense to terminate ASAP, without any further delays?

Also, I noticed that graceful-shutdown-timeout option is never mentioned in Lighttpd docs at https://redmine.lighttpd.net/projects/lighttpd/wiki/Docs_ConfigurationOptions

@gstrauss
Copy link

gstrauss commented May 2, 2024

though I am not sure why to set changed if we haven't modified the state?

CON_STATE_ERROR triggers various request/connection cleanups prior to transition to CON_STATE_CLOSE. If already in CON_STATE_CLOSE, setting changed and calling connection_state_machine() optimistically attempts to read from the connection, looking for EOF. As you suggest, maybe that is unnecessary and can wait for read event notification. On the other hand, the optimistic read does sink (read and discard) a bit of extra data from the socket if the kernel socket buffers were full.

Near the top of connection_graceful_shutdown_maint(), if the connection is in CON_STATE_CLOSE and the HTTP_LINGER_TIMEOUT has expired, changed = 1 is set, so as you said, maybe it is unnecessary to set changed = 1 if graceful_expire is set but the state is already CON_STATE_CLOSE.

Another thought I got — in this scenario, when we arrive to connection_handle_close_state() after expiring the graceful shutdown timeout, perhaps there is no need to additionally wait for HTTP_LINGER_TIMEOUT?

As above, CON_STATE_ERROR does some things prior to transition to CON_STATE_CLOSE, e.g. calling shutdown() on the socket and then setting con->close_timeout_ts. One second later, connection_graceful_shutdown_maint() will be run again, will find the connection in CON_STATE_CLOSE, and will adjust con->close_timeout_ts further in the past so that lighttpd will wait at most one more second before HTTP_LINGER_TIMEOUT expires, giving the cilent a chance to respond to the shutdown() on the socket. Therefore, if the default server.graceful-shutdown-timeout is 8 seconds, everything should finish shutting down in 8 seconds + a maximum of 2 more seconds following shutdown() on the sockets on which the graceful shutdown timeout expired. Are you seeing that? With that patch, are you seeing an extra second or two, or are you seeing an extra 5 seconds (HTTP_LINGER_TIMEOUT)?

Also, I noticed that graceful-shutdown-timeout option is never mentioned in Lighttpd docs at https://redmine.lighttpd.net/projects/lighttpd/wiki/Docs_ConfigurationOptions

server.feature-flags documents "server.graceful-shutdown-timeout"

@birdofpreyru
Copy link
Owner Author

birdofpreyru commented May 2, 2024

Are you seeing that? With that patch, are you seeing an extra second or two, or are you seeing an extra 5 seconds (HTTP_LINGER_TIMEOUT)?

You are right, with the patch in place, the server shuts down within a second after the graceful shutdown expiration. My bad.

@gstrauss
Copy link

gstrauss commented May 2, 2024

Thanks for confirming. I have committed the following patch with your patch to fix the issue here and a small change to more quickly close connections after the graceful shutdown timeout has expired and the connection state is CON_STATE_CLOSE. There may still be a 1 second delay after graceful shutdown timeout has expired. This allows for lighttpd to set CON_STATE_ERROR to trigger sending shutdown(fd, SHUT_WR) and transitioning to CON_STATE_CLOSE, and then a second later to adjust the con->close_timeout_ts so that HTTP_LINGER_TIMEOUT is expired, too.

--- a/src/connections.c
+++ b/src/connections.c
@@ -884,6 +884,7 @@ connection_graceful_shutdown_maint (server * const srv)
              * (from zero) *up to* one more second, but no more */
             if (HTTP_LINGER_TIMEOUT > 1)
                 con->close_timeout_ts -= (HTTP_LINGER_TIMEOUT - 1);
+            con->close_timeout_ts -= (graceful_expire << 1); /*(-2 if expired)*/
             if (log_monotonic_secs - con->close_timeout_ts > HTTP_LINGER_TIMEOUT)
                 changed = 1;
         }
@@ -907,7 +908,7 @@ connection_graceful_shutdown_maint (server * const srv)
             changed = 1;
         }
 
-        if (graceful_expire) {
+        if (graceful_expire && r->state != CON_STATE_CLOSE) {
             connection_set_state_error(r, CON_STATE_ERROR);
             changed = 1;
         }

@gstrauss
Copy link

gstrauss commented May 2, 2024

I should note that the issue you are seeing here in #106 is a bug in lighttpd, but that bug was exposed due to your clients ignoring the shutdown() sent by lighttpd. Is there something particular about the way the clients are handling shutdown() that could be improved, too?

@birdofpreyru
Copy link
Owner Author

birdofpreyru commented May 2, 2024

By clients do you mean the HTTP client on the other side of connection that fails to close gracefully? I should investigate it… though, in my case the client connecting to the server is the Android’s WebView, not some custom client I’ve implemented myself; but the problem occurs when I’m moving my app into background, sounds like it may interfere with normal handling of connections by WebView.

@gstrauss
Copy link

gstrauss commented May 2, 2024

By clients do you mean the HTTP client on the other side of connection that fails to close gracefully? I should investigate it… though, in my case the client connecting to the server is the Android’s WebView, not some custom client I’ve implemented myself; but the problem occurs when I’m moving my app into background, sounds like it may interfere with normal handling of connections by WebView.

Yes, the HTTP client on the other side of the connection. In the scenario you describe, are the clients receiving the shutdown() sent by lighttpd? Are the clients subsequently calling close() on the socket prior to expiration of the graceful shutdown timeout? Is lighttpd receiving a socket event indicating that the client closed the connection? Is lighttpd processing that socket event? Somewhere along the line, something seems not to be happening. It might be WebView. It might be backgrounding your app on Android. It might be lighttpd. It might be something else.

Overall, my question is: What are the reason(s) that the client connections are not being closed prior to the graceful shutdown timeout in lighttpd?

@birdofpreyru
Copy link
Owner Author

birdofpreyru commented May 2, 2024

I have no idea. I guess, some bug on the client side, but for now just falling back to non-graceful shutdown is good enough for me. I'll return to investigating it at a better time later 🤷‍♂️

@birdofpreyru birdofpreyru changed the title [Android] Sporadic failures to shutdown the server [Android] Sporadic failures to shutdown the server gracefully May 2, 2024
@birdofpreyru birdofpreyru changed the title [Android] Sporadic failures to shutdown the server gracefully [Android] Sporadic failures to shutdown the server (gracefully) May 2, 2024
@birdofpreyru birdofpreyru added In Progress The issue is actively worked upon Ready Ready for release. and removed In Progress The issue is actively worked upon labels May 3, 2024
birdofpreyru added a commit to birdofpreyru/lighttpd1.4 that referenced this issue May 3, 2024
lighttpd-git pushed a commit to lighttpd/lighttpd1.4 that referenced this issue May 9, 2024
avoid changing connection state to back to CON_STATE_ERROR if the state
has already reached CON_STATE_CLOSE, which may result in the graceful
shutdown timeout not taking effect for that connection if the client
does not notice the socket shutdown(fd, SHUT_WR).

(thx birdofpreyru)

x-ref:
  "[Android] Sporadic failures to shutdown the server"
  birdofpreyru/react-native-static-server#106
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 High priority issue. Ready Ready for release.
Projects
None yet
Development

No branches or pull requests

2 participants