Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait for Nginx restart to complete when using Nginx plugin #7740

Closed
wants to merge 11 commits into from

Conversation

mustanggb
Copy link

@mustanggb mustanggb commented Feb 3, 2020

Fixes #7422.

@schoen
Copy link
Contributor

schoen commented Feb 5, 2020

Hi @mustanggb,

I remember having had some code that depended on psutil in the past which we were subsequently able to remove (which I think may have helped improve Certbot's compatibility because psutil may be unavailable on some platforms). I'd like to check in with others about whether there's a consensus about whether it's OK to re-introduce that dependency.

@schoen
Copy link
Contributor

schoen commented Feb 5, 2020

The history of our removing psutil deliberately is found at

#3106

Is there any other way to do this without reintroducing this dependency? Perhaps we could use psutil if it happens to be present, but fall back to a fixed delay or another method if it's not?

@mustanggb
Copy link
Author

mustanggb commented Feb 5, 2020

@schoen thanks for commenting, I did try to look for why psutil was removed from "core", but didn't notice any particular reason, so assumed it was just to remove some dependencies for easier maintainability/portability.

I deliberately didn't squash my commits so people could have a look at the approaches I tried.

psutil.pid_exists(worker_pid) could be replaced with subprocess.call(["kill", "-0", worker_pid]) as I had originally, or potentially even something like os.kill(worker_pid, 0) (untested), so that one's not too much of an issue.

However the issue I had was the tests complaining that pidof wasn't available on Azure/Windows, hence why I gave psutil a try.

Also I didn't want to spend too much time on it if the overall approach wasn't going to be acceptable.

If people are happy with the approach a cross-platform method of achieving pidof could potentially be sought. I'm no Python expert, so suggestions welcome.

@schoen
Copy link
Contributor

schoen commented Feb 6, 2020

Thanks! I realize this is a thorny problem but so far I haven't come across a consensus about what to do about it. It does seem like the maintainers don't want to re-add the psutil dependency as discussed back in #3106; that may leave us without a clear path forward for this issue.

@adferrand
Copy link
Collaborator

Hello! If you want, I can search for Windows native alternatives for pidof and associated operations on pids.

If there is a solution, we will need to implement it in the certbot.compat package, in order to expose to the rest of the codebase several platform agnostic methods.

@schoen
Copy link
Contributor

schoen commented Feb 7, 2020

@adferrand, I think that would probably be helpful. A related question is whether there is a portable Unix equivalent that works on non-Linux Unix systems (which I think was a problem with psutil as well, if I'm remembering correctly from a few years ago).

@bmw, do you have a thought on an appropriate solution to #7422 or what characteristics such a solution ought to have?

@adferrand
Copy link
Collaborator

OK. My feedback so far after some digging. I think that without psutil, it will be hard to have something working on every system. psutil is doing a hard work to find the native solution, but as @schoen said, failed to do so in the past. Maybe it is better now, but even in this case, we would add a new dependency on the nginx plugin, that will make integration more difficult.

On Windows specifically, we can have a decent solution thanks to pywin32, and reproduce what psutil does, but it still a significant complexity: it consistents essentially in trying to open the process associated to PID, and assert on if we can retrieve the process handle or not, and what is its status if it is retrieved.

I would like to propose to study another approach, that would avoid to manipulate PIDs and so to require a deep knowledge of the actual platform that is running Nginx.

Why not trying to make HTTP requests to the local Nginx instance ? If the results of these requests are usable to know when Nginx is effectively restarted, I think this solution will be way more maintainable in the long term.

@mustanggb mustanggb removed the request for review from ohemorange February 7, 2020 14:25
@mustanggb
Copy link
Author

Appreciate the replies.

Why not trying to make HTTP requests to the local Nginx instance ?

Mainly just because I thought the original method would be a more robust solution, but I've updated the pull request to do it this way instead, please let me know your feedback.

@bmw
Copy link
Member

bmw commented Feb 10, 2020

@bmw, do you have a thought on an appropriate solution to #7422 or what characteristics such a solution ought to have?

I think this is tricky and I'm not sure there is a solution that will work 100% of the time.

While it's probably rare, the server might not be listening on all interfaces and will not be accessible through localhost.

As for the psutil approach, the concerns I have are:

  • The issues we've had in the past with the library
  • The reliability of the string "nginx: worker process" across all nginx packages and forks like Openresty.
  • Complexity in making the dependency conditional and/or adding another dependency that requires compilation.

If we thought it was significantly more reliable than connecting to localhost, I'd be potentially open to adding the psutil dependency again, but I personally suspect this is not the case.

If we cannot do this 100% reliably, I personally think the most important thing here in general is to make sure we have good behavior when this is not going to work. A couple suggestions I have related to this are:

  1. If nginx is not accessible over localhost, the current code raises an exception from requests.get which I think it should not do.
  2. If we resolve the issue of the exception and nginx is not accessible over localhost, we now wait 2 minutes instead of 3 seconds. I think this is pretty bad UX for someone running Certbot interactively where in almost all cases the 3 second sleep is long enough. What if we checked if the server was accessible over localhost before reloading nginx and only wait for the challenge response to become available if we could previously connect to it? This would mean this logic is skipped for servers that weren't listening on port 80 before we set up the challenge, but I personally think it's worth it to avoid waiting for 2 minutes when we'll never be able to connect.

@mustanggb
Copy link
Author

  1. If nginx is not accessible over localhost, the current code raises an exception from requests.get which I think it should not do.

I'm not familiar with such a setup, do you mean a non-default port, listening on a socket (I don't think Nginx has this feature), or something else; you could elaborate? If it's just a different port could we add an argument to set this if it needs to be changed?

  1. If we resolve the issue of the exception and nginx is not accessible over localhost, we now wait 2 minutes instead of 3 seconds. I think this is pretty bad UX for someone running Certbot interactively where in almost all cases the 3 second sleep is long enough. What if we checked if the server was accessible over localhost before reloading nginx and only wait for the challenge response to become available if we could previously connect to it? This would mean this logic is skipped for servers that weren't listening on port 80 before we set up the challenge, but I personally think it's worth it to avoid waiting for 2 minutes when we'll never be able to connect.

Agree.

@bmw
Copy link
Member

bmw commented Feb 10, 2020

Most people's listen directives look like:

listen 80;

This configures nginx to listen to port 80 on all interfaces.

Using the output of a tool like ifconfig, if I configure nginx to listen on the IP address of a specific interface (other than the loopback interface of 127.0.0.1), it will no longer be accessible through localhost but still accessible to users of the site. The modified directive would look something like:

listen 1.2.3.4:80;

I wasn't thinking of a problem of using a non-default port, but we should probably make that work too. The nginx plugin should currently respect the value of --http-01-port when setting up the challenge so we should probably direct our request there as well.

I also wasn't thinking of listening on a socket, although that is also supported by nginx: https://nginx.org/en/docs/http/ngx_http_core_module.html#listen

@alexzorin
Copy link
Collaborator

alexzorin commented Feb 10, 2020

Just a shower thought: you don't need to verify that the challenge path is accessible (and FWIW I think this is a lot harder than it seems due to all the weird and wonderful things you can do in vhosts), you only need to verify that nginx reloaded.

To that end, one could just get the configurator to add a sentinel vhost as part of the challenge solver, such as:

server {
    listen unix:/var/lib/letsencrypt/nginx-sentinel.sock;
    location / { return 200 "<random sentinel value>"; }
}

This might dodge some of the issues that @bmw is worried about.

@mustanggb
Copy link
Author

mustanggb commented Feb 11, 2020

one could just get the configurator to add a sentinel vhost as part of the challenge solver

Bingo! This seems like the best of both worlds, I like this idea better as well as it's a closer check of an actual reload, like you say, a vhost could be giving a 200 response even before a reload has completed.

EDIT: Actually, will a new server/listen directive be picked up on reload, or does it require a full restart, because that would be a deal breaker I'd say.

@mustanggb
Copy link
Author

Well it seems a reload does work fine.

Updated with the sentinel approach; not expecting tests to pass yet.

@mustanggb
Copy link
Author

mustanggb commented Feb 11, 2020

Okay, tests did better than expected, however I have no idea why the integration one is failing.

@mustanggb
Copy link
Author

Any further input from maintainers, possible to have a milestone?

@schoen
Copy link
Contributor

schoen commented Feb 19, 2020

@bmw, could you opine on this? It seems that @alexzorin has had a clever idea and @mustanggb has implemented it.

(One thing I wonder about is whether this could reduce our Windows compatibility, since I didn't think Windows supported Unix domain sockets.)

@bmw
Copy link
Member

bmw commented Feb 20, 2020

That is a clever idea from alexzorin.

In addition to making Windows support harder though, I'm a little concerned about file system permissions or mandatory access control issues like those described in #4716. We used the directory more in the past, but we currently don't configure nginx to use our working directory at all except when the config previously wasn't listening on necessary port for the challenge, we conditionally reference the nonexistent directory http_01_nonexistent.

I'm sorry I keep popping in here trying to poke holes in the proposed approaches, but we currently have nearly half a million nginx users for whom what we currently have is working. That's a lot of nginx configurations doing who knows what and I don't want to break their ability to automatically renew certificates. While it'd be great for us to also support the few configurations who have reported hitting #7422, landing anything here with any significant risk of regressions personally makes me a little nervous.

I don't really care what approach is taken here if we're confident it'll work, but out of what's been proposed so far, I personally think I like using localhost the best with the changes I described at #7740 (comment).

Two other ideas are:

  1. Offer a configurable flag like --nginx-sleep-seconds. It's ugly and there's a discoverability problem for new users who hit Nginx installer not properly reloading configuration  #7422, but it gives a way for the plugin to work reliably for any nginx configuration.
  2. Don't change anything here and close Nginx installer not properly reloading configuration  #7422 as wontfix. People can still do things like using the nginx installer with the webroot authenticator if they want.

@adferrand
Copy link
Collaborator

I would like to give back my reasoning here, trying to avoid further nginx configuration, to include windows compatibility, and various situations. Do not hesitate to explain me why I am wrong.

So, given the port used for the HTTP-01 challenge, as defined by the --http-01-port. The nginx plugin is supposed to ensure that a vhost is listening to this port, and define a path in it to the challenge. The vhost may pre-exist (usually the case, because it is the standard 80 HTTP port), or not pre-exist, and the socket may be available for every interface, or only specific ones.

I suppose that we could test the availability of the socket (and so state that nginx is currently running or not) by trying to bind from certbot to 0.0.0.0 on the HTTP-01 challenge port, and see if it possible or not. If it is not, then the socket is currently used at one or more interfaces accessible to the system, and so nginx is running.

Now, the process to see that nginx restarted. We can be in the situation where nginx was listening to this port before the restart or not.

  1. (pre-perform) check for socket availability
  2. reconfigure nginx
  3. restart it
  4. if the socket was not available, wait for it to be released
  5. wait for the socket to not available again
  6. assume nginx has restarted, proceed to perform the challenge

In theory, we avoid any further configuration to Nginx, like the unix socket, increasing compatibility for Windows.

Considering that the current situation (the plugin does not wait) is mostly working, I think the all process should be wrapped by a global short timeout, typically 30s. If timeout is reached, the plugin assume nginx has restarted and continue.

@mustanggb
Copy link
Author

@bmw
I'm not a Windows user, but from https://devblogs.microsoft.com/commandline/af_unix-comes-to-windows/ it looks like unix sockets are supported?

If the directory permissions are a concern could always put it somewhere else, /var/run perhaps, or even /tmp?

@adferrand
Copy link
Collaborator

adferrand commented Feb 24, 2020

@bmw
I'm not a Windows user, but from https://devblogs.microsoft.com/commandline/af_unix-comes-to-windows/ it looks like unix sockets are supported?

If the directory permissions are a concern could always put it somewhere else, /var/run perhaps, or even /tmp?

This is available since Windows 10 1803. I could not find information about the Server flavor of Windows 10, which is Windows Server 2019. And we need to cover all non-EOL versions of Windows, so down to Windows Server 2012.

@bmw bmw added the priority: unplanned Work that we believe should be done, but does not have a higher priority. label Mar 25, 2020
@bmw
Copy link
Member

bmw commented Apr 1, 2020

@adferrand, how would you propose that your recent suggestion be implemented in a platform independent way and why do you like it better than your previous suggestion of trying to connect to nginx on localhost?

@adferrand
Copy link
Collaborator

adferrand commented Apr 1, 2020

Well, about trying to connect to Nginx on localhost, it may not work depending on how Nginx is configured: it may just not listen to the loopback interface (although very unlikely in a standard configuration, but possible). Given VirtualHost features, the behavior of an HTTP request could also be very different depending on the actual Host header shipped with the request.

However, I may be wrong, but I would expect that trying to bind a port to all interfaces will fail if any interface is already bound for this port: this failure would be a good indication that Nginx has actually restarted, independantly from the config, so whatever the interface is and the server name set up for the associated virtual host.

About how to implement that, the Python socket.bind() is very likely platform independant, as it is already used by Certbot for the standalone plugin for instance, and is working both on Linux, Mac OS and Windows.

@bmw
Copy link
Member

bmw commented Apr 1, 2020

I also could be mistaken, but the two problems I see there are:

  1. If nginx was already listening on port 80, because we reload nginx rather than restart it nginx will never release the port so we won't be able to see any change here.
  2. If nginx was not already listening on port 80 and we regularly try to bind to it, we may prevent nginx from binding it when it restarts causing it to crash.

@mustanggb
Copy link
Author

Won't fix isn't an option, because as it stands the nginx plugin is useless.

So what solution would you like to use?

@bmw
Copy link
Member

bmw commented May 6, 2020

@mustanggb, please keep your comments civil and constructive if you are going to continue posting to this repo. Calling an open source project that has been worked on by hundreds of people and whose nginx plugin is successfully being used by hundreds of thousands of people useless is not appropriate.

As for a path forward, I'm personally just leaning towards an --nginx-sleep-seconds flag for now. It's not at all elegant, however, it's a simple fix that provides a way to resolve this problem 100% of the time.

If other people who have offered suggestions in this thread have differing opinions, I'd be interested to hear them.

alexzorin added a commit to alexzorin/certbot that referenced this pull request Jul 22, 2020
As described in certbot#7422, reloading nginx is an asynchronous process and
Certbot does not know when it is complete. In an environment where this
reload takes a long time, the nginx plugin suffers from an issue where
it responds to and fails the ACME challenge before the nginx server is
ready to serve it.

Following the discussion in a previous PR certbot#7740, this commit introduces
a new flag, --nginx-sleep-seconds, which may be used to increase the
duration that Certbot will wait for nginx to reload, from its previously
hard-coded value of 1s.

Fixes certbot#7422
bmw pushed a commit that referenced this pull request Jul 27, 2020
* nginx: add --nginx-sleep-seconds

As described in #7422, reloading nginx is an asynchronous process and
Certbot does not know when it is complete. In an environment where this
reload takes a long time, the nginx plugin suffers from an issue where
it responds to and fails the ACME challenge before the nginx server is
ready to serve it.

Following the discussion in a previous PR #7740, this commit introduces
a new flag, --nginx-sleep-seconds, which may be used to increase the
duration that Certbot will wait for nginx to reload, from its previously
hard-coded value of 1s.

Fixes #7422

* update CHANGELOG

* nginx: update docstring for nginx_restart
@bmw
Copy link
Member

bmw commented Jul 27, 2020

--nginx-sleep-seconds was implemented in #8163 and will be included in our release next week so I'm closing this PR.

Thanks for getting things started here and triggering the discussion around this issue.

@bmw bmw closed this Jul 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: unplanned Work that we believe should be done, but does not have a higher priority.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Nginx installer not properly reloading configuration
5 participants