Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add lockfile #4449

Merged
merged 25 commits into from
May 1, 2017
Merged

Add lockfile #4449

merged 25 commits into from
May 1, 2017

Conversation

bmw
Copy link
Member

@bmw bmw commented Mar 31, 2017

Built on #4554. Fixes #933.

This is a large PR, but over 60% of it is testing.

Design decisions

  • I came full circle and decided to put a lock in the top level of each directory we use. The biggest problem with a global, world writeable lockfile is arbitrary users can lock the file and prevent your program from running or remove protections entirely.
  • Since we're creating the lockfile in our directories which have permissions 755 (source), give it 600 permissions.
  • Try and delete the lockfile on exit. With the OS level locking mechanisms we're using, the lock is automatically released when the process exits, but lets not leave .certbot.lock files everywhere. Users will complain about us littering files all over their system or wonder if Certbot crashed. The lock_file module which the LockFile class is based on is the only Python lock file module I found that securely deletes the lockfile.
  • lock_file is no longer maintained and isn't packaged anywhere, so I added it to our tree and cleaned it up. There's a slight benefit here because we've promised Debian we wouldn't be adding any new (non-optional) dependencies to Certbot. lock_file doesn't work on Windows, but neither does Certbot right now so we can deal with that problem when/if it comes up.
  • Ensure consistent locking order through testing to so if two Certbot instances run concurrently, at least one of them will be successful. We don't block on locks because it increases the chance of deadlock, we'd have to write our own timeout code (if we wanted a timeout), and we may block on a lock we don't need in the case of plugins.

@bmw bmw added this to the 0.13.0 milestone Mar 31, 2017
@bmw bmw requested review from pde, ohemorange and joohoi March 31, 2017 04:56
@ohemorange
Copy link
Contributor

Arbitrary users can also lock files places in that particular directory, and many lockfiles come with a time expiration mechanism. I think if we're going to do this, it should be global. We can talk about this Monday though.

@bmw
Copy link
Member Author

bmw commented Mar 31, 2017

Arbitrary users can also lock files places in that particular directory

I don't think they can. You need an open file descriptor and they can't create files in those directories or read the one we created. Additionally, the lockf specification says the opened file should be open for writing.

@ohemorange
Copy link
Contributor

I see now that "in each directory" means "in each core Certbot directory", not "in each subdirectory of those", which makes more sense.

I'm a little worried about Certbot having multiple locks, because I can see us getting into a state where two different instances of Certbot have acquired different locks. It's possible that this isn't actually a valid state given what code is touched where, but that's complex to check and to make sure we keep as a valid property going forward.

Any reason not to lock just one of the directories that Certbot needs, and block on that? I think it makes sense for the plugins to have their own locks, but for main Certbot, why not block on just one of the directories so there's a single lock per process rather than directory?

@ohemorange ohemorange assigned bmw and unassigned ohemorange Apr 3, 2017
@pde pde modified the milestones: 0.13.0, 0.14.0 Apr 5, 2017
@bmw
Copy link
Member Author

bmw commented Apr 7, 2017

I see now that "in each directory" means "in each core Certbot directory", not "in each subdirectory of those", which makes more sense.

I updated my original post to make this more clear. Good catch.

I'm a little worried about Certbot having multiple locks, because I can see us getting into a state where two different instances of Certbot have acquired different locks. It's possible that this isn't actually a valid state given what code is touched where, but that's complex to check and to make sure we keep as a valid property going forward.

I'm assuming you're worried about something like deadlock correct? That shouldn't be possible because we don't block on trying to acquire a lock. If another process has acquired the lock we need, we exit which releases any locks we hold. If both this and the order we lock directories changes in the future though that could be a problem. I'm happy to add a comment to the Lock class saying that us not blocking is important.

Any reason not to lock just one of the directories that Certbot needs, and block on that? I think it makes sense for the plugins to have their own locks, but for main Certbot, why not block on just one of the directories so there's a single lock per process rather than directory?

While it might be slightly overkill, I think a lock per directory is most correct. For example, imagine these two commands are run on a system at once:

certbot
certbot --lock-file-or-dir-containing-lock somewhere_not_default

Other directories are still shared but the lock is set elsewhere so Certbot doesn't complain. As described above and places like Apache's documentation, we should ideally place our lock file at a restricted path. If we do this, I see two options:

Use an existing core directory

The problem with this is it's not clear to a user that running certbot --config/work/logs-dir somewhere_else is a potential problem, but two instances of Certbot might try to modify your certs, backups, or logs at the same time.

Use a separate path

This will break integrations for anyone currently running Certbot without root. They'll update and suddenly Certbot has insufficent permissions because they also need to set --lock-file to somewhere else.

I'm not necessarily opposed to a single lock, but there are potential problems with any approach to that I'm aware of that can be avoided by using a lock per directory. Also, due to the nature of this PR, it might be a good idea to get two reviews on it.

@bmw bmw assigned ohemorange and unassigned bmw Apr 7, 2017
@ohemorange
Copy link
Contributor

Hm. I didn't realize that it quits on failing to acquire. And good point about certbot / certbot --lock-file-or-dir-containing-lock somewhere_not_default.

So, the main use case for this is that someone had already put a certbot command in their crontab, but then the packaging system goes and puts it in there again.

If we quit, we definitely won't deadlock, but we could easily always fail to run either of them, or at least one of them.

We probably want both to complete successfully, no? Which means we do want to block on the lock.

And you're right, the no-sudo is a hard problem. Which is why I liked having a particular file in /tmp. Because while other users can maliciously change things, we're not doing this for security, we're doing it to avoid unintentionally getting to an invalid state.

@bmw
Copy link
Member Author

bmw commented Apr 14, 2017

So, the main use case for this is that someone had already put a certbot command in their crontab, but then the packaging system goes and puts it in there again.

That's definitely the biggest use case. Another one is a user tries to manually run Certbot at the same time as the cronjob happens to fire. Even less likely is two users on the same system try to run Certbot at the same time using some of the same files.

If we quit, we definitely won't deadlock, but we could easily always fail to run either of them, or at least one of them.

We probably want both to complete successfully, no? Which means we do want to block on the lock.

Hm. While this might be a nice property, I personally think achieving it has other consequences that outweigh the benefits. Having a single instance or even all concurrently running Certbot instances fail isn't a big deal in my opinion. We tell people to run the cronjob twice a day and this advice is followed by our package maintainers so I think a single failure due to lock contention hardly matters at all. In fact, if people have e-mail from cron set up properly this will help notify them that there is a problem and there are two cron jobs trying to run Certbot on their system.

If we block and have more than one lock (including any additional locks in the plugins), we have ensure the locking order never changes or risk deadlock. While we could definitely do this, I'm not sure it's worth the extra complexity and risk of bugs. If we have a single lock file, the risk of deadlock goes away but we now either break people who are running Certbot without root, have an insecure lockfile, or put a lockfile in a single core directory allowing users to easily shoot themselves in the foot without knowing it.

And you're right, the no-sudo is a hard problem. Which is why I liked having a particular file in /tmp. Because while other users can maliciously change things, we're not doing this for security, we're doing it to avoid unintentionally getting to an invalid state.

You're right that the purpose of this PR is to prevent multiple instances of Certbot from putting any of the files it modifies into an invalid state. Putting a lockfile in /tmp though allows arbitrary users to circumvent or abuse these protections. They can acquire the lock themselves and prevent Certbot from running and renewing your certs. They can delete the lockfile after it's created removing these protections entirely. While maybe neither of these things are a big deal, it opens Certbot up to other problems that we do not have with another approach.

Despite me standing my ground a bit here, I'm certainly open to changing things if we come up with a better solution. The only potential problem I am aware of with the current approach though is we don't block, but changing this comes with other potential problems I'd like to avoid.

@ohemorange
Copy link
Contributor

Having a single instance or even all concurrently running Certbot instances fail isn't a big deal in my opinion. We tell people to run the cronjob twice a day and this advice is followed by our package maintainers so I think a single failure due to lock contention hardly matters at all. In fact, if people have e-mail from cron set up properly this will help notify them that there is a problem and there are two cron jobs trying to run Certbot on their system.

This is actually exactly what worries me -- if a job runs at the same time, it's probably not from randomness, it's because they were both set up to run at the same time, and will continue to do so consistently. So they'll continue failing in the exact same way each time. And I'm not comfortable relying on having cron email set up.

That being said, maybe only one of the two completing successfully is fine... except obviously for the ones where it's always the wrong cron job that wins out, which could totally happen if it starts like 30 seconds before the other one, and certbot always takes longer than that to complete. Oh and even worse, in the cases where (possible since we have multiple locks), neither job completes successfully.

If we block and have more than one lock (including any additional locks in the plugins), we have ensure the locking order never changes or risk deadlock. While we could definitely do this, I'm not sure it's worth the extra complexity and risk of bugs.

I definitely hear this, especially given that we're trying to protect people from an edge case to begin with. I just worry that for the exact people we're trying to protect, we'll accidentally end up failing them even more.

Putting a lockfile in /tmp though allows arbitrary users to circumvent or abuse these protections. They can acquire the lock themselves and prevent Certbot from running and renewing your certs. They can delete the lockfile after it's created removing these protections entirely. While maybe neither of these things are a big deal, it opens Certbot up to other problems that we do not have with another approach.

Ugh. You're totally right about these. These are new problems that we'd be opening Certbot up to. /tmp doesn't solve our problems.

What really worries me is going from a "maybe would have worked fine" to a "definitely always breaks things for this set of users" due to an update we push. And the "both jobs fail" case is the worst, I think. If that's not possible, I'd be much more comfortable with this. So question: how likely is that to happen? How do we handle threading? What about plugins? Are our instructions always actually run in order?

@ohemorange ohemorange assigned bmw and unassigned ohemorange Apr 15, 2017
@bmw
Copy link
Member Author

bmw commented Apr 18, 2017

Talked about this a bit with Erica out of band.

I'm planning on trying to ensure the order we acquire locks doesn't change through comments/tests.

@bmw
Copy link
Member Author

bmw commented Apr 20, 2017

Partially implemented plan here:

  1. Add a unit test that asserts when plugins are prepared, they are done so in sorted order.
  2. Add a system test requesting a specific plugin where the testing code acquires locks and asserts Certbot errors out about the correct lock.

Unfortunately it's hard to add a system test for plugin lock order because:

  1. Nginx is hidden so it is ignored unless it is specifically requested (at which point Apache is ignored).
  2. If a specific plugin hasn't been selected, a plugin being unavailable because it can't acquire the lock isn't fatal.

I'm also not planning on blocking on the locks because it increases the chance of deadlock, we'd have to write our own timeout code (if we wanted a timeout), and we may block on a lock we don't need in the case of plugins. At the end of this, we should have reasonable tests to help ensure that if multiple instances of Certbot are run at the same time, at least one instance will always run successfully.

If anyone has a problem with this plan, please let me know so I can change my approach before I spend more time coding.

@bmw bmw force-pushed the lock branch 2 times, most recently from 3204aab to 5e2531b Compare April 27, 2017 01:38
@bmw
Copy link
Member Author

bmw commented Apr 29, 2017

Ah. Gotcha.

Unfortunately, we need a lockfile. Without it, we're just counting down the days until two Certbot instances modify a configuration at once and break everything (if it hasn't happened already). I don't see immediately see another solution to the problem you outlined above though other than a timeout. Having Certbot's default behavior be non-interactive would also largely mitigate this problem, but we obviously can't do that now. Certbot 2.0 is going to be awesome one day though.

I think I'd like to see this PR reviewed as is so I can make any requested changes ASAP. Again, a solution to endlessly waiting for input should almost certainly go into another PR due to the size of this one. We can wait to merge this PR until we've got a plan for this problem that we're comfortable with though.

@ohemorange
Copy link
Contributor

Yeah, I think a timeout is reasonable and shouldn't be too hard. But if we have a timeout anyway, does it then make sense to continue to not block, as we're doing here? If we're going to change the fundamental behavior of the lock based on the results of that, then it makes sense to put it here.

Copy link
Contributor

@ohemorange ohemorange left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually don't have comments on the code itself, I've read through it a few times and it seems meticulously thought out and careful.

We should definitely run the tests on multiple OSes -- I have Arch in mind in particular, because of that weird subprocess communication thing that it did that one time (or more specifically, failed to do). That was probably specific to Nginx on Arch, but I'd be much more comfortable having checked.

Also, I'd like to see another test: after a process holding a lock has quit unexpectedly (and didn't remove the file), does another process successfully acquire the lock?

@bmw
Copy link
Member Author

bmw commented May 1, 2017

Yeah, I think a timeout is reasonable and shouldn't be too hard. But if we have a timeout anyway, does it then make sense to continue to not block, as we're doing here? If we're going to change the fundamental behavior of the lock based on the results of that, then it makes sense to put it here.

OK. I'll write a timeout PR today. My plan is to make the timeout something ridiculous like 24 hours so no user should accidentally hit it if they get distracted while running Certbot or have to stop and Google something.

I do not think we should change the lock behavior with a timeout for the same reasons identified my original post. The timeout I refer to there is a timeout waiting for the lock, not a timeout on user input.

We should definitely run the tests on multiple OSes -- I have Arch in mind in particular, because of that weird subprocess communication thing that it did that one time (or more specifically, failed to do). That was probably specific to Nginx on Arch, but I'd be much more comfortable having checked.

Over the course of doing this I've tested on more obscure OSes like FreeBSD, but I'll run a full set of test farm tests today which will include all of our unit tests and the lock_test I wrote.

Also, I'd like to see another test: after a process holding a lock has quit unexpectedly (and didn't remove the file), does another process successfully acquire the lock?

Good idea. I'll add a unit test for this today as well.

@ohemorange
Copy link
Contributor

That all sounds great. Maybe just under 24 hours, so it's gone before the next day's run.

As to the timeout, I suppose many of the same reasons still apply, in which case yup I'm good with another PR.

I saw you added the new test, which just leaves test farming.

Did you want another set of eyes on this before merging? I remember you'd mentioned something about that.

Copy link
Contributor

@ohemorange ohemorange left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good, but wait for the test farm tests before merging.

@bmw
Copy link
Member Author

bmw commented May 1, 2017

I ran all our test farm tests and everything passed. I also manually ran our test suite on Arch and FreeBSD on both Python 2 and Python 3 with no issues.

@pde said he'd try to review the code here which certainly doesn't hurt. I'm also fine to wait to merge this until the input timeout PR has landed since we've decided we don't want this without that PR as well.

@@ -162,6 +162,14 @@ def prepare(self):
if self.version is None:
self.version = self.get_version()

# Prevent two Nginx plugins from modifying a config at once
try:
util.lock_dir_until_exit(self.conf('server-root'))
Copy link
Member

@pde pde May 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One day, we'd probably want to make this a part of IAuthenticator and IInstaller, so that third party plugin authors don't have to know to make locking calls themselves (ie the API can suggest they should provide a directory to lock).

Copy link
Member

@pde pde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just one suggested change to avoid spinning really fast when we don't have the lock!

finally:
# Close the file if it is not the required one
if self._fd is None:
os.close(fd)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider something like:

time.sleep(random.randint(0,200) / 1000.)

to ensure we don't eat lots of CPU if we can't get the lock?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add this if you want but I'm not convinced it's necessary. If we try to lock the file, but another process holds the lock, we error out immediately. The only time we repeat this loop is if another process deletes the lock file after we've opened the file but before we've locked it. We only delete the lockfile when releasing an acquired lock so it's very rare for us to loop at all here.

With that said, if you'd still like me to add something like this, let me know and I will.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM!

@pde pde assigned bmw and unassigned pde, ohemorange and joohoi May 1, 2017
@pde pde merged commit 5ca8f7c into master May 1, 2017
@bmw bmw deleted the lock branch May 1, 2017 21:50
@ajaegers
Copy link

ajaegers commented May 8, 2017

Hello @bmw
We are managing thousands of LE certs.
For multiple certificate renew or certonly we parallelize tasks (5 maximum at the same time).
Before this PR it worked like a charm but now we have lots of errors with message "Another instance of Certbot is already running."

Is there a way to bypass this .lock feature or other solution? (Maybe locking certbot based on domains list specified in command.)

Also this behavior in docs would be appreciable.

Thanks!

@bmw
Copy link
Member Author

bmw commented May 8, 2017

Hi @ajaegers. I'm sorry this has caused problems for you. Until the latest release, Certbot had no protections to prevent concurrently running Certbot instances from corrupting your certs, keys, logs, backups, Apache/Nginx config etc. I apologize that this behavior being unsafe wasn't documented and I created #4627 to document the new locking mechanism.

We could potentially implement finer grained locking, but this would be a significant increase in Certbot's complexity. Unfortunately, it isn't enough to just lock based on the domain list. Unless you are also using --cert-name, Certbot will walk all directories in /etc/letsencrypt/live, some of which may be being modified by other processes. We could implement reader/writer locks here (and on your ACME account, logs, etc.), but again, this is much more complex than our current approach.

Even if you're using --cert-name which bypasses walking all live certificates, things such as your logs and depending on the plugin you're using backups and server configuration may be shared with other processes causing this behavior to be unsafe. With all that said, I created #4628 to look into the issue.

With Certbot 0.14.0, there isn't a nice and safe way to run multiple instances in parallel. You could have concurrent instances use different directories with the --config-dir, --work-dir, and --logs-dir flags, but two certificates in the same configuration directory (/etc/letsencrypt by default) cannot be currently obtained/renewed in parallel safely. I'm sorry we didn't document this sooner and I apologize for the trouble!

@pde
Copy link
Member

pde commented May 11, 2017

@ajaegers please file another ticket if you need help getting multiple Certbots running in parallel with --config-dir, --work-dir and --logs-dir. I think that should be a good solution for your use case; you probably want to seed the --config-dir directory by copying /etc/letsencrypt there, and then selectively deleting renewal config files (the ones in /etc/letsencrypt/renewal/) to ensure that each cert lives in exactly one home. It's probably advisable to also delete the corresponding subdirectories below the live/ and archive/ config subdirectories, but that's not strictly necessary.

The alternative would be supporting a --unsafe-no-locking option for cases like yours, with the understanding that you might have to fix things if you hit a race condition. Again, please file another bug if you need that!

@ohemorange
Copy link
Contributor

(You can also use certbot delete --cert-name CERTNAME instead of manually deleting renewal config files)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants