-
-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add lockfile #4449
Add lockfile #4449
Conversation
Arbitrary users can also lock files places in that particular directory, and many lockfiles come with a time expiration mechanism. I think if we're going to do this, it should be global. We can talk about this Monday though. |
I don't think they can. You need an open file descriptor and they can't create files in those directories or read the one we created. Additionally, the lockf specification says the opened file should be open for writing. |
I see now that "in each directory" means "in each core Certbot directory", not "in each subdirectory of those", which makes more sense. I'm a little worried about Certbot having multiple locks, because I can see us getting into a state where two different instances of Certbot have acquired different locks. It's possible that this isn't actually a valid state given what code is touched where, but that's complex to check and to make sure we keep as a valid property going forward. Any reason not to lock just one of the directories that Certbot needs, and block on that? I think it makes sense for the plugins to have their own locks, but for main Certbot, why not block on just one of the directories so there's a single lock per process rather than directory? |
I updated my original post to make this more clear. Good catch.
I'm assuming you're worried about something like deadlock correct? That shouldn't be possible because we don't block on trying to acquire a lock. If another process has acquired the lock we need, we exit which releases any locks we hold. If both this and the order we lock directories changes in the future though that could be a problem. I'm happy to add a comment to the
While it might be slightly overkill, I think a lock per directory is most correct. For example, imagine these two commands are run on a system at once:
Other directories are still shared but the lock is set elsewhere so Certbot doesn't complain. As described above and places like Apache's documentation, we should ideally place our lock file at a restricted path. If we do this, I see two options: Use an existing core directoryThe problem with this is it's not clear to a user that running Use a separate pathThis will break integrations for anyone currently running Certbot without root. They'll update and suddenly Certbot has insufficent permissions because they also need to set I'm not necessarily opposed to a single lock, but there are potential problems with any approach to that I'm aware of that can be avoided by using a lock per directory. Also, due to the nature of this PR, it might be a good idea to get two reviews on it. |
Hm. I didn't realize that it quits on failing to acquire. And good point about So, the main use case for this is that someone had already put a certbot command in their crontab, but then the packaging system goes and puts it in there again. If we quit, we definitely won't deadlock, but we could easily always fail to run either of them, or at least one of them. We probably want both to complete successfully, no? Which means we do want to block on the lock. And you're right, the no-sudo is a hard problem. Which is why I liked having a particular file in |
That's definitely the biggest use case. Another one is a user tries to manually run Certbot at the same time as the cronjob happens to fire. Even less likely is two users on the same system try to run Certbot at the same time using some of the same files.
Hm. While this might be a nice property, I personally think achieving it has other consequences that outweigh the benefits. Having a single instance or even all concurrently running Certbot instances fail isn't a big deal in my opinion. We tell people to run the cronjob twice a day and this advice is followed by our package maintainers so I think a single failure due to lock contention hardly matters at all. In fact, if people have e-mail from cron set up properly this will help notify them that there is a problem and there are two cron jobs trying to run Certbot on their system. If we block and have more than one lock (including any additional locks in the plugins), we have ensure the locking order never changes or risk deadlock. While we could definitely do this, I'm not sure it's worth the extra complexity and risk of bugs. If we have a single lock file, the risk of deadlock goes away but we now either break people who are running Certbot without root, have an insecure lockfile, or put a lockfile in a single core directory allowing users to easily shoot themselves in the foot without knowing it.
You're right that the purpose of this PR is to prevent multiple instances of Certbot from putting any of the files it modifies into an invalid state. Putting a lockfile in Despite me standing my ground a bit here, I'm certainly open to changing things if we come up with a better solution. The only potential problem I am aware of with the current approach though is we don't block, but changing this comes with other potential problems I'd like to avoid. |
This is actually exactly what worries me -- if a job runs at the same time, it's probably not from randomness, it's because they were both set up to run at the same time, and will continue to do so consistently. So they'll continue failing in the exact same way each time. And I'm not comfortable relying on having cron email set up. That being said, maybe only one of the two completing successfully is fine... except obviously for the ones where it's always the wrong cron job that wins out, which could totally happen if it starts like 30 seconds before the other one, and certbot always takes longer than that to complete. Oh and even worse, in the cases where (possible since we have multiple locks), neither job completes successfully.
I definitely hear this, especially given that we're trying to protect people from an edge case to begin with. I just worry that for the exact people we're trying to protect, we'll accidentally end up failing them even more.
Ugh. You're totally right about these. These are new problems that we'd be opening Certbot up to. What really worries me is going from a "maybe would have worked fine" to a "definitely always breaks things for this set of users" due to an update we push. And the "both jobs fail" case is the worst, I think. If that's not possible, I'd be much more comfortable with this. So question: how likely is that to happen? How do we handle threading? What about plugins? Are our instructions always actually run in order? |
Talked about this a bit with Erica out of band. I'm planning on trying to ensure the order we acquire locks doesn't change through comments/tests. |
Partially implemented plan here:
Unfortunately it's hard to add a system test for plugin lock order because:
I'm also not planning on blocking on the locks because it increases the chance of deadlock, we'd have to write our own timeout code (if we wanted a timeout), and we may block on a lock we don't need in the case of plugins. At the end of this, we should have reasonable tests to help ensure that if multiple instances of Certbot are run at the same time, at least one instance will always run successfully. If anyone has a problem with this plan, please let me know so I can change my approach before I spend more time coding. |
3204aab
to
5e2531b
Compare
Ah. Gotcha. Unfortunately, we need a lockfile. Without it, we're just counting down the days until two Certbot instances modify a configuration at once and break everything (if it hasn't happened already). I don't see immediately see another solution to the problem you outlined above though other than a timeout. Having Certbot's default behavior be non-interactive would also largely mitigate this problem, but we obviously can't do that now. Certbot 2.0 is going to be awesome one day though. I think I'd like to see this PR reviewed as is so I can make any requested changes ASAP. Again, a solution to endlessly waiting for input should almost certainly go into another PR due to the size of this one. We can wait to merge this PR until we've got a plan for this problem that we're comfortable with though. |
Yeah, I think a timeout is reasonable and shouldn't be too hard. But if we have a timeout anyway, does it then make sense to continue to not block, as we're doing here? If we're going to change the fundamental behavior of the lock based on the results of that, then it makes sense to put it here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually don't have comments on the code itself, I've read through it a few times and it seems meticulously thought out and careful.
We should definitely run the tests on multiple OSes -- I have Arch in mind in particular, because of that weird subprocess communication thing that it did that one time (or more specifically, failed to do). That was probably specific to Nginx on Arch, but I'd be much more comfortable having checked.
Also, I'd like to see another test: after a process holding a lock has quit unexpectedly (and didn't remove the file), does another process successfully acquire the lock?
OK. I'll write a timeout PR today. My plan is to make the timeout something ridiculous like 24 hours so no user should accidentally hit it if they get distracted while running Certbot or have to stop and Google something. I do not think we should change the lock behavior with a timeout for the same reasons identified my original post. The timeout I refer to there is a timeout waiting for the lock, not a timeout on user input.
Over the course of doing this I've tested on more obscure OSes like FreeBSD, but I'll run a full set of test farm tests today which will include all of our unit tests and the
Good idea. I'll add a unit test for this today as well. |
That all sounds great. Maybe just under 24 hours, so it's gone before the next day's run. As to the timeout, I suppose many of the same reasons still apply, in which case yup I'm good with another PR. I saw you added the new test, which just leaves test farming. Did you want another set of eyes on this before merging? I remember you'd mentioned something about that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good, but wait for the test farm tests before merging.
I ran all our test farm tests and everything passed. I also manually ran our test suite on Arch and FreeBSD on both Python 2 and Python 3 with no issues. @pde said he'd try to review the code here which certainly doesn't hurt. I'm also fine to wait to merge this until the input timeout PR has landed since we've decided we don't want this without that PR as well. |
@@ -162,6 +162,14 @@ def prepare(self): | |||
if self.version is None: | |||
self.version = self.get_version() | |||
|
|||
# Prevent two Nginx plugins from modifying a config at once | |||
try: | |||
util.lock_dir_until_exit(self.conf('server-root')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One day, we'd probably want to make this a part of IAuthenticator
and IInstaller
, so that third party plugin authors don't have to know to make locking calls themselves (ie the API can suggest they should provide a directory to lock).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Just one suggested change to avoid spinning really fast when we don't have the lock!
finally: | ||
# Close the file if it is not the required one | ||
if self._fd is None: | ||
os.close(fd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider something like:
time.sleep(random.randint(0,200) / 1000.)
to ensure we don't eat lots of CPU if we can't get the lock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add this if you want but I'm not convinced it's necessary. If we try to lock the file, but another process holds the lock, we error out immediately. The only time we repeat this loop is if another process deletes the lock file after we've opened the file but before we've locked it. We only delete the lockfile when releasing an acquired lock so it's very rare for us to loop at all here.
With that said, if you'd still like me to add something like this, let me know and I will.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM!
Hello @bmw Is there a way to bypass this .lock feature or other solution? (Maybe locking certbot based on domains list specified in command.) Also this behavior in docs would be appreciable. Thanks! |
Hi @ajaegers. I'm sorry this has caused problems for you. Until the latest release, Certbot had no protections to prevent concurrently running Certbot instances from corrupting your certs, keys, logs, backups, Apache/Nginx config etc. I apologize that this behavior being unsafe wasn't documented and I created #4627 to document the new locking mechanism. We could potentially implement finer grained locking, but this would be a significant increase in Certbot's complexity. Unfortunately, it isn't enough to just lock based on the domain list. Unless you are also using Even if you're using With Certbot 0.14.0, there isn't a nice and safe way to run multiple instances in parallel. You could have concurrent instances use different directories with the |
@ajaegers please file another ticket if you need help getting multiple Certbots running in parallel with The alternative would be supporting a |
(You can also use |
Built on #4554. Fixes #933.
This is a large PR, but over 60% of it is testing.
Design decisions
.certbot.lock
files everywhere. Users will complain about us littering files all over their system or wonder if Certbot crashed. Thelock_file
module which theLockFile
class is based on is the only Python lock file module I found that securely deletes the lockfile.lock_file
is no longer maintained and isn't packaged anywhere, so I added it to our tree and cleaned it up. There's a slight benefit here because we've promised Debian we wouldn't be adding any new (non-optional) dependencies to Certbot.lock_file
doesn't work on Windows, but neither does Certbot right now so we can deal with that problem when/if it comes up.