Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simpler Storage API #5

Closed
DisposaBoy opened this issue Dec 13, 2018 · 13 comments
Closed

Simpler Storage API #5

DisposaBoy opened this issue Dec 13, 2018 · 13 comments
Labels
feature request Request for new feature or functionality

Comments

@DisposaBoy
Copy link
Contributor

It's probably too late to change the API now, so this is more of a question about why the API is designed the way it is than anything.

What would you like to have changed?

Remove Locker from the Storage interface.

Why is this feature a useful, necessary, and/or important addition to this project?

TryLock is hard to implement correctly and the implementation doesn't seem to make use of it, because it just waits anyway.

What alternatives are there, or what are you doing in the meantime to work around the lack of this feature?

The ideal interface would be Storage on its own without Locker, then all methods would be required to be safe for concurrent use.
Apart from being more idiomatic, it makes things a lot simpler because the actual storage implementation has a better idea about how to do locking than CertMagic.

The main motivation was to implement a Storage in SQLite (or other databases) but I was blocked by the requirement that TryLock being non-blocking.
Instead of implementing the equivalent of mu.Lock(), it seemed I'd have to introduce a whole bunch of state that will probably be buggy for unknown reasons. Even FileStorage.TryLock makes me nervous because at a quick glance, it doesn't seem to handle cleanup so if the process crashes, everything is probably borked.

@DisposaBoy DisposaBoy added the feature request Request for new feature or functionality label Dec 13, 2018
@mholt
Copy link
Member

mholt commented Dec 13, 2018

This is a good question, but I am still convinced the two interfaces are jointly required for a workable Storage implementation.

It's not the Storage methods that need to be made safe for concurrent use -- they already need to be, of course -- it's that the ACME operations like the obtaining of a certificate need to be synchronized. This means not only one operation at a time, but one in total. TryLock is the easiest way I could think of that allows other processes to wait while another finishes obtaining the certificate it needs, then simply loading that certificate without obtaining it again. This idempotency is crucial for running correctly in a cluster, and it's why other libraries such as autocert don't do well sharing storage imementations even though it has a Cache interface.

Look closer at CertMagic's code and you'll see that it doesn't always wait -- it only waits if the lock is already held, and then it does some other flow when the waiting is over, whereas with a regular mutex you just wait and then continue, but here we have to alter the program flow.

For database backed storage you can use a transaction to "obtain" a lock or, if already obtained, to know that and simply return a waiter instead.

I hope that makes sense. They used to be separate but I really could not justify totally separating the two because it was not safe to use one but not the other.

FileStorage does need to handle crashes better, and it's on my list to enable expiring the locks, but it's also safe to assume that in a clean shutdown scenario, Unlock() will be called, cleaning up the resources. (Also a WIP, I haven't hit either of these edge cases in the years I've been using it though.)

@mholt
Copy link
Member

mholt commented Dec 13, 2018

Commit fe72205 handles stale locks in FileStorage, and adds a method to the Locker interface to help callers remove all locks before the process exits. (Sorry, it's the opposite of what you wanted -- a simple interface -- but I think it is necessary, since you brought up the idea of cleanup.)

Let me know what you think!

@DisposaBoy
Copy link
Contributor Author

This is a good question, but I am still convinced the two interfaces are jointly required for a workable Storage implementation.

It's not the Storage methods that need to be made safe for concurrent use -- they already need to be, of course -- it's that the ACME operations like the obtaining of a certificate need to be synchronized. This means not only one operation at a time, but one in total. TryLock is the easiest way I could think of that allows other processes to wait while another finishes obtaining the certificate it needs, then simply loading that certificate without obtaining it again. This idempotency is crucial for running correctly in a cluster, and it's why other libraries such as autocert don't do well sharing storage imementations even though it has a Cache interface.

Look closer at CertMagic's code and you'll see that it doesn't always wait -- it only waits if the lock is already held, and then it does some other flow when the waiting is over, whereas with a regular mutex you just wait and then continue, but here we have to alter the program flow.

Going by the 2 uses of TryLock that I saw (Obtain, Renew): IIUC, the flow is TryLock then Wait if it returns a Waiter.

To me this the same as a Lock except the Locker implementation needs to figure out if the lock is held, then figure out how to wait for it to be released without blocking the caller.

Without the need to be asynchronous the implementation can just use a sync.Mutex or database transaction since they already implement this behavior correctly.

e.g. Renew using Lock would be:

lock()

if we still need to renew:
    renew()

unlock()

If I understand the code correctly, it currently does:

trylock()

if someone else has the lock:
    wait()
    # assume they succeeded
else:
    renew()
    unlock()

Again, it's entirely possible that I just don't understand the code but it seems to me that if 2 processes happen to try to renew at the same time but don't block each other, i.e. p1 calls Unlock before p2 calls TryLock then p2 will go on to try and renew as well.

For database backed storage you can use a transaction to "obtain" a lock or, if already obtained, to know that and simply return a waiter instead.

How do you implement said waiter?

I hope that makes sense. They used to be separate but I really could not justify totally separating the two because it was not safe to use one but not the other.

To be clear: I originally thought an external lock might not be needed and it could be simpler to remove Locker, but the main source of complexity is TryLock + Wait.

FileStorage does need to handle crashes better, and it's on my list to enable expiring the locks, but it's also safe to assume that in a clean shutdown scenario, Unlock() will be called, cleaning up the resources. (Also a WIP, I haven't hit either of these edge cases in the years I've been using it though.)

Did you not hit not hit them? or are you just not aware that you hit them? :p

@DisposaBoy
Copy link
Contributor Author

Commit fe72205 handles stale locks in FileStorage, and adds a method to the Locker interface to help callers remove all locks before the process exits. (Sorry, it's the opposite of what you wanted -- a simple interface -- but I think it is necessary, since you brought up the idea of cleanup.)

Let me know what you think!

I don't think this works. If the process crashed, there'd be no chance to clean up the locks at shutdown, otherwise it could just unlock them.

It's the same problem you have with pid files, disk-based locks, (non-abstract) domain sockets, etc.: you have to take care to clean up at program start.

@mholt
Copy link
Member

mholt commented Dec 13, 2018

@DisposaBoy Thanks for your comments.

Also thanks for GoSublime! 😊

e.g. Renew using Lock would be:

lock()

if we still need to renew:
renew()

unlock()

This is probably a better pattern. I'll push a branch soon that changes this, and I would like you to try it if that's okay with you!

How do you implement said waiter?

For FileStorage, we do a simple poll every so often.

To be clear: I originally thought an external lock might not be needed and it could be simpler to remove Locker, but the main source of complexity is TryLock + Wait.

I understand now -- I bet we can remove both.

I don't think this works. If the process crashed, there'd be no chance to clean up the locks at shutdown, otherwise it could just unlock them.

It does work shutdowns that can cleanup (e.g. SIGTERM), but crashes have to be handled by ignoring/removing stale locks, which that commit also does.

@mholt
Copy link
Member

mholt commented Dec 13, 2018

@DisposaBoy Please see and review PR #7 when you have a chance. Hopefully soon. :) I'd like to get this pinned down.

@DisposaBoy
Copy link
Contributor Author

@DisposaBoy Thanks for your comments.

Also thanks for GoSublime! blush

No problem 😄

How do you implement said waiter?

For FileStorage, we do a simple poll every so often.

That's a common solution, but it's the one I was afraid of because you run the risk of polling too quickly, thus overloading the system or not quickly enough, thus introducing more latency than necessary.

I don't think this works. If the process crashed, there'd be no chance to clean up the locks at shutdown, otherwise it could just unlock them.

It does work shutdowns that can cleanup (e.g. SIGTERM), but crashes have to be handled by ignoring/removing stale locks, which that commit also does.

Reading the code: stale is defined as being 2 hours old which means it only works if you try the lock after 2 hours. If you try before this time, it will block forever (or until whatever lock timeout might be in place).

The code below demonstrates the issue for me. I ran it with ctrl.,ctrl+r which is roughly equivalent to sigint existing-instance; go run....

package main

import (
	"log"
	"time"

	"github.com/mholt/certmagic"
)

func main() {
	fs := certmagic.FileStorage{}
	k := "hello/world"
	log.Println("TryLock")
	w, err := fs.TryLock(k)
	if err != nil {
		log.Fatalln("TryLock:", err)
	}
	defer fs.Unlock(k)
	if w != nil {
		log.Println("waiting...")
		w.Wait()
	}
	log.Println("sleeping...")
	// give us some time to kill the process
	time.Sleep(10 * time.Second)
}

@DisposaBoy
Copy link
Contributor Author

@DisposaBoy Please see and review PR #7 when you have a chance. Hopefully soon. :) I'd like to get this pinned down.

It's later here, so I'll add it to my list of things to do tomorrow.

@mholt
Copy link
Member

mholt commented Dec 13, 2018

@DisposaBoy

That's a common solution, but it's the one I was afraid of because you run the risk of polling too quickly, thus overloading the system or not quickly enough, thus introducing more latency than necessary.

What better solution would you propose?

Thanks for your reproduce case, I'd be interested what you think of the new design in #7 after you wake up 👍

@mholt
Copy link
Member

mholt commented Dec 13, 2018

Also:

Reading the code: stale is defined as being 2 hours old which means it only works if you try the lock after 2 hours. If you try before this time, it will block forever (or until whatever lock timeout might be in place).

New in #7, the Wait() will only block until the lock is stale or released, whichever comes first. No more arbitrary timeout.

@DisposaBoy
Copy link
Contributor Author

DisposaBoy commented Dec 14, 2018

@DisposaBoy

That's a common solution, but it's the one I was afraid of because you run the risk of polling too quickly, thus overloading the system or not quickly enough, thus introducing more latency than necessary.

What better solution would you propose?

I don't know of a better solution than to simply delegate the locking to someone else who know how to do it better, where possible i.e. by using *Flock(), db transactions, mutexes, etc.

mholt added a commit that referenced this issue Dec 19, 2018
* Replace TryLock and Wait with Lock, and check for idempotency (issue #5)

* Fix logic of lock waiter creation in FileStorage (+ improve client log)

* Return from Wait() if lock file becomes stale

* Remove racy deletion of empty lock folder

* move all (FileStorage) methods to (*FileStorage) so assignments to fields like fileStorageNameLocks aren't lost

* rework lock acquisition

* Create lockDir just before lock file creation to reduce the chance that another process calls Unlock() and removes lockDir while we were waiting, preventing us from creating the lock file.
* Use the same strategy that Wait() uses to avoid depending on internal state.

* fix unlock of unlocked mutex

* Move fileStorageNameLocksMu into FileStorage struct

* implement new lockfile removal strategy and simplify the lock acquisition loop.

* readme: Add link to full examples

* Rework file lock obtaining and waiting logic

* Remove not-useful optimization to simplify file-locking logic
@mholt mholt closed this as completed Dec 19, 2018
@mholt
Copy link
Member

mholt commented Dec 19, 2018

Woohoo! Thanks for your critiques!

At some point, can we get your test program added to the repo as a test case?

@DisposaBoy
Copy link
Contributor Author

I'll see if I can convert it tomorrow or Friday.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Request for new feature or functionality
Projects
None yet
Development

No branches or pull requests

2 participants