-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for stronger single-activation guarantee #2428
Comments
Sorry about the actions. was trying to comment. But clicked the wrong button. |
@sergeybykov have you considered timespan as a way of expressing lease duration? This may be preferable as it removes any possibility of errors due to clock drift when writing a provider. I notice that the Azure blob HTTP API uses a duration ( https://docs.microsoft.com/en-us/rest/api/storageservices/fileservices/lease-blob |
Perhaps this could be implemented in a placement director - pros: pluggable, non-invasive, cons: work needs to be done to ensure composability with other placement strategies If the interface methods accept the We could make the system cheap with chunkier leases on slices of a consistent ring. Exciting! |
I'm wary of the lease duration parameter: I see that it has a correlation to the point on the availability/consistency spectrum which the user wishes to achieve, but I worry that it leaks implementation details and conflates policy with mechanism. What if I don't want to implement this policy (strong single activation) using leases as the mechanism? Is it realistic that users would want different lease durations per grain type? If so, perhaps we could use a type parameter on the interface instead, allowing the implementation to decide. |
I also wonder if the interface could be simplified: public interface IGrainLeaseProivider
{
Task<TimeSpan[]> AquireLeases(GrainReference[] grains);
Task ReleaseLeases(GrainReference[] grains);
Task<TimeSpan[]> RenewLeases(GrainReference[] grains);
} where the acquisition of a single lease is achieved by passing an array with a single entry. |
@ReubenBond I could imagine the lease time might derive from the business domain – where this conflates with the idea of timers – and one would like to ensure it will always be enough to cover the business domain. Technically it means long enough and indeed, per grain type (if going that direction, would it hurt to have both per type and ID). Is renew the same |
Good point. Wouldn't the provider have to account for the latency of its call to external lease service, and subtract it from the returned duration then? But then if we do that, can't we as easily return absolute expiration time in the local machine terms where Catalog is running?
I'm not sure I understand. This is meant to be an interface between Catalog and lease providers, to tell Catalog when it has to let an activation go if it fails to renew its lease of it. This would be optional, only for grain types that are explicitly marked with the attribute.
I suspect it is likely that you'll want different lease durations for different types for different CA tradeoffs. I don't understand what we would get by putting the attribute on the interface. Interface is only a contract, and its the implementation class that gets instantiated and is subject to the CA dilemma I think
I don't see why not. Although I think I missed passing requested lease duration to the Get/Acquire calls. So maybe it should rather be: public interface IGrainLeaseProivider
{
Task<DateTime> AcquireLeases(GrainReference[] grains, TimeSpan period);
Task ReleaseLeases(GrainReference[] grains);
Task<DateTime[]> RenewLeases(GrainReference[] grains, TimeSpan period);
} |
I meant passing the period into the
I understand now. My initial impression was that we could use the placement system instead of the catalog - if you squint, it looks doable.
I'm dubious. I believe that there will almost always be a single value. I also don't believe the majority of users will (or should) understand the implications of the lease duration. We might even see some specifying |
@ReubenBond I might have misunderstood the scope. If there is no chance for user code to run, then there isn't a value meaningful to business domain. I had an idea of a pattern regrettably frequent in integrations where an external system is called and it should ever be called only once within some period of time and someone might try that in grain activation. |
Placement is mostly stateless logic, only to make a decision when a new activation needs to be created. Catalog is very stateful, keeping track of all local activations and their collection upon inactivity. Since expiration of a lease is a reason for deactivation, and renewal is a prerequisite for keeping an activation in memory, I thought Catalog would be the natural component to have this logic. |
To stress something that I realized might not have been obvious in my description, and I heard confused some people, application code will never call All interactions with the lease providers will be done by the Catalog obtaining and renewing leases in order to activate grains and keep them activated, and deactivating grains if unable to renew leases for them on time. The only reason I thought allowing for an optional |
This sounds promising. The last comment is from 2016. Is this still on the radar? Have viable alternatives been implemented in the meantime? I am very interested in having stronger consistency, and, of course, would like to choose a solution that minimizes the cost in terms of availability. |
Writing to storage is always strong consistency, so that much is handled. When two instances of the same grain write to storage, only one can win and the other will be terminated. That is the primary reason why this isn't as pressing. Orleans also has a pluggable grain directory now, eliminating most of the cases where a duplicate could occur (eg, you can use Azure Table Storage as a grain directory, or Redis). The remaining case is when a silo has been declared dead by the others but has not yet learned of that. In that case, grains which are active on that silo are free to also be activated elsewhere. The only way which I can see (feel free to weigh in) to eliminate that which doesn't involve per-operation storage writes/etc, would be to acquire leases at some level of granularity. That would trade some availability for a reduced window for duplicate activations. Of course, major clock skew, VM migration, massive GC pauses, etc, could still result in two hosts thinking they own the same grain simultaneously. So, this is still on the radar, but is not currently top-of-mind. |
I appreciate the feedback, @ReubenBond!
Could you elaborate on what damage such a grain might cause while it is unreachable? I presume the issue is with it processing requests that have already accumulated in its queue? I suppose the practical consequences are limited: mainly writing to storage (which results in an exception) or talking to other grains (which it is unlikely to be able to reach). Would you agree? Additionally, I'm wondering if it is possible for a silo to get temporarily disconnected, after which a second silo also instantiates a grain present on the silo-in-limbo, and then the latter silo reconnects with the cluster. Is this a way that we might get duplicate grains? I'm interested in the answer for both the in-memory directory as the external directory. My educated guess is that the in-memory one would cause a temporary duplicate grain, whereas the external directory would strictly keep the grain mapped to the silo-in-limbo as long as that isn't declared dead (thus rendering the grain unreachable for a few seconds).
Understood. My two concerns here are stale reads and transient faults. By now, I've reasoned that stale reads are a non-issue, since any scenario involving them either (A) involves writing the grain's own state, or (B) involves exclusively writing to other grains' states and thus are not atomic with the read regardless. Transient faults, however, can lead to bothersome investigative work, which I think is a valid concern.
I'm fond of this solution. It hits me that strong consistency guarantees often go hand-in-hand with solid uptime guarantees. Regrettably, at the time of writing, the Azure SLAs of these two products are only 99.9% (for Azure Storage, I'm referring to writes), with the exception of pricey Redis Enterprise. Alternatively, simple Azure SQL databases tend to have at least 99.99%, but they have much worse response times compared to the in-memory options. I'm certainly open to writing another grain directory implementation. Do you know of any product that has all three: an SLA of at least 99.95%, an affordable tier, and great response times? For those already paying for Azure SQL Premium or Business Critical, its 1-2 ms read latency might suffice, and additional infrastructure could be avoided. Beautifully simple. But a bit pricey if not already required. Edit: The Redis zone redundancy preview announcement (2020) claims that zone redundancy increases the SLA to 99.95%, but the actual SLA page makes no such claim. |
Orleans temporarily violates the single-activation guarantee in certain failure cases preferring availability over strong consistency. In some scenarios applications are okay to trade availability for the simplicity of strong consistency. Today, strong consistency is achievable via external mechanisms, such as storage with etag support. This proposal is to add a mechanism and an extensibility point to formalize the pattern.
Add a grain class attribute, e.g.
StrongSingleActivation(TimeSpan leaseTime)
that would indicate that before creating an activation of a grain of this type a lease has to be obtained by the runtime. A failure to obtain a lease will fail the activation process. A failure to renew the lease will trigger deactivation of the grain before the lease expires.Add a lease provider interface and a config option for defining lease providers.
Methods of
IGrainLeaseProivider
would return expiration UTC time(s) for the lease(s).Catalog would be responsible for trying to renew leases, e.g. when a half of the original lease time elapses. We could start with a single renewal attempt, and add optional retries later. As part of the deactivation sequence, Catalog will make a best effort to release the lease.
A first lease provider could simply leverage Azure Blob leases. A more performant/scalable solution could leverage other consistency and leader election mechanisms.
The text was updated successfully, but these errors were encountered: