Broadcast/per-silo grains #610

ReubenBond · 2015-07-15T23:06:21Z

In #587, we discussed the occasional desire to break location transparency of Virtual Actors and have the ability to message a grain (or otherwise) on an individual silo to broadcast a message to all silos.

In my case, I would have been happy to kick off a stream consumer inside a bootstrap provider and push messages to it using a stream, but this is not currently supported.

Use cases include:

Configuration
Cache management
Metrics collection
Fault injection
Process control (eg, live code reload)
Implementing certain distributed algorithms, eg Raft & Paxos

So, how should we do this?

gabikliot · 2015-07-15T23:11:26Z

@ReubenBond , is it only for "broadcast a message to all silos"? So from an API perspective you have a way to identify a group of grains and sending ONE msg to the whole group of grains and don't know how much are in the group and don't get to message each one individually?
This is a bit different from "break location transparency of Virtual Actors".

ReubenBond · 2015-07-15T23:16:00Z

@gabikliot for most of the use cases, yes. Although that would probably not be the nicest way to perform metrics collection. @jthelin likely has some ideas.

gabikliot · 2015-07-16T04:26:02Z

@ReubenBond , why not use the standard solution we recommend for this case:
Bootstrap provider makes a call into a PreferLcalPlacement grain with random id and that grain will subscribe to the stream?

jthelin · 2015-07-16T16:17:41Z

The other part of the "Per Silo" grain semantics is that those grain instances should never "roam" off their designated silo.

If that silo crashes, that grain is just gone, and would never be recreated on another silo.

A (simplified) concrete scenario example is if you have a "shard" of data physically located on one machine, then the grain responsible for dealing with that shard can only run on that one physical machine.

The current PreferLocalPLacement policy handles the initial "local" placement part, but not the "guaranteed no-roam" part.

I think my scenario is a slightly more specific and constrained version of the general requirement that @ReubenBond is describing above.

gabikliot · 2015-07-16T16:26:53Z

Do you guys have concrete ideas how the new API and semantics (w.r.t. failures) for this new type of grain might look like?

jthelin · 2015-07-17T00:53:36Z

From my perspective, the failure mode can be very simple:

If silo is gone, then the per-silo grain(s) in that silo are also gone.

I would ideally prefer a specific exception with well understood semantics (maybe something like GrainNoLongerActiveException ?) and [preferably] fast-track throw (as soon as the cluster membership change is detected) rather than needing to wait full timeout period for an eventual generic TimeoutException exception.
That makes it easier for any "coordinator" functions in my app to make better informed decisions about shard availability, rather than trying to second guessing whether timeout is from temporary communications glitch [retriable] or because silo just crashed / went offline [dead].

When new silo restarts on a failed host (i.e. same host / ip address but different silo generation number) then Orleans runtime will run the usual app boostrap provider plugins in that silo, which would recreate a fresh instance of the per-silo grain(s) on that silo.

ReubenBond · 2015-07-17T01:41:39Z

I am on-board with @jthelin's suggestion. It still requires us to add some application-specific logic in the bootstrap provider, though.

We could create a [SiloSingleton] marker attribute and automatically activate each marked grain type during bootstrap.

jthelin · 2015-08-21T15:52:37Z

BTW, the new [StatelessWorker(1)] functionality in PR #688 should now provide the "no-roam" guarantee i was talking about above, rather than using [PreferLocalPLacement] which is only best-efforts.

Updated: This approach does not work, for subtle reasons that @gabikliot pointed out to me in e-mail and hinted at below. :(

veikkoeeva · 2015-08-21T16:28:32Z

@jthelin An excellent point! Would it a worthy point to document in FAQ, for instance?

jthelin · 2015-08-21T16:32:13Z

I am hoping to upload a code sample this weekend, but I will look at the docs too ;)

gabikliot · 2015-08-21T18:21:46Z

Just to keep in mind, this is still StatelessWorker, which means you can have more than one globally. There will be up to 1 on every silo, but each silo can have 1 and globally there may be many.

Also, the above means you can not address an individual activation of the stateless worker globally. There is no way (currently, in the current implementation, unless we change the programming model) to send a message from silo X to StatelessWorker on silo Y. You can address local activation of the StatelessWorker. So for example, every BootstrapProvider will be talking to its own, single copy, of the StatelessWorker(1) grain.

- Add some minimal example code and scenario test case for implementing PerSilo grains using `[StatelessWorker(1)]`. Implementation overview: 1. Grain class `TestGrains.PartitionGrain` is the PerSilo grain. It is annotated with `[StatelessWorker(1)]` attribute to __guarantee local placement and "no-roaming"__. 2. Grain class `TestGrains.PartitionManagerGrain` is the registry for PerSilo grains. Only GrainId = 0 is used, so there will be a single instance of the Partition Manager in the cluster. 3. Class `TestGrains.PartitionStartup` is registered as a bootstrap provider in silo config. 4. During silo startup, the Orleans runtime calls `PartitionStartup.Init` on each silo, which: 4a. Calls `PartitionGrain.Start` to create the local PerSilo grain instance on that silo. 4b. Calls `PartitionManager.RegisterPartition` to declare that PerSilo grain with the registrar. Xref: Discussion of Per-silo Grain feature request in issue dotnet#610

jthelin · 2015-08-24T04:12:23Z

Yes, @gabikliot states a very important point to remember when using [StatelessWorker] attribute.

In my example scenario I use a different Guid (Grain ID) for each silo, so each of the PerSilo grains is independently registered / addressable.

I posted PR #738 with my minimal example code / scenario test case.

There are lots of similar ways to get PerSilo grains working, and lots of extra sophistication that could be added -- including higher-level programming abstractions like "DoPerSilo(Action)" -- but my example is intended to be a minimal usage example of the most basic PerSilo grain scenario.

Comments welcome :)

Updated: This approach does not work, for subtle reasons that @gabikliot pointed out to me in e-mail and hinted at above. :(

jason-bragg · 2015-08-27T20:42:24Z

Hi Jthelin,
For the purposes you described in #738, specifically

"Use cases include:
•Configuration
•Cache management
•Metrics collection
•Fault injection
•Process control (eg, live code reload)"

Have you considered the silo control channel Gabi added recently (#718)? It's a new capability wherein one can send commands to providers. You 'should' be able to create a provider that inherits from IControllable that processes commands that perform actions like you described. The glitch here is that these commands are executed on all silos, not a specific one. This can be worked around, but I'd not recommend that. Since it's a new capability, I thought I'd bring it up.

gabikliot · 2015-08-28T00:01:30Z

We can easily extend #718 to send a command to a provider on a specific silo. Silos list is exposed already via ManagementGrain, so adding such a capability would be trivial and it will not change any public grain abstractions, it's virtual nature or APIs.

ReubenBond · 2015-09-12T07:34:33Z

Interesting suggestion regarding #718 - I wonder how easily we could use this to implement a distributed consensus algorithm... could be a fun project for someone :)

EDIT: Upon inspection, the IControllable looks like it would be a pain to use for implementing anything non-trivial

gabikliot · 2015-09-14T17:02:49Z

@ReubenBond , why is that: "Upon inspection, the IControllable looks like it would be a pain to use for implementing anything non-trivial"?
I modeled it following @yevhen's "uniform interface" - just a weakly typed interface. One can build a strongly typed interface above it. It should not be more pain to implement anything non-trivial on top of IControllable as on top of F# or any non typed interface.

ReubenBond · 2015-10-26T08:01:11Z

Maybe we could have an IGrainWithSiloKey, what do you think?

gabikliot · 2015-10-26T08:03:39Z

And what happens when this silo dies?

ReubenBond · 2015-10-26T08:05:46Z

If the silo is no longer in the set of active silos, the call should fail with a well-typed exception

gabikliot · 2015-10-26T08:36:05Z

So non virtual actor, non location transparent? Plus, silo id is ip plus epoch, so exposing silo addresses should also be part of the programming model.

ReubenBond · 2015-10-26T08:43:17Z

The silo address itself is opaque. The epoch should probably be included if the epoch represents a new instance of a given silo.

I see the concern with breaking the abstraction. All abstractions are leaky, though, and this helps us to create cluster services which we otherwise will have a much harder time creating.

Is the main concern that IGrainWithSiloKey would make it too easy to enter into shark-infested waters?

gabikliot · 2015-10-26T10:47:32Z

We can definitely do that. In fact, we had this implemented about 3 years ago and removed afterwards. Totally doable to bring it back.

The question I have is can we do better.like you said, indeed, it pushes the developer into shark infested waters. It makes developers life much harder with those grains, in the scenarios they are needed.

Allowing more locality and broadcasting support in general is something I believe we need to do, as I commented multiple times in the past. The question is how.

I think we all, collectively, did not think hard enough how to solve this problem better. What you ate suggesting is easy to do, but I think we can do better. My intuition is that better solution exists. I might be wrong, and then we are back to what you suggested, but my intuition is we can do better. There is an idea of group grain for example, still fully virtual.

I would suggest to first explore the alternatives before we give up and go with what you suggested now.

Can you describe your exact scenario, and how you would solve it with your suggested non virtual actor, including error handling.

Btw, using steams from bootstrap provider (how this whole issue started) is something I already implemented. There is an open PR doing that.

ReubenBond · 2015-10-26T11:40:45Z

I agree with you, @gabikliot, we should look for the best solution possible.

Let's assume we want to implement a strongly consistent database atop Orleans. We intend to use a distributed consensus algorithm which involves persisting an operation log on each node.

When a node becomes available, we start replicating operations to it.
When a node is up-to-date, we allow it to take part in the replication quorum.
When a node fails or falls behind, we remove it from the quorum.
Clients can message an individual node in order to perform an operation (writes go to the leader, reads may be served from followers).

So in this case, I don't need to know the physical location of a node, I just need each node to be on a separate host/fault domain (if I form a quorum on a single host, I have not achieved fault tolerance).
I need nodes to be individually addressable so I can send messages to individuals ("hey, here's a new operation to replicate").
I need notification for when new nodes become available and when they become unavailable so that I can perform leader election and decide on the nodes included in the replica set.

Effectively, a reference to a node in this case is like a WeakReference<T>... a WeakGrainReference.

Tell me more about this "group grain"

gabikliot · 2015-10-26T12:04:54Z

I described the group grain above. You send one msg and it arrives to all group members. I didn't think about notifications about group changes, but it might be possible to add.

But I don't think it will work for your scenario. Since group grain I meant is an application abstraction. I think what you are describing is not an application layer, but rather a low level system layer and to built it you will need Orleans to expose low level, non application, data, like silos, pings, ...
Totally doable, maybe even a good idea, but not as application abstraction. I am of a strong opinion that if we start mixing application abstractions with lower level system abstraction we are doomed. Will become the next CORBA, WCF, SF: Too much options, too much confusion, too much layers intervene in each other .

To build what you are asking, we can just espouse system targets and our membership interface, with membership notifications. If we do that (not sure we should, but if) we just need to be very clear it is not an application abstraction: system target is not an actor. It's an endpoint. Different abstractions.

ReubenBond · 2015-10-26T12:10:42Z

I agree that we should avoid making these things too readily accessible - they are for building system services & not application-level use - unlike your group grain suggestion which satisfies most of the use cases in the initial comment.

Ideally we would expose the required functionality in a way which allows external plugins to take advantage of it while still dissuading application developers from going near it.

Add minimal example of PerSilo grains using StatelessWorker(1). - Add some minimal example code and scenario test case for implementing PerSilo grains using `[StatelessWorker(1)]`. Implementation overview: 1. Grain class `TestGrains.PartitionGrain` is the PerSilo grain. It is annotated with `[StatelessWorker(1)]` attribute to __guarantee local placement and "no-roaming"__. 2. Grain class `TestGrains.PartitionManagerGrain` is the registry for PerSilo grains. Only GrainId = 0 is used, so there will be a single instance of the Partition Manager in the cluster. 3. Class `TestGrains.PartitionStartup` is registered as a bootstrap provider in silo config. 4. During silo startup, the Orleans runtime calls `PartitionStartup.Init` on each silo, which: 4a. Calls `PartitionGrain.Start` to create the local PerSilo grain instance on that silo. 4b. Calls `PartitionManager.RegisterPartition` to declare that PerSilo grain with the registrar. Xref: Discussion of Per-silo Grain feature request in issue dotnet#610 PartitionGrain registers itself with PartitionManager when activated. OnDeactivate-UnregisterPartiton - Added OnDeactivate method to unregister this partition if we are deactivated. Added test case PerSiloGrainExample_SendMsg. Rename test cases, for easier tracking & reporting. - Rename test cases, for easier tracking & reporting. - Shut down silos at end of each test case to prevent bleed over of any failure conditions from one test case to another. - Fix some logging messages. Add extra CountActivations check to these test cases. Add Broadcast function to PartitionManagerGrain - Add Broadcast function to PartitionManagerGrain (requires PM grain to be made [Reentrant] to work correctly) - Add test case SendMsg_Broadcast_PerSiloGrainExample which uses the new PM.Broadcast functionality. Tests work ok by switching to use [PreferLocalPlacement] instead. - Switching from [StatelessWorker(1)] to [PreferLocalPlacement] gets the test cases to work, but loses "no-roaming" guarantee. - Add test case PreferLocalPlacementGrainShouldMigrateWhenKillSilo to PinnedGrainFailureTests - Add PinnedGrainFailureTests test cases for behaviour when host silo - Add [PinnedGrainPlacement] placement policy attribute.

Add minimal example of PerSilo grains using StatelessWorker(1). - Add some minimal example code and scenario test case for implementing PerSilo grains using `[StatelessWorker(1)]`. Implementation overview: 1. Grain class `TestGrains.PartitionGrain` is the PerSilo grain. It is annotated with `[StatelessWorker(1)]` attribute to __guarantee local placement and "no-roaming"__. 2. Grain class `TestGrains.PartitionManagerGrain` is the registry for PerSilo grains. Only GrainId = 0 is used, so there will be a single instance of the Partition Manager in the cluster. 3. Class `TestGrains.PartitionStartup` is registered as a bootstrap provider in silo config. 4. During silo startup, the Orleans runtime calls `PartitionStartup.Init` on each silo, which: 4a. Calls `PartitionGrain.Start` to create the local PerSilo grain instance on that silo. 4b. Calls `PartitionManager.RegisterPartition` to declare that PerSilo grain with the registrar. Xref: Discussion of Per-silo Grain feature request in issue dotnet#610 PartitionGrain registers itself with PartitionManager when activated. OnDeactivate-UnregisterPartiton - Added OnDeactivate method to unregister this partition if we are deactivated. Added test case PerSiloGrainExample_SendMsg. Rename test cases, for easier tracking & reporting. - Rename test cases, for easier tracking & reporting. - Shut down silos at end of each test case to prevent bleed over of any failure conditions from one test case to another. - Fix some logging messages. Add extra CountActivations check to these test cases. Add Broadcast function to PartitionManagerGrain - Add Broadcast function to PartitionManagerGrain (requires PM grain to be made [Reentrant] to work correctly) - Add test case SendMsg_Broadcast_PerSiloGrainExample which uses the new PM.Broadcast functionality. Tests work ok by switching to use [PreferLocalPlacement] instead. - Switching from [StatelessWorker(1)] to [PreferLocalPlacement] gets the test cases to work, but loses "no-roaming" guarantee. - Add PinnedGrainFailureTests test cases for behaviour when host silo - Add [PinnedGrain] placement policy attribute. - Assert.Fail if call to grain on dead silo completes successfully. - Switch to using FluentAssertions for test condition checks - Add test case combos for silo {Primary|Secondary} and action {Kill|Stop}

Implementation overview: 1. Grain class `TestGrains.PartitionGrain` is the PerSilo grain. It is annotated with `[PinnedGrain]` attribute to __guarantee local placement and "no-roaming"__. 2. Grain class `TestGrains.PartitionManagerGrain` is the registry for PerSilo grains. Only GrainId = 0 is used, so there will be a single instance of the Partition Manager in the cluster. 3. Class `TestGrains.PartitionStartup` is registered as a bootstrap provider in silo config. 4. During silo startup, the Orleans runtime calls `PartitionStartup.Init` on each silo, which: 4a. Calls `PartitionGrain.Start` to create the local PerSilo grain instance on that silo. 4b. Calls `PartitionManager.RegisterPartition` to declare that PerSilo grain with the registrar. Xref: Discussion of Per-silo Grain feature request in issue dotnet#610 PartitionGrain registers itself with PartitionManager when activated and unregisters when deactivated. Add Broadcast function to PartitionManagerGrain - Add Broadcast function to PartitionManagerGrain (requires PM grain to be made [Reentrant] to work correctly) Add PartitionGrain.Stop function - Centralize all Register / Unregister functions for PartitionGrain. - Mark activation for GC after Stop() - Mark activation to never be paged out after Start()

sergeybykov · 2017-01-05T22:18:01Z

One-per-silo application system targets, grain services, were implemented for #2444 in #2459/#2531.

gabikliot added question Status: investigating labels Jul 15, 2015

gabikliot mentioned this issue Jul 20, 2015

Silo runtime identity #624

Merged

jthelin mentioned this issue Aug 24, 2015

Add minimal example of PerSilo grains scenario. #738

Closed

ReubenBond added the help wanted label Jan 29, 2016

jasonholloway mentioned this issue May 15, 2016

Added option to perform provider commands quietly #1762

Merged

ReubenBond removed the Status: investigating label Dec 6, 2016

sergeybykov mentioned this issue Jan 5, 2017

cluster: broadcast message to many actors of same kind asynkron/protoactor-go#62

Closed

yindongfei mentioned this issue Sep 21, 2017

How to control the location where to spawn actor in cluster mode. asynkron/protoactor-dotnet#315

Closed

sergeybykov added this to the Backlog milestone Oct 13, 2017

veikkoeeva mentioned this issue Feb 21, 2018

Writer/Reader Pattern (and the problems therein) #4062

Closed

ReubenBond closed this as completed Apr 17, 2018

dotnet locked as resolved and limited conversation to collaborators Sep 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broadcast/per-silo grains #610

Broadcast/per-silo grains #610

ReubenBond commented Jul 15, 2015

gabikliot commented Jul 15, 2015

ReubenBond commented Jul 15, 2015

gabikliot commented Jul 16, 2015

jthelin commented Jul 16, 2015

gabikliot commented Jul 16, 2015

jthelin commented Jul 17, 2015

ReubenBond commented Jul 17, 2015

jthelin commented Aug 21, 2015

veikkoeeva commented Aug 21, 2015

jthelin commented Aug 21, 2015

gabikliot commented Aug 21, 2015

jthelin commented Aug 24, 2015

jason-bragg commented Aug 27, 2015

gabikliot commented Aug 28, 2015

ReubenBond commented Sep 12, 2015

gabikliot commented Sep 14, 2015

ReubenBond commented Oct 26, 2015

gabikliot commented Oct 26, 2015

ReubenBond commented Oct 26, 2015

gabikliot commented Oct 26, 2015

ReubenBond commented Oct 26, 2015

gabikliot commented Oct 26, 2015

ReubenBond commented Oct 26, 2015

gabikliot commented Oct 26, 2015

ReubenBond commented Oct 26, 2015

sergeybykov commented Jan 5, 2017