New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Broadcast/per-silo grains #610
Comments
@ReubenBond , is it only for "broadcast a message to all silos"? So from an API perspective you have a way to identify a group of grains and sending ONE msg to the whole group of grains and don't know how much are in the group and don't get to message each one individually? |
@gabikliot for most of the use cases, yes. Although that would probably not be the nicest way to perform metrics collection. @jthelin likely has some ideas. |
@ReubenBond , why not use the standard solution we recommend for this case: |
The other part of the "Per Silo" grain semantics is that those grain instances should never "roam" off their designated silo. If that silo crashes, that grain is just gone, and would never be recreated on another silo. A (simplified) concrete scenario example is if you have a "shard" of data physically located on one machine, then the grain responsible for dealing with that shard can only run on that one physical machine. The current I think my scenario is a slightly more specific and constrained version of the general requirement that @ReubenBond is describing above. |
Do you guys have concrete ideas how the new API and semantics (w.r.t. failures) for this new type of grain might look like? |
From my perspective, the failure mode can be very simple: If silo is gone, then the per-silo grain(s) in that silo are also gone. I would ideally prefer a specific exception with well understood semantics (maybe something like When new silo restarts on a failed host (i.e. same host / ip address but different silo generation number) then Orleans runtime will run the usual app boostrap provider plugins in that silo, which would recreate a fresh instance of the per-silo grain(s) on that silo. |
I am on-board with @jthelin's suggestion. It still requires us to add some application-specific logic in the bootstrap provider, though. We could create a |
Updated: This approach does not work, for subtle reasons that @gabikliot pointed out to me in e-mail and hinted at below. :( |
@jthelin An excellent point! Would it a worthy point to document in FAQ, for instance? |
I am hoping to upload a code sample this weekend, but I will look at the docs too ;) |
Just to keep in mind, this is still Also, the above means you can not address an individual activation of the stateless worker globally. There is no way (currently, in the current implementation, unless we change the programming model) to send a message from silo X to |
- Add some minimal example code and scenario test case for implementing PerSilo grains using `[StatelessWorker(1)]`. Implementation overview: 1. Grain class `TestGrains.PartitionGrain` is the PerSilo grain. It is annotated with `[StatelessWorker(1)]` attribute to __guarantee local placement and "no-roaming"__. 2. Grain class `TestGrains.PartitionManagerGrain` is the registry for PerSilo grains. Only GrainId = 0 is used, so there will be a single instance of the Partition Manager in the cluster. 3. Class `TestGrains.PartitionStartup` is registered as a bootstrap provider in silo config. 4. During silo startup, the Orleans runtime calls `PartitionStartup.Init` on each silo, which: 4a. Calls `PartitionGrain.Start` to create the local PerSilo grain instance on that silo. 4b. Calls `PartitionManager.RegisterPartition` to declare that PerSilo grain with the registrar. Xref: Discussion of Per-silo Grain feature request in issue dotnet#610
Yes, @gabikliot states a very important point to remember when using [StatelessWorker] attribute. In my example scenario I use a different Guid (Grain ID) for each silo, so each of the PerSilo grains is independently registered / addressable. I posted PR #738 with my minimal example code / scenario test case. There are lots of similar ways to get PerSilo grains working, and lots of extra sophistication that could be added -- including higher-level programming abstractions like "DoPerSilo(Action)" -- but my example is intended to be a minimal usage example of the most basic PerSilo grain scenario. Comments welcome :) Updated: This approach does not work, for subtle reasons that @gabikliot pointed out to me in e-mail and hinted at above. :( |
Hi Jthelin, "Use cases include: Have you considered the silo control channel Gabi added recently (#718)? It's a new capability wherein one can send commands to providers. You 'should' be able to create a provider that inherits from |
We can easily extend #718 to send a command to a provider on a specific silo. Silos list is exposed already via |
Interesting suggestion regarding #718 - I wonder how easily we could use this to implement a distributed consensus algorithm... could be a fun project for someone :) EDIT: Upon inspection, the |
@ReubenBond , why is that: "Upon inspection, the IControllable looks like it would be a pain to use for implementing anything non-trivial"? |
Maybe we could have an |
And what happens when this silo dies? |
If the silo is no longer in the set of active silos, the call should fail with a well-typed exception |
So non virtual actor, non location transparent? Plus, silo id is ip plus epoch, so exposing silo addresses should also be part of the programming model. |
The silo address itself is opaque. The epoch should probably be included if the epoch represents a new instance of a given silo. I see the concern with breaking the abstraction. All abstractions are leaky, though, and this helps us to create cluster services which we otherwise will have a much harder time creating. Is the main concern that |
We can definitely do that. In fact, we had this implemented about 3 years ago and removed afterwards. Totally doable to bring it back. The question I have is can we do better.like you said, indeed, it pushes the developer into shark infested waters. It makes developers life much harder with those grains, in the scenarios they are needed. Allowing more locality and broadcasting support in general is something I believe we need to do, as I commented multiple times in the past. The question is how. I think we all, collectively, did not think hard enough how to solve this problem better. What you ate suggesting is easy to do, but I think we can do better. My intuition is that better solution exists. I might be wrong, and then we are back to what you suggested, but my intuition is we can do better. There is an idea of group grain for example, still fully virtual. I would suggest to first explore the alternatives before we give up and go with what you suggested now. Can you describe your exact scenario, and how you would solve it with your suggested non virtual actor, including error handling. Btw, using steams from bootstrap provider (how this whole issue started) is something I already implemented. There is an open PR doing that. |
I agree with you, @gabikliot, we should look for the best solution possible. Let's assume we want to implement a strongly consistent database atop Orleans. We intend to use a distributed consensus algorithm which involves persisting an operation log on each node. When a node becomes available, we start replicating operations to it. So in this case, I don't need to know the physical location of a node, I just need each node to be on a separate host/fault domain (if I form a quorum on a single host, I have not achieved fault tolerance). Effectively, a reference to a node in this case is like a Tell me more about this "group grain" |
I described the group grain above. You send one msg and it arrives to all group members. I didn't think about notifications about group changes, but it might be possible to add. But I don't think it will work for your scenario. Since group grain I meant is an application abstraction. I think what you are describing is not an application layer, but rather a low level system layer and to built it you will need Orleans to expose low level, non application, data, like silos, pings, ... To build what you are asking, we can just espouse system targets and our membership interface, with membership notifications. If we do that (not sure we should, but if) we just need to be very clear it is not an application abstraction: system target is not an actor. It's an endpoint. Different abstractions. |
I agree that we should avoid making these things too readily accessible - they are for building system services & not application-level use - unlike your group grain suggestion which satisfies most of the use cases in the initial comment. Ideally we would expose the required functionality in a way which allows external plugins to take advantage of it while still dissuading application developers from going near it. |
Add minimal example of PerSilo grains using StatelessWorker(1). - Add some minimal example code and scenario test case for implementing PerSilo grains using `[StatelessWorker(1)]`. Implementation overview: 1. Grain class `TestGrains.PartitionGrain` is the PerSilo grain. It is annotated with `[StatelessWorker(1)]` attribute to __guarantee local placement and "no-roaming"__. 2. Grain class `TestGrains.PartitionManagerGrain` is the registry for PerSilo grains. Only GrainId = 0 is used, so there will be a single instance of the Partition Manager in the cluster. 3. Class `TestGrains.PartitionStartup` is registered as a bootstrap provider in silo config. 4. During silo startup, the Orleans runtime calls `PartitionStartup.Init` on each silo, which: 4a. Calls `PartitionGrain.Start` to create the local PerSilo grain instance on that silo. 4b. Calls `PartitionManager.RegisterPartition` to declare that PerSilo grain with the registrar. Xref: Discussion of Per-silo Grain feature request in issue dotnet#610 PartitionGrain registers itself with PartitionManager when activated. OnDeactivate-UnregisterPartiton - Added OnDeactivate method to unregister this partition if we are deactivated. Added test case PerSiloGrainExample_SendMsg. Rename test cases, for easier tracking & reporting. - Rename test cases, for easier tracking & reporting. - Shut down silos at end of each test case to prevent bleed over of any failure conditions from one test case to another. - Fix some logging messages. Add extra CountActivations check to these test cases. Add Broadcast function to PartitionManagerGrain - Add Broadcast function to PartitionManagerGrain (requires PM grain to be made [Reentrant] to work correctly) - Add test case SendMsg_Broadcast_PerSiloGrainExample which uses the new PM.Broadcast functionality. Tests work ok by switching to use [PreferLocalPlacement] instead. - Switching from [StatelessWorker(1)] to [PreferLocalPlacement] gets the test cases to work, but loses "no-roaming" guarantee. - Add test case PreferLocalPlacementGrainShouldMigrateWhenKillSilo to PinnedGrainFailureTests - Add PinnedGrainFailureTests test cases for behaviour when host silo - Add [PinnedGrainPlacement] placement policy attribute.
Add minimal example of PerSilo grains using StatelessWorker(1). - Add some minimal example code and scenario test case for implementing PerSilo grains using `[StatelessWorker(1)]`. Implementation overview: 1. Grain class `TestGrains.PartitionGrain` is the PerSilo grain. It is annotated with `[StatelessWorker(1)]` attribute to __guarantee local placement and "no-roaming"__. 2. Grain class `TestGrains.PartitionManagerGrain` is the registry for PerSilo grains. Only GrainId = 0 is used, so there will be a single instance of the Partition Manager in the cluster. 3. Class `TestGrains.PartitionStartup` is registered as a bootstrap provider in silo config. 4. During silo startup, the Orleans runtime calls `PartitionStartup.Init` on each silo, which: 4a. Calls `PartitionGrain.Start` to create the local PerSilo grain instance on that silo. 4b. Calls `PartitionManager.RegisterPartition` to declare that PerSilo grain with the registrar. Xref: Discussion of Per-silo Grain feature request in issue dotnet#610 PartitionGrain registers itself with PartitionManager when activated. OnDeactivate-UnregisterPartiton - Added OnDeactivate method to unregister this partition if we are deactivated. Added test case PerSiloGrainExample_SendMsg. Rename test cases, for easier tracking & reporting. - Rename test cases, for easier tracking & reporting. - Shut down silos at end of each test case to prevent bleed over of any failure conditions from one test case to another. - Fix some logging messages. Add extra CountActivations check to these test cases. Add Broadcast function to PartitionManagerGrain - Add Broadcast function to PartitionManagerGrain (requires PM grain to be made [Reentrant] to work correctly) - Add test case SendMsg_Broadcast_PerSiloGrainExample which uses the new PM.Broadcast functionality. Tests work ok by switching to use [PreferLocalPlacement] instead. - Switching from [StatelessWorker(1)] to [PreferLocalPlacement] gets the test cases to work, but loses "no-roaming" guarantee. - Add PinnedGrainFailureTests test cases for behaviour when host silo - Add [PinnedGrain] placement policy attribute. - Assert.Fail if call to grain on dead silo completes successfully. - Switch to using FluentAssertions for test condition checks - Add test case combos for silo {Primary|Secondary} and action {Kill|Stop}
Implementation overview: 1. Grain class `TestGrains.PartitionGrain` is the PerSilo grain. It is annotated with `[PinnedGrain]` attribute to __guarantee local placement and "no-roaming"__. 2. Grain class `TestGrains.PartitionManagerGrain` is the registry for PerSilo grains. Only GrainId = 0 is used, so there will be a single instance of the Partition Manager in the cluster. 3. Class `TestGrains.PartitionStartup` is registered as a bootstrap provider in silo config. 4. During silo startup, the Orleans runtime calls `PartitionStartup.Init` on each silo, which: 4a. Calls `PartitionGrain.Start` to create the local PerSilo grain instance on that silo. 4b. Calls `PartitionManager.RegisterPartition` to declare that PerSilo grain with the registrar. Xref: Discussion of Per-silo Grain feature request in issue dotnet#610 PartitionGrain registers itself with PartitionManager when activated and unregisters when deactivated. Add Broadcast function to PartitionManagerGrain - Add Broadcast function to PartitionManagerGrain (requires PM grain to be made [Reentrant] to work correctly) Add PartitionGrain.Stop function - Centralize all Register / Unregister functions for PartitionGrain. - Mark activation for GC after Stop() - Mark activation to never be paged out after Start()
In #587, we discussed the occasional desire to break location transparency of Virtual Actors and have the ability to message a grain (or otherwise) on an individual silo to broadcast a message to all silos.
In my case, I would have been happy to kick off a stream consumer inside a bootstrap provider and push messages to it using a stream, but this is not currently supported.
Use cases include:
So, how should we do this?
The text was updated successfully, but these errors were encountered: