Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

=cls #17846 Use CRDTs instead of PersistentActor to remember the state of the ShardCoordinator #17871

Merged
merged 1 commit into from
Aug 20, 2015

Conversation

oseval
Copy link

@oseval oseval commented Jun 30, 2015

Currently this is not real PR. Only to show my attempt to change the claster-sharding dependency from the akka-persistence to the akka-ddata. I would incrementally update it when found time for this .

Idea:

  1. Distributed state of the ShardRegions are all we need in the cluster-sharding.
  2. Distributed state can be handled by akka-ddata module.
  3. Inconsistency is possible if only we allowing two ShardRegions allocate equals Shards simultaneously.
  4. It is possible avoid inconsistency if we have singleton Coordinator who can allow to the ShardRegion allocate the Shard. R1 ask C to allow him allocate the S1. C1 allow. R1 allocate S1 and then change DData + inner state. The Coordinator will be receiving these changes and then update inner state.
  5. If the Coordinator fails, then it can be recovered from DData which be actual for all nodes inside a cluster.

Welcome for comments!

It is very possible if I missing something in this concept. Correct me please.

And I am sorry for my English.

@akka-ci
Copy link

akka-ci commented Jun 30, 2015

Can one of the repo owners verify this patch?

@oseval
Copy link
Author

oseval commented Jul 2, 2015

@patriknw I have some questions about cluster sharding. I need to know the "underlying idea". Currently only two questions.

  1. Is it critical that the Coordinator reallocates the shards of the terminated regions immediately? Is it possible if the shard from a terminated region will be allocated only after sending a message to this shard?
  2. Is it applicable if the terminated region would be remove itself from the distributed state by sending message to the replicator inside the postStop?

@patriknw
Copy link
Member

patriknw commented Jul 2, 2015

Hi @SmLin
I have not had time to look at your code yet. Will try to get here tomorrow.

There are two things that are persisted.

  1. Coordinator state, the allocations of all shards
  2. The started entities within each shard, see the Remember Entities section in the documentation.

Lets try to figure out the coordinator state first.
For that the allocation is always done for the first message to a shard, also for rebalance and crash.

It is possible to send message to replicator in postStop, but crash must also be handled and if the node is removed from the cluster the replicator might not be alive any more.

@oseval
Copy link
Author

oseval commented Jul 3, 2015

@patriknw PR is not ready for review yet. So you could do not waste your time. I will notify to you when it would be ready.

@patriknw
Copy link
Member

patriknw commented Jul 3, 2015

@SmLin My thinking around the design.

I think the coordinator should remain a singleton, because it is important with a central decision maker of where the shards are to be allocated. We could start with just saving the coordinator state in a LWWRegister, using WriteMajority and ReadMajority.

The LWWRegister is pretty safe here since the singleton failover is not instant, but we can later create a more customized data type that makes use of akka.cluster.Member.isOlderThan, we want to use the data from the newest singleton. That can be rather useful for singletons in general.

That has the small consistency problems that I sketched in the issue, but perhaps I'm looking too much for the bad scenarios, and perhaps we can make it more safe somehow.

The state of the PersistenShard could be stored in one ORSet per shard id.

I will be on vacation the next 4 weeks, so if I'm unresponsive here that is not because I'm not interested. Thanks for working on this.

@oseval
Copy link
Author

oseval commented Jul 3, 2015

@patriknw Thank you too.
I am was created first version of code. Currently it is mostly not completed and not works yet, but I wrote a lot of comments everywhere. And I was not working on PersistentShard because this is a much more work in a row - hope it could be doing step by step. I will be wait for response.
Enjoy your holiday!
I will be working on this on my vacation at the next week:)

case g @ GetSuccess(RegionsKey, _) =>
val regions: Map[ActorRef, Set[(ShardId, ActorRef)]] = g.get(RegionsKey).entries
regions.foreach { case (region, shards) =>
// TODO: Would the Coordinator notified by the `Terminated` message if the region was terminated before `watch`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it will

oseval added a commit to oseval/akka that referenced this pull request Jul 8, 2015
@oseval
Copy link
Author

oseval commented Jul 8, 2015

I was done the first step as @patriknw offered. I will be move the PersistentShard onto the CRDT in next step, since this is possible.

Now PR is ready for review.
@rkuhn Could somebody review this PR as long as Patrik on vacation?

import akka.cluster.ddata.Replicator.Update
import akka.cluster.ddata.Replicator.UpdateSuccess
import akka.cluster.ddata.Replicator.UpdateTimeout
import akka.cluster.ddata.Replicator.WriteMajority
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use a wildcard import for these

@rkuhn
Copy link
Contributor

rkuhn commented Jul 9, 2015

Thanks a lot for this idea and PR, @SmLin! There are some aspects that I cannot currently think through in full, revolving around the semantics of shard allocation. Currently you send back the result immediately before knowing that the CRDT update has been disseminated. The old implementation makes sure that the allocation has been persisted before it can be handed out and acted upon, which ensures that two different regions cannot create the same shard when the coordinator’s machine fails and a new coordinator makes a different decision.

The other point is related: the old implementation pulled shard location information lazily into the regions when needed, but CRDT replication would allow us to just listen to the same data at all regions, saving roundtrips to the coordinator.

One thing I really like about this approach is that the CRDT’s lifetime is coupled to the lifetime of the cluster itself. Persistence means that allocations can outlive the whole cluster, but when starting from scratch we may as well just reallocate things instead. The only missing feature then is that instantiated entities will not automatically be resurrected, but my thinking goes in the direction of making that an optional add-on since not everybody needs that and then we would have completely separated the sharding from the persistence concerns.

@oseval
Copy link
Author

oseval commented Jul 10, 2015

@rkuhn I have two ways.
First - I have to fix the "double shards allocation" issue and write some test for checking this. And then I would be start the "next step" (and next PR) at bounds of which the distributed data could be wrote only inside their owners. The data about the allocated shards inside the shards, data about the started regions inside the regions. And of course I will to remove all dependencies onto the persistence.
Second - I can do all of this inside the current PR, but it is a lot of changes in a row. Of course I will do as you will say. But I think the first case is better.

@rkuhn
Copy link
Contributor

rkuhn commented Jul 10, 2015

Yes, I agree that the first proposal is the way to go, optimizing the use of the CRDTs can and should a second step; please create a ticket so that we have it properly on the radar.

# Settings for the coordinator singleton. Same layout as akka.cluster.singleton.

# Timeout of waiting the initial distributed state (an initial state will be queried again if the timeout happened)
waiting-for-state-timeout = 30 s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a more reasonable default is 5 s

@patriknw
Copy link
Member

@SmLin I took a first look. I think it is in the right direction. I appreciate the step-wise approach you take on this. Great work!

To set the expectations right, we will probably have to support both this and the persistent sharding and we need to discuss how to package those. Anyway, don't worry about that now. Please continue in this PR, and separate the different steps with separate commits.

@oseval
Copy link
Author

oseval commented Jul 14, 2015

@rkuhn @patriknw I am stopped on consistency requirements to case of duplicated shards. To avoid such duplication a coordinator must ensure changes of state to a whole cluster (WriteAll in case of CRDTs) or synchronize state with all regions before moving to the active state. But in second step I will change the source of distributed state changes from a coordinator to the regions and shards. So after second step duplication will be avoided automatically "by design".
Now question: is WriteMajority enough for merge this PR or it must be implemented with the guaranteed consistency?
Currently I think that this step is finished and need for review before start the second step.

@oseval
Copy link
Author

oseval commented Aug 2, 2015

I have to notice one case when avoiding the shards duplication is impossible. If cluster was fragmented (due to broken link between two datacenters or some other cases), then such situation would be end up with the two identical coordinators in the both clusters.
In case when coordinator extends the PersistentActor this will lead to crash both coordinators (due concurrent writes by two persistent actors) and could be repaired only by cleaning persistent data from database.
In case of CRDT this eventually would lead to stand up two identical shards/entities and could be crash these entities if they are persistent actors.

"The ShardCoordinator was unable to update a distributed state with error {}. Was retrying with event {}.",
error,
evt)
sendUpdate(evt)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ModifyFailure only happens if you modify function in the Update message throws an exception, and that should never happen unless you have done a programming error. No point in trying to do it again. Will probably fail again. Instead you should just re-throw the cause.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it happens due to message/updates reordering? And would fixed after some time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, typical error would be things like NullPointerException, i.e. it should never happen unless we have a bug somewhere and then we don't want to continue anyway

@oseval
Copy link
Author

oseval commented Aug 3, 2015

We are attacking that issue from other angles.

Am I right understand this mean that issue must not be solved in this PR?

@patriknw
Copy link
Member

patriknw commented Aug 3, 2015

Am I right understand this mean that issue must not be solved in this PR?

yes

@oseval
Copy link
Author

oseval commented Aug 8, 2015

@patriknw It turns out that ddata realization (without straightforward message exchanges between coordinator, regions and shards) is tremendously slow since gossips have a timeout. So I see two variants:

  1. The Replicator must sending gossips immediately to the nodes where subscribers exists.
  2. Currently this PR seems to be working and I think it is anyway better then saving a state to disk.
    Using a ddata not only for the coordinator state gives slightly more safety and no more.

Probability of the shards duplication even at the moments of cluster failures seems to be very small. I don't know exactly, but it possible that such failure could happen only if auto-down enabled. But if it is enabled then saving the coordinator state to disk also have the problems.

Currently I think it is complete solution. Any better solution required absolutely different idea and realization. Sorry, I have no a time to do such realization.

@@ -155,6 +155,12 @@ object ShardRegion {
@SerialVersionUID(1L) final case object GracefulShutdown extends ShardRegionCommand

/**
* We must be sure that a shard is initialized before to start send messages to it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the reason for the changes in this file?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as mentioned elsewhere: Sometimes persistent shard could be failure during a replay process due to temporary journal unavailability. The user messages which sent to such shard just after creation, but before a failure would be lost. This code removes that problem.

@patriknw
Copy link
Member

patriknw commented Aug 9, 2015

@SmLin I agree that the solution in this PR is good and it is a great advantage that it is so similar to current persistent implementation, since we need to support both approaches in 2.4 (as described in my comment above). Would you be able to wrap up that part or would you like that I make those adjustments?

@oseval
Copy link
Author

oseval commented Aug 10, 2015

I will try do it by myslef. Thank you for review.

@patriknw
Copy link
Member

Thanks!

@patriknw
Copy link
Member

To clarify my idea. We should make it possible to switch between the two implementations with a config setting. The dependency to distributed-data should be in scope provided, so the user must define the dependency and switch the config setting to use this implementation. Does that make sense?

@oseval
Copy link
Author

oseval commented Aug 11, 2015

Yes, it make sense. Perhaps now it could be done also for akka-persistence dependency.

@oseval
Copy link
Author

oseval commented Aug 17, 2015

@patriknw I wrote two implementations of the coordinator which depends on config parameter. I duplicated tests for ddata coordinator - currently I don't know how make it more elegantly. Possible you could help me.

context.watch(region)
sender() ! RegisterAck(self)
region ! RegisterAck(self)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sender() and region is the same, right?
but now we use region here, but sender() a few lines above

@patriknw
Copy link
Member

@SmLin excellent separation! I have a few comments, but it looks like this can be merged soon. We are releasing on 2.4-RC1 on Friday so it would be great if we can merge it on Thursday at the latests.

Could you also add a section in the docs of the alternative ddata mode:

  • akka-docs/rst/scala/cluster-sharding.rst
  • akka-docs/rst/java/cluster-sharding.rst

Regarding the tests I think you could split them up into an abstract class and two implementations. The only difference should be the config (and some initial creation of actors). See example of similar split:

case evt: ClusterDomainEvent ⇒ receiveClusterEvent(evt)
case state: CurrentClusterState ⇒ receiveClusterState(state)
case msg: CoordinatorMessage ⇒ receiveCoordinatorMessage(msg)
case cmd: ShardRegionCommand ⇒ receiveCommand(cmd)
case msg if extractEntityId.isDefinedAt(msg) ⇒ deliverMessage(msg, sender())
case msg: RestartShard ⇒ deliverMessage(msg, sender())
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RestartShard message handling has not existed before - it is a bug fix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great catch, it's scheduled by receiveTerminated

@oseval oseval force-pushed the sharding-with-ddata branch 2 times, most recently from fc7c106 to 241f4d8 Compare August 18, 2015 22:26
@patriknw
Copy link
Member

LGTM
@SmLin please squash into one commit and I will merge it. Excellent contribution!

I will add a few notes about experimental and such in the docs, but I can take care of that.

@oseval
Copy link
Author

oseval commented Aug 20, 2015

@patriknw Squashed.

@patriknw
Copy link
Member

thanks!

patriknw added a commit that referenced this pull request Aug 20, 2015
=cls #17846 Use CRDTs instead of PersistentActor to remember the state of the ShardCoordinator
@patriknw patriknw merged commit 453a554 into akka:master Aug 20, 2015
@patriknw
Copy link
Member

Refs #17846

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants