Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ClusterShardingSpec flip rate #4165

Open
IgorFedchenko opened this issue Jan 25, 2020 · 0 comments
Open

Fix ClusterShardingSpec flip rate #4165

IgorFedchenko opened this issue Jan 25, 2020 · 0 comments

Comments

@IgorFedchenko
Copy link
Contributor

This is part of #3786, but dedicated to ClusterShardingSpec (Akka.Cluster.Sharding.Tests.MultiNode)

This test does not seem to be time sensitive - at least first failures are not related to failed timeouts.

Actually, most failures happen here:

private void CreateCoordinator()
{
var typeNames = new[]
{
TestCounterShardingTypeName, "rebalancingCounter", "RememberCounterEntities", "AnotherRememberCounter",
"RememberCounter", "RebalancingRememberCounter", "AutoMigrateRememberRegionTest"
};
foreach (var typeName in typeNames)
{
var rebalanceEnabled = typeName.ToLowerInvariant().StartsWith("rebalancing");
var rememberEnabled = typeName.ToLowerInvariant().Contains("remember");
var singletonProps = BackoffSupervisor.Props(
CoordinatorProps(typeName, rebalanceEnabled, rememberEnabled),
"coordinator",
TimeSpan.FromSeconds(5),
TimeSpan.FromSeconds(5),
0.1,
-1).WithDeploy(Deploy.Local);
Sys.ActorOf(ClusterSingletonManager.Props(
singletonProps,
PoisonPill.Instance,
ClusterSingletonManagerSettings.Create(Sys)),
typeName + "Coordinator");
}
}

When ClusterSharding_should_work_in_single_node_cluster() is executed, and first node is joining cluster, it is calling CreateCoordinator() which sometimes failes during CoordinatorProps call like this:

[Node #2(first)][Node2:first][FAIL] Akka.Cluster.Sharding.Tests.DDataClusterShardingSpec.ClusterSharding_specs
[Node #2(first)][Node2:first][FAIL-EXCEPTION] Type: System.NullReferenceException
[Node #2(first)]--> [Node2:first][FAIL-EXCEPTION] Message: Object reference not set to an instance of an object.
[Node #2(first)]--> [Node2:first][FAIL-EXCEPTION] StackTrace:    at Hocon.Config.WithFallback(Config fallback)
[Node #2(first)]   at Akka.Cluster.Sharding.Tests.ClusterShardingSpec.CoordinatorProps(String typeName, Boolean rebalanceEntities, Boolean rememberEntities) in D:\a\1\s\src\contrib\cluster\Akka.Cluster.Sharding.Tests.MultiNode\ClusterShardingSpec.cs:line 486
[Node #2(first)]   at Akka.Cluster.Sharding.Tests.ClusterShardingSpec.CreateCoordinator() in D:\a\1\s\src\contrib\cluster\Akka.Cluster.Sharding.Tests.MultiNode\ClusterShardingSpec.cs:line 467
[Node #2(first)]   at Akka.Cluster.Sharding.Tests.ClusterShardingSpec.Join(RoleName from, RoleName to) in D:\a\1\s\src\contrib\cluster\Akka.Cluster.Sharding.Tests.MultiNode\ClusterShardingSpec.cs:line 446
[Node #2(first)]   at Akka.Cluster.Sharding.Tests.ClusterShardingSpec.<ClusterSharding_should_work_in_single_node_cluster>b__28_0() in D:\a\1\s\src\contrib\cluster\Akka.Cluster.Sharding.Tests.MultiNode\ClusterShardingSpec.cs:line 587
[Node #2(first)]   at Akka.TestKit.TestKitBase.<>c__DisplayClass150_0.<Within>b__0() in D:\a\1\s\src\core\Akka.TestKit\TestKitBase_Within.cs:line 57
[Node #2(first)]   at Akka.TestKit.TestKitBase.Within[T](TimeSpan min, TimeSpan max, Func`1 function, String hint, Nullable`1 epsilonValue) in D:\a\1\s\src\core\Akka.TestKit\TestKitBase_Within.cs:line 134
[Node #2(first)]   at Akka.TestKit.TestKitBase.Within(TimeSpan min, TimeSpan max, Action action, String hint, Nullable`1 epsilonValue) in D:\a\1\s\src\core\Akka.TestKit\TestKitBase_Within.cs:line 57
[Node #2(first)]   at Akka.TestKit.TestKitBase.Within(TimeSpan max, Action action, Nullable`1 epsilonValue) in D:\a\1\s\src\core\Akka.TestKit\TestKitBase_Within.cs:line 32
[Node #2(first)]   at Akka.Cluster.Sharding.Tests.ClusterShardingSpec.ClusterSharding_should_work_in_single_node_cluster() in D:\a\1\s\src\contrib\cluster\Akka.Cluster.Sharding.Tests.MultiNode\ClusterShardingSpec.cs:line 585
[Node #2(first)]   at Akka.Cluster.Sharding.Tests.ClusterShardingSpec.ClusterSharding_specs() in D:\a\1\s\src\contrib\cluster\Akka.Cluster.Sharding.Tests.MultiNode\ClusterShardingSpec.cs:line 529

Sometimes little bit later with

[Node #6(fifth)]Cause: [akka://DDataClusterShardingSpec/user/rebalancingCounterCoordinator#1327172867]: Akka.Actor.ActorInitializationException: Exception during creation ---> System.TypeLoadException: Error while creating actor instance of type Akka.Cluster.Tools.Singleton.ClusterSingletonManager with 3 args: (Akka.Actor.Props,<PoisonPill>,Akka.Cluster.Tools.Singleton.ClusterSingletonManagerSettings) ---> System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation. ---> Hocon.ConfigurationException: min-number-of-hand-over-retries must be >= 1
[Node #6(fifth)]   at Akka.Cluster.Tools.Singleton.ClusterSingletonManager..ctor(Props singletonProps, Object terminationMessage, ClusterSingletonManagerSettings settings) in D:\a\1\s\src\contrib\cluster\Akka.Cluster.Tools\Singleton\ClusterSingletonManager.cs:line 562
[Node #6(fifth)]   --- End of inner exception stack trace ---
[Node #6(fifth)]   at System.RuntimeMethodHandle.InvokeMethod(Object target, Object[] arguments, Signature sig, Boolean constructor, Boolean wrapExceptions)
[Node #6(fifth)]   at System.Reflection.RuntimeConstructorInfo.Invoke(BindingFlags invokeAttr, Binder binder, Object[] parameters, CultureInfo culture)
[Node #6(fifth)]   at System.RuntimeType.CreateInstanceImpl(BindingFlags bindingAttr, Binder binder, Object[] args, CultureInfo culture, Object[] activationAttributes)
[Node #6(fifth)]   at Akka.Actor.Props.ActivatorProducer.Produce() in D:\a\1\s\src\core\Akka\Actor\Props.cs:line 639
[Node #6(fifth)]   at Akka.Actor.Props.NewActor() in D:\a\1\s\src\core\Akka\Actor\Props.cs:line 575
[Node #6(fifth)]   --- End of inner exception stack trace ---
[Node #6(fifth)]   at Akka.Actor.Props.NewActor() in D:\a\1\s\src\core\Akka\Actor\Props.cs:line 577
[Node #6(fifth)]   at Akka.Actor.ActorCell.CreateNewActorInstance() in D:\a\1\s\src\core\Akka\Actor\ActorCell.cs:line 351
[Node #6(fifth)]   at Akka.Actor.ActorCell.<>c__DisplayClass117_0.<NewActor>b__0() in D:\a\1\s\src\core\Akka\Actor\ActorCell.cs:line 336
[Node #6(fifth)]   at Akka.Actor.ActorCell.UseThreadContext(Action action) in D:\a\1\s\src\core\Akka\Actor\ActorCell.cs:line 375
[Node #6(fifth)]   at Akka.Actor.ActorCell.NewActor() in D:\a\1\s\src\core\Akka\Actor\ActorCell.cs:line 342
[Node #6(fifth)]   at Akka.Actor.ActorCell.Create(Exception failure) in D:\a\1\s\src\core\Akka\Actor\ActorCell.DefaultMessages.cs:line 422
[Node #6(fifth)]   --- End of inner exception stack trace ---
[Node #6(fifth)]   at Akka.Actor.ActorCell.Create(Exception failure) in D:\a\1\s\src\core\Akka\Actor\ActorCell.DefaultMessages.cs:line 439
[Node #6(fifth)]   at Akka.Actor.ActorCell.SysMsgInvokeAll(EarliestFirstSystemMessageList messages, Int32 currentState) in D:\a\1\s\src\core\Akka\Actor\ActorCell.DefaultMessages.cs:line 256

So it is just trying to get one or another setting from Settings.Config and fails.
Need to check is this a root issue (maybe some HOCON issues) - if so, this may change once we will move to standalone HOCON library.

If this is just a result of some inner exception and settings are cleanup due to another failure (is it possible? They should be immutable, right?), need to understand what's going under the cover.

Here is a log part of what happens when first node is trying to join cluster and create ClusterSingletoneManager:

[Node #2(first)][INFO][1/23/2020 10:19:56 PM][Thread 0018][Cluster (akka://DDataClusterShardingSpec)] Cluster Node [akka.tcp://DDataClusterShardingSpec@localhost:1752] - Node [akka.tcp://DDataClusterShardingSpec@localhost:1752] is JOINING itself (with roles [backend]) and forming a new cluster
[Node #2(first)][INFO][1/23/2020 10:19:56 PM][Thread 0018][Cluster (akka://DDataClusterShardingSpec)] Cluster Node [akka.tcp://DDataClusterShardingSpec@localhost:1752] - Leader is moving node [akka.tcp://DDataClusterShardingSpec@localhost:1752] to [Up]
[Node #2(first)][INFO][1/23/2020 10:19:57 PM][Thread 0019][akka.tcp://DDataClusterShardingSpec@localhost:1752/user/TestCounterCoordinator] Singleton manager started singleton actor [akka://DDataClusterShardingSpec/user/TestCounterCoordinator/singleton] 
[Node #2(first)][INFO][1/23/2020 10:19:57 PM][Thread 0019][akka.tcp://DDataClusterShardingSpec@localhost:1752/user/TestCounterCoordinator] ClusterSingletonManager state change [Start -> Oldest] Akka.Cluster.Tools.Singleton.Uninitialized
[Node #2(first)]---------------DISPOSING--------------------
[Node #2(first)][INFO][1/23/2020 10:19:58 PM][Thread 0019][akka.tcp://DDataClusterShardingSpec@localhost:1752/user/TestConductorClient] Terminating connection to multi-node test controller due to [Akka.Actor.FSMBase+Shutdown]
[Node #2(first)][INFO][1/23/2020 10:19:58 PM][Thread 0032][PlayerHandler (akka://DDataClusterShardingSpec)] Client: disconnecting [::1]:1758 from [::1]:4711
[Node #2(first)][WARNING][1/23/2020 10:19:58 PM][Thread 0018][akka://DDataClusterShardingSpec/user/rebalancingCounterCoordinator] DeadLetter from [akka://DDataClusterShardingSpec/system/cluster/$b#301104707] to [akka://DDataClusterShardingSpec/user/rebalancingCounterCoordinator#1586511960]: <Received dead letter from [akka://DDataClusterShardingSpec/system/cluster/$b#301104707]: Akka.Cluster.Tools.Singleton.StartOldestChangedBuffer>
[Node #2(first)][WARNING][1/23/2020 10:19:58 PM][Thread 0018][akka://DDataClusterShardingSpec/user/TestCounterCoordinator/singleton/coordinator] DeadLetter from [akka://DDataClusterShardingSpec/deadLetters] to [akka://DDataClusterShardingSpec/user/TestCounterCoordinator/singleton/coordinator#2037937872]: <Received dead letter from [akka://DDataClusterShardingSpec/deadLetters]: Akka.Cluster.Sharding.PersistentShardCoordinator+StateInitialized>
[Node #1(controller)][ERROR][1/23/2020 10:19:58 PM][Thread 0016][akka://DDataClusterShardingSpec/user/controller/barriers] unannounced disconnect of RoleName(first)
[Node #1(controller)]Cause: Akka.Remote.TestKit.BarrierCoordinator+ClientLostException: unannounced disconnect of RoleName(first)
[Node #1(controller)]   at Akka.Remote.TestKit.BarrierCoordinator.<>c__DisplayClass14_0.<InitFSM>b__5(ClientDisconnected disconnected) in D:\a\1\s\src\core\Akka.Remote.TestKit\BarrierCoordinator.cs:line 523
[Node #1(controller)]   at Akka.Case.With[TMessage](Action`1 action) in D:\a\1\s\src\core\Akka\PatternMatch.cs:line 107
[Node #1(controller)]   at Akka.Remote.TestKit.BarrierCoordinator.<InitFSM>b__14_0(Event`1 event) in D:\a\1\s\src\core\Akka.Remote.TestKit\BarrierCoordinator.cs:line 528
[Node #1(controller)]   at Akka.Actor.FSM`2.<>c__DisplayClass52_0.<OrElse>b__0(Event`1 event) in D:\a\1\s\src\core\Akka\Actor\FSM.cs:line 1101
[Node #1(controller)]   at Akka.Actor.FSM`2.ProcessEvent(Event`1 fsmEvent, Object source) in D:\a\1\s\src\core\Akka\Actor\FSM.cs:line 1213
[Node #1(controller)]   at Akka.Actor.FSM`2.Receive(Object message) in D:\a\1\s\src\core\Akka\Actor\FSM.cs:line 1115
[Node #1(controller)]   at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message) in D:\a\1\s\src\core\Akka\Actor\ActorBase.cs:line 158
[Node #1(controller)]   at Akka.Actor.ActorCell.ReceiveMessage(Object message) in D:\a\1\s\src\core\Akka\Actor\ActorCell.DefaultMessages.cs:line 177
[Node #1(controller)]   at Akka.Actor.ActorCell.Invoke(Envelope envelope) in D:\a\1\s\src\core\Akka\Actor\ActorCell.DefaultMessages.cs:line 83
[Node #1(controller)][Akka.Remote.TestKit.MsgEncoder][Debug][1/23/2020 10:19:58 PM]Encoding Akka.Remote.TestKit.BarrierResult
[Node #1(controller)][Akka.Remote.TestKit.Proto.ProtobufEncoder][Debug][1/23/2020 10:19:58 PM][[::1]:4711 --> [::1]:1763] Encoding { "barrier": { "name": "first-joined", "op": "Failed" } } into Protobuf
[Node #1(controller)][Akka.Remote.TestKit.MsgEncoder][Debug][1/23/2020 10:19:58 PM]Encoding Akka.Remote.TestKit.BarrierResult
[Node #1(controller)][Akka.Remote.TestKit.Proto.ProtobufEncoder][Debug][1/23/2020 10:19:58 PM][[::1]:4711 --> [::1]:1765] Encoding { "barrier": { "name": "first-joined", "op": "Failed" } } into Protobuf
[Node #1(controller)][Akka.Remote.TestKit.MsgEncoder][Debug][1/23/2020 10:19:58 PM]Encoding Akka.Remote.TestKit.BarrierResult
[Node #1(controller)][Akka.Remote.TestKit.Proto.ProtobufEncoder][Debug][1/23/2020 10:19:58 PM][[::1]:4711 --> [::1]:1761] Encoding { "barrier": { "name": "first-joined", "op": "Failed" } } into Protobuf

So there is some error, but not clear what happened, and why RoleName(first) is announced to be disconnected.

Attaching full log for details:

ClusterShardingSpec_fail_log.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants