Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failed: MultiDcSplitBrainSpec #23306

Closed
patriknw opened this issue Jul 6, 2017 · 15 comments
Closed

failed: MultiDcSplitBrainSpec #23306

patriknw opened this issue Jul 6, 2017 · 15 comments
Assignees
Labels
failed Tickets that indicate a CI failure, likely a flakey test t:cluster:dc
Milestone

Comments

@patriknw
Copy link
Member

patriknw commented Jul 6, 2017

https://jenkins.akka.io:8498/job/akka-multi-node-repeat/13905

could be real, but much has also changed that is not merged yet

@johanandren

@patriknw patriknw added 1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted failed Tickets that indicate a CI failure, likely a flakey test pick t:cluster:dc 2 - pick next Used to mark issues which are next up in the queue to be worked on. The tag is non-binding and removed pick labels Jul 6, 2017
@johanandren
Copy link
Member

Just so I remember it: I saw it fail once or twice locally as well while doing unrelated things.

@johanandren johanandren self-assigned this Jul 7, 2017
@johanandren johanandren added 3 - in progress Someone is working on this ticket and removed 1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted 2 - pick next Used to mark issues which are next up in the queue to be worked on. The tag is non-binding labels Jul 7, 2017
@johanandren
Copy link
Member

Looking into it I think it was related to waiting on unreachable becoming empty/nonEmpty which we do not do anymore. Will keep the ticket open for a while anyway to see if it fails again on the ci-server.

@johanandren
Copy link
Member

Haven't seen it fail after more things got merged, so closing.

@johanandren johanandren added this to the 2.5.x milestone Jul 12, 2017
@raboof
Copy link
Member

raboof commented Sep 19, 2017

Saw it (or something similar?) at https://jenkins.akka.io:8498/job/akka-nightly-2.12/332/consoleFull

@raboof raboof reopened this Sep 19, 2017
@jrudolph
Copy link
Member

Another one: https://jenkins.akka.io:8498/job/akka-multi-node-nightly/5294/consoleFull

Stack dump for searchability:

[JVM-1] - must be able to have a data center member join while there is inter data center split (on node 'first', class akka.cluster.MultiDcSplitBrainMultiJvmNode1) *** FAILED *** (32 seconds, 315 milliseconds)
[JVM-1]   java.lang.AssertionError: assertion failed: timeout (25 seconds) during expectMsgClass waiting for class akka.cluster.ClusterEvent$ReachableDataCenter
[JVM-1]   at scala.Predef$.assert(Predef.scala:170)
[JVM-1]   at akka.testkit.TestKitBase$class.expectMsgClass_internal(TestKit.scala:508)
[JVM-1]   at akka.testkit.TestKitBase$class.expectMsgType(TestKit.scala:490)
[JVM-1]   at akka.testkit.TestKit.expectMsgType(TestKit.scala:850)
[JVM-1]   at akka.cluster.MultiDcSplitBrainSpec$$anonfun$unsplitDataCenters$3.apply$mcV$sp(MultiDcSplitBrainSpec.scala:121)
[JVM-1]   at akka.remote.testkit.MultiNodeSpec.runOn(MultiNodeSpec.scala:355)
[JVM-1]   at akka.cluster.MultiDcSplitBrainSpec.unsplitDataCenters(MultiDcSplitBrainSpec.scala:120)
[JVM-1]   at akka.cluster.MultiDcSplitBrainSpec$$anonfun$1$$anonfun$apply$mcV$sp$4$$anonfun$apply$mcV$sp$5.apply$mcV$sp(MultiDcSplitBrainSpec.scala:155)
[JVM-1]   at akka.cluster.MultiDcSplitBrainSpec$$anonfun$1$$anonfun$apply$mcV$sp$4$$anonfun$apply$mcV$sp$5.apply(MultiDcSplitBrainSpec.scala:137)
[JVM-1]   at akka.cluster.MultiDcSplitBrainSpec$$anonfun$1$$anonfun$apply$mcV$sp$4$$anonfun$apply$mcV$sp$5.apply(MultiDcSplitBrainSpec.scala:137)
[JVM-1]   at akka.testkit.TestKitBase$class.within(TestKit.scala:359)
[JVM-1]   at akka.testkit.TestKit.within(TestKit.scala:850)
[JVM-1]   at akka.testkit.TestKitBase$class.within(TestKit.scala:373)
[JVM-1]   at akka.testkit.TestKit.within(TestKit.scala:850)
[JVM-1]   at akka.cluster.MultiDcSplitBrainSpec$$anonfun$1$$anonfun$apply$mcV$sp$4.apply$mcV$sp

@jrudolph jrudolph added 1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted and removed 3 - in progress Someone is working on this ticket labels Sep 25, 2017
@jrudolph
Copy link
Member

(Cleared the assignee since it was reopened. Maybe let's not reopen failures in the future but create new issues linking to old instances if related.)

@patriknw patriknw self-assigned this Sep 25, 2017
@patriknw patriknw added 3 - in progress Someone is working on this ticket and removed 1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted labels Sep 25, 2017
@patriknw
Copy link
Member Author

hmm, I don't think this is multi-dc issue. Looks more like a remoting issue. third can't connect to second.

[third] [WARN] [09/23/2017 00:25:40.445] [MultiDcSplitBrainSpec-akka.remote.default-remote-dispatcher-19] [akka.trttl.gremlin.tcp://MultiDcSplitBrainSpec@third/system/endpointManager/reliableEndpointWriter-akka.trttl.gremlin.tcp%3A%2F%2FMultiDcSplitBrainSpec%40a5.moxie%3A56104-1] Association with remote system [akka.trttl.gremlin.tcp://MultiDcSplitBrainSpec@second] has failed, address is now gated for [1000] ms. Reason: [Association failed with [akka.trttl.gremlin.tcp://MultiDcSplitBrainSpec@second]] Caused by: [No response from remote for outbound association. Handshake timed out after [5000 ms].]

and it seems like it doesn't "heal" even though third is repeatedly sending heartbeat messages to second

it could also be something with the test transport?

@patriknw
Copy link
Member Author

Tried to reproduce without success.

@patriknw patriknw removed their assignment Sep 26, 2017
@patriknw patriknw added 1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted and removed 3 - in progress Someone is working on this ticket labels Sep 26, 2017
@jrudolph
Copy link
Member

@johanandren
Copy link
Member

@chbatey
Copy link
Member

chbatey commented Oct 24, 2017

@chbatey chbatey self-assigned this Nov 13, 2017
@chbatey chbatey added the 3 - in progress Someone is working on this ticket label Nov 13, 2017
@patriknw
Copy link
Member Author

@chbatey
Copy link
Member

chbatey commented Nov 20, 2017

Note the latest failure doesn't have quite the most up to date code for this test:

This fails as second in DC1 doesn't see a MemberUp event for fifth after it has restarted.

The leader (third) in DC2 marks the restarted fifth as Up

[third] [INFO] [11/16/2017 17:04:34.551] [MultiDcSplitBrainSpec-akka.actor.default-dispatcher-3] [akka.cluster.Cluster(akka://MultiDcSplitBrainSpec)] Cluster Node [akka://MultiDcSplitBrainSpec@third] dc [dc2] - Leader is moving node [akka://MultiDcSplitBrainSpec@fifth] to [Up]

We can see cross DC heart beats happening after that.

You can see second and fifth do establish a connection:

[second] [DEBUG] [11/16/2017 17:04:35.270] [MultiDcSplitBrainSpec-akka.actor.default-dispatcher-18] [akka.remote.artery.Association(akka://MultiDcSplitBrainSpec)] Incarnation 2 of association to [akka://MultiDcSplitBrainSpec@fifth] with new UID [6369290769613153078] (old UID [2515755499573518128])

The last logged event from second before the test fails:

[second] [DEBUG] [11/16/2017 17:04:38.382] [MultiDcSplitBrainSpec-akka.remote.default-remote-dispatcher-8] [InboundHandshake$$anon$2(akka://MultiDcSplitBrainSpec)] # control stream HandshakeReq(akka://MultiDcSplitBrainSpec@third#-6435521115663539303,akka://MultiDcSplitBrainSpec@second)

In the 3 seconds between fifth being marked as back up and second failing there are no logs regarding gossip but the verbose gossip logging is not on.

I also think there is an issue with the barrier which means nodes move on before fifth is back up meaning they'll start their asserts early and be more likely to time out.

WIll add the gossip logging and fix the barrier.

chbatey added a commit to chbatey/akka that referenced this issue Nov 20, 2017
The last time this failed there was no gossip to or from a node that
didn't see fifth coming back.

Also note that this test doesn't quite test what it says as the split
brain is repaired before starting the second actor system but without
extensions to the multi jvm test kit this can't be improved.

Refs akka#23306
@chbatey
Copy link
Member

chbatey commented Nov 20, 2017

I can't "fix" the barrier unless we can handle a remote system being restarted in the test kit. Raised:

#24025

@johanandren
Copy link
Member

ktoso pushed a commit that referenced this issue Dec 14, 2017
…cy (#24024)

The last time this failed there was no gossip to or from a node that
didn't see fifth coming back.

Also note that this test doesn't quite test what it says as the split
brain is repaired before starting the second actor system but without
extensions to the multi jvm test kit this can't be improved.

Refs #23306
@ktoso ktoso removed 1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted 3 - in progress Someone is working on this ticket labels Dec 14, 2017
@ktoso ktoso modified the milestones: 2.5.x, 2.5.9 Dec 14, 2017
@ktoso ktoso closed this as completed Dec 14, 2017
Sebruck pushed a commit to Sebruck/akka that referenced this issue Dec 15, 2017
…cy (akka#24024)

The last time this failed there was no gossip to or from a node that
didn't see fifth coming back.

Also note that this test doesn't quite test what it says as the split
brain is repaired before starting the second actor system but without
extensions to the multi jvm test kit this can't be improved.

Refs akka#23306
manonthegithub pushed a commit to manonthegithub/akka that referenced this issue Jan 31, 2018
…cy (akka#24024)

The last time this failed there was no gossip to or from a node that
didn't see fifth coming back.

Also note that this test doesn't quite test what it says as the split
brain is repaired before starting the second actor system but without
extensions to the multi jvm test kit this can't be improved.

Refs akka#23306
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
failed Tickets that indicate a CI failure, likely a flakey test t:cluster:dc
Projects
None yet
Development

No branches or pull requests

6 participants