Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate multi node testkit to Netty 4. #486

Closed
wants to merge 3 commits into from
Closed

Conversation

He-Pin
Copy link
Member

@He-Pin He-Pin commented Jul 15, 2023

refs: #462

I verified it locally withcluster/MultiJvm/test

@He-Pin He-Pin force-pushed the netty4 branch 2 times, most recently from 22cc854 to ad512eb Compare July 15, 2023 14:39
@He-Pin He-Pin added this to the 1.1.0 milestone Jul 15, 2023
pjfanning
pjfanning previously approved these changes Jul 30, 2023
Copy link
Contributor

@pjfanning pjfanning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm - pekko-mult-node-testkit is documented to be 'API may change'

private[pekko] trait RemoteConnection {
def channel: Channel
def shutdown(): Unit
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A new trait is added for help.

@He-Pin
Copy link
Member Author

He-Pin commented Jul 30, 2023

Thanks @pjfanning for the review , waiting another LGTM before merge it.

@He-Pin He-Pin requested review from raboof and mdedetrich July 31, 2023 04:40
@mdedetrich
Copy link
Contributor

@He-Pin So I am getting failures when running sbt cluster/MultiJvm/test, I have posted the entire output here https://gist.github.com/mdedetrich/9a8a1f7d73f3840c6378c3f549c284a2

I ran the test with JDK 11 + export JAVA_8_HOME=/usr/lib/jvm/java-8-openjdk/ and also a ran a sbt +clean beforehand. I noticed that some exceptions are being thrown regarding connection refused, maybe thats the reason? (I am not that familiar with these tests).

@He-Pin
Copy link
Member Author

He-Pin commented Aug 1, 2023

@mdedetrich
https://gist.github.com/He-Pin/c7232c8635a69b16e02b0dfc90b20f01
https://gist.github.com/He-Pin/f1349ef25c80cb4ed822b3a3fab7a4bf

these are my outputs and I don't have the errors in your gist, not sure why.

I just rebased on the current main.

Copy link
Contributor

@mdedetrich mdedetrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im sorry but I have to block the PR at least for now because at least on my system its causing a regression in the cluster/MultiJvm/test.

I did an additional run with the latest rebase and its still failing (see https://gist.github.com/mdedetrich/5f68a3e3faa9cf8ee220547618fb89e2) however if I check out current main then the tests pass fine (see https://gist.github.com/mdedetrich/73ceedc04f51fda4381e9a3af7ec7bb4 ).

I had a deeper look, and it appears that the core issue is that its unable to resolve a connection, i.e.

[info] [JVM-3-MultiDcMultiJvmNode3] [ERROR] [08/01/2023 09:16:40.161] [MultiDc-pekko.actor.internal-dispatcher-3] [pekko://MultiDc/system/TestConductorClient] Connection refused: localhost/127.0.0.1:4711
[info] [JVM-3-MultiDcMultiJvmNode3] org.apache.pekko.actor.ActorInitializationException: pekko://MultiDc/system/TestConductorClient: exception during creation, root cause message: [Connection refused]
[info] [JVM-3-MultiDcMultiJvmNode3]     at org.apache.pekko.actor.ActorInitializationException$.apply(Actor.scala:206)
[info] [JVM-3-MultiDcMultiJvmNode3]     at org.apache.pekko.actor.ActorCell.create(ActorCell.scala:679)
[info] [JVM-3-MultiDcMultiJvmNode3]     at org.apache.pekko.actor.ActorCell.invokeAll$1(ActorCell.scala:523)
[info] [JVM-3-MultiDcMultiJvmNode3]     at org.apache.pekko.actor.ActorCell.systemInvoke(ActorCell.scala:545)
[info] [JVM-3-MultiDcMultiJvmNode3]     at org.apache.pekko.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:305)
[info] [JVM-3-MultiDcMultiJvmNode3]     at org.apache.pekko.dispatch.Mailbox.run(Mailbox.scala:240)
[info] [JVM-3-MultiDcMultiJvmNode3]     at org.apache.pekko.dispatch.Mailbox.exec(Mailbox.scala:253)
[info] [JVM-3-MultiDcMultiJvmNode3]     at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
[info] [JVM-3-MultiDcMultiJvmNode3]     at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
[info] [JVM-3-MultiDcMultiJvmNode3]     at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
[info] [JVM-3-MultiDcMultiJvmNode3]     at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
[info] [JVM-3-MultiDcMultiJvmNode3]     at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
[info] [JVM-3-MultiDcMultiJvmNode3] Caused by: java.lang.reflect.InvocationTargetException
[info] [JVM-3-MultiDcMultiJvmNode3]     at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
[info] [JVM-3-MultiDcMultiJvmNode3]     at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
[info] [JVM-3-MultiDcMultiJvmNode3]     at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
[info] [JVM-3-MultiDcMultiJvmNode3]     at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
[info] [JVM-3-MultiDcMultiJvmNode3]     at org.apache.pekko.util.Reflect$.instantiate(Reflect.scala:82)
[info] [JVM-3-MultiDcMultiJvmNode3]     at org.apache.pekko.actor.ArgsReflectConstructor.produce(IndirectActorProducer.scala:111)
[info] [JVM-3-MultiDcMultiJvmNode3]     at org.apache.pekko.actor.Props.newActor(Props.scala:236)
[info] [JVM-3-MultiDcMultiJvmNode3]     at org.apache.pekko.actor.ActorCell.newActor(ActorCell.scala:626)
[info] [JVM-3-MultiDcMultiJvmNode3]     at org.apache.pekko.actor.ActorCell.create(ActorCell.scala:653)
[info] [JVM-3-MultiDcMultiJvmNode3]     ... 10 more
[info] [JVM-3-MultiDcMultiJvmNode3] Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: localhost/127.0.0.1:4711
[info] [JVM-3-MultiDcMultiJvmNode3] Caused by: java.net.ConnectException: Connection refused
[info] [JVM-3-MultiDcMultiJvmNode3]     at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
[info] [JVM-3-MultiDcMultiJvmNode3]     at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:777)
[info] [JVM-3-MultiDcMultiJvmNode3]     at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:337)
[info] [JVM-3-MultiDcMultiJvmNode3]     at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
[info] [JVM-3-MultiDcMultiJvmNode3]     at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:776)
[info] [JVM-3-MultiDcMultiJvmNode3]     at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
[info] [JVM-3-MultiDcMultiJvmNode3]     at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
[info] [JVM-3-MultiDcMultiJvmNode3]     at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
[info] [JVM-3-MultiDcMultiJvmNode3]     at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
[info] [JVM-3-MultiDcMultiJvmNode3]     at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
[info] [JVM-3-MultiDcMultiJvmNode3]     at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
[info] [JVM-3-MultiDcMultiJvmNode3]     at java.base/java.lang.Thread.run(Thread.java:829)

I think netty/netty#6865 is related and you got pinged in that.

@mdedetrich
Copy link
Contributor

@pjfanning @raboof @jrudolph If you have time do you mind checking out this PR's branch and running sbt cluster/MultiJvm/test to see if you have the same issues? I ran my tests on Linux system and I have a suspicion that the tests don't run in general on Linux OpenJDK but do run fine on Windows (which is what @He-Pin runs locally iirc)

@pjfanning pjfanning self-requested a review August 1, 2023 09:33
@pjfanning
Copy link
Contributor

I'm seeing lots of test failures - similar to what @mdedetrich reported.

I think we need to go back to the drawing board and start by getting the multi-jvm tests up and running in CI and sorting out the logging. The tests are reporting lots of slf4j binding issues - so we need to ensure that logback or something like it is set up when running the JVMs used in these tests.

@mdedetrich
Copy link
Contributor

mdedetrich commented Aug 1, 2023

I think we need to go back to the drawing board and start by getting the multi-jvm tests up and running in CI and sorting out the logging. The tests are reporting lots of slf4j binding issues - so we need to ensure that logback or something like it is set up when running the JVMs used in these tests.

iirc Making these tests run in our current CI is problematic because of high flakiness due to noisy neighbour problem/VM's being very weak for the cluster tests, there is a reason why the tests were never enabled in CI for Akka. At least with Akka, before Lightbend would make a release/merge a PR like this they would test on machines like we are. Although this is another topic, this is one of the problems that dedicated hardware was meant to solve.

We can try temporarily enabling the tests only on this PR to see that it won't break main (on the presumption that we fixed the underlying issue we should be able to get at least a single pass with enough runs)

@He-Pin
Copy link
Member Author

He-Pin commented Aug 1, 2023

Will this multiNodeTestOnly JoinInProgressMultiJvmSpec test pass on your box @mdedetrich ?
I am using Windows box, if this connection refuse raise up, then the issue should be the bind is not sync I think.

@He-Pin

This comment was marked as resolved.

@He-Pin He-Pin marked this pull request as draft August 1, 2023 10:11
@mdedetrich
Copy link
Contributor

Will this multiNodeTestOnly JoinInProgressMultiJvmSpec test pass on your box @mdedetrich ? I am using Windows box, if this connection refuse raise up, then the issue should be the bind is not sync I think.

I am getting issues that I think are related to #486 (comment), see https://gist.github.com/mdedetrich/e4f0cec36d405bd0ba21921849d8a07a

@pjfanning
Copy link
Contributor

cd1c74a makes org.apache.pekko.cluster.SplitBrainQuarantine work on my laptop - it was failing before this.

@He-Pin
Copy link
Member Author

He-Pin commented Aug 1, 2023

@mdedetrich @pjfanning Wired, I didn't see this on my Windows, and just saw it when try to run testOnly

@He-Pin He-Pin marked this pull request as ready for review August 1, 2023 11:27
@mdedetrich
Copy link
Contributor

@He-Pin So I just re-ran the tests against latest state of the branch and I got the following https://gist.github.com/mdedetrich/d91a5d9e1731a20806ce6e04e58aa9ca.

Running multiNodeTestOnly JoinInProgressMultiJvmSpec doesn't actually do anything (you get "No tests run for MultiNode")

@He-Pin
Copy link
Member Author

He-Pin commented Aug 1, 2023

@mdedetrich Maybe it's bind on a different ip(not loopback) and try to connect with localhost/127.0.0.1:4711.

I mean, the multinode.server-host config is related.

@mdedetrich
Copy link
Contributor

@mdedetrich Maybe it's bind on a different ip(not loopback) and try to connect with localhost/127.0.0.1:4711.

I mean, the multinode.server-host config is related.

Yes this definitely seems related although I wonder why its only occurring when updating to Netty 4, I guess Netty 4 may have changed loopback/resolution defaults?

@He-Pin He-Pin force-pushed the netty4 branch 3 times, most recently from 0f525d8 to aec80d4 Compare August 1, 2023 15:02
@mdedetrich
Copy link
Contributor

@He-Pin He-Pin force-pushed the netty4 branch 2 times, most recently from 37709b3 to f826ddc Compare August 1, 2023 15:49
@mdedetrich
Copy link
Contributor

@He-Pin Here are the results of https://gist.github.com/mdedetrich/1f483dc433b32be6c9199c55c7de0b02 at commit aec80d4

@He-Pin He-Pin marked this pull request as draft August 1, 2023 16:15
@mdedetrich
Copy link
Contributor

@He-Pin
Copy link
Member Author

He-Pin commented Aug 1, 2023

I can 100% reproduce it on linux box, still find the root cause.

@He-Pin
Copy link
Member Author

He-Pin commented Aug 2, 2023

Update: I will try to allocate sometime this weekend for this.

@He-Pin He-Pin force-pushed the netty4 branch 2 times, most recently from d868d85 to 41b6152 Compare August 2, 2023 05:29
Signed-off-by: He-Pin <hepin1989@gmail.com>
Signed-off-by: He-Pin <hepin1989@gmail.com>
Signed-off-by: He-Pin <hepin1989@gmail.com>
@He-Pin
Copy link
Member Author

He-Pin commented Aug 3, 2023

Close now,will reopen after I tested on linux box.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants