Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] mixed-cluster:v6.0.0#mixedClusterTestRunner failure #73459

Closed
matriv opened this issue May 27, 2021 · 12 comments · Fixed by #73610
Closed

[CI] mixed-cluster:v6.0.0#mixedClusterTestRunner failure #73459

matriv opened this issue May 27, 2021 · 12 comments · Fixed by #73610
Labels
:Delivery/Build Build or test infrastructure Team:Delivery Meta label for Delivery team >test-failure Triaged test failures from CI

Comments

@matriv
Copy link
Contributor

matriv commented May 27, 2021

Build scan:
https://gradle-enterprise.elastic.co/s/64inre6ljqgcc

Repro line:

./gradlew ':qa:mixed-cluster:v6.0.0#mixedClusterTestRunner' -Dtests.seed=4925E97ABC52AD2B -Dtests.class=org.elasticsearch.backwards.MixedClusterClientYamlTestSuiteIT -Dtests.method="test {p0=nodes.stats/20_response_filtering/Nodes Stats filtered using both includes and excludes filters}" -Dtests.security.manager=true -Dtests.locale=sr -Dtests.timezone=America/Punta_Arenas -Dcompiler.java=11 -Druntime.java=8

Reproduces locally?:
Haven't tried

Applicable branches:
6.8

Failure history:
https://gradle-enterprise.elastic.co/scans/failures?failures.failureClassification=all_failures&failures.failureMessage=Execution%20failed%20for%20task%20*%0A%3E%20Process%20%27kill%20%5B-9,%20*%20finished%20with%20non-zero%20exit%20value%201&search.relativeStartTime=P7D&search.timeZoneId=Europe/Athens

Failure excerpt:

java.lang.AssertionError: expected total memory to be positive, got: -1
	at org.elasticsearch.monitor.os.OsStats$Mem.<init>(OsStats.java:267) ~[elasticsearch-6.8.17-SNAPSHOT.jar:6.8.17-SNAPSHOT]
	at org.elasticsearch.monitor.os.OsStats.<init>(OsStats.java:55) ~[elasticsearch-6.8.17-SNAPSHOT.jar:6.8.17-SNAPSHOT]
	at org.elasticsearch.common.io.stream.StreamInput.readOptionalWriteable(StreamInput.java:777) ~[elasticsearch-6.8.17-SNAPSHOT.jar:6.8.17-SNAPSHOT]
	at org.elasticsearch.action.admin.cluster.node.stats.NodeStats.readFrom(NodeStats.java:227) ~[elasticsearch-6.8.17-SNAPSHOT.jar:6.8.17-SNAPSHOT]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1.read(TransportNodesAction.java:199) ~[elasticsearch-6.8.17-SNAPSHOT.jar:6.8.17-SNAPSHOT]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1.read(TransportNodesAction.java:195) ~[elasticsearch-6.8.17-SNAPSHOT.jar:6.8.17-SNAPSHOT]
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.read(TransportService.java:1107) ~[elasticsearch-6.8.17-SNAPSHOT.jar:6.8.17-SNAPSHOT]
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.read(TransportService.java:1094) ~[elasticsearch-6.8.17-SNAPSHOT.jar:6.8.17-SNAPSHOT]
	at org.elasticsearch.transport.TcpTransport.handleResponse(TcpTransport.java:970) ~[elasticsearch-6.8.17-SNAPSHOT.jar:6.8.17-SNAPSHOT]
	at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:952) ~[elasticsearch-6.8.17-SNAPSHOT.jar:6.8.17-SNAPSHOT]
	at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:763) ~[elasticsearch-6.8.17-SNAPSHOT.jar:6.8.17-SNAPSHOT]
	at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:53) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[?:?]
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323) ~[?:?]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:297) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[?:?]
	at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[?:?]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1434) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[?:?]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:965) ~[?:?]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:656) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:556) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:510) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:470) ~[?:?]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909) ~[?:?]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_291]

Could be related to the JAVA related updates:
14f3273
bb580a4

@matriv matriv added :Delivery/Build Build or test infrastructure >test-failure Triaged test failures from CI labels May 27, 2021
@elasticmachine elasticmachine added the Team:Delivery Meta label for Delivery team label May 27, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-delivery (Team:Delivery)

@mark-vieira
Copy link
Contributor

I can't get this to reproduce locally for me with the exact same Java versions. Perhaps it is something to do with the underlying system?

Some of the versions 6.8 is compatible with do not work on cgroups2 so we pick a platform
that doesn't come with it.

That stood out to me when we hard-code these jobs to run on centos-7 but the versions don't line up. In this case it's versions 5.6-6.0 that are busted but the cgroups issue is with 5.0-5.3.

I also find it odd that the stack trace above is being thrown by the current version (6.8.17) but these failures are only with certain BWC versions... odd.

@mark-vieira
Copy link
Contributor

I don't quite understand the stack trace but is it possible the 6.8.17 node is trying to read this information from the BWC node, which is reporting things wrong, and that's what's blowing up?

@mark-vieira
Copy link
Contributor

mark-vieira commented May 27, 2021

@williamrandolph I'm picking on you since you seemed to be the latest person to touch some of this code some you might have some context or "fresh" memory of what might be happening here.

Is it possible that a minor change to the runtime JDK for these BWC clusters is causing them to report bad stats, which then causes the 6.8.17 node in the cluster to fail to start?

@williamrandolph
Copy link
Contributor

I do think that's possible, though it would take some digging to confirm. I remember looking into this and digging around in JVM code but I haven't managed to find my notes.

In our current code, we have guards that will return 0 instead of a negative value if the OS reports used memory higher than free memory or something like that. We also have assertions that catch situations where a negative value is being passed. It's clear how this can be a problem in mixed clusters, where old-version nodes pass bad values to newer nodes, which will then fail to deserialize the NodeStats. It might be better to update the 6.8 code to warn in this situation.

But I'd like to see if we can figure out what change in the JDK code brought this up on our test OSes.

Linking my PR for awareness: #68554

@mark-vieira
Copy link
Contributor

@williamrandolph what area should own this? Core/Infra?

@rjernst
Copy link
Member

rjernst commented May 28, 2021

While backporting that change to warn and use 0 in place of negative values is good for 6.8, it won't fix the issue for already released versions of Elasticsearch, like 6.0.0 that this CI issue was created referencing.

Aside from whatever the underlying JDK change may have been, I think we should change 7.x/master to be lenient in reading from older nodes, so that instead of asserting >=0, they do the conversion to 0 from those nodes when reading (and can then warn).

@mark-vieira
Copy link
Contributor

mark-vieira commented May 28, 2021

I think we should change 7.x/master to be lenient in reading from older nodes, so that instead of asserting >=0, they do the conversion to 0 from those nodes when reading (and can then warn).

These failures are on the 6.8 branch though. Don't you mean we should build that leniency in there? There is no wire-compatibility between 7.x/master and 6.0.

@rjernst
Copy link
Member

rjernst commented May 28, 2021

Ah, I misread, I thought it was 7.x testing bwc with 6.x. We should still do the backport William opened, but I just think the change I suggested is also necessary. The change can go back to 6.8 as well.

@mark-vieira
Copy link
Contributor

mark-vieira commented May 29, 2021

The PR that Will linked was merged back in February. I think we'll need something additional here to implement the leniency you describe.

@rjernst
Copy link
Member

rjernst commented May 29, 2021

Yes I meant my suggestion as a new change. Sorry I didn’t mean to conflate the two.

williamrandolph added a commit that referenced this issue Jun 1, 2021
We've had a series of bug fixes for cases where an OsProbe gives negative
values, most often just -1, to the OsStats class. We added assertions to catch
cases where we were initializing OsStats with bad values. Unfortunately, these
fixes turned to not be backwards-compatible. In this commit, we simply coerce
bad values to 0 when data is coming from nodes that don't have the relevant bug
fixes.

Relevant PRs:
* #42725
* #56435
* #57317

Fixes #73459
williamrandolph added a commit that referenced this issue Jun 1, 2021
We've had a series of bug fixes for cases where an OsProbe gives negative
values, most often just -1, to the OsStats class. We added assertions to catch
cases where we were initializing OsStats with bad values. Unfortunately, these
fixes turned to not be backwards-compatible. In this commit, we simply coerce
bad values to 0 when data is coming from nodes that don't have the relevant bug
fixes.

Relevant PRs:
* #42725
* #56435
* #57317

Fixes #73459
williamrandolph added a commit that referenced this issue Jun 1, 2021
We've had a series of bug fixes for cases where an OsProbe gives negative
values, most often just -1, to the OsStats class. We added assertions to catch
cases where we were initializing OsStats with bad values. Unfortunately, these
fixes turned to not be backwards-compatible. In this commit, we simply coerce
bad values to 0 when data is coming from nodes that don't have the relevant bug
fixes.

Relevant PRs:
* #42725
* #56435
* #57317

Fixes #73459
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Delivery/Build Build or test infrastructure Team:Delivery Meta label for Delivery team >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants