[CI] mixed-cluster:v6.0.0#mixedClusterTestRunner failure #73459

matriv · 2021-05-27T09:40:03Z

Build scan:
https://gradle-enterprise.elastic.co/s/64inre6ljqgcc

Repro line:

./gradlew ':qa:mixed-cluster:v6.0.0#mixedClusterTestRunner' -Dtests.seed=4925E97ABC52AD2B -Dtests.class=org.elasticsearch.backwards.MixedClusterClientYamlTestSuiteIT -Dtests.method="test {p0=nodes.stats/20_response_filtering/Nodes Stats filtered using both includes and excludes filters}" -Dtests.security.manager=true -Dtests.locale=sr -Dtests.timezone=America/Punta_Arenas -Dcompiler.java=11 -Druntime.java=8

Reproduces locally?:
Haven't tried

Applicable branches:
6.8

Failure history:
https://gradle-enterprise.elastic.co/scans/failures?failures.failureClassification=all_failures&failures.failureMessage=Execution%20failed%20for%20task%20*%0A%3E%20Process%20%27kill%20%5B-9,%20*%20finished%20with%20non-zero%20exit%20value%201&search.relativeStartTime=P7D&search.timeZoneId=Europe/Athens

Failure excerpt:

java.lang.AssertionError: expected total memory to be positive, got: -1
	at org.elasticsearch.monitor.os.OsStats$Mem.<init>(OsStats.java:267) ~[elasticsearch-6.8.17-SNAPSHOT.jar:6.8.17-SNAPSHOT]
	at org.elasticsearch.monitor.os.OsStats.<init>(OsStats.java:55) ~[elasticsearch-6.8.17-SNAPSHOT.jar:6.8.17-SNAPSHOT]
	at org.elasticsearch.common.io.stream.StreamInput.readOptionalWriteable(StreamInput.java:777) ~[elasticsearch-6.8.17-SNAPSHOT.jar:6.8.17-SNAPSHOT]
	at org.elasticsearch.action.admin.cluster.node.stats.NodeStats.readFrom(NodeStats.java:227) ~[elasticsearch-6.8.17-SNAPSHOT.jar:6.8.17-SNAPSHOT]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1.read(TransportNodesAction.java:199) ~[elasticsearch-6.8.17-SNAPSHOT.jar:6.8.17-SNAPSHOT]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1.read(TransportNodesAction.java:195) ~[elasticsearch-6.8.17-SNAPSHOT.jar:6.8.17-SNAPSHOT]
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.read(TransportService.java:1107) ~[elasticsearch-6.8.17-SNAPSHOT.jar:6.8.17-SNAPSHOT]
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.read(TransportService.java:1094) ~[elasticsearch-6.8.17-SNAPSHOT.jar:6.8.17-SNAPSHOT]
	at org.elasticsearch.transport.TcpTransport.handleResponse(TcpTransport.java:970) ~[elasticsearch-6.8.17-SNAPSHOT.jar:6.8.17-SNAPSHOT]
	at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:952) ~[elasticsearch-6.8.17-SNAPSHOT.jar:6.8.17-SNAPSHOT]
	at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:763) ~[elasticsearch-6.8.17-SNAPSHOT.jar:6.8.17-SNAPSHOT]
	at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:53) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[?:?]
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323) ~[?:?]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:297) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[?:?]
	at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[?:?]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1434) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[?:?]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:965) ~[?:?]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:656) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:556) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:510) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:470) ~[?:?]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909) ~[?:?]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_291]

Could be related to the JAVA related updates:
14f3273
bb580a4

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-05-27T09:40:05Z

Pinging @elastic/es-delivery (Team:Delivery)

matriv · 2021-05-27T09:45:23Z

mark-vieira · 2021-05-27T23:08:57Z

I can't get this to reproduce locally for me with the exact same Java versions. Perhaps it is something to do with the underlying system?

Some of the versions 6.8 is compatible with do not work on cgroups2 so we pick a platform
that doesn't come with it.

That stood out to me when we hard-code these jobs to run on centos-7 but the versions don't line up. In this case it's versions 5.6-6.0 that are busted but the cgroups issue is with 5.0-5.3.

I also find it odd that the stack trace above is being thrown by the current version (6.8.17) but these failures are only with certain BWC versions... odd.

mark-vieira · 2021-05-27T23:16:04Z

I don't quite understand the stack trace but is it possible the 6.8.17 node is trying to read this information from the BWC node, which is reporting things wrong, and that's what's blowing up?

mark-vieira · 2021-05-27T23:17:16Z

@williamrandolph I'm picking on you since you seemed to be the latest person to touch some of this code some you might have some context or "fresh" memory of what might be happening here.

Is it possible that a minor change to the runtime JDK for these BWC clusters is causing them to report bad stats, which then causes the 6.8.17 node in the cluster to fail to start?

williamrandolph · 2021-05-28T14:34:01Z

I do think that's possible, though it would take some digging to confirm. I remember looking into this and digging around in JVM code but I haven't managed to find my notes.

In our current code, we have guards that will return 0 instead of a negative value if the OS reports used memory higher than free memory or something like that. We also have assertions that catch situations where a negative value is being passed. It's clear how this can be a problem in mixed clusters, where old-version nodes pass bad values to newer nodes, which will then fail to deserialize the NodeStats. It might be better to update the 6.8 code to warn in this situation.

But I'd like to see if we can figure out what change in the JDK code brought this up on our test OSes.

Linking my PR for awareness: #68554

mark-vieira · 2021-05-28T15:35:32Z

@williamrandolph what area should own this? Core/Infra?

rjernst · 2021-05-28T19:43:24Z

While backporting that change to warn and use 0 in place of negative values is good for 6.8, it won't fix the issue for already released versions of Elasticsearch, like 6.0.0 that this CI issue was created referencing.

Aside from whatever the underlying JDK change may have been, I think we should change 7.x/master to be lenient in reading from older nodes, so that instead of asserting >=0, they do the conversion to 0 from those nodes when reading (and can then warn).

mark-vieira · 2021-05-28T22:19:22Z

I think we should change 7.x/master to be lenient in reading from older nodes, so that instead of asserting >=0, they do the conversion to 0 from those nodes when reading (and can then warn).

These failures are on the 6.8 branch though. Don't you mean we should build that leniency in there? There is no wire-compatibility between 7.x/master and 6.0.

rjernst · 2021-05-28T23:06:25Z

Ah, I misread, I thought it was 7.x testing bwc with 6.x. We should still do the backport William opened, but I just think the change I suggested is also necessary. The change can go back to 6.8 as well.

mark-vieira · 2021-05-29T00:09:04Z

The PR that Will linked was merged back in February. I think we'll need something additional here to implement the leniency you describe.

rjernst · 2021-05-29T00:10:59Z

Yes I meant my suggestion as a new change. Sorry I didn’t mean to conflate the two.

We've had a series of bug fixes for cases where an OsProbe gives negative values, most often just -1, to the OsStats class. We added assertions to catch cases where we were initializing OsStats with bad values. Unfortunately, these fixes turned to not be backwards-compatible. In this commit, we simply coerce bad values to 0 when data is coming from nodes that don't have the relevant bug fixes. Relevant PRs: * #42725 * #56435 * #57317 Fixes #73459

matriv added :Delivery/Build Build or test infrastructure >test-failure Triaged test failures from CI labels May 27, 2021

elasticmachine added the Team:Delivery Meta label for Delivery team label May 27, 2021

This was referenced Jun 1, 2021

[8.x] OsStats must be lenient with bad data from older nodes #73610

Merged

[7.x] OsStats must be lenient with bad data from older nodes #73614

Merged

[6.8] OsStats must be lenient with bad data from older nodes #73616

Merged

williamrandolph closed this as completed in #73610 Jun 1, 2021

mark-vieira mentioned this issue Jun 1, 2021

BWC timeouts after switch to oracle jdk #73564

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] mixed-cluster:v6.0.0#mixedClusterTestRunner failure #73459

[CI] mixed-cluster:v6.0.0#mixedClusterTestRunner failure #73459

matriv commented May 27, 2021

elasticmachine commented May 27, 2021

matriv commented May 27, 2021

mark-vieira commented May 27, 2021

mark-vieira commented May 27, 2021

mark-vieira commented May 27, 2021 •

edited

Loading

williamrandolph commented May 28, 2021

mark-vieira commented May 28, 2021

rjernst commented May 28, 2021

mark-vieira commented May 28, 2021 •

edited

Loading

rjernst commented May 28, 2021

mark-vieira commented May 29, 2021 •

edited

Loading

rjernst commented May 29, 2021

[CI] mixed-cluster:v6.0.0#mixedClusterTestRunner failure #73459

[CI] mixed-cluster:v6.0.0#mixedClusterTestRunner failure #73459

Comments

matriv commented May 27, 2021

elasticmachine commented May 27, 2021

matriv commented May 27, 2021

mark-vieira commented May 27, 2021

mark-vieira commented May 27, 2021

mark-vieira commented May 27, 2021 • edited Loading

williamrandolph commented May 28, 2021

mark-vieira commented May 28, 2021

rjernst commented May 28, 2021

mark-vieira commented May 28, 2021 • edited Loading

rjernst commented May 28, 2021

mark-vieira commented May 29, 2021 • edited Loading

rjernst commented May 29, 2021

mark-vieira commented May 27, 2021 •

edited

Loading

mark-vieira commented May 28, 2021 •

edited

Loading

mark-vieira commented May 29, 2021 •

edited

Loading