Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set netty system properties in BuildPlugin #45881

Merged
merged 3 commits into from
Aug 29, 2019
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -884,6 +884,22 @@ class BuildPlugin implements Plugin<Project> {
// TODO: remove this once cname is prepended to transport.publish_address by default in 8.0
test.systemProperty 'es.transport.cname_in_publish_address', 'true'

// Set netty system properties
test.systemProperty('io.netty.noUnsafe', 'true')
test.systemProperty('io.netty.noKeySetOptimization', 'true')
test.systemProperty('io.netty.recycler.maxCapacityPerThread', '0')
def seed = Long.parseUnsignedLong(((String) project.property('testSeed')).tokenize(':').get(0), 16)
if (seed % 4 == 0) {
// Occasionally use the unpooled allocator as we ergonomically choose it on small heaps
test.systemProperty('io.netty.allocator.type', 'unpooled')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used to do this kind of randomization for es tests, but I'm not sure we need it? If we want to have a test case that randomizes this, then let's do that, but having this randomized for all tests means we could get unexplainable failures in tests completely unrelated to netty. Could we have netty specific tests where we do this? Perhaps it would be better to have tests specifically around small heaps, wher we would use this allocator type?

Copy link
Member

@original-brownbear original-brownbear Aug 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rjernst I think the unpooled path here isn't so important but the difference between setting io.netty.allocator.numDirectArenas or not is. That setting triggers very different code paths now (whether to use our custom CopyingBytesChannel or not).
Admittedly this isn't the best argument here because you could always argue that we need better networking tests then but:
We only saw some initial bugs in CopyingBytesChannel when other tests started failing. So getting wider coverage on both code paths we have in Netty would be and was definitely it in practice right now. I don't think we'll be able to beef up the networking tests to a point where we could get comparable coverage out of them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would someone completely unfamiliar with netty diagnose and debug a failure from CopyBytesSocketChannel? Randomizing these for all tests is similar to what we do in ESIntegTestCase with randomizing various distributed settings. Having a mappings test fail for reasons completely outside of the realm of expertise or the focus of the test is a bug IMO. I can live with it there because I see ESIntegTestCase as sort of distributed focused tests (testing node interaction), but doing so for all tests is extending these failures to places that are difficult to debug (eg rest tests).

I don't think we'll be able to beef up the networking tests to a point where we could get comparable coverage out of them.

I believe in the power of unit tests. :) Could we have randomized, targeted unit tests for CopyBytesSocketChannel?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe in the power of unit tests. :) Could we have randomized, targeted unit tests for CopyBytesSocketChannel?

In general so do I :) But in case of networking it's tricky to keep the faith. In the example I mentioned above, issues were triggered by larger messages under load at a low rate. It's hard to unit-test that kind of scenario a-priory (Tim not added a test for it after the fact) and I still prefer some ITs breaking right away to only seeing this kind of thing in the benchmarks failing (at best) or production (at worst).

How would someone completely unfamiliar with netty diagnose and debug a failure from CopyBytesSocketChannel

Maybe they can't but the same issue holds true for a number of tests (especially in X-pack SSL) where we randomize between Netty and NIO networking already anyway. I don't think this change really makes the situation that much worse. As a matter of fact, it actually is a bit of a help in tracking down issues because once might be able to narrow things down quickly to one or the other channel implementation used.

Having a mappings test fail for reasons completely outside of the realm of expertise or the focus of the test is a bug IMO.

I agree but this change doesn't really affect the scenario much I think. In the end it only affects the Netty tests itself and X-Pack tests for SSL (and ML) that use the Netty codebase. The bulk of ESIntegTestCases use the mock nio networking anyway that isn't touched by these changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the end it only affects the Netty tests itself and X-Pack tests for SSL (and ML) that use the Netty codebase

This change affects all integ tests since they use real elasticsearch nodes which will pick up these randomized settings. I only meant the mappings example as something unrelated to the randomization of distributed settings, yet susceptible to random failures.

In the example I mentioned above, issues were triggered by larger messages under load at a low rate.

Why isn't this something we can mimic in unit tests? Not just this particular pattern, but in general randomizing several axes like message size and rate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still we do randomize between Netty and our own NIO networking in these already anyway aren't we?

AFAIK that randomization is for ESIntegTestCase. We use real jvm.options that is included with the Elasticsearch distribution when running REST integ tests.

I would still argue that it's generally pretty easy to pinpoint issues to networking

I'm not sure that is the case. I would imagine they can be difficult to distinguish from eg a mucked up bwc version logic? And given most peopling doing test triage look quickly at the test, we have got lots of issues assigned to core/infra because they are "the cluster didn't come up" when in fact these most of the time have nothing to do with the build.

But until we have that, doing what this PR suggests is better than not doing it because without it we simply don't test one possible implementation of the low-level channels

That's a similar slippery slope. Adding this now lowers the incentive to improve the tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a big 👎 for me. This kind of thing should be more purposeful. I parrot Ryan's concerns of relying on randomization for test coverage. Something as core as this I almost see as another axis of testing altogether. Something as cross-cutting as our networking stack should not be something we add as another random variable for testing. If this causes funkiness with test failures in unpredictable ways I suspect we'll have very little chance of detecting and/or fixing these issues. I would propose instead a whole parallel CI branch which runs our test suite w/ the pooled vs unpooled allocated. Similar to how we do platform-specific testing or testing w/ FIPS.

We should get in the habit of if we think a particular case is worth test coverage we should do it always. Either by a) adding explicit test cases for those scenarios instead of relying on randomization or b) setting up a parallel build which runs existing test cases w/ that feature flag or configuration enabled.

The only benefit of randomization is that we don't have to do full coverage on every build. As we bring down build times and make builds more parallelizable I think we should rely less on this method of test coverage. It makes it really hard to determine to cause of test failures from a statistical point of view, which is where we want to get to because we can apply automation ML in those scenarios.

My thoughts on testing aside, this will completely bork cacheability for tests. At the very least if we decide to go this route we need to hide this behind our ignore.tests.seed system property such that a changing test seed doesn't manifest in our intake builds via this system property.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rjernst @mark-vieira ok you win. Let's see what others think then, but I'm fine giving up here and starting work on some unit-test like coverage across supported Netty configurations then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should sync up and have a discussion about the best way to achieve the desired test coverage here. There are many ways we can do this that have their own considerations depending on our goal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mark-vieira sure we can sync up :)

As an easy solution though: I wonder if we couldn't just as an alternative to the randomisation here simply run all the Netty tests twice (that should be really cheap ... it's ~2min on my machine) with both channel implementations. (and maybe do the same for the security tests that use Netty ... that should also only be an added 5-10 minutes).
That doesn't give us double coverage on REST and integration tests tests but at least we have both implementations covered by the existing unit tests.

} else {
test.systemProperty('io.netty.allocator.type', 'pooled')
if (seed % 2 != 0) {
// Disable direct buffer pooling most of the time as it is disabled by default in Elasticsearch
test.systemProperty('io.netty.allocator.numDirectArenas', '0')
}
}

test.testLogging { TestLoggingContainer logging ->
logging.showExceptions = true
logging.showCauses = true
Expand Down
6 changes: 0 additions & 6 deletions modules/transport-netty4/build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -51,9 +51,6 @@ test {
* other if we allow them to set the number of available processors as it's set-once in Netty.
*/
systemProperty 'es.set.netty.runtime.available.processors', 'false'

// Disable direct buffer pooling as it is disabled by default in Elasticsearch
systemProperty 'io.netty.allocator.numDirectArenas', '0'
}

integTestRunner {
Expand All @@ -62,9 +59,6 @@ integTestRunner {
* other if we allow them to set the number of available processors as it's set-once in Netty.
*/
systemProperty 'es.set.netty.runtime.available.processors', 'false'

// Disable direct buffer pooling as it is disabled by default in Elasticsearch
systemProperty 'io.netty.allocator.numDirectArenas', '0'
}

thirdPartyAudit {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -242,9 +242,6 @@ private static void setTestSysProps() {

// Enable Netty leak detection and monitor logger for logged leak errors
System.setProperty("io.netty.leakDetection.level", "paranoid");

// Disable direct buffer pooling
System.setProperty("io.netty.allocator.numDirectArenas", "0");
}

protected final Logger logger = LogManager.getLogger(getClass());
Expand Down