-
Notifications
You must be signed in to change notification settings - Fork 24.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set netty system properties in BuildPlugin #45881
Merged
Merged
Changes from 1 commit
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We used to do this kind of randomization for es tests, but I'm not sure we need it? If we want to have a test case that randomizes this, then let's do that, but having this randomized for all tests means we could get unexplainable failures in tests completely unrelated to netty. Could we have netty specific tests where we do this? Perhaps it would be better to have tests specifically around small heaps, wher we would use this allocator type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rjernst I think the
unpooled
path here isn't so important but the difference between settingio.netty.allocator.numDirectArenas
or not is. That setting triggers very different code paths now (whether to use our customCopyingBytesChannel
or not).Admittedly this isn't the best argument here because you could always argue that we need better networking tests then but:
We only saw some initial bugs in
CopyingBytesChannel
when other tests started failing. So getting wider coverage on both code paths we have in Netty would be and was definitely it in practice right now. I don't think we'll be able to beef up the networking tests to a point where we could get comparable coverage out of them.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would someone completely unfamiliar with netty diagnose and debug a failure from CopyBytesSocketChannel? Randomizing these for all tests is similar to what we do in ESIntegTestCase with randomizing various distributed settings. Having a mappings test fail for reasons completely outside of the realm of expertise or the focus of the test is a bug IMO. I can live with it there because I see ESIntegTestCase as sort of distributed focused tests (testing node interaction), but doing so for all tests is extending these failures to places that are difficult to debug (eg rest tests).
I believe in the power of unit tests. :) Could we have randomized, targeted unit tests for CopyBytesSocketChannel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general so do I :) But in case of networking it's tricky to keep the faith. In the example I mentioned above, issues were triggered by larger messages under load at a low rate. It's hard to unit-test that kind of scenario a-priory (Tim not added a test for it after the fact) and I still prefer some ITs breaking right away to only seeing this kind of thing in the benchmarks failing (at best) or production (at worst).
Maybe they can't but the same issue holds true for a number of tests (especially in X-pack SSL) where we randomize between Netty and NIO networking already anyway. I don't think this change really makes the situation that much worse. As a matter of fact, it actually is a bit of a help in tracking down issues because once might be able to narrow things down quickly to one or the other channel implementation used.
I agree but this change doesn't really affect the scenario much I think. In the end it only affects the Netty tests itself and X-Pack tests for SSL (and ML) that use the Netty codebase. The bulk of
ESIntegTestCase
s use the mock nio networking anyway that isn't touched by these changes.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change affects all integ tests since they use real elasticsearch nodes which will pick up these randomized settings. I only meant the mappings example as something unrelated to the randomization of distributed settings, yet susceptible to random failures.
Why isn't this something we can mimic in unit tests? Not just this particular pattern, but in general randomizing several axes like message size and rate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK that randomization is for ESIntegTestCase. We use real jvm.options that is included with the Elasticsearch distribution when running REST integ tests.
I'm not sure that is the case. I would imagine they can be difficult to distinguish from eg a mucked up bwc version logic? And given most peopling doing test triage look quickly at the test, we have got lots of issues assigned to core/infra because they are "the cluster didn't come up" when in fact these most of the time have nothing to do with the build.
That's a similar slippery slope. Adding this now lowers the incentive to improve the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a big 👎 for me. This kind of thing should be more purposeful. I parrot Ryan's concerns of relying on randomization for test coverage. Something as core as this I almost see as another axis of testing altogether. Something as cross-cutting as our networking stack should not be something we add as another random variable for testing. If this causes funkiness with test failures in unpredictable ways I suspect we'll have very little chance of detecting and/or fixing these issues. I would propose instead a whole parallel CI branch which runs our test suite w/ the pooled vs unpooled allocated. Similar to how we do platform-specific testing or testing w/ FIPS.
We should get in the habit of if we think a particular case is worth test coverage we should do it always. Either by a) adding explicit test cases for those scenarios instead of relying on randomization or b) setting up a parallel build which runs existing test cases w/ that feature flag or configuration enabled.
The only benefit of randomization is that we don't have to do full coverage on every build. As we bring down build times and make builds more parallelizable I think we should rely less on this method of test coverage. It makes it really hard to determine to cause of test failures from a statistical point of view, which is where we want to get to because we can apply automation ML in those scenarios.
My thoughts on testing aside, this will completely bork cacheability for tests. At the very least if we decide to go this route we need to hide this behind our
ignore.tests.seed
system property such that a changing test seed doesn't manifest in our intake builds via this system property.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rjernst @mark-vieira ok you win. Let's see what others think then, but I'm fine giving up here and starting work on some unit-test like coverage across supported Netty configurations then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should sync up and have a discussion about the best way to achieve the desired test coverage here. There are many ways we can do this that have their own considerations depending on our goal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mark-vieira sure we can sync up :)
As an easy solution though: I wonder if we couldn't just as an alternative to the randomisation here simply run all the Netty tests twice (that should be really cheap ... it's ~2min on my machine) with both channel implementations. (and maybe do the same for the security tests that use Netty ... that should also only be an added 5-10 minutes).
That doesn't give us double coverage on REST and integration tests tests but at least we have both implementations covered by the existing unit tests.