Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce port range re-use in tests #85777

Merged
merged 6 commits into from
Apr 13, 2022

Conversation

DaveCTurner
Copy link
Contributor

@DaveCTurner DaveCTurner commented Apr 11, 2022

Today we select the port range to use in tests by taking the Gradle
worker ID mod 223. We now use significantly more than 223 Gradle workers
in each test run which means port ranges are re-used, resulting in
spurious failures.

This commit replaces the magic number 223 with a computation defined in
terms of the available ports and the number of ports to allocate per
worker. It also reduces the number of ports per worker to 30 to try and
reduce the frequency of port clashes.

@DaveCTurner DaveCTurner added >test Issues or PRs that are addressing/adding tests :Distributed/Network Http and internode communication implementations v8.3.0 labels Apr 11, 2022
@elasticmachine elasticmachine added the Team:Distributed Meta label for distributed team label Apr 11, 2022
@DaveCTurner DaveCTurner marked this pull request as draft April 11, 2022 09:26
@DaveCTurner DaveCTurner force-pushed the 2022-04-11-worker-id-overflow branch from 596cf03 to 1de1f5a Compare April 11, 2022 12:51
@DaveCTurner DaveCTurner changed the title Prevent worker ID overflow Forbid port range re-use in tests Apr 11, 2022
@DaveCTurner DaveCTurner force-pushed the 2022-04-11-worker-id-overflow branch from 1de1f5a to 669bb28 Compare April 11, 2022 12:57
Today we select the port range to use in tests by taking the Gradle
worker ID mod 223. We now use significantly more than 223 Gradle workers
in each test run which means port ranges are re-used, resulting in
spurious failures.

This commit removes this lenience, insisting that every worker now gets
a distinct port range. It reduces the size of the port range available
to each worker to avoid running out of ports. Future changes that add
more workers may need to reduce the port range still further, or find a
different solution to keeping the tests separate, but at least will not
see spurious failures due to cross-test interactions.
@DaveCTurner DaveCTurner force-pushed the 2022-04-11-worker-id-overflow branch from 669bb28 to a71cb0a Compare April 11, 2022 13:08
@DaveCTurner DaveCTurner marked this pull request as ready for review April 11, 2022 14:28
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

* Defines the size of the port range assigned to each worker, which must be large enough to supply enough ports to run the tests, but
* not so large that we run out of ports. See also [NOTE: Port ranges for tests].
*/
private static final int PORTS_PER_WORKER = 30;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proposing 30 here because empirically it works in CI, whereas 40 didn't.

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will not work well with the gradle daemon? Perhaps we can still do the modulus and assert something around the worker id in the build somewhere?

// use a non-default base port otherwise some cluster in this JVM might reuse a port

// We rely on Gradle implementation details here, the worker IDs are long values incremented by one for the
// lifespan of the daemon this means that they can get larger than the allowed port range.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this old comment is still true, I think this change would no longer work when running gradle check multiple times locally without restarting the gradle daemon? I think that could be annoying. Perhaps it works differently now?

I am not sure if CI restarts the daemon either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Darn it, yes. I've asked @elastic/es-delivery for some input here. I would hope we don't use the daemon in CI, at least we didn't seem to in the runs this PR has had so far, but it'd be good to confirm this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@henningandersen is right. The id generator we rely on here is created as part of a shared service in gradle. (see https://github.com/gradle/gradle/blob/394c3176354b8bb6565a0c92f1ea93175608f163/subprojects/core/src/main/java/org/gradle/internal/service/scopes/GradleUserHomeScopeServices.java#L185) These services are reused across multiple builds and operations scoped to a particular gradle user home directory.
My understanding is that we use the daemon locally AND on CI these days.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we capture an offset very early in the build? I.e., capture the initial_worker_id and use worker_id - initial_worker_id for the port range?

@rjernst
Copy link
Member

rjernst commented Apr 11, 2022

Could we just use an ephemeral port? There are only a handful of places we use getPortRange(), and I just replaced those with 0, and the tests that use it appear to pass. Looking at each of these, we most of the time create a transport with the settings, then introspect the port that was bound anyways. I don't see anywhere that we take the port returned from getPortRange and try to use it, we always just pass to settings and later figure out what port was used.

@DaveCTurner
Copy link
Contributor Author

I think this falls down (nondeterministically) in tests that have multiple transport services in play if they shut one of them down while the other ones are still trying to reconnect. In that case another test may come along and grab the just-released port, receive a load of unexpected handshakes and/or other traffic, and fail.

@rjernst
Copy link
Member

rjernst commented Apr 11, 2022

Ok, I understand how that might be needed for multi node tests (ie ESIntegTestCase). But it seems most of the places using this getPortRange() method are just constructing a single transport service directly. Could we at least isolate this just to ESIntegTestCase, so that test writers are not tempted to use this shared method that really shouldn't be necessary except in this one special case?

@rjernst
Copy link
Member

rjernst commented Apr 11, 2022

(Note that I understand my request is unrelated to the original issue here, it is just something I noticed while reviewing, and would at least reduce the re-use a bit).

@mark-vieira
Copy link
Contributor

Chatting Slack a bit about this but wanted to share a summary here for posterity.

The concerns around the daemon are somewhat irrelevant, the core problem exists even with a single-use daemon in a single test invocation. That is, we now have a strict upper limit on the number of test workers that can get launched by a single build. Eventually we will hit this limit, and in fact, we might already as our PR checks run subsets of our entire test suite in parallel for efficiency and later on we have jobs that might fail because they run all tests in a single build execution. So essentially we're adding a timebomb to our build which we have no idea when it might go off. Not ideal.

Instead we'll go forward with this change but add back the leniency bit from before. We still get the benefit of narrower port ranges so the likelyhood of overlap reduces. We can then log when we're reusing a port range that had been used before, at least giving us some insight as to why we might be seeing funny failures.

Long term we'll need a better solution here which I've described in #85786.

@DaveCTurner
Copy link
Contributor Author

Ok, I reinstated the wraparound behaviour but left PORTS_PER_WORKER at 30 which should help a bunch.

@DaveCTurner DaveCTurner changed the title Forbid port range re-use in tests Reduce port range re-use in tests Apr 12, 2022
@arteam arteam added v8.1.4 and removed v8.1.3 labels Apr 12, 2022
Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@DaveCTurner DaveCTurner merged commit 4f58cf2 into elastic:master Apr 13, 2022
@DaveCTurner DaveCTurner deleted the 2022-04-11-worker-id-overflow branch April 13, 2022 07:06
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Apr 13, 2022
Today we select the port range to use in tests by taking the Gradle
worker ID mod 223. We now use significantly more than 223 Gradle workers
in each test run which means port ranges are re-used, resulting in
spurious failures.

This commit replaces the magic number 223 with a computation defined in
terms of the available ports and the number of ports to allocate per
worker. It also reduces the number of ports per worker to 30 to try and
reduce the frequency of port clashes.
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Apr 13, 2022
Today we select the port range to use in tests by taking the Gradle
worker ID mod 223. We now use significantly more than 223 Gradle workers
in each test run which means port ranges are re-used, resulting in
spurious failures.

This commit replaces the magic number 223 with a computation defined in
terms of the available ports and the number of ports to allocate per
worker. It also reduces the number of ports per worker to 30 to try and
reduce the frequency of port clashes.
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
7.17 Commit could not be cherrypicked due to conflicts
8.2
8.1

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 85777

elasticsearchmachine pushed a commit that referenced this pull request Apr 13, 2022
Today we select the port range to use in tests by taking the Gradle
worker ID mod 223. We now use significantly more than 223 Gradle workers
in each test run which means port ranges are re-used, resulting in
spurious failures.

This commit replaces the magic number 223 with a computation defined in
terms of the available ports and the number of ports to allocate per
worker. It also reduces the number of ports per worker to 30 to try and
reduce the frequency of port clashes.
elasticsearchmachine pushed a commit that referenced this pull request Apr 13, 2022
Today we select the port range to use in tests by taking the Gradle
worker ID mod 223. We now use significantly more than 223 Gradle workers
in each test run which means port ranges are re-used, resulting in
spurious failures.

This commit replaces the magic number 223 with a computation defined in
terms of the available ports and the number of ports to allocate per
worker. It also reduces the number of ports per worker to 30 to try and
reduce the frequency of port clashes.
DaveCTurner added a commit that referenced this pull request Apr 13, 2022
Today we select the port range to use in tests by taking the Gradle
worker ID mod 223. We now use significantly more than 223 Gradle workers
in each test run which means port ranges are re-used, resulting in
spurious failures.

This commit replaces the magic number 223 with a computation defined in
terms of the available ports and the number of ports to allocate per
worker. It also reduces the number of ports per worker to 30 to try and
reduce the frequency of port clashes.
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Apr 13, 2022
…n/elasticsearch into datastream-reuse-pipeline-source

* 'datastream-reuse-pipeline-source' of github.com:weizijun/elasticsearch: (28 commits)
  Add JDK 19 to Java testing matrix
  [ML] add nlp config update serialization tests (elastic#85867)
  [ML] A text categorization aggregation that works like ML categorization (elastic#80867)
  [ML] Fix serialisation of text embedding updates (elastic#85863)
  TSDB: fix wrong initial value of tsidOrd in TimeSeriesIndexSearcher (elastic#85713)
  Enforce external id uniqueness during DesiredNode construction (elastic#84227)
  Fix Intellij integration (elastic#85866)
  Upgrade Azure SDK to version 12.14.4 (elastic#83884)
  [discovery-gce] Fix initialisation of transport in FIPS mode (elastic#85817)
  Remove unnecessary docs/changelog/85534.yaml
  Prevent ThreadContext header leak when sending response (elastic#68649)
  Add support for impact_areas to health impacts  (elastic#85830)
  Reduce port range re-use in tests (elastic#85777)
  Fix TranslogTests#testStats (elastic#85828)
  Remove hppc from cat allocation api (elastic#85842)
  Fix BuildTests serialization (elastic#85827)
  Use urgent priority for node shutdown cluster state update (elastic#85838)
  Remove Task classes from HLRC (elastic#85835)
  Remove unused migration classes (elastic#85834)
  Remove uses of Charset name parsing (elastic#85795)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Network Http and internode communication implementations Team:Distributed Meta label for distributed team >test Issues or PRs that are addressing/adding tests v7.17.3 v8.1.4 v8.2.1 v8.3.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants