Reduce port range re-use in tests #85777

DaveCTurner · 2022-04-11T09:26:17Z

Today we select the port range to use in tests by taking the Gradle
worker ID mod 223. We now use significantly more than 223 Gradle workers
in each test run which means port ranges are re-used, resulting in
spurious failures.

This commit replaces the magic number 223 with a computation defined in
terms of the available ports and the number of ports to allocate per
worker. It also reduces the number of ports per worker to 30 to try and
reduce the frequency of port clashes.

Today we select the port range to use in tests by taking the Gradle worker ID mod 223. We now use significantly more than 223 Gradle workers in each test run which means port ranges are re-used, resulting in spurious failures. This commit removes this lenience, insisting that every worker now gets a distinct port range. It reduces the size of the port range available to each worker to avoid running out of ports. Future changes that add more workers may need to reduce the port range still further, or find a different solution to keeping the tests separate, but at least will not see spurious failures due to cross-test interactions.

elasticmachine · 2022-04-11T14:28:56Z

Pinging @elastic/es-distributed (Team:Distributed)

DaveCTurner · 2022-04-11T14:30:58Z

test/framework/src/main/java/org/elasticsearch/test/ESTestCase.java

+     * Defines the size of the port range assigned to each worker, which must be large enough to supply enough ports to run the tests, but
+     * not so large that we run out of ports. See also [NOTE: Port ranges for tests].
+     */
+    private static final int PORTS_PER_WORKER = 30;


Proposing 30 here because empirically it works in CI, whereas 40 didn't.

henningandersen

I think this will not work well with the gradle daemon? Perhaps we can still do the modulus and assert something around the worker id in the build somewhere?

henningandersen · 2022-04-11T14:36:14Z

test/framework/src/main/java/org/elasticsearch/test/ESTestCase.java

-        // use a non-default base port otherwise some cluster in this JVM might reuse a port
-
-        // We rely on Gradle implementation details here, the worker IDs are long values incremented by one for the
-        // lifespan of the daemon this means that they can get larger than the allowed port range.


If this old comment is still true, I think this change would no longer work when running gradle check multiple times locally without restarting the gradle daemon? I think that could be annoying. Perhaps it works differently now?

I am not sure if CI restarts the daemon either.

Darn it, yes. I've asked @elastic/es-delivery for some input here. I would hope we don't use the daemon in CI, at least we didn't seem to in the runs this PR has had so far, but it'd be good to confirm this.

@henningandersen is right. The id generator we rely on here is created as part of a shared service in gradle. (see https://github.com/gradle/gradle/blob/394c3176354b8bb6565a0c92f1ea93175608f163/subprojects/core/src/main/java/org/gradle/internal/service/scopes/GradleUserHomeScopeServices.java#L185) These services are reused across multiple builds and operations scoped to a particular gradle user home directory.
My understanding is that we use the daemon locally AND on CI these days.

Could we capture an offset very early in the build? I.e., capture the initial_worker_id and use worker_id - initial_worker_id for the port range?

rjernst · 2022-04-11T15:51:46Z

Could we just use an ephemeral port? There are only a handful of places we use getPortRange(), and I just replaced those with 0, and the tests that use it appear to pass. Looking at each of these, we most of the time create a transport with the settings, then introspect the port that was bound anyways. I don't see anywhere that we take the port returned from getPortRange and try to use it, we always just pass to settings and later figure out what port was used.

DaveCTurner · 2022-04-11T15:54:58Z

I think this falls down (nondeterministically) in tests that have multiple transport services in play if they shut one of them down while the other ones are still trying to reconnect. In that case another test may come along and grab the just-released port, receive a load of unexpected handshakes and/or other traffic, and fail.

rjernst · 2022-04-11T16:02:39Z

Ok, I understand how that might be needed for multi node tests (ie ESIntegTestCase). But it seems most of the places using this getPortRange() method are just constructing a single transport service directly. Could we at least isolate this just to ESIntegTestCase, so that test writers are not tempted to use this shared method that really shouldn't be necessary except in this one special case?

rjernst · 2022-04-11T16:07:50Z

(Note that I understand my request is unrelated to the original issue here, it is just something I noticed while reviewing, and would at least reduce the re-use a bit).

mark-vieira · 2022-04-11T16:24:03Z

Chatting Slack a bit about this but wanted to share a summary here for posterity.

The concerns around the daemon are somewhat irrelevant, the core problem exists even with a single-use daemon in a single test invocation. That is, we now have a strict upper limit on the number of test workers that can get launched by a single build. Eventually we will hit this limit, and in fact, we might already as our PR checks run subsets of our entire test suite in parallel for efficiency and later on we have jobs that might fail because they run all tests in a single build execution. So essentially we're adding a timebomb to our build which we have no idea when it might go off. Not ideal.

Instead we'll go forward with this change but add back the leniency bit from before. We still get the benefit of narrower port ranges so the likelyhood of overlap reduces. We can then log when we're reusing a port range that had been used before, at least giving us some insight as to why we might be seeing funny failures.

Long term we'll need a better solution here which I've described in #85786.

DaveCTurner · 2022-04-12T14:41:43Z

Ok, I reinstated the wraparound behaviour but left PORTS_PER_WORKER at 30 which should help a bunch.

henningandersen

LGTM.

Today we select the port range to use in tests by taking the Gradle worker ID mod 223. We now use significantly more than 223 Gradle workers in each test run which means port ranges are re-used, resulting in spurious failures. This commit replaces the magic number 223 with a computation defined in terms of the available ports and the number of ports to allocate per worker. It also reduces the number of ports per worker to 30 to try and reduce the frequency of port clashes.

elasticsearchmachine · 2022-04-13T07:07:57Z

💔 Backport failed

Status	Branch	Result
❌	7.17	Commit could not be cherrypicked due to conflicts
✅	8.2
✅	8.1

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 85777

Today we select the port range to use in tests by taking the Gradle worker ID mod 223. We now use significantly more than 223 Gradle workers in each test run which means port ranges are re-used, resulting in spurious failures. This commit replaces the magic number 223 with a computation defined in terms of the available ports and the number of ports to allocate per worker. It also reduces the number of ports per worker to 30 to try and reduce the frequency of port clashes.

…n/elasticsearch into datastream-reuse-pipeline-source * 'datastream-reuse-pipeline-source' of github.com:weizijun/elasticsearch: (28 commits) Add JDK 19 to Java testing matrix [ML] add nlp config update serialization tests (elastic#85867) [ML] A text categorization aggregation that works like ML categorization (elastic#80867) [ML] Fix serialisation of text embedding updates (elastic#85863) TSDB: fix wrong initial value of tsidOrd in TimeSeriesIndexSearcher (elastic#85713) Enforce external id uniqueness during DesiredNode construction (elastic#84227) Fix Intellij integration (elastic#85866) Upgrade Azure SDK to version 12.14.4 (elastic#83884) [discovery-gce] Fix initialisation of transport in FIPS mode (elastic#85817) Remove unnecessary docs/changelog/85534.yaml Prevent ThreadContext header leak when sending response (elastic#68649) Add support for impact_areas to health impacts (elastic#85830) Reduce port range re-use in tests (elastic#85777) Fix TranslogTests#testStats (elastic#85828) Remove hppc from cat allocation api (elastic#85842) Fix BuildTests serialization (elastic#85827) Use urgent priority for node shutdown cluster state update (elastic#85838) Remove Task classes from HLRC (elastic#85835) Remove unused migration classes (elastic#85834) Remove uses of Charset name parsing (elastic#85795) ...

DaveCTurner added >test Issues or PRs that are addressing/adding tests :Distributed/Network Http and internode communication implementations v8.3.0 labels Apr 11, 2022

elasticmachine added the Team:Distributed Meta label for distributed team label Apr 11, 2022

DaveCTurner marked this pull request as draft April 11, 2022 09:26

DaveCTurner added the WIP label Apr 11, 2022

DaveCTurner force-pushed the 2022-04-11-worker-id-overflow branch from 596cf03 to 1de1f5a Compare April 11, 2022 12:51

DaveCTurner changed the title ~~Prevent worker ID overflow~~ Forbid port range re-use in tests Apr 11, 2022

DaveCTurner force-pushed the 2022-04-11-worker-id-overflow branch from 1de1f5a to 669bb28 Compare April 11, 2022 12:57

DaveCTurner force-pushed the 2022-04-11-worker-id-overflow branch from 669bb28 to a71cb0a Compare April 11, 2022 13:08

DaveCTurner added v7.17.3 v8.1.3 v8.2.1 labels Apr 11, 2022

DaveCTurner marked this pull request as ready for review April 11, 2022 14:28

DaveCTurner removed the WIP label Apr 11, 2022

DaveCTurner requested a review from henningandersen April 11, 2022 14:29

DaveCTurner commented Apr 11, 2022

View reviewed changes

henningandersen reviewed Apr 11, 2022

View reviewed changes

mark-vieira mentioned this pull request Apr 11, 2022

Use more robust method for allocating port ranges used in integration tests #85786

Open

DaveCTurner added 4 commits April 12, 2022 14:56

Merge branch 'master' into 2022-04-11-worker-id-overflow

1531944

Re-introduce wraparound again, but computed in terms of port ranges

9e5058e

Inline

a97e301

Safety check

18f2dd2

Finesse

59fccaa

DaveCTurner changed the title ~~Forbid port range re-use in tests~~ Reduce port range re-use in tests Apr 12, 2022

arteam added v8.1.4 and removed v8.1.3 labels Apr 12, 2022

henningandersen approved these changes Apr 12, 2022

View reviewed changes

DaveCTurner added the auto-backport-and-merge label Apr 13, 2022

DaveCTurner merged commit 4f58cf2 into elastic:master Apr 13, 2022

DaveCTurner deleted the 2022-04-11-worker-id-overflow branch April 13, 2022 07:06

DaveCTurner mentioned this pull request Apr 13, 2022

[8.2] Reduce port range re-use in tests (#85777) #85858

Merged

DaveCTurner mentioned this pull request Apr 13, 2022

[8.1] Reduce port range re-use in tests (#85777) #85859

Merged

This was referenced Jun 14, 2022

[CI] StableMasterDisruptionIT testFollowerCheckerDetectsUnresponsiveNodeAfterMasterReelection failing #87448

Closed

[CI] Real-cluster tests pick ports that clash with ESTestCase port ranges #87734

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce port range re-use in tests #85777

Reduce port range re-use in tests #85777

DaveCTurner commented Apr 11, 2022 •

edited

Loading

elasticmachine commented Apr 11, 2022

DaveCTurner Apr 11, 2022

henningandersen left a comment

henningandersen Apr 11, 2022

DaveCTurner Apr 11, 2022

breskeby Apr 11, 2022

henningandersen Apr 11, 2022

rjernst commented Apr 11, 2022 •

edited

Loading

DaveCTurner commented Apr 11, 2022

rjernst commented Apr 11, 2022

rjernst commented Apr 11, 2022

mark-vieira commented Apr 11, 2022

DaveCTurner commented Apr 12, 2022

henningandersen left a comment

elasticsearchmachine commented Apr 13, 2022

Reduce port range re-use in tests #85777

Reduce port range re-use in tests #85777

Conversation

DaveCTurner commented Apr 11, 2022 • edited Loading

elasticmachine commented Apr 11, 2022

DaveCTurner Apr 11, 2022

Choose a reason for hiding this comment

henningandersen left a comment

Choose a reason for hiding this comment

henningandersen Apr 11, 2022

Choose a reason for hiding this comment

DaveCTurner Apr 11, 2022

Choose a reason for hiding this comment

breskeby Apr 11, 2022

Choose a reason for hiding this comment

henningandersen Apr 11, 2022

Choose a reason for hiding this comment

rjernst commented Apr 11, 2022 • edited Loading

DaveCTurner commented Apr 11, 2022

rjernst commented Apr 11, 2022

rjernst commented Apr 11, 2022

mark-vieira commented Apr 11, 2022

DaveCTurner commented Apr 12, 2022

henningandersen left a comment

Choose a reason for hiding this comment

elasticsearchmachine commented Apr 13, 2022

💔 Backport failed

DaveCTurner commented Apr 11, 2022 •

edited

Loading

rjernst commented Apr 11, 2022 •

edited

Loading