Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] IpFilteringUpdateTests testThatInvalidDynamicIpFilterConfigurationIsRejected failing #102349

Closed
mark-vieira opened this issue Nov 18, 2023 · 15 comments · Fixed by #103894
Closed
Assignees
Labels
low-risk An open issue or test failure that is a low risk to future releases :Security/Security Security issues without another label Team:Security Meta label for security team >test-failure Triaged test failures from CI

Comments

@mark-vieira
Copy link
Contributor

Looks like these tests are trying to use an occupied port. We have various logic in our base test classes to ensure this doesn't happen but sounds like this test is doing something unique that's making this condition more likely. This happens across all tests in IpFilteringUpdateTests and I see similar errors in SslMultiPortTests as well.

Build scan:
https://gradle-enterprise.elastic.co/s/jiktat3ht3ye6/tests/:x-pack:plugin:security:internalClusterTest/org.elasticsearch.xpack.security.transport.filter.IpFilteringUpdateTests/testThatInvalidDynamicIpFilterConfigurationIsRejected

Reproduction line:

gradlew ':x-pack:plugin:security:internalClusterTest' --tests "org.elasticsearch.xpack.security.transport.filter.IpFilteringUpdateTests.testThatInvalidDynamicIpFilterConfigurationIsRejected" -Dtests.seed=DDEE06167C7FD912 -Dtests.locale=es-PE -Dtests.timezone=Etc/GMT+10 -Druntime.java=21

Applicable branches:
main, 8.11, 7.17

Reproduces locally?:
Didn't try

Failure history:
https://es-delivery-stats.elastic.dev/app/dashboards#/view/dcec9e60-72ac-11ee-8f39-55975ded9e63?_g=(refreshInterval:(pause:!t,value:60000),time:(from:now-7d%2Fd,to:now))&_a=(controlGroupInput:(chainingSystem:HIERARCHICAL,controlStyle:twoLine,ignoreParentSettings:(ignoreFilters:!f,ignoreQuery:!f,ignoreTimerange:!f,ignoreValidations:!t),panels:('0c0c9cb8-ccd2-45c6-9b13-96bac4abc542':(explicitInput:(dataViewId:fbbdc689-be23-4b3d-8057-aa402e9ed0c5,enhancements:(),fieldName:task.keyword,grow:!t,id:'0c0c9cb8-ccd2-45c6-9b13-96bac4abc542',searchTechnique:wildcard,selectedOptions:!(),singleSelect:!t,title:'Gradle%20Task',width:medium),grow:!t,order:0,type:optionsListControl,width:small),'144933da-5c1b-4257-a969-7f43455a7901':(explicitInput:(dataViewId:fbbdc689-be23-4b3d-8057-aa402e9ed0c5,enhancements:(),fieldName:name.keyword,grow:!t,id:'144933da-5c1b-4257-a969-7f43455a7901',searchTechnique:wildcard,selectedOptions:!('testThatInvalidDynamicIpFilterConfigurationIsRejected'),title:Test,width:medium),grow:!t,order:2,type:optionsListControl,width:medium),'4e6ad9d6-6fdc-4fcc-bf1a-aa6ca79e0850':(explicitInput:(dataViewId:fbbdc689-be23-4b3d-8057-aa402e9ed0c5,enhancements:(),fieldName:className.keyword,grow:!t,id:'4e6ad9d6-6fdc-4fcc-bf1a-aa6ca79e0850',searchTechnique:wildcard,selectedOptions:!('org.elasticsearch.xpack.security.transport.filter.IpFilteringUpdateTests'),title:Suite,width:medium),grow:!t,order:1,type:optionsListControl,width:medium))))

Failure excerpt:

org.elasticsearch.transport.BindTransportException: Failed to bind to [::1]:[51805-51905]

  at __randomizedtesting.SeedInfo.seed([DDEE06167C7FD912:4F2E25CCAC535C51]:0)
  at org.elasticsearch.transport.TcpTransport.bindToPort(TcpTransport.java:453)
  at org.elasticsearch.transport.TcpTransport.bindServer(TcpTransport.java:414)
  at org.elasticsearch.transport.nio.NioTransport.doStart(NioTransport.java:101)
  at org.elasticsearch.xpack.security.transport.nio.SecurityNioTransport.doStart(SecurityNioTransport.java:113)
  at org.elasticsearch.common.component.AbstractLifecycleComponent.start(AbstractLifecycleComponent.java:48)
  at org.elasticsearch.transport.TransportService.doStart(TransportService.java:318)
  at org.elasticsearch.common.component.AbstractLifecycleComponent.start(AbstractLifecycleComponent.java:48)
  at org.elasticsearch.node.Node.start(Node.java:1176)
  at org.elasticsearch.test.InternalTestCluster$NodeAndClient.startNode(InternalTestCluster.java:1057)
  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
  at java.util.concurrent.FutureTask.run(FutureTask.java:317)
  at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
  at java.lang.Thread.run(Thread.java:1583)

  Caused by: org.elasticsearch.common.util.concurrent.UncategorizedExecutionException: Failed execution

    at org.elasticsearch.common.util.concurrent.FutureUtils.rethrowExecutionException(FutureUtils.java:80)
    at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:50)
    at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:26)
    at org.elasticsearch.transport.nio.NioTransport.bind(NioTransport.java:76)
    at org.elasticsearch.transport.nio.NioTransport.bind(NioTransport.java:45)
    at org.elasticsearch.transport.TcpTransport.lambda$bindToPort$6(TcpTransport.java:441)
    at org.elasticsearch.common.transport.PortsRange.iterate(PortsRange.java:58)
    at org.elasticsearch.transport.TcpTransport.bindToPort(TcpTransport.java:439)
    at org.elasticsearch.transport.TcpTransport.bindServer(TcpTransport.java:414)
    at org.elasticsearch.transport.nio.NioTransport.doStart(NioTransport.java:101)
    at org.elasticsearch.xpack.security.transport.nio.SecurityNioTransport.doStart(SecurityNioTransport.java:113)
    at org.elasticsearch.common.component.AbstractLifecycleComponent.start(AbstractLifecycleComponent.java:48)
    at org.elasticsearch.transport.TransportService.doStart(TransportService.java:318)
    at org.elasticsearch.common.component.AbstractLifecycleComponent.start(AbstractLifecycleComponent.java:48)
    at org.elasticsearch.node.Node.start(Node.java:1176)
    at org.elasticsearch.test.InternalTestCluster$NodeAndClient.startNode(InternalTestCluster.java:1057)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
    at java.util.concurrent.FutureTask.run(FutureTask.java:317)
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
    at java.lang.Thread.run(Thread.java:1583)

    Caused by: java.util.concurrent.ExecutionException: java.net.BindException: Failed to bind server socket channel {localAddress=/[0:0:0:0:0:0:0:1]:51905}.

      at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:257)
      at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:244)
      at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:75)
      at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:45)
      at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:26)
      at org.elasticsearch.transport.nio.NioTransport.bind(NioTransport.java:76)
      at org.elasticsearch.transport.nio.NioTransport.bind(NioTransport.java:45)
      at org.elasticsearch.transport.TcpTransport.lambda$bindToPort$6(TcpTransport.java:441)
      at org.elasticsearch.common.transport.PortsRange.iterate(PortsRange.java:58)
      at org.elasticsearch.transport.TcpTransport.bindToPort(TcpTransport.java:439)
      at org.elasticsearch.transport.TcpTransport.bindServer(TcpTransport.java:414)
      at org.elasticsearch.transport.nio.NioTransport.doStart(NioTransport.java:101)
      at org.elasticsearch.xpack.security.transport.nio.SecurityNioTransport.doStart(SecurityNioTransport.java:113)
      at org.elasticsearch.common.component.AbstractLifecycleComponent.start(AbstractLifecycleComponent.java:48)
      at org.elasticsearch.transport.TransportService.doStart(TransportService.java:318)
      at org.elasticsearch.common.component.AbstractLifecycleComponent.start(AbstractLifecycleComponent.java:48)
      at org.elasticsearch.node.Node.start(Node.java:1176)
      at org.elasticsearch.test.InternalTestCluster$NodeAndClient.startNode(InternalTestCluster.java:1057)
      at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
      at java.util.concurrent.FutureTask.run(FutureTask.java:317)
      at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
      at java.lang.Thread.run(Thread.java:1583)

      Caused by: java.net.BindException: Failed to bind server socket channel {localAddress=/[0:0:0:0:0:0:0:1]:51905}.

        at org.elasticsearch.nio.ServerChannelContext.register(ServerChannelContext.java:76)
        at org.elasticsearch.nio.EventHandler.handleRegistration(EventHandler.java:54)
        at org.elasticsearch.nio.NioSelector.registerChannel(NioSelector.java:443)
        at org.elasticsearch.nio.NioSelector.setUpNewChannels(NioSelector.java:435)
        at org.elasticsearch.nio.NioSelector.preSelect(NioSelector.java:256)
        at org.elasticsearch.nio.NioSelector.singleLoop(NioSelector.java:149)
        at org.elasticsearch.nio.NioSelector.runLoop(NioSelector.java:125)
        at java.lang.Thread.run(Thread.java:1583)

        Caused by: java.net.BindException: Address already in use: bind

          at sun.nio.ch.Net.bind0(Net.java:-2)
          at sun.nio.ch.Net.bind(Net.java:565)
          at sun.nio.ch.ServerSocketChannelImpl.netBind(ServerSocketChannelImpl.java:344)
          at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:301)
          at java.nio.channels.ServerSocketChannel.bind(ServerSocketChannel.java:224)
          at org.elasticsearch.nio.ServerChannelContext.register(ServerChannelContext.java:73)
          at org.elasticsearch.nio.EventHandler.handleRegistration(EventHandler.java:54)
          at org.elasticsearch.nio.NioSelector.registerChannel(NioSelector.java:443)
          at org.elasticsearch.nio.NioSelector.setUpNewChannels(NioSelector.java:435)
          at org.elasticsearch.nio.NioSelector.preSelect(NioSelector.java:256)
          at org.elasticsearch.nio.NioSelector.singleLoop(NioSelector.java:149)
          at org.elasticsearch.nio.NioSelector.runLoop(NioSelector.java:125)
          at java.lang.Thread.run(Thread.java:1583)

@mark-vieira mark-vieira added :Security/Security Security issues without another label >test-failure Triaged test failures from CI labels Nov 18, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-security (Team:Security)

@elasticsearchmachine elasticsearchmachine added the Team:Security Meta label for security team label Nov 18, 2023
@slobodanadamovic
Copy link
Contributor

slobodanadamovic commented Nov 20, 2023

The assessment is the same (low-risk) as in #101591 (comment) and #101615 (comment).

These tests started failing recently with the same cause. I'm wondering if this is somehow related to the recent migration to Buildkite?

@slobodanadamovic slobodanadamovic added low-risk An open issue or test failure that is a low risk to future releases and removed blocker labels Nov 20, 2023
@slobodanadamovic
Copy link
Contributor

The issue might have persisted even before the migration to Buildkite. Wondering if it surfaced now because many jobs are now grouped together under one pipeline?

@mark-vieira
Copy link
Contributor Author

The issue might have persisted even before the migration to Buildkite. Wondering if it surfaced now because many jobs are now grouped together under one pipeline?

They still all run on independent hosts though. There's been no change to the number of tests that run on any given agent.

@slobodanadamovic
Copy link
Contributor

slobodanadamovic commented Dec 19, 2023

This issue is same in nature as #102870 and #101615, we try to bind to a range of 100 ports and fail.

Some observations:

  • tests started failing after migration to buildkite
  • failures are only reproducible on Windows Server
  • SslMultiPortTests, IpFilteringUpdateTests and PkiOptionalClientAuthTests all use the same random range: 49000 - 65500
  • Windows defines dynamic range from 49152 to 65535
  • Windows allows to customize the dynamic port range as well to exclude certain ports from the range

After some searching I think that the issue is same as described here: docker/for-win#3171, where some high ports from dynamic range are being excluded on Windows. Attempting to bind to some of these ports would fail with Address already in use: bind error.
To confirm this we would need to run following command on Windows: netsh int ipv4 show excludedportrange tcp
@mark-vieira Is it possible to run this command on Windows test worker? This would help us confirm the above hypothesis.

Potential workarounds/solutions:

  • disable port exclusion on Windows (see this blogpost which explains how to do it)
  • increase the test's port range to e.g. 250 in order to lower the probability of this issue happening again

Another issue, but not related to this problem:

  • These tests could attempt to bind to an unavailable port from ranges 4900 - 49151 and 65536 - 65600.

@mark-vieira
Copy link
Contributor Author

@brianseeders
Copy link
Contributor

It does now! I just got my local changes for it fixed up and pushed.

@slobodanadamovic see:
https://github.com/elastic/elasticsearch-infra/tree/master/buildkite-tools#setup
https://github.com/elastic/elasticsearch-infra/tree/master/buildkite-tools#agent-instance

This will create a Windows instance in GCP for you, configured the same way as the Buildkite worker. It will SSH into the box for you as well.

Then, you can:

  • type bash to get a bash prompt, or powershell, if you'd like (instead of cmd.exe)
  • cd dev/elasticsearch
  • Run whatever you want. GRADLE_TASK="checkPart2" .buildkite/scripts/windows-run-gradle.sh will run checkPart2 like CI.

@slobodanadamovic slobodanadamovic self-assigned this Dec 20, 2023
@slobodanadamovic
Copy link
Contributor

slobodanadamovic commented Dec 20, 2023

@brianseeders

First, I've had issue with creating an instance due to my username containing . char. I was able to solve it by hardcoding the instance name prefix instead of using the username in create-instance.js.

Error: Command failed: /Users/slobodan.adamovic/git/elasticsearch-infra/buildkite-tools/agent-instance/temp.sh
ERROR: (gcloud.compute.instances.create) Could not fetch resource:
 - Invalid value for field 'resource.name': 'slobodan.adamovic-1703060975975'. Must be a match of regex '(?:[a-z](?:[-a-z0-9]{0,61}[a-z0-9])?)'

    at genericNodeError (node:internal/errors:956:15)
    at wrappedFn (node:internal/errors:510:14)
    at checkExecSyncError (node:child_process:890:11)
    at Object.execSync (node:child_process:962:15)
    at /Users/slobodan.adamovic/git/elasticsearch-infra/buildkite-tools/agent-instance/create-instance.js:74:17

Now, the ./agent-instance.sh <buildkite_job_url> is stuck at trying to SSH into the created instance:

Creation complete.
Waiting until SSH is available...
Trying...
Trying...
Trying...
Trying...
Trying...
Trying...
Trying...
Trying...
Trying...

Any idea what I'm missing here?

@slobodanadamovic
Copy link
Contributor

Ok, now after I've executed manually SSH command (gcloud compute ssh --project "elastic-elasticsearch" "buildkite-agent@my-instance-name-1703074956565" --zone="us-west1-a" --ssh-flag="-o ConnectTimeout=10 -q") as buildkite-agent and other small adjustments to the script, I was able to login.

Running $ netsh int ipv4 show excludedportrange tcp shows two consecutive ranges of 100 ports being excluded. This makes it a range of total 200 ports being excluded (49927-50126):

Protocol tcp Port Exclusion Ranges

Start Port    End Port      
----------    --------      
      5985        5985      
      5986        5986      
     47001       47001      
     49927       50026      
     50027       50126      

* - Administered port exclusions.

After that, I was able to manually hardcode these port ranges in theIpFilteringIntegrationTests and got the same error:

REPRODUCE WITH: gradlew ':x-pack:plugin:security:internalClusterTest' --tests "org.elasticsearch.xpack.security.transport.filter.IpFilteringIntegrationTests.testThatIpFilteringIsAppliedForProfile" -Dtests.seed=DDEE06167C7FD912 -Dtests.locale=es-PE -Dtests.timezone=Etc/GMT+10 -Druntime.java=21

org.elasticsearch.xpack.security.transport.filter.IpFilteringIntegrationTests > testThatIpFilteringIsAppliedForProfile FAILED
    org.elasticsearch.transport.BindTransportException: Failed to bind to 127.0.0.1:[49927-50026]
        at app//org.elasticsearch.transport.TcpTransport.bindToPort(TcpTransport.java:504)
        at app//org.elasticsearch.transport.TcpTransport.bindServer(TcpTransport.java:465)
        at app//org.elasticsearch.transport.netty4.Netty4Transport.doStart(Netty4Transport.java:154)
        at app//org.elasticsearch.xpack.core.security.transport.netty4.SecurityNetty4Transport.doStart(SecurityNetty4Transport.java:126)
        at app//org.elasticsearch.xpack.security.transport.netty4.SecurityNetty4ServerTransport.doStart(SecurityNetty4ServerTransport.java:62)
        at app//org.elasticsearch.common.component.AbstractLifecycleComponent.start(AbstractLifecycleComponent.java:50)
        at app//org.elasticsearch.transport.TransportService.doStart(TransportService.java:332)
        at app//org.elasticsearch.common.component.AbstractLifecycleComponent.start(AbstractLifecycleComponent.java:50)
        at app//org.elasticsearch.node.Node.start(Node.java:303)
        at app//org.elasticsearch.test.InternalTestCluster$NodeAndClient.startNode(InternalTestCluster.java:986)
        at app//org.elasticsearch.test.InternalTestCluster.getOrBuildRandomNode(InternalTestCluster.java:647)
        at app//org.elasticsearch.test.InternalTestCluster.client(InternalTestCluster.java:819)
        at app//org.elasticsearch.test.ESIntegTestCase.client(ESIntegTestCase.java:648)
        at app//org.elasticsearch.test.ESIntegTestCase.client(ESIntegTestCase.java:641)
        at app//org.elasticsearch.test.ESIntegTestCase.admin(ESIntegTestCase.java:1550)
        at app//org.elasticsearch.test.ESIntegTestCase.clusterAdmin(ESIntegTestCase.java:1557)
        at app//org.elasticsearch.test.SecurityIntegTestCase.doAssertXPackIsInstalled(SecurityIntegTestCase.java:183)
        at app//org.elasticsearch.test.SecurityIntegTestCase.assertXPackIsInstalled(SecurityIntegTestCase.java:179)
        at java.base@21.0.1/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
        at java.base@21.0.1/java.lang.reflect.Method.invoke(Method.java:580)
        at app//com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
        at app//com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:980)
        at app//com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
        at app//org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
        at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at app//org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
        at app//org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
        at app//org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
        at app//org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
        at app//org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
        at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at app//com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
        at app//com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
        at app//com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
        at app//com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
        at app//com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
        at app//com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
        at app//com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
        at app//org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
        at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at app//org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
        at app//com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
        at app//com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
        at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at app//org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
        at app//org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
        at app//org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
        at app//org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
        at app//org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
        at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at app//com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
        at app//com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
        at java.base@21.0.1/java.lang.Thread.run(Thread.java:1583)

        Caused by:
        java.net.BindException: Address already in use: bind
            at java.base/sun.nio.ch.Net.bind0(Native Method)
            at java.base/sun.nio.ch.Net.bind(Net.java:565)
            at java.base/sun.nio.ch.ServerSocketChannelImpl.netBind(ServerSocketChannelImpl.java:344)
            at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:301)
            at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:141)
            at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:562)
            at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1334)
            at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:600)
            at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:579)
            at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:973)
            at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:260)
            at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:356)
            at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
            at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
            at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
            at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569)
            at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
            at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
            ... 1 more

    org.elasticsearch.transport.BindTransportException: Failed to bind to 127.0.0.1:[49927-50026]
        at app//org.elasticsearch.transport.TcpTransport.bindToPort(TcpTransport.java:504)
        at app//org.elasticsearch.transport.TcpTransport.bindServer(TcpTransport.java:465)
        at app//org.elasticsearch.transport.netty4.Netty4Transport.doStart(Netty4Transport.java:154)
        at app//org.elasticsearch.xpack.core.security.transport.netty4.SecurityNetty4Transport.doStart(SecurityNetty4Transport.java:126)
        at app//org.elasticsearch.xpack.security.transport.netty4.SecurityNetty4ServerTransport.doStart(SecurityNetty4ServerTransport.java:62)
        at app//org.elasticsearch.common.component.AbstractLifecycleComponent.start(AbstractLifecycleComponent.java:50)
        at app//org.elasticsearch.transport.TransportService.doStart(TransportService.java:332)
        at app//org.elasticsearch.common.component.AbstractLifecycleComponent.start(AbstractLifecycleComponent.java:50)
        at app//org.elasticsearch.node.Node.start(Node.java:303)
        at app//org.elasticsearch.test.InternalTestCluster$NodeAndClient.startNode(InternalTestCluster.java:986)
        at app//org.elasticsearch.test.InternalTestCluster.getOrBuildRandomNode(InternalTestCluster.java:647)
        at app//org.elasticsearch.test.InternalTestCluster.client(InternalTestCluster.java:819)
        at app//org.elasticsearch.test.ESIntegTestCase.client(ESIntegTestCase.java:648)
        at app//org.elasticsearch.test.ESIntegTestCase.client(ESIntegTestCase.java:641)
        at app//org.elasticsearch.test.ESIntegTestCase.admin(ESIntegTestCase.java:1550)
        at app//org.elasticsearch.test.ESIntegTestCase.clusterAdmin(ESIntegTestCase.java:1557)
        at app//org.elasticsearch.test.ESIntegTestCase.afterInternal(ESIntegTestCase.java:576)
        at app//org.elasticsearch.test.ESIntegTestCase.cleanUpCluster(ESIntegTestCase.java:2304)
        at java.base@21.0.1/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
        at java.base@21.0.1/java.lang.reflect.Method.invoke(Method.java:580)
        at app//com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
        at app//com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:1004)
        at app//org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
        at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at app//org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
        at app//org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
        at app//org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
        at app//org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
        at app//org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
        at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at app//com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
        at app//com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
        at app//com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
        at app//com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
        at app//com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
        at app//com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
        at app//com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
        at app//org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
        at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at app//org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
        at app//com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
        at app//com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
        at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at app//org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
        at app//org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
        at app//org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
        at app//org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
        at app//org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
        at app//com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at app//com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
        at app//com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
        at java.base@21.0.1/java.lang.Thread.run(Thread.java:1583)

        Caused by:
        java.net.BindException: Address already in use: bind
            at java.base/sun.nio.ch.Net.bind0(Native Method)
            at java.base/sun.nio.ch.Net.bind(Net.java:565)
            at java.base/sun.nio.ch.ServerSocketChannelImpl.netBind(ServerSocketChannelImpl.java:344)
            at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:301)
            at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:141)
            at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:562)
            at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1334)
            at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:600)
            at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:579)
            at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:973)
            at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:260)
            at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:356)
            at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
            at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
            at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
            at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569)
            at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
            at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
            ... 1 more

@slobodanadamovic
Copy link
Contributor

I'm thinking to take a naive approach and simply increase the range to 500 ports (out of 16383) in these tests.
@mark-vieira Do you see a better option?

@mark-vieira
Copy link
Contributor Author

I think the issue with having larger port ranges is that we increase the likelihood of collisions between tests. Do we know why those ports are being excluded? Is there something we can change on our CI agents such that it no longer reserves these ports?

@slobodanadamovic
Copy link
Contributor

slobodanadamovic commented Dec 21, 2023

Do we know why those ports are being excluded? Is there something we can change on our CI agents such that it no longer reserves these ports?

Unfortunately, I wasn't able to trace down why or who excluded these ports.
What I noticed is that every time I restart the instance, there is a different range of 200 ports being excluded:

Protocol tcp Port Exclusion Ranges

Start Port    End Port
----------    --------
      5985        5985      
      5986        5986
     47001       47001
     55812       55911
     55912       56011

* - Administered port exclusions.

I think the issue with having larger port ranges is that we increase the likelihood of collisions between tests.

AFAIK I don't think that this would be an issue in these tests. The collisions are anticipated and handled as long we can find at least 1 available TCP port. The binding is done with the first available port in the range. If we are unable to find a free port in this range, then we would fail with BindTransportException (e.g. org.elasticsearch.transport.BindTransportException: Failed to bind to 127.0.0.1:[49927-50026]).

boolean success = portsRange.iterate(portNumber -> {
try {
TcpServerChannel channel = bind(name, new InetSocketAddress(hostAddress, portNumber));
serverChannels.computeIfAbsent(name, k -> new ArrayList<>()).add(channel);
boundSocket.set(channel.getLocalAddress());
} catch (Exception e) {
lastException.set(e);
return false;
}
return true;
});
if (success == false) {
throw new BindTransportException(
"Failed to bind to " + NetworkAddress.format(hostAddress, portsRange),
lastException.get()
);
}

I would even argue that setting a full dynamic ports range (49152 - 65535) would be the safest here and avoid these failures in case of a high number of tests. Downside is that this is not optimal since all these tests would start iterating from 49152 until they find a first available port. Hence, I think that increasing a range to 500 (and still selecting it randomly) would not be an issue here.

@mark-vieira
Copy link
Contributor Author

Hence, I think that increasing a range to 500 (and still selecting it randomly) would not be an issue here.

Fair enough. Can we limit this to only Windows? Since this issue seems to be restricted to that platform, and we run fewer concurrent tests on Windows vs Linux as well.

@mark-vieira
Copy link
Contributor Author

@brianseeders any thoughts on what might be reserving these ports on Windows? Have you observed this before?

@brianseeders
Copy link
Contributor

No idea. I can help figure it out when I get back, though. You can also RDP into the machines if you open them in the GCP console and go to Connect, if that is helpful

slobodanadamovic added a commit to slobodanadamovic/elasticsearch that referenced this issue Jan 4, 2024
This PR increases client's port ranges for tests which are executed
on Windows in order to avoid failures due to some port ranges being
excluded from use. The larger ports range (300) is chosen based on
the observation where a random consecutive range of 200 ports is excluded.

Closes elastic#102349
slobodanadamovic added a commit that referenced this issue Jan 4, 2024
…03894)

This PR increases client's port ranges for tests which are executed
on Windows in order to avoid failures due to some port ranges being
excluded from use. The larger ports range (300) is chosen based on
the observation where a random consecutive range of 200 ports can
be excluded on Windows test workers.

Closes #102349
slobodanadamovic added a commit to slobodanadamovic/elasticsearch that referenced this issue Jan 4, 2024
…astic#103894)

This PR increases client's port ranges for tests which are executed
on Windows in order to avoid failures due to some port ranges being
excluded from use. The larger ports range (300) is chosen based on
the observation where a random consecutive range of 200 ports can
be excluded on Windows test workers.

Closes elastic#102349
slobodanadamovic added a commit to slobodanadamovic/elasticsearch that referenced this issue Jan 4, 2024
…astic#103894)

This PR increases client's port ranges for tests which are executed
on Windows in order to avoid failures due to some port ranges being
excluded from use. The larger ports range (300) is chosen based on
the observation where a random consecutive range of 200 ports can
be excluded on Windows test workers.

Closes elastic#102349

(cherry picked from commit bdf5c7f)

# Conflicts:
#	modules/transport-netty4/src/internalClusterTest/java/org/elasticsearch/transport/netty4/Netty4TransportMultiPortIntegrationIT.java
#	x-pack/plugin/security/src/internalClusterTest/java/org/elasticsearch/xpack/security/transport/filter/IpFilteringIntegrationTests.java
elasticsearchmachine pushed a commit that referenced this issue Jan 4, 2024
…03894) (#103910)

This PR increases client's port ranges for tests which are executed
on Windows in order to avoid failures due to some port ranges being
excluded from use. The larger ports range (300) is chosen based on
the observation where a random consecutive range of 200 ports can
be excluded on Windows test workers.

Closes #102349
elasticsearchmachine pushed a commit that referenced this issue Jan 4, 2024
…ows (#103894) (#103914)

* [Test] Use larger client ports range for tests running on Windows (#103894)

This PR increases client's port ranges for tests which are executed
on Windows in order to avoid failures due to some port ranges being
excluded from use. The larger ports range (300) is chosen based on
the observation where a random consecutive range of 200 ports can
be excluded on Windows test workers.

Closes #102349

(cherry picked from commit bdf5c7f)

# Conflicts:
#	modules/transport-netty4/src/internalClusterTest/java/org/elasticsearch/transport/netty4/Netty4TransportMultiPortIntegrationIT.java
#	x-pack/plugin/security/src/internalClusterTest/java/org/elasticsearch/xpack/security/transport/filter/IpFilteringIntegrationTests.java

* Fix compilation error
jbaiera pushed a commit to jbaiera/elasticsearch that referenced this issue Jan 10, 2024
…astic#103894)

This PR increases client's port ranges for tests which are executed
on Windows in order to avoid failures due to some port ranges being
excluded from use. The larger ports range (300) is chosen based on
the observation where a random consecutive range of 200 ports can
be excluded on Windows test workers.

Closes elastic#102349
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
low-risk An open issue or test failure that is a low risk to future releases :Security/Security Security issues without another label Team:Security Meta label for security team >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants