Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reject connection attempts while closing #92465

Conversation

DaveCTurner
Copy link
Contributor

Today if there is a constant stream of connection attempts then it's possible for the ClusterConnectionManager to wait forever in close() for connectingRefCounter to be fully released. With this commit we reject connection attempts while closing, avoiding this starvation situation.

Today if there is a constant stream of connection attempts then it's
possible for the `ClusterConnectionManager` to wait forever in `close()`
for `connectingRefCounter` to be fully released. With this commit we
reject connection attempts while closing, avoiding this starvation
situation.
@DaveCTurner DaveCTurner added >bug :Distributed/Network Http and internode communication implementations v8.6.1 v8.7.0 v7.17.9 labels Dec 20, 2022
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Meta label for distributed team label Dec 20, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@elasticsearchmachine
Copy link
Collaborator

Hi @DaveCTurner, I've created a changelog YAML for you.

@DaveCTurner
Copy link
Contributor Author

Spotted this while looking at a test failure which apparently failed due to getting stuck exactly here: https://gradle-enterprise.elastic.co/s/rdlucvijy2dby. I'm not sure this is actually the reason for this failure, it is kinda implausible that we open connections at a high enough rate to cause this starvation, but still I don't see any other obvious reasons and this is worth fixing.

Copy link
Member

@pxsalehi pxsalehi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. But a small nit: the test setup seems not very straight forward. Individual pieces/blocks of it could benefit from some extra comments.

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@DaveCTurner
Copy link
Contributor Author

LGTM. But a small nit: the test setup seems not very straight forward. Individual pieces/blocks of it could benefit from some extra comments.

Fair comment :) I added some more detail in 2c3b8c6, hope that helps.

@DaveCTurner DaveCTurner added auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) auto-backport-and-merge Automatically create backport pull requests and merge when ready labels Dec 21, 2022
@elasticsearchmachine elasticsearchmachine merged commit e1c861d into elastic:main Dec 21, 2022
@DaveCTurner DaveCTurner deleted the 2022-12-20-cluster-connection-manager-close-starvation branch December 21, 2022 13:48
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Dec 21, 2022
Today if there is a constant stream of connection attempts then it's
possible for the `ClusterConnectionManager` to wait forever in `close()`
for `connectingRefCounter` to be fully released. With this commit we
reject connection attempts while closing, avoiding this starvation
situation.
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.6
7.17 Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 92465

elasticsearchmachine pushed a commit that referenced this pull request Dec 21, 2022
Today if there is a constant stream of connection attempts then it's
possible for the `ClusterConnectionManager` to wait forever in `close()`
for `connectingRefCounter` to be fully released. With this commit we
reject connection attempts while closing, avoiding this starvation
situation.
DaveCTurner added a commit that referenced this pull request Dec 21, 2022
Today if there is a constant stream of connection attempts then it's
possible for the `ClusterConnectionManager` to wait forever in `close()`
for `connectingRefCounter` to be fully released. With this commit we
reject connection attempts while closing, avoiding this starvation
situation.
@DaveCTurner
Copy link
Contributor Author

On reflection I think this might fix the test failure mentioned above after all. We don't have to constantly open connections until the suite times out, we only have to do it until we terminate the threadpool, because we're using GENERIC threads here which just silently drop work on the floor at shutdown and therefore could well leak a ref.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport-and-merge Automatically create backport pull requests and merge when ready auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >bug :Distributed/Network Http and internode communication implementations Team:Distributed Meta label for distributed team v7.17.9 v8.6.1 v8.7.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants