Skip to content

Conversation

nicktindall
Copy link
Contributor

@nicktindall nicktindall commented Sep 23, 2025

Add a feature flag for the write load decider, and change the enabled setting to default to ENABLED. This will mean the decider and its infrastructure will be enabled for snapshot builds, but disabled in production and release builds.

Relates: ES-12881

@elasticsearchmachine elasticsearchmachine added the serverless-linked Added by automation, don't add manually label Sep 24, 2025
@nicktindall nicktindall added :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >non-issue labels Sep 24, 2025
@nicktindall nicktindall marked this pull request as ready for review September 24, 2025 03:43
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Sep 24, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@nicktindall nicktindall removed the serverless-linked Added by automation, don't add manually label Sep 24, 2025
Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a question

Comment on lines 118 to 122
if (WRITE_LOAD_DECIDER_ENABLED_FF.isEnabled()) {
clusterSettings.initializeAndWatch(WRITE_LOAD_DECIDER_ENABLED_SETTING, status -> this.writeLoadDeciderStatus = status);
} else {
writeLoadDeciderStatus = WriteLoadDeciderStatus.DISABLED;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this means we cannot test by overriding the setting in a serverless QA enviroment. Is this intentional? Or maybe we should set the default accordingly based on the feature flag and leave the dynamic update always enabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! changed in 149b2ea

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally I thought we'd just have to restart the cluster the first time it was turned on (to enable the feature flag). But your suggested approach is much better.

Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@DiannaHohensee DiannaHohensee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good 👍

setWriteLoadDeciderEnablement(
randomBoolean()
? WriteLoadConstraintSettings.WriteLoadDeciderStatus.ENABLED
: WriteLoadConstraintSettings.WriteLoadDeciderStatus.LOW_THRESHOLD_ONLY
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would your helper method be appropriate for an explicit DISABLED settings update in the finally block below, and in other tests?

IIUC, we'll still need the finally block's disable setting update, so that these tests will pass when the release build runs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ESSingleNodeTestCase checks that there is no persistent metadata (including settings) left behind in teardown, so we need to clear these settings before the test ends.

We need to clear the settings, rather than setting them to a specific value. I've added a helper for that too in 9845c38

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes here are just to stop depending on the default being one way or another, because this will change depending on the feature flag.

public class WriteLoadConstraintSettings {

private static final String SETTING_PREFIX = "cluster.routing.allocation.write_load_decider.";
private static final FeatureFlag WRITE_LOAD_DECIDER_ENABLED_FF = new FeatureFlag("write_load_decider_enabled");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably get rid of 'enabled' in both names, as it's redundant, but doesn't really matter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, improved in 917c617

@nicktindall nicktindall enabled auto-merge (squash) September 30, 2025 01:09
@nicktindall nicktindall enabled auto-merge (squash) September 30, 2025 01:09
@nicktindall nicktindall merged commit 23e8349 into elastic:main Sep 30, 2025
34 checks passed
@nicktindall nicktindall deleted the enabled_write_load_decider branch September 30, 2025 01:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >non-issue Team:Distributed Coordination Meta label for Distributed Coordination team v9.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants