Skip to content

Conversation

@edsavage
Copy link
Contributor

@edsavage edsavage commented Nov 5, 2025

Add a daily maintenance task to roll over .ml-state indices if the index size exceeds a configurable default size (default 50GB).

This replaces the previous method of using ILM to manage the state indices, as that was not a workable solution for serverless.

This builds on the work done in PR #136065 which provides similar functionality for results indices.

Add a daily maintenance task to roll over .ml-state indices if the index size exceeds a configurable default size (default 50GB).

This replaces the previous method of using ILM to manage the state indices, as that was not a workable solution for serverless.

This builds on the work done in PR elastic#136065 which provides similar functionality for results indices.

WIP
@elasticsearchmachine elasticsearchmachine added the Team:ML Meta label for the ML team label Nov 5, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @edsavage, I've created a changelog YAML for you.

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@benwtrent benwtrent requested a review from davidkyle November 7, 2025 11:47
Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question about what happens if you try to remove an alias from an index if the alias does not exist, does that throw an error? I'm wondering the the various alias request need to check if there is an alias first.

triggerDeleteJobsInStateDeletingWithoutDeletionTask(continueOnFailureListener("reset-jobs", resetJobs));
}

private ActionListener<AcknowledgedResponse> continueOnFailureListener(String nextTaskName, Runnable next) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice this is a lot easier to read

Map<String, String> variables = new HashMap<>();
variables.put(VERSION_ID_PATTERN, String.valueOf(ML_INDEX_TEMPLATE_VERSION));
// In serverless a different version of "state_index_template.json" is shipped that won't substitute the ILM policy variable
variables.put(INDEX_LIFECYCLE_NAME, ML_SIZE_BASED_ILM_POLICY_NAME);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assume there is an .ml-state-000001 index created in an earlier version (say 9.0) and that index has the ILM policy. There's now a race between the ILM policy and the ML maintenance service to roll over .ml-state-000001. Can the ML maintenance service ignore indices with a ILM policy? (in stateful but not serverless)

New ml-state indices created from this template won't have the ILM policy so there's no race once a new index is rolled over.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a filter for those indices that have an associated ILM policy to avoid the potential race when rolling over. Thanks!

@edsavage
Copy link
Contributor Author

I have a question about what happens if you try to remove an alias from an index if the alias does not exist, does that throw an error? I'm wondering the the various alias request need to check if there is an alias first.

I think we're good here, no error is thrown when attempting to remove a non-existent alias. There's at least one integration test that exercises this scenario, e.g. https://github.com/edsavage/elasticsearch/blob/df2859c7589fbacdab908aeb8d56c7dd106e294d/x-pack/plugin/ml/src/internalClusterTest/java/org/elasticsearch/xpack/ml/integration/MlDailyMaintenanceServiceRolloverIndicesIT.java#L509

Copy link
Contributor

@valeriy42 valeriy42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work! 🚀 I have just some minor comments.

rollAndUpdateAliases(clusterState, index, allIndices, updated);
try {
updated.actionGet();
} catch (Exception ex) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error message in the next line says "Failed to rollover ML anomalies index" but this method is used for both results and state indices. The message should be more generic or parameterized.

public void triggerRollResultsIndicesIfNecessaryTask(ActionListener<AcknowledgedResponse> finalListener) {
logger.info("[ML] maintenance task: triggerRollResultsIndicesIfNecessaryTask");
// Helper function to check for the "index.lifecycle.name" setting on an index
private boolean hasIlm(String indexName) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a unit test or a BWC test to verify that this works as expected.

Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks for the ILM change

@edsavage edsavage merged commit 711e445 into elastic:main Nov 14, 2025
34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>enhancement :ml Machine learning Team:ML Meta label for the ML team v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants