ILM wait for active shards on rolled index in a separate step #50718

andreidan · 2020-01-07T21:52:34Z

After we rollover the index we wait for the configured number of shards for the rolled index to become active (based on the index.write.wait_for_active_shards setting which might be present in a template, or otherwise in the default case, for the primaries to become active).
This wait might be long due to disk watermarks being tripped, replicas not being able to spring to life due to cluster nodes reconfiguration and others and the RolloverStep might not complete successfully due to this inherent transient situation, albeit the rolled index having been created.

Relates to #48183 and #44135

elasticmachine · 2020-01-07T21:52:46Z

Pinging @elastic/es-core-features (:Core/Features/ILM+SLM)

andreidan · 2020-01-07T21:53:02Z

@dakrone what are your thoughts on this?

dakrone · 2020-01-07T22:18:28Z

I took a brief look. I think maybe we should separate this and support the parameter on the regular Rollover API (right now it's only accessible from the Transport layer I think?). Once that is available. I think splitting the steps on the ILM side sounds like a good idea to me. We need to make sure to default to the current behavior for the Rollover API though, to make sure we don't accidentally make a breaking change.

andreidan · 2020-01-08T10:26:51Z

@dakrone the regular Rollover API supports `wait_for_active_shards" as a query parameter and the rollover client api supports it as well ( https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/action/admin/indices/rollover/RolloverRequestBuilder.java#L101 )

Are you thinking of exposing this as a parameter in the ILM Rollover action configuration or something else?
ie.

"hot": {                      
        "actions": {
          "rollover": {             
            "max_size": "50GB",
            "max_age": "30d",
            "wait_for_active_shards": 2
          }
        }

I'm not sure we should do this as we didn't see interest into something like this and it also requires validating which we can't perform (the value must not be larger than the number of primaries + replicas but this is a policy definition that can be applied to any index - and then the templates for the rolled index can make this even more complex)

dakrone · 2020-01-08T17:53:36Z

Are you thinking of exposing this as a parameter in the ILM Rollover action configuration or something else?

Ah no! I didn't realize we had already exposed it (since only the setter was added in the actual change). If we already support it then splitting the two sounds good to me.

andreidan · 2020-01-15T13:17:19Z

@elasticmachine update branch

andreidan · 2020-01-15T17:46:39Z

@elasticmachine update branch

dakrone

Thanks Andrei, I left a couple of comments!

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/RolloverStep.java

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/WaitForActiveShardsStep.java

andreidan · 2020-01-16T12:47:46Z

@elasticmachine update branch

andreidan · 2020-01-16T13:54:39Z

@elasticmachine update branch

andreidan · 2020-01-16T17:53:39Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/WaitForActiveShardsStep.java

+            // if the rollover was not performed on a write index alias, the alias will be moved to the new index and it will be the only
+            // index this alias points to
+            List<IndexMetaData> indices = alias.getIndices();
+            assert indices.size() == 1 : "when performing rollover on alias with is_write_index = false the alias must point to only " +


this assertion doesn't stand in a CCR environment

coming up with the fix by parsing the number from the index name and finding the rolled index by "max number"

andreidan · 2020-01-17T17:16:45Z

@elasticmachine update branch

dakrone

I left a few more comments, thanks Andrei

dakrone · 2020-01-17T17:25:28Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/WaitForActiveShardsStep.java

+
+        Alias alias = (Alias) aliasOrIndex;
+        IndexMetaData aliasWriteIndex = alias.getWriteIndex();
+        String rolledIndexName;


Suggested change

String rolledIndexName;

final String rolledIndexName;

dakrone · 2020-01-17T17:25:37Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/WaitForActiveShardsStep.java

+        Alias alias = (Alias) aliasOrIndex;
+        IndexMetaData aliasWriteIndex = alias.getWriteIndex();
+        String rolledIndexName;
+        String waitForActiveShardsSettingValue;


Suggested change

String waitForActiveShardsSettingValue;

final String waitForActiveShardsSettingValue;

dakrone · 2020-01-17T17:28:45Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/WaitForActiveShardsStep.java

+                // Index must have been since deleted
+                logger.debug("unable to find the index that was rolled over from [{}] as part of lifecycle action [{}]", index.getName(),
+                    getKey().getAction());
+                return new Result(false, null);


Can we return the above log message as part of the "info" ToXContent object here? Otherwise this could stick in false forever and the user would never know why

dakrone · 2020-01-17T17:36:25Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/WaitForActiveShardsStep.java

+     */
+    static int parseIndexNameCounter(String indexName) {
+        int numberIndex = indexName.lastIndexOf("-");
+        assert numberIndex != -1 : "no separator '-' found";


I think we should remove this assert and throw a regular error

dakrone · 2020-01-17T17:37:36Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/WaitForActiveShardsStep.java

+    static int parseIndexNameCounter(String indexName) {
+        int numberIndex = indexName.lastIndexOf("-");
+        assert numberIndex != -1 : "no separator '-' found";
+        return Integer.parseInt(indexName.substring(numberIndex + 1, indexName.endsWith(">") ? indexName.length() - 1 :


This should catch the NumberFormatException and format it into a nicer, more human readable exception. It's possible to hit this if someone were to take an index foo-000003 and snapshot it, then restore it with a different name (like foo-000003-restored for example), this would flip out but it wouldn't be clear why.

I've kept the exception handling light here as this step follows closely the RolloverStep (which would not succeed in the example you illustrated as the source index would not match the ^.*-\d+$ pattern). Its visibility is package default solely for testing purposes.
I guess renames and such could occur between steps, so I'll add the exception handling and remove the assumptions.

dakrone · 2020-01-17T17:42:18Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/WaitForActiveShardsStep.java

+        static final ParseField TARGET_ACTIVE_SHARDS_COUNT = new ParseField("target_active_shards_count");
+        static final ParseField ENOUGH_SHARDS_ACTIVE = new ParseField("enough_shards_active");
+        static final ParseField MESSAGE = new ParseField("message");
+        static final ConstructingObjectParser<Info, Void> PARSER = new ConstructingObjectParser<>("wait_for_active_shards_step_info",


I believe this PARSER is never actually used?

dakrone · 2020-01-17T17:44:07Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/WaitForActiveShardsStep.java

+            this.enoughShardsActive = enoughShardsActive;
+
+            if (enoughShardsActive) {
+                message = "The target of [" + targetActiveShardsCount + "] are active. Don't need to wait anymore.";


Minor nit, but we should keep these lowercase, and I don't think we need a trailing period (these are more like error messages than sentences)

dakrone · 2020-01-17T17:44:17Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/WaitForActiveShardsStep.java

+            if (enoughShardsActive) {
+                message = "The target of [" + targetActiveShardsCount + "] are active. Don't need to wait anymore.";
+            } else {
+                message = "Waiting for [" + targetActiveShardsCount + "] shards to become active, but only [" + currentActiveShardsCount +


Same about lowercasing/trailing-period here.

andreidan · 2020-01-20T13:24:41Z

@elasticmachine update branch

dakrone

LGTM, thanks Andrei!

…c#50718) After we rollover the index we wait for the configured number of shards for the rolled index to become active (based on the index.write.wait_for_active_shards setting which might be present in a template, or otherwise in the default case, for the primaries to become active). This wait might be long due to disk watermarks being tripped, replicas not being able to spring to life due to cluster nodes reconfiguration and others and, the RolloverStep might not complete successfully due to this inherent transient situation, albeit the rolled index having been created. (cherry picked from commit 457a92f) Signed-off-by: Andrei Dan <andrei.dan@elastic.co>

#51296) After we rollover the index we wait for the configured number of shards for the rolled index to become active (based on the index.write.wait_for_active_shards setting which might be present in a template, or otherwise in the default case, for the primaries to become active). This wait might be long due to disk watermarks being tripped, replicas not being able to spring to life due to cluster nodes reconfiguration and others and, the RolloverStep might not complete successfully due to this inherent transient situation, albeit the rolled index having been created. (cherry picked from commit 457a92f) Signed-off-by: Andrei Dan <andrei.dan@elastic.co>

…c#50718) After we rollover the index we wait for the configured number of shards for the rolled index to become active (based on the index.write.wait_for_active_shards setting which might be present in a template, or otherwise in the default case, for the primaries to become active). This wait might be long due to disk watermarks being tripped, replicas not being able to spring to life due to cluster nodes reconfiguration and others and, the RolloverStep might not complete successfully due to this inherent transient situation, albeit the rolled index having been created.

ILM wait for active shards on rolled index in a separate step

f6290e4

andreidan added :Data Management/ILM+SLM Index and Snapshot lifecycle management WIP labels Jan 7, 2020

Add license header

fccfee5

Fix RolloverActionTests to reflect the new step

456ba48

andreidan added 3 commits January 15, 2020 12:18

WaitForActiveShardsStep uses the alias index

292a8e6

Fix integratino test

b2673bd

Merge branch 'master' into ilm-rollover-wait-for-active-shards

77364de

elasticmachine and others added 2 commits January 15, 2020 07:17

Merge branch 'master' into ilm-rollover-wait-for-active-shards

0a98bf0

Don't wait for active shards when rolling over in ILM

4842f39

andreidan added v8.0.0 and removed WIP labels Jan 15, 2020

andreidan requested a review from dakrone January 15, 2020 13:26

andreidan added the v7.7.0 label Jan 15, 2020

Fix TransportPutLifecycleActionTests

eb8f95f

Merge branch 'master' into ilm-rollover-wait-for-active-shards

a4615f8

dakrone requested changes Jan 15, 2020

View reviewed changes

andreidan added 3 commits January 16, 2020 12:47

Comment to clarify why rollover doens't wait for active shards

115a131

Guard against the index having been deleted while executing a policy

0924222

Return a meaningful shards state message

bf09bbc

Merge branch 'master' into ilm-rollover-wait-for-active-shards

c33a595

Merge branch 'master' into ilm-rollover-wait-for-active-shards

a27e1cc

andreidan commented Jan 16, 2020

View reviewed changes

andreidan added 4 commits January 17, 2020 15:47

Drop unused getters

a838925

Find rolled index by finding the max counter in the name.

9f9b842

Escape < and > in javadoc

c3f506e

Skip WaitForActiveShardsStep when lifecycle complete is set

7679143

Merge branch 'master' into ilm-rollover-wait-for-active-shards

3a1cd87

dakrone requested changes Jan 17, 2020

View reviewed changes

andreidan added 6 commits January 20, 2020 12:20

Use lower case and drop . in error messages

b838f3a

Mark vars as final

4449853

Add explicit error handling to parseIndexNameCounter

64c9f06

Remove Parser

9322665

Add Info object to report various step progress messages

36e07fc

Make constructors default visible

c0ecf31

Merge branch 'master' into ilm-rollover-wait-for-active-shards

f4f2c84

andreidan requested a review from dakrone January 20, 2020 14:13

dakrone approved these changes Jan 21, 2020

View reviewed changes

andreidan merged commit 457a92f into elastic:master Jan 22, 2020

andreidan added the backport pending label Jan 22, 2020

andreidan mentioned this pull request Jan 22, 2020

[7.x] ILM wait for active shards on rolled index in a separate step (#50718) #51296

Merged

andreidan removed the backport pending label Jan 22, 2020

andreidan mentioned this pull request Jan 22, 2020

Automatic retries for ILM rollover action #44135

Closed

codebrain mentioned this pull request Apr 1, 2020

7.7.0 meta ticket elastic/elasticsearch-net#4525

Closed

38 tasks

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ILM wait for active shards on rolled index in a separate step #50718

ILM wait for active shards on rolled index in a separate step #50718

andreidan commented Jan 7, 2020 •

edited

Loading

elasticmachine commented Jan 7, 2020

andreidan commented Jan 7, 2020

dakrone commented Jan 7, 2020 •

edited

Loading

andreidan commented Jan 8, 2020 •

edited

Loading

dakrone commented Jan 8, 2020

andreidan commented Jan 15, 2020

andreidan commented Jan 15, 2020

dakrone left a comment

andreidan commented Jan 16, 2020

andreidan commented Jan 16, 2020

andreidan Jan 16, 2020

andreidan Jan 17, 2020

andreidan commented Jan 17, 2020

dakrone left a comment

dakrone Jan 17, 2020

dakrone Jan 17, 2020

dakrone Jan 17, 2020

dakrone Jan 17, 2020

dakrone Jan 17, 2020

andreidan Jan 20, 2020

dakrone Jan 17, 2020

dakrone Jan 17, 2020

dakrone Jan 17, 2020

andreidan commented Jan 20, 2020

dakrone left a comment

	String waitForActiveShardsSettingValue;
	final String waitForActiveShardsSettingValue;

ILM wait for active shards on rolled index in a separate step #50718

ILM wait for active shards on rolled index in a separate step #50718

Conversation

andreidan commented Jan 7, 2020 • edited Loading

elasticmachine commented Jan 7, 2020

andreidan commented Jan 7, 2020

dakrone commented Jan 7, 2020 • edited Loading

andreidan commented Jan 8, 2020 • edited Loading

dakrone commented Jan 8, 2020

andreidan commented Jan 15, 2020

andreidan commented Jan 15, 2020

dakrone left a comment

Choose a reason for hiding this comment

andreidan commented Jan 16, 2020

andreidan commented Jan 16, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreidan commented Jan 17, 2020

dakrone left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreidan commented Jan 20, 2020

dakrone left a comment

Choose a reason for hiding this comment

andreidan commented Jan 7, 2020 •

edited

Loading

dakrone commented Jan 7, 2020 •

edited

Loading

andreidan commented Jan 8, 2020 •

edited

Loading