Retrying an ILM action that is an AsyncActionStep does not work due to execution model changes #35397

dakrone · 2018-11-08T22:49:35Z

When an index is stuck in the error state of a step which is an AsyncActionStep, like so:

PUT _ilm/policy/bad
{
  "policy": {
    "phases": {
      "warm": {
        "min_age": "5s",
        "actions": {
          "shrink": {
            "number_of_shards": 13
          }
        }
      }
    }
  }
}

PUT /foo
{
  "settings": {
    "index.number_of_shards": 2,
    "index.lifecycle.name": "bad"
  }
}

With error:

{
  "indices" : {
    "foo" : {
      "index" : "foo",
      "managed" : true,
      "policy" : "bad",
      "lifecycle_date" : "2018-11-08T22:47:45.865Z",
      "lifecycle_date_millis" : 1541717265865,
      "phase" : "warm",
      "phase_time" : "2018-11-08T22:47:52.601Z",
      "phase_time_millis" : 1541717272601,
      "action" : "shrink",
      "action_time" : "2018-11-08T22:47:52.601Z",
      "action_time_millis" : 1541717272601,
      "step" : "ERROR",
      "step_time" : "2018-11-08T22:47:52.688Z",
      "step_time_millis" : 1541717272688,
      "failed_step" : "shrink",
      "step_info" : {
        "type" : "illegal_argument_exception",
        "reason" : "the number of target shards [13] must be less that the number of source shards [2]"
      },
      "phase_execution" : {
        "policy" : "bad",
        "phase_definition" : {
          "min_age" : "5s",
          "actions" : {
            "shrink" : {
              "number_of_shards" : 13
            }
          }
        },
        "version" : 5,
        "modified_date" : "2018-11-08T22:47:44.230Z",
        "modified_date_in_millis" : 1541717264230
      }
    }
  }
}

When I correct the policy and issue a POST /foo/_ilm/retry...

The now warm/shrink/shrink step never executes, because we only execute AsyncActionSteps when they are transitioned into.

The 'retry' API should perform the correct steps to retry any type of operation.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-11-08T22:49:37Z

Pinging @elastic/es-core-infra

gwbrown · 2018-11-08T22:51:26Z

This looks very similar to #34618, a similar solution might be applicable.

Before, moving to a failed step would only change the step info to be that of the failed step. This means two things. 1. Async Steps would never be triggered to execute 2. If there are inherent problems with the action definition that can be fixed with a policy update, these changes were not being reflected by the new execution info. Changes now 1. Async steps are executed after the move to the failed step in cluster state 2. the lifecycle execution info's phase definition is updated from the current latest policy definition, even though the index isn't moving to a new phase. Closes elastic#35397.

#35406) Before, moving to a failed step would only change the step info to be that of the failed step. This means two things. 1. Async Steps would never be triggered to execute 2. If there are inherent problems with the action definition that can be fixed with a policy update, these changes were not being reflected by the new execution info. Changes now 1. Async steps are executed after the move to the failed step in cluster state 2. the lifecycle execution info's phase definition is updated from the current latest policy definition, even though the index isn't moving to a new phase. Closes #35397.

elastic#35406) Before, moving to a failed step would only change the step info to be that of the failed step. This means two things. 1. Async Steps would never be triggered to execute 2. If there are inherent problems with the action definition that can be fixed with a policy update, these changes were not being reflected by the new execution info. Changes now 1. Async steps are executed after the move to the failed step in cluster state 2. the lifecycle execution info's phase definition is updated from the current latest policy definition, even though the index isn't moving to a new phase. Closes elastic#35397.

dakrone added >bug blocker :Data Management/ILM+SLM Index and Snapshot lifecycle management labels Nov 8, 2018

talevy self-assigned this Nov 8, 2018

talevy mentioned this issue Nov 9, 2018

[ILM] fix retry so it picks up latest policy and executes async action #35406

Merged

talevy closed this as completed in #35406 Nov 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retrying an ILM action that is an AsyncActionStep does not work due to execution model changes #35397

Retrying an ILM action that is an AsyncActionStep does not work due to execution model changes #35397

dakrone commented Nov 8, 2018

elasticmachine commented Nov 8, 2018

gwbrown commented Nov 8, 2018 •

edited

Retrying an ILM action that is an AsyncActionStep does not work due to execution model changes #35397

Retrying an ILM action that is an AsyncActionStep does not work due to execution model changes #35397

Comments

dakrone commented Nov 8, 2018

elasticmachine commented Nov 8, 2018

gwbrown commented Nov 8, 2018 • edited

gwbrown commented Nov 8, 2018 •

edited