Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retrying an ILM action that is an AsyncActionStep does not work due to execution model changes #35397

Closed
dakrone opened this issue Nov 8, 2018 · 2 comments · Fixed by #35406
Assignees
Labels
blocker >bug :Data Management/ILM+SLM Index and Snapshot lifecycle management

Comments

@dakrone
Copy link
Member

dakrone commented Nov 8, 2018

When an index is stuck in the error state of a step which is an AsyncActionStep, like so:

PUT _ilm/policy/bad
{
  "policy": {
    "phases": {
      "warm": {
        "min_age": "5s",
        "actions": {
          "shrink": {
            "number_of_shards": 13
          }
        }
      }
    }
  }
}

PUT /foo
{
  "settings": {
    "index.number_of_shards": 2,
    "index.lifecycle.name": "bad"
  }
}

With error:

{
  "indices" : {
    "foo" : {
      "index" : "foo",
      "managed" : true,
      "policy" : "bad",
      "lifecycle_date" : "2018-11-08T22:47:45.865Z",
      "lifecycle_date_millis" : 1541717265865,
      "phase" : "warm",
      "phase_time" : "2018-11-08T22:47:52.601Z",
      "phase_time_millis" : 1541717272601,
      "action" : "shrink",
      "action_time" : "2018-11-08T22:47:52.601Z",
      "action_time_millis" : 1541717272601,
      "step" : "ERROR",
      "step_time" : "2018-11-08T22:47:52.688Z",
      "step_time_millis" : 1541717272688,
      "failed_step" : "shrink",
      "step_info" : {
        "type" : "illegal_argument_exception",
        "reason" : "the number of target shards [13] must be less that the number of source shards [2]"
      },
      "phase_execution" : {
        "policy" : "bad",
        "phase_definition" : {
          "min_age" : "5s",
          "actions" : {
            "shrink" : {
              "number_of_shards" : 13
            }
          }
        },
        "version" : 5,
        "modified_date" : "2018-11-08T22:47:44.230Z",
        "modified_date_in_millis" : 1541717264230
      }
    }
  }
}

When I correct the policy and issue a POST /foo/_ilm/retry...

The now warm/shrink/shrink step never executes, because we only execute AsyncActionSteps when they are transitioned into.

The 'retry' API should perform the correct steps to retry any type of operation.

@dakrone dakrone added >bug blocker :Data Management/ILM+SLM Index and Snapshot lifecycle management labels Nov 8, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@gwbrown
Copy link
Contributor

gwbrown commented Nov 8, 2018

This looks very similar to #34618, a similar solution might be applicable.

@talevy talevy self-assigned this Nov 8, 2018
talevy added a commit to talevy/elasticsearch that referenced this issue Nov 9, 2018
Before, moving to a failed step would only change the step info
to be that of the failed step. This means two things.

1. Async Steps would never be triggered to execute
2. If there are inherent problems with the action definition that can
be fixed with a policy update, these changes were not being reflected
by the new execution info.

Changes now

1. Async steps are executed after the move to the failed step in cluster state
2. the lifecycle execution info's phase definition is updated from the current
latest policy definition, even though the index isn't moving to a new phase.

Closes elastic#35397.
talevy added a commit that referenced this issue Nov 12, 2018
#35406)

Before, moving to a failed step would only change the step info
to be that of the failed step. This means two things.

1. Async Steps would never be triggered to execute
2. If there are inherent problems with the action definition that can
be fixed with a policy update, these changes were not being reflected
by the new execution info.

Changes now

1. Async steps are executed after the move to the failed step in cluster state
2. the lifecycle execution info's phase definition is updated from the current
latest policy definition, even though the index isn't moving to a new phase.

Closes #35397.
talevy added a commit that referenced this issue Nov 12, 2018
#35406)

Before, moving to a failed step would only change the step info
to be that of the failed step. This means two things.

1. Async Steps would never be triggered to execute
2. If there are inherent problems with the action definition that can
be fixed with a policy update, these changes were not being reflected
by the new execution info.

Changes now

1. Async steps are executed after the move to the failed step in cluster state
2. the lifecycle execution info's phase definition is updated from the current
latest policy definition, even though the index isn't moving to a new phase.

Closes #35397.
pgomulka pushed a commit to pgomulka/elasticsearch that referenced this issue Nov 21, 2018
elastic#35406)

Before, moving to a failed step would only change the step info
to be that of the failed step. This means two things.

1. Async Steps would never be triggered to execute
2. If there are inherent problems with the action definition that can
be fixed with a policy update, these changes were not being reflected
by the new execution info.

Changes now

1. Async steps are executed after the move to the failed step in cluster state
2. the lifecycle execution info's phase definition is updated from the current
latest policy definition, even though the index isn't moving to a new phase.

Closes elastic#35397.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker >bug :Data Management/ILM+SLM Index and Snapshot lifecycle management
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants