Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ILM: Changing a policy does not have desired effect. #48431

Closed
jordansissel opened this issue Oct 23, 2019 · 14 comments
Closed

ILM: Changing a policy does not have desired effect. #48431

jordansissel opened this issue Oct 23, 2019 · 14 comments
Assignees
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management

Comments

@jordansissel
Copy link
Contributor

Tested on Elasticsearch 7.3.0

We ran out of disk today. We noticed that Curator for some reason was not deleting indices, possibly because it detects ILM and ignores indices with ILM enabled.

I updated our existing ILM policy to add:

      "delete": {
        "min_age": "30d",
        "actions": {
          "delete": {}
        }
      }

This policy change never takes effect due to some behavior in ILM that I am not understanding.

If index foo has lifecycle policy bar and bar policy is updated to say "Delete when older than 30 days", then foo should be deleted when it is older than 30 days. If foo is already older than 30 days, then it should be deleted by ILM.

Without this, attempting to modify policy when there are hundreds of indices leaves us without much obvious recourse besides doing it ourselves with scripting or with curator.

@dakrone dakrone added :Data Management/ILM+SLM Index and Snapshot lifecycle management team-discuss labels Oct 23, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (:Core/Features/ILM+SLM)

@gwbrown
Copy link
Contributor

gwbrown commented Oct 23, 2019

The discussion on #46357 is highly related.

@jordansissel
Copy link
Contributor Author

This comment is extremely relevant to my ILM experience: #46357 (comment)

It’s possible the design of ILM is incompatible with what I want to do: describe retention policy for my data.

@jordansissel
Copy link
Contributor Author

Maybe the broader question is this: Is ILM even for my use case?

ILM applies when the index is created, or osmething like that? Because we cannot "update" a policy, we are left with a very on-your-own "change the policy" scenario.

From discussions today, we would need to:

  • Find all affected indices
  • Create a new policy with just delete
  • Apply that policy to those affected indices.

"Find all affected indices" is basically "Query /*/_settings and select all indices with a specific lifecycle name" -- Then apply two steps (create a policy, apply to those indices). Now, if we look at what we're doing:

For each item in /*/_settings, If the index.lifecycle.name is the name of our policy needing updating, remove the policy and set the policy to the new "delete only" policy.

But looking at this specific scenario, "Delete older than 30 days". The resutl of /*/_settings gives us index.creation_date which we could use to select "index.creation_date older than 30 days" and then issue the delete call.

In these two scenarios, it's fewer steps to just implement ILM myself with curl than it is to create a new policy and apply it to all affected indices, then wait for ILM to execute and hope I did it right.

This seems backwards.

Compare this scenario to Curator:

  • We have a policy to manage properties of our indices
  • Curator enforces this policy every time it runs
  • If we change the retention period, Curator enforces this on its next run.

With Curator, the "create a policy" and "change a policy" are identical steps:

  • Write some config
  • Deploy it
  • Wait for Curator to run

With ILM, the "create a policy" and "change a policy" are very different steps.

@jordansissel
Copy link
Contributor Author

This is a blocker for us using ILM beyond "hot phase rollover"

@dakrone
Copy link
Member

dakrone commented Oct 24, 2019

I believe the problem stems from a difference in execution style between Curator and ILM. I have an illustration that might help think about the problem.

In ILM. You can think of an ILM policy as a factory assembly line (pardon my poor ASCII art)


           ___       ___       ___
Start --> | 1 | --> | 2 | --> | 3 | --> Completed
           ---       ---       ---

When an index is first assigned a policy, it begins at the "start" and starts executing actions (numbered 1, 2, and 3 above). These actions may be quick, or they may block for a long time (such as waiting for the min_age to be met on and index before proceeding). When an index has gone through all of the actions, it moves to the "completed" terminal step.

Curator on the other hand, can be thought of as a list of if statements that check the age of the index every time it's run. In some bad pseudo-code it would look something like:

index.rolloverIfNeeded()

// 1
if (index.age() > "5d" && index.needForceMerge()) {
  index.forceMerge()
}

// 2
if (index.age() > "30d" && index.isNotFrozen()) {
  index.freeze()
}

// 3
if (index.age() > "60d") {
  index.moveToColdNodes()
}

Now. Let's tackle the scenario where you want to add a fourth step, a "delete" step (but it could be any step). In Curator's case this looks like adding an additional if condition to the list of conditions to be checked every time curator runs:

// 4
if (index.age() > "90d") {
  index.delete()
}

For ILM however, it's a little different. If we change the policy to add the delete action. It's like updating a factory assemble line to add a new machine—all new indices are going to use the new pipeline, but all the previous indices already finished the pipeline. So the indices already in the terminal/completed step won't be deleted.

In our industrial analogy, if we add a machine at the end of the car assembly line to paint the car yellow, all of the previously assembled cars don't immediately turn yellow.

Curator, on the other hand, is going to all the cars ever created, checking whether they meet the criteria to be painted yellow, and painting them yellow, regardless of what other steps they have previously run or not run.


Okay, hopefully that clarifies the difference in execution model between the two. There are two main ways to tackle the cognitive mismatch this leads to.

  1. We can change ILM's execution model to match something like Curator. This would be a lot of work and a big breaking change, but it's always an option regardless of how good or bad we think it would be.

  2. Make policy execution idempotent. We already started this with ILM: Skip rolling indexes that are already rolled #47324 where we skip rolling over indices that have already been rolled over. This would allow use to "re-apply" a policy to an index that was in the completed state, and have it re-run through the steps, skipping the ones that were already executed.

There are probably other ways to tackle this too, so feel free to leave ideas. This is just the start of a discussion to hopefully frame the mismatch in execution models that we've seen to cause some incorrect assumptions.

@jordansissel
Copy link
Contributor Author

This factory/cars metaphor is a good one. I suppose in this model, we've "finished" making the cars, but there's a defect, and now we need to issue a recall to apply a correction of some kind.

To do a recall with ILM, we're on our own (Find all affected indices ourselves, carefully apply a new policy, etc).

To do a recall with Curator, we just update the curator config.

The way I'm thinking about this is that I am not particularly curious about the implementation details. I want a way to describe my retention policy, "Delete data older than 30 days" and implementing this with ILM is only possible if we never make mistakes or never change our minds.

Scenarios:

  • We forgot to set "Delete after 30 days". Our cluster is now full (all disks are rejecting allocation past the low disk watermark). How do we correct this?

    • With ILM: We have to do a bunch of work and discovery. We craft a new policy that only mentions "delete after 30 days", custom-code some way to identify affected indices, delete the old policy on these indices and apply the new policy?
    • With Curator: We tell curator to delete everything older than 30 days, and it does.
  • We change our minds and want to "Delete after 60 days". (was 30)

    • With ILM: Same complex steps as previous scenario, otherwise our existing "Marked for deletion after 30 days" will be deleted after 30 days.
    • With Curator: We tell curator to delete everything older than 60 days, and it does.
  • We change our minds and want to "Delete after 15 days" (was 30)

    • With ILM: Same complex steps as previous scenario, because we need those indices older than 15 days now to be deleted.
    • With Curator: We tell curator to delete everything older than 15 days, and it does.

My illustration is to highlight that Curator allows me to describe my objectives and it implements my objectives. With ILM, unless I am perfect in describing my objectives the first time, then I have to do a bunch of work that might break depending on the phase ILM is in.

The concept of "phases" a specific set of constants (hot, warm, delete, etc) is a state machine I must be completely aware of whenever changing ILM settings -- If I'm in the middle of the machine (between "hot" and "completed" ?), changing ILM settings will depend on the current state. If I'm updating a phase configuration after ILM has executed that phase, I have to find-and-replace all instances of that ILM policy myself, based on what we've discussed so far, right?

@jordansissel
Copy link
Contributor Author

jordansissel commented Oct 25, 2019

adding an additional if condition to the list of conditions to be checked every time curator runs:

The way this is italicized around "every time curator runs" makes me think there's some problem with that?

Back to ILM and thinking out loud about how to make this repair, "I forgot to implement 30 days retention"

# Get a list of indices affected:
curl -s --cacert <(echo $cert) --resolve foo-es:9200:127.0.0.1 -u elastic:$pass https://foo-es:9200/*/_settings | jq -r '.[] | select(.settings.index.lifecycle.name == "infosec-beats") | .settings.index.provided_name'

With this list, we need to update each index's lifecycle name, right?

My list of indices is too long for a single request:

% affected_indices="$(curl -s --cacert <(echo $cert) --resolve foo-es:9200:127.0.0.1 -u elastic:$pass https://foo-es:9200/*/_settings | jq -r '.[] | select(.settings.index.lifecycle.name == "infosec-beats") | .settings.index.provided_name' | tr '\n' ',' | sed -e 's/,$//')"

# Let's try asking for /{index}/_ilm/explain for all indices found above:
% curl -s --cacert <(echo $cert) --resolve foo-es:9200:127.0.0.1 -u elastic:$pass https://foo-es:9200/${affected_indices}/_ilm/explain
{"error":{"root_cause":[{"type":"too_long_frame_exception","reason":"An HTTP line is larger than 4096 bytes."}],"type":"too_long_frame_exception","reason":"An HTTP line is larger than 4096 bytes."},"status":400}

Ok, so we have to make request lines shorter. Let's make the index list 2000 chars long each.

% curl -s --cacert <(echo $cert) --resolve foo-es:9200:127.0.0.1 -u elastic:$pass https://foo-es:9200/*/_settings | jq -r '.[] | select(.settings.index.lifecycle.name == "infosec-beats") | .settings.index.provided_name' | fmt -w 2000 | tr ' ' ','
auditbeat-6.5.4-2019.09.10-000127,auditbeat-6.5.4-2019.09.1...
auditbeat-6.5.4-2019.09.10-000119,auditbeat-6.5.4-2019.09.10-000081,auditbea...

With this, we replace the policy, right?

# Delete the current policy
curl -s --cacert <(echo $cert) --resolve foo-es:9200:127.0.0.1 -u elastic:$pass https://foo-es:9200/*/_settings | jq -r '.[] | select(.settings.index.lifecycle.name == "infosec-beats") | .settings.index.provided_name' | fmt -w 2000 | tr ' ' ',' | xargs sh -xc 'curl --whatever-flags-go-here -XDELETE https://foo-es:9200/$1/_settings -d "{ \"settings\": { \"index.lifecycle.name\": null } }"' -

# Add our new policy that has only a "delete" phase to avoid other breakages
curl -s --cacert <(echo $cert) --resolve foo-es:9200:127.0.0.1 -u elastic:$pass https://foo-es:9200/*/_settings | jq -r '.[] | select(.settings.index.lifecycle.name == "infosec-beats") | .settings.index.provided_name' | fmt -w 2000 | tr ' ' ',' | xargs sh -xc 'curl --whatever-flags-go-here -XDELETE https://foo-es:9200/$1/_settings -d "{ \"settings\": { \"index.lifecycle.name\": "new-policy" } }"' -

I haven't tesetd the actual "delete and apply new policy" but the "Find affected indices" and "split requests t stay within ES's max request length boundary" parts work.

Is this what you expect for the repair steps?

@jordansissel
Copy link
Contributor Author

Any thoughts?

@dakrone
Copy link
Member

dakrone commented Oct 29, 2019

We forgot to set "Delete after 30 days". Our cluster is now full (all disks are rejecting allocation past the low disk watermark). How do we correct this?

With ILM: We have to do a bunch of work and discovery. We craft a new policy that only mentions "delete after 30 days", custom-code some way to identify affected indices, delete the old policy on these indices and apply the new policy?

Not necessarily, in your case, you delete indices that are older than 30 days and change the policy so future indices get automatically deleted after 30 days.

We change our minds and want to "Delete after 60 days". (was 30)
With ILM: Same complex steps as previous scenario, otherwise our existing "Marked for deletion after 30 days" will be deleted after 30 days.

No, for this you can change the policy, no need for manual discover or work.

We change our minds and want to "Delete after 15 days" (was 30)
With ILM: Same complex steps as previous scenario, because we need those indices older than 15 days now to be deleted.

No, for this you can change the policy from "30d" to "15d".

If I'm updating a phase configuration after ILM has executed that phase, I have to find-and-replace all instances of that ILM policy myself, based on what we've discussed so far, right?

If you want to perform an action after ILM has executed that phase, then you'll have to do those actions manually. In your previous examples, however, changing the min_age for deletion doesn't mean you need to take any manual action (other than changing the policy). It's only in the scenario that you previously had no deletion and you want to add deletion that you'd have to take manual action.

adding an additional if condition to the list of conditions to be checked every time curator runs:

The way this is italicized around "every time curator runs" makes me think there's some problem with that?

None other than emphasizing the difference in execution models.

Back to ILM and thinking out loud about how to make this repair, "I forgot to implement 30 days retention"
... lots of steps ...
Is this what you expect for the repair steps?

This is really complicating it I think. Here's how to get a list of indices to delete and delete them:

curl -u elastic:password -s "localhost:9200/_cat/indices?h=i,cd&format=json" | jq 'if 1572362742828 - (.[].cd|tonumber) >= 2592000000 then .[].i else "" end' | fgrep -v "\"\""

Replace 1572362742828 above with the System.currentTimeMillis() equivalent (2592000000 is 30 days in milliseconds, so replace it with whatever age you want) of whenever you run it. That will spit out a list of indices to be deleted. You can pipe it to xargs or do a shell for i in $(<that command>); do curl -XDELETE localhost:9200/$i; done to delete those indices.

One quick side note (not related to ILM), don't use index.provided_name when gathering a list of indices, since this will include any < and > for date math indices that isn't actually part of the name of the index.

@dakrone
Copy link
Member

dakrone commented Oct 29, 2019

Okay, that got a little bit on a tangent focusing solely on deletions. There are other considerations we should address, specifically:

How do we address the disconnect between the execution model of ILM and the model that users expect?

I've been thinking about this recently and have a few half-formed ideas to brainstorm around. One would be that we stored the previously executed steps somewhere in the index lifecycle state. This would allow us to "re-apply" a policy idempotently, skipping the steps that had already been executed. Then a user could update a policy to add a delete phase and have the other steps that had been performed be skipped.

There are some downsides to this, especially when it comes to policies without a delete, we wouldn't want to allocate a bunch of indices to a warm node, only to turn around and immediately move them to cold nodes just because a user happened to add both allocation steps to a policy that was already complete.

Another idea would be to execute the policy of an index in the "complete" phase backwards when changes to the policy happened, allowing us to immediately execute a 'delete' action if one were added to a policy where indices had already finished execution.

Neither of these are fully fleshed out, just some ideas to get us started on brainstorming. Feel free to suggest other ideas for our discussion.

@dakrone dakrone self-assigned this Jan 6, 2020
dakrone added a commit to dakrone/elasticsearch that referenced this issue Jan 9, 2020
There are some cases when updating a policy does not change the
structure in a significant way. In these cases, we can reread the
policy definition for any indices using the updated policy.

This commit adds this refreshing to the `TransportPutLifecycleAction`
to allow this. It allows us to do things like change the configuration
values for a particular step, even when on that step (for example,
changing the rollover criteria while on the `check-rollover-ready` step).

There are more cases where the phase definition can be reread that just
the ones checked here (for example, removing an action that has already
been passed), and those will be added in subsequent work.

Relates to elastic#48431
dakrone added a commit that referenced this issue Jan 13, 2020
* Refresh cached phase policy definition if possible on new policy

There are some cases when updating a policy does not change the
structure in a significant way. In these cases, we can reread the
policy definition for any indices using the updated policy.

This commit adds this refreshing to the `TransportPutLifecycleAction`
to allow this. It allows us to do things like change the configuration
values for a particular step, even when on that step (for example,
changing the rollover criteria while on the `check-rollover-ready` step).

There are more cases where the phase definition can be reread that just
the ones checked here (for example, removing an action that has already
been passed), and those will be added in subsequent work.

Relates to #48431
dakrone added a commit to dakrone/elasticsearch that referenced this issue Jan 13, 2020
…tic#50820)

* Refresh cached phase policy definition if possible on new policy

There are some cases when updating a policy does not change the
structure in a significant way. In these cases, we can reread the
policy definition for any indices using the updated policy.

This commit adds this refreshing to the `TransportPutLifecycleAction`
to allow this. It allows us to do things like change the configuration
values for a particular step, even when on that step (for example,
changing the rollover criteria while on the `check-rollover-ready` step).

There are more cases where the phase definition can be reread that just
the ones checked here (for example, removing an action that has already
been passed), and those will be added in subsequent work.

Relates to elastic#48431
dakrone added a commit that referenced this issue Jan 13, 2020
* Refresh cached phase policy definition if possible on new policy

There are some cases when updating a policy does not change the
structure in a significant way. In these cases, we can reread the
policy definition for any indices using the updated policy.

This commit adds this refreshing to the `TransportPutLifecycleAction`
to allow this. It allows us to do things like change the configuration
values for a particular step, even when on that step (for example,
changing the rollover criteria while on the `check-rollover-ready` step).

There are more cases where the phase definition can be reread that just
the ones checked here (for example, removing an action that has already
been passed), and those will be added in subsequent work.

Relates to #48431
SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this issue Jan 23, 2020
…tic#50820)

* Refresh cached phase policy definition if possible on new policy

There are some cases when updating a policy does not change the
structure in a significant way. In these cases, we can reread the
policy definition for any indices using the updated policy.

This commit adds this refreshing to the `TransportPutLifecycleAction`
to allow this. It allows us to do things like change the configuration
values for a particular step, even when on that step (for example,
changing the rollover criteria while on the `check-rollover-ready` step).

There are more cases where the phase definition can be reread that just
the ones checked here (for example, removing an action that has already
been passed), and those will be added in subsequent work.

Relates to elastic#48431
dakrone added a commit to dakrone/elasticsearch that referenced this issue Jan 29, 2020
Currently when an ILM policy finishes its execution, the index moves into the `TerminalPolicyStep`,
denoted by a completed/completed/completed phase/action/step lifecycle execution state.

This commit changes the behavior so that the index lifecycle execution state halts at the last
configured phase's `PhaseCompleteStep`, so for instance, if an index were configured with a policy
containing a `hot` and `cold` phase, the index would stop at the `cold/complete/complete`
`PhaseCompleteStep`. This allows an ILM user to update the policy to add any later phases and have
indices configured to use that policy pick up execution at the newly added "later" phase. For
example, if a `delete` phase were added to the policy specified about, the index would then move
from `cold/complete/complete` into the `delete` phase.

Relates to elastic#48431
dakrone added a commit that referenced this issue Jan 31, 2020
Currently when an ILM policy finishes its execution, the index moves into the `TerminalPolicyStep`,
denoted by a completed/completed/completed phase/action/step lifecycle execution state.

This commit changes the behavior so that the index lifecycle execution state halts at the last
configured phase's `PhaseCompleteStep`, so for instance, if an index were configured with a policy
containing a `hot` and `cold` phase, the index would stop at the `cold/complete/complete`
`PhaseCompleteStep`. This allows an ILM user to update the policy to add any later phases and have
indices configured to use that policy pick up execution at the newly added "later" phase. For
example, if a `delete` phase were added to the policy specified about, the index would then move
from `cold/complete/complete` into the `delete` phase.

Relates to #48431
dakrone added a commit to dakrone/elasticsearch that referenced this issue Jan 31, 2020
…tic#51631)

Currently when an ILM policy finishes its execution, the index moves into the `TerminalPolicyStep`,
denoted by a completed/completed/completed phase/action/step lifecycle execution state.

This commit changes the behavior so that the index lifecycle execution state halts at the last
configured phase's `PhaseCompleteStep`, so for instance, if an index were configured with a policy
containing a `hot` and `cold` phase, the index would stop at the `cold/complete/complete`
`PhaseCompleteStep`. This allows an ILM user to update the policy to add any later phases and have
indices configured to use that policy pick up execution at the newly added "later" phase. For
example, if a `delete` phase were added to the policy specified about, the index would then move
from `cold/complete/complete` into the `delete` phase.

Relates to elastic#48431
dakrone added a commit that referenced this issue Jan 31, 2020
Currently when an ILM policy finishes its execution, the index moves into the `TerminalPolicyStep`,
denoted by a completed/completed/completed phase/action/step lifecycle execution state.

This commit changes the behavior so that the index lifecycle execution state halts at the last
configured phase's `PhaseCompleteStep`, so for instance, if an index were configured with a policy
containing a `hot` and `cold` phase, the index would stop at the `cold/complete/complete`
`PhaseCompleteStep`. This allows an ILM user to update the policy to add any later phases and have
indices configured to use that policy pick up execution at the newly added "later" phase. For
example, if a `delete` phase were added to the policy specified about, the index would then move
from `cold/complete/complete` into the `delete` phase.

Relates to #48431
@dakrone
Copy link
Member

dakrone commented Feb 11, 2020

I believe this has been resolved by #50820 and #51631, so I'm going to close this for now. Let me know if I'm mistaken and we can re-open it.

@dakrone dakrone closed this as completed Feb 11, 2020
@JacksonHacker
Copy link

I believe this has been resolved by #50280 and #51631, so I'm going to close this for now. Let me know if I'm mistaken and we can re-open it.

Not #50280. It's #50820

@dakrone
Copy link
Member

dakrone commented May 27, 2020

@JacksonHacker whoops yep, thanks for pointing that out, I edited my comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management
Projects
None yet
Development

No branches or pull requests

5 participants