-
Notifications
You must be signed in to change notification settings - Fork 24.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry downsample
ILM action using a new target index
#94965
Conversation
@@ -81,7 +81,7 @@ public static Builder builder(LifecycleExecutionState state) { | |||
.setSnapshotName(state.snapshotName) | |||
.setShrinkIndexName(state.shrinkIndexName) | |||
.setSnapshotIndexName(state.snapshotIndexName) | |||
.setRollupIndexName(state.downsampleIndexName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took the opportunity and renamed all instances of rollup
to downsample
.
I know this adds a bit of noise to this PR, but I could not resist and did it while I was in the area
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this rename Christos (it's been quite confusing before)
downsample
ILM action retry using a new target index name
@@ -107,6 +107,7 @@ public void refreshAbstractions() { | |||
alias, | |||
policy | |||
); | |||
updateClusterSettings(client(), Settings.builder().put("indices.lifecycle.poll_interval", "5s").build()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Remove this
I only had it for my testing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it can be removed now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed in b669ad9
Hi @csoulios, I've created a changelog YAML for you. |
downsample
ILM action retry using a new target index namedownsample
ILM action using a new target index name
downsample
ILM action using a new target index namedownsample
ILM action using a new target index
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the direction in this pr looks good.
); | ||
// Rollup index has already been created with the generated name but its status is not "success". | ||
// So we delete the index and proceed with executing the rollup step. | ||
DeleteIndexRequest deleteRequest = new DeleteIndexRequest(downsampleIndexName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is removed because cleanupDownsampleIndexKey
will be invoked in case of failure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the step will fail and ILM will go to the cleanup step. So no need to delete the index here
@@ -35,9 +34,20 @@ public class DownsampleStep extends AsyncActionStep { | |||
private static final Logger logger = LogManager.getLogger(DownsampleStep.class); | |||
|
|||
private final DateHistogramInterval fixedInterval; | |||
private final StepKey nextStepOnSuccess; | |||
private final StepKey nextStepOnFailure; | |||
private boolean downsampleFailed; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also use SetOnce here like in CreateSnapshotStep?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to avoid having to set it at every successful outcome. So, boolean value is initially set to false
and on failure I set it to true
. This way, ILM will by default proceed to the next step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I do think we need to make this field volatile
? Since It is being accessed from different threads (line 123?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in c097cad
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @csoulios!
I left a few comments, otherwise looks good.
@@ -107,6 +107,7 @@ public void refreshAbstractions() { | |||
alias, | |||
policy | |||
); | |||
updateClusterSettings(client(), Settings.builder().put("indices.lifecycle.poll_interval", "5s").build()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it can be removed now?
@@ -35,9 +34,20 @@ public class DownsampleStep extends AsyncActionStep { | |||
private static final Logger logger = LogManager.getLogger(DownsampleStep.class); | |||
|
|||
private final DateHistogramInterval fixedInterval; | |||
private final StepKey nextStepOnSuccess; | |||
private final StepKey nextStepOnFailure; | |||
private boolean downsampleFailed; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I do think we need to make this field volatile
? Since It is being accessed from different threads (line 123?).
@@ -9,7 +9,6 @@ | |||
import org.apache.logging.log4j.LogManager; | |||
import org.apache.logging.log4j.Logger; | |||
import org.elasticsearch.action.ActionListener; | |||
import org.elasticsearch.action.admin.indices.delete.DeleteIndexRequest; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It think we need a testNextStepKey()
similar to the one in CreateSnapshotStepTests
too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 65b671f
@elasticsearchmachine generate changelog |
Pinging @elastic/es-analytics-geo (Team:Analytics) |
Pinging @elastic/es-data-management (Team:Data Management) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
Currently, when the ILM downample step is being retried, the same target index is used. This can cause the subsequent downsample API invocation to index rolled up data into shards of the target index that already exists and while the previous downsample API invocation is still partially running (and also rolling up data into the same target shard). Note that, the downsample step may fail in case a cluster is being restarted in a rolling manner (for example for an upgrade) or when the elected master node fails (the downsample action is coordinated from the elected master node). This PR modfies the ILM DownsampleAction so that when DownsampleStep fails, it will retry by going performing the following steps 1. Cleanup existing target index, 2. Generate a new index name for the target index 3. Downsample using the new target index name. Note 1: This change may leave some garbage indices that we must find another way to cleanup. However, the downsample process will become more resilient. Note 2: A similar approach is used by the searchable_snapshot ILM action Closes elastic#93580
Currently, when the ILM downample step is being retried, the same target index is used. This can cause the subsequent downsample API invocation to index rolled up data into shards of the target index that already exists and while the previous downsample API invocation is still partially running (and also rolling up data into the same target shard). Note that, the downsample step may fail in case a cluster is being restarted in a rolling manner (for example for an upgrade) or when the elected master node fails (the downsample action is coordinated from the elected master node). This PR modfies the ILM DownsampleAction so that when DownsampleStep fails, it will retry by going performing the following steps 1. Cleanup existing target index, 2. Generate a new index name for the target index 3. Downsample using the new target index name. Note 1: This change may leave some garbage indices that we must find another way to cleanup. However, the downsample process will become more resilient. Note 2: A similar approach is used by the searchable_snapshot ILM action Closes elastic#93580
Currently, when the ILM downample step is being retried, the same target index is used. This can cause the subsequent downsample API invocation to index rolled up data into shards of the target index that already exists and while the previous downsample API invocation is still partially running (and also rolling up data into the same target shard). Note that, the downsample step may fail in case a cluster is being restarted in a rolling manner (for example for an upgrade) or when the elected master node fails (the downsample action is coordinated from the elected master node). This PR modfies the ILM DownsampleAction so that when DownsampleStep fails, it will retry by going performing the following steps 1. Cleanup existing target index, 2. Generate a new index name for the target index 3. Downsample using the new target index name. Note 1: This change may leave some garbage indices that we must find another way to cleanup. However, the downsample process will become more resilient. Note 2: A similar approach is used by the searchable_snapshot ILM action Closes #93580
Currently, when the ILM downample step is being retried, the same target index is used. This can cause the subsequent downsample API invocation to index rolled up data into shards of the target index that already exists and while the previous downsample API invocation is still partially running (and also rolling up data into the same target shard). Note that, the downsample step may fail in case a cluster is being restarted in a rolling manner (for example for an upgrade) or when the elected master node fails (the downsample action is coordinated from the elected master node). This PR modfies the ILM DownsampleAction so that when DownsampleStep fails, it will retry by going performing the following steps 1. Cleanup existing target index, 2. Generate a new index name for the target index 3. Downsample using the new target index name. Note 1: This change may leave some garbage indices that we must find another way to cleanup. However, the downsample process will become more resilient. Note 2: A similar approach is used by the searchable_snapshot ILM action Closes #93580
Currently, when the ILM downample step is being retried, the same target index is used. This can cause the subsequent
downsample
API invocation to index rolled up data into shards of the target index that already exists and while the previousdownsample
API invocation is still partially running (and also rolling up data into the same target shard).Note that, the downsample step may fail in case a cluster is being restarted in a rolling manner (for example for an upgrade) or when the elected master node fails (the downsample action is coordinated from the elected master node).
This PR modfies the ILM
DownsampleAction
so that whenDownsampleStep
fails, it will retry by going performing the following steps 1. Cleanup existing target index, 2. Generate a new index name for the target index, 3. Downsample using the new target index name.Note 1: This change may leave some garbage indices that we must find another way to cleanup. However, the downsample process will become more resilient.
Note 2: A similar approach is used by the
searchable_snapshot
ILM actionCloses #93580