Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry downsample ILM action using a new target index #94965

Merged
merged 18 commits into from
May 15, 2023

Conversation

csoulios
Copy link
Contributor

@csoulios csoulios commented Mar 31, 2023

Currently, when the ILM downample step is being retried, the same target index is used. This can cause the subsequent downsample API invocation to index rolled up data into shards of the target index that already exists and while the previous downsample API invocation is still partially running (and also rolling up data into the same target shard).

Note that, the downsample step may fail in case a cluster is being restarted in a rolling manner (for example for an upgrade) or when the elected master node fails (the downsample action is coordinated from the elected master node).

This PR modfies the ILM DownsampleAction so that when DownsampleStep fails, it will retry by going performing the following steps 1. Cleanup existing target index, 2. Generate a new index name for the target index, 3. Downsample using the new target index name.

Note 1: This change may leave some garbage indices that we must find another way to cleanup. However, the downsample process will become more resilient.

Note 2: A similar approach is used by the searchable_snapshot ILM action

Closes #93580

@csoulios csoulios added :Data Management/ILM+SLM Index and Snapshot lifecycle management :StorageEngine/Rollup Turn fine-grained time-based data into coarser-grained data v8.7.1 v8.8.0 labels Mar 31, 2023
@@ -81,7 +81,7 @@ public static Builder builder(LifecycleExecutionState state) {
.setSnapshotName(state.snapshotName)
.setShrinkIndexName(state.shrinkIndexName)
.setSnapshotIndexName(state.snapshotIndexName)
.setRollupIndexName(state.downsampleIndexName)
Copy link
Contributor Author

@csoulios csoulios Mar 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took the opportunity and renamed all instances of rollup to downsample.

I know this adds a bit of noise to this PR, but I could not resist and did it while I was in the area

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this rename Christos (it's been quite confusing before)

@csoulios csoulios changed the title Fix ds ilm2 Make downsample ILM action retry using a new target index name Mar 31, 2023
@@ -107,6 +107,7 @@ public void refreshAbstractions() {
alias,
policy
);
updateClusterSettings(client(), Settings.builder().put("indices.lifecycle.poll_interval", "5s").build());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Remove this

I only had it for my testing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it can be removed now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in b669ad9

@csoulios csoulios added the >bug label Mar 31, 2023
@elasticsearchmachine
Copy link
Collaborator

Hi @csoulios, I've created a changelog YAML for you.

@csoulios csoulios changed the title Make downsample ILM action retry using a new target index name Retry downsample ILM action using a new target index name Mar 31, 2023
@csoulios csoulios changed the title Retry downsample ILM action using a new target index name Retry downsample ILM action using a new target index Mar 31, 2023
@csoulios csoulios added the :StorageEngine/TSDB You know, for Metrics label Apr 4, 2023
Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the direction in this pr looks good.

);
// Rollup index has already been created with the generated name but its status is not "success".
// So we delete the index and proceed with executing the rollup step.
DeleteIndexRequest deleteRequest = new DeleteIndexRequest(downsampleIndexName);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is removed because cleanupDownsampleIndexKey will be invoked in case of failure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the step will fail and ILM will go to the cleanup step. So no need to delete the index here

@@ -35,9 +34,20 @@ public class DownsampleStep extends AsyncActionStep {
private static final Logger logger = LogManager.getLogger(DownsampleStep.class);

private final DateHistogramInterval fixedInterval;
private final StepKey nextStepOnSuccess;
private final StepKey nextStepOnFailure;
private boolean downsampleFailed;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also use SetOnce here like in CreateSnapshotStep?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to avoid having to set it at every successful outcome. So, boolean value is initially set to false and on failure I set it to true. This way, ILM will by default proceed to the next step.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I do think we need to make this field volatile? Since It is being accessed from different threads (line 123?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in c097cad

@pquentin pquentin added v8.7.2 and removed v8.7.1 labels Apr 12, 2023
@gmarouli gmarouli added v8.9.0 and removed v8.8.0 labels Apr 26, 2023
Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @csoulios!
I left a few comments, otherwise looks good.

@@ -107,6 +107,7 @@ public void refreshAbstractions() {
alias,
policy
);
updateClusterSettings(client(), Settings.builder().put("indices.lifecycle.poll_interval", "5s").build());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it can be removed now?

@@ -35,9 +34,20 @@ public class DownsampleStep extends AsyncActionStep {
private static final Logger logger = LogManager.getLogger(DownsampleStep.class);

private final DateHistogramInterval fixedInterval;
private final StepKey nextStepOnSuccess;
private final StepKey nextStepOnFailure;
private boolean downsampleFailed;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I do think we need to make this field volatile? Since It is being accessed from different threads (line 123?).

@@ -9,7 +9,6 @@
import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;
import org.elasticsearch.action.ActionListener;
import org.elasticsearch.action.admin.indices.delete.DeleteIndexRequest;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It think we need a testNextStepKey() similar to the one in CreateSnapshotStepTests too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 65b671f

@csoulios csoulios added v8.8.1 auto-backport-and-merge Automatically create backport pull requests and merge when ready >bug and removed >bug labels May 12, 2023
@csoulios
Copy link
Contributor Author

@elasticsearchmachine generate changelog

@csoulios csoulios marked this pull request as ready for review May 12, 2023 13:12
@elasticsearchmachine elasticsearchmachine added Team:Data Management Meta label for data/management team Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) labels May 12, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@csoulios csoulios requested a review from martijnvg May 12, 2023 13:12
Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.7
8.8

csoulios added a commit to csoulios/elasticsearch that referenced this pull request May 15, 2023
Currently, when the ILM downample step is being retried, the same target index is used.
This can cause the subsequent downsample API invocation to index rolled up data into
shards of the target index that already exists and while the previous downsample API
invocation is still partially running (and also rolling up data into the same target shard).

Note that, the downsample step may fail in case a cluster is being restarted in a rolling 
manner (for example for an upgrade) or when the elected master node fails (the downsample 
action is coordinated from the elected master node).

This PR modfies the ILM DownsampleAction so that when DownsampleStep fails, it will retry
by going performing the following steps
  1. Cleanup existing target index, 
  2. Generate a new index name for the target index
  3. Downsample using the new target index name.

Note 1: This change may leave some garbage indices that we must find another way
to cleanup. However, the downsample process will become more resilient.

Note 2: A similar approach is used by the searchable_snapshot ILM action

Closes elastic#93580
csoulios added a commit to csoulios/elasticsearch that referenced this pull request May 15, 2023
Currently, when the ILM downample step is being retried, the same target index is used.
This can cause the subsequent downsample API invocation to index rolled up data into
shards of the target index that already exists and while the previous downsample API
invocation is still partially running (and also rolling up data into the same target shard).

Note that, the downsample step may fail in case a cluster is being restarted in a rolling 
manner (for example for an upgrade) or when the elected master node fails (the downsample 
action is coordinated from the elected master node).

This PR modfies the ILM DownsampleAction so that when DownsampleStep fails, it will retry
by going performing the following steps
  1. Cleanup existing target index, 
  2. Generate a new index name for the target index
  3. Downsample using the new target index name.

Note 1: This change may leave some garbage indices that we must find another way
to cleanup. However, the downsample process will become more resilient.

Note 2: A similar approach is used by the searchable_snapshot ILM action

Closes elastic#93580
elasticsearchmachine pushed a commit that referenced this pull request May 15, 2023
Currently, when the ILM downample step is being retried, the same target index is used.
This can cause the subsequent downsample API invocation to index rolled up data into
shards of the target index that already exists and while the previous downsample API
invocation is still partially running (and also rolling up data into the same target shard).

Note that, the downsample step may fail in case a cluster is being restarted in a rolling 
manner (for example for an upgrade) or when the elected master node fails (the downsample 
action is coordinated from the elected master node).

This PR modfies the ILM DownsampleAction so that when DownsampleStep fails, it will retry
by going performing the following steps
  1. Cleanup existing target index, 
  2. Generate a new index name for the target index
  3. Downsample using the new target index name.

Note 1: This change may leave some garbage indices that we must find another way
to cleanup. However, the downsample process will become more resilient.

Note 2: A similar approach is used by the searchable_snapshot ILM action

Closes #93580
elasticsearchmachine pushed a commit that referenced this pull request May 15, 2023
Currently, when the ILM downample step is being retried, the same target index is used.
This can cause the subsequent downsample API invocation to index rolled up data into
shards of the target index that already exists and while the previous downsample API
invocation is still partially running (and also rolling up data into the same target shard).

Note that, the downsample step may fail in case a cluster is being restarted in a rolling 
manner (for example for an upgrade) or when the elected master node fails (the downsample 
action is coordinated from the elected master node).

This PR modfies the ILM DownsampleAction so that when DownsampleStep fails, it will retry
by going performing the following steps
  1. Cleanup existing target index, 
  2. Generate a new index name for the target index
  3. Downsample using the new target index name.

Note 1: This change may leave some garbage indices that we must find another way
to cleanup. However, the downsample process will become more resilient.

Note 2: A similar approach is used by the searchable_snapshot ILM action

Closes #93580
@gmarouli gmarouli added v8.8.0 and removed v8.8.1 labels May 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport-and-merge Automatically create backport pull requests and merge when ready >bug :Data Management/ILM+SLM Index and Snapshot lifecycle management :StorageEngine/Rollup Turn fine-grained time-based data into coarser-grained data :StorageEngine/TSDB You know, for Metrics Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:Data Management Meta label for data/management team v8.7.2 v8.8.0 v8.9.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Downsample retry issue
6 participants