Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid unintentionally clearing the DataStream.rolloverOnWrite flag #107122

Merged
merged 8 commits into from
Apr 8, 2024

Conversation

nielsbauman
Copy link
Contributor

A lot of places in the code use a DataStream constructor that sets the rolloverOnWrite flag to false. For some places, this was intentional, but for others, this was erroneous (and for most tests, it didn't matter much).

This PR fixes the erroneous spots and avoids similar unintentional behavior in the future by removing the constructor in question altogether. Most use cases just want to copy the flag over and if you do want to set the flag to false, it makes more sense to do so explicitly yourself rather than letting the constructor do it for you.

An additional small bonus is that we have one less constructor for the DataStream class :).

Follow up of this discussion.

@nielsbauman nielsbauman added >bug :Data Management/Data streams Data streams and their lifecycles Team:Data Management Meta label for data/management team v8.14.0 v8.13.2 labels Apr 4, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine
Copy link
Collaborator

Hi @nielsbauman, I've created a changelog YAML for you.

@@ -1791,7 +1791,8 @@ public ClusterState execute(ClusterState currentState) throws Exception {
original.getLifecycle(),
original.isFailureStore(),
original.getFailureIndices(),
null
original.rolloverOnWrite(),
original.getAutoShardingEvent()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple of changes like this, where I also copy the auto-sharding event rather than passing null (only in tests). From what I could see, none of these tests actually cared about the auto-sharding event, so we should just copy the original value. This will also simplify the use of the Builder class that's being worked on in another PR.

@@ -810,6 +796,7 @@ public DataStream snapshot(Collection<String> indicesInSnapshot) {
lifecycle,
failureStore,
failureIndices,
rolloverOnWrite,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't 100% sure about the snapshot and restore ones (see the change in the RestoreService.java). I don't immediately see a reason why we wouldn't want to copy the flag when snapshotting or restoring, so I decided to keep it in (rather than overriding it to false). Let me know if I'm missing something here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am leaning also into copying it. If this flag was added because something changes in the index template, I think it's safer to apply the rollover at the next write to ensure we have the latest index template data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we restore a data stream as a different name that no longer matches the template we wont be able to roll it over. We document that this as a concern currently: https://www.elastic.co/guide/en/elasticsearch/reference/current/restore-snapshot-api.html#restore-snapshot-api-prereqs

Restoring the data stream without a template poses a maintenance problem, but if we restore a data stream in that way with the rollover on write flag set, will it cause us to not be able to write to the data stream at all until it is remedied? That would take that corner case from being just a maintenance problem to a possible availability problem.

I don't think this is a blocker for merging this, but I think we may want to make sure that kind of interaction is what we expect it to be and note it as a "gotcha" in the documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that's an excellent point. I don't immediately see an easy/sensible way to avoid that "gotcha" while keeping the rolloverOnWrite flag set to true (I see some ways, but they're either very limiting in the snapshot/restore process or they require elaborate changes in the rollover process, both of which I'm not a big fan of right away). I think I'm still on board with copying the flag, but I agree that we should make this "gotcha" (more) clear in the documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to include the doc change in this PR, see 59b66a7. Let me know if either of you had anything else in mind. (fingers crossed that this renders like I intended)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, very good point:

That would take that corner case from being just a maintenance problem to a possible availability problem.

I think this is not very good user experience.

What if we checked if there is a valid template at the moment of the restoration, we have the cluster state available at: https://github.com/elastic/elasticsearch/blob/882b92ab6092a8ea6b53a64e90cb12e812241483/server/src/main/java/org/elasticsearch/snapshots/RestoreService.java#L1409

If there is a template we copy the rolloverOnWrite, if not we set it to false. We could even add an info log line restoring a data stream without a template, so we can debug this.

If you think this is too complex, then I am in favour of not copying the flag, and give the user the recommendation to rollover if they want to use the current index template.

What are your thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a template we copy the rolloverOnWrite, if not we set it to false. We could even add an info log line restoring a data stream without a template, so we can debug this.

Ah I like this idea. I don't have an issue with the complexity myself, but I am trying to figure out what impact this would have on the UX as well. Ideally, I think I'd like to make this clear to the user in some way other than a log line as well (to hopefully reduce SDHs coming our way about this).

If we do not copy the rolloverOnWrite flag, and a user creates a matching template afterwards (with a different mapping or w/e), we won't automatically roll over because it isn't considered a "change" (source). I guess this would be a less "blocking" flow than the current behaviour in this PR, but I'm trying to come up with a way to make this clear to the user. Maybe we could add a warn log line along the lines of

Restoring a data stream that was marked for lazy rollover, without a matching index template. The restored data stream will NOT be marked for lazy rollover, and might thus not include the latest changes in the index template that caused the original data stream to be marked for rollover.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the fact that you can restore a data stream without a template is vaguely bug-like already considering that the templates are so tightly coupled with how data streams are managed.

I think we could potentially surface a warning in the logs as well as via the HeaderWarning internal API, but lets discuss some more off this PR.

@@ -395,6 +398,7 @@ static DataStream updateLocalDataStream(
localDataStream.getLifecycle(),
localDataStream.isFailureStore(),
localDataStream.getFailureIndices(),
localDataStream.rolloverOnWrite(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was a bit on the fence about this one. In theory, localDataStream should already have rolloverOnWrite set to false as per the lines above, so that's why I just copied it. But, always setting it to false would technically be safer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I am curious if this is something we always want. Maybe it should be in the DataStream constructor, either as an assertion or also when setting the rolloverOnWrite. Maybe there is has to be replicated == false && rolloverOnWrite. What do you think? If we make the decision that a replicated data stream always has the rolloverOnWrite false, I would prefer it to be in a the core data stream code. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of doing this.rolloverOnWrite = replicated == false && rolloverOnWrite in the constructor. My only concern right away is how this will work out in randomized tests. That's also why I think an assertion wouldn't be a (good) option here, as it would require all randomized tests to take this requirement into consideration when generating values. I think that we should be fine in randomized tests if we override the rolloverOnWrite flag in the constructor as you suggested. Do you foresee any other potential issues?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I expect randomised tests to create valid objects, random but valid. If a certain combination of constructor arguments is not valid, I think the randomised test should generate it. Right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah right, we also have this already in the constructor:

assert system == false || hidden; // system indices must be hidden

Then I'll add an assertion and check/update all constructors (that use randomization).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll throw out another corner case based on the one above with restoring a data stream:

If we are going to persist the rollover on write flag for snapshots and restoring, should we do so for CCR? I'm not super well read up on CCR and how frequently we synchronize the data streams between clusters. Is it possible that we end up in a situation where we are promoting a data stream in a disaster recovery case where the data stream should be rolled over on the next write?

My guess is no, since it doesn't look like we require a template on the cluster that is promoting a data stream, and practically speaking you would only want to lazily rollover a data stream if the template has been updated. Plus, if CCR only updates the data stream definition when there's a new index to sync, then we should definitely not allow the flag to be set on the follower.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For context: we don't allow rolling over a replicated data stream (source) and

Whenever the name of a new index on the remote cluster matches the auto-follow pattern, a corresponding follower index is added to the local cluster.
You can also create auto-follow patterns for data streams. When a new backing index is generated on a remote cluster, that index and its data stream are automatically followed if the data stream name matches an auto-follow pattern. If you create a data stream after creating the auto-follow pattern, all backing indices are followed automatically.
The data streams replicated from a remote cluster by CCR are protected from local rollovers. The promote data stream API can be used to turn these data streams into regular data streams.

source

Your scenario is an interesting one, though.

since it doesn't look like we require a template on the cluster that is promoting a data stream

I just had a quick look at the CCR code and it seems like that is indeed the case. That sounds odd to me. There's no point in promoting a DS if you don't have the/a template, right?

Copy link
Contributor

@gmarouli gmarouli Apr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nielsbauman , that's my understanding as well, if you promote the data stream then we end up with a broken data stream since it cannot rollover anymore (either by ILM, DSL or manually). Should we create a health indicator for stuff like this?

I never thought about this before because I thought this could not happen, but with these scenarios it's possible. ILM & DSL health indicators would signal this, but I think this is too late. Signalling users as soon as we see this is better I think.

Thoughts?

Copy link
Contributor

@gmarouli gmarouli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for fixing this @nielsbauman . I continue the discussion about the combo of replicated and rolloverOnWrite flags but I do not see this as blocking this version of the code.

@@ -810,6 +796,7 @@ public DataStream snapshot(Collection<String> indicesInSnapshot) {
lifecycle,
failureStore,
failureIndices,
rolloverOnWrite,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am leaning also into copying it. If this flag was added because something changes in the index template, I think it's safer to apply the rollover at the next write to ensure we have the latest index template data.

@@ -395,6 +398,7 @@ static DataStream updateLocalDataStream(
localDataStream.getLifecycle(),
localDataStream.isFailureStore(),
localDataStream.getFailureIndices(),
localDataStream.rolloverOnWrite(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I am curious if this is something we always want. Maybe it should be in the DataStream constructor, either as an assertion or also when setting the rolloverOnWrite. Maybe there is has to be replicated == false && rolloverOnWrite. What do you think? If we make the decision that a replicated data stream always has the rolloverOnWrite false, I would prefer it to be in a the core data stream code. What do you think?

Copy link
Member

@jbaiera jbaiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also +1 on this change - I posted about possible corner cases on previous comments but those posts are meant to make sure we're all good with our decisions here. I also don't think those comments are necessarily blockers.

@nielsbauman nielsbauman added auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) auto-backport-and-merge Automatically create backport pull requests and merge when ready labels Apr 8, 2024
@nielsbauman
Copy link
Contributor Author

I'm gonna go ahead and merge (and backport) this (when the CI passes), as it's been blocking me to continue with the Builder PR and therefore lazy failure store rollover. I still would like to finish the discussion(s) that are ongoing on this PR, but we can do that while I work on those other PRs.

@elasticsearchmachine elasticsearchmachine merged commit 887d48d into elastic:main Apr 8, 2024
14 checks passed
@nielsbauman nielsbauman deleted the ds-constructor branch April 8, 2024 10:06
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.13 Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 107122

nielsbauman added a commit to nielsbauman/elasticsearch that referenced this pull request Apr 8, 2024
…lastic#107122)

A lot of places in the code use a `DataStream` constructor that sets the
`rolloverOnWrite` flag to `false`. For some places, this was
intentional, but for others, this was erroneous (and for most tests, it
didn't matter much).

This PR fixes the erroneous spots and avoids similar unintentional
behavior in the future by removing the constructor in question
altogether. Most use cases just want to copy the flag over and if you
_do_ want to set the flag to false, it makes more sense to do so
explicitly yourself rather than letting the constructor do it for you.

An additional small bonus is that we have one less constructor for the
`DataStream` class :).

Follow up of
[this](elastic#107035 (comment))
discussion.
nielsbauman added a commit that referenced this pull request Apr 8, 2024
…107122) (#107206)

A lot of places in the code use a `DataStream` constructor that sets the
`rolloverOnWrite` flag to `false`. For some places, this was
intentional, but for others, this was erroneous (and for most tests, it
didn't matter much).

This PR fixes the erroneous spots and avoids similar unintentional
behavior in the future by removing the constructor in question
altogether. Most use cases just want to copy the flag over and if you
_do_ want to set the flag to false, it makes more sense to do so
explicitly yourself rather than letting the constructor do it for you.

An additional small bonus is that we have one less constructor for the
`DataStream` class :).

Follow up of
[this](#107035 (comment))
discussion.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport-and-merge Automatically create backport pull requests and merge when ready auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >bug :Data Management/Data streams Data streams and their lifecycles Team:Data Management Meta label for data/management team v8.13.3 v8.14.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants