Avoid unintentionally clearing the `DataStream.rolloverOnWrite` flag #107122

nielsbauman · 2024-04-04T19:14:12Z

A lot of places in the code use a DataStream constructor that sets the rolloverOnWrite flag to false. For some places, this was intentional, but for others, this was erroneous (and for most tests, it didn't matter much).

This PR fixes the erroneous spots and avoids similar unintentional behavior in the future by removing the constructor in question altogether. Most use cases just want to copy the flag over and if you do want to set the flag to false, it makes more sense to do so explicitly yourself rather than letting the constructor do it for you.

An additional small bonus is that we have one less constructor for the DataStream class :).

Follow up of this discussion.

elasticsearchmachine · 2024-04-04T19:14:35Z

Pinging @elastic/es-data-management (Team:Data Management)

elasticsearchmachine · 2024-04-04T19:14:36Z

Hi @nielsbauman, I've created a changelog YAML for you.

nielsbauman · 2024-04-04T19:16:06Z

...es/data-streams/src/internalClusterTest/java/org/elasticsearch/datastreams/DataStreamIT.java

@@ -1791,7 +1791,8 @@ public ClusterState execute(ClusterState currentState) throws Exception {
                        original.getLifecycle(),
                        original.isFailureStore(),
                        original.getFailureIndices(),
-                        null
+                        original.rolloverOnWrite(),
+                        original.getAutoShardingEvent()


There are a couple of changes like this, where I also copy the auto-sharding event rather than passing null (only in tests). From what I could see, none of these tests actually cared about the auto-sharding event, so we should just copy the original value. This will also simplify the use of the Builder class that's being worked on in another PR.

nielsbauman · 2024-04-04T19:18:04Z

server/src/main/java/org/elasticsearch/cluster/metadata/DataStream.java

@@ -810,6 +796,7 @@ public DataStream snapshot(Collection<String> indicesInSnapshot) {
            lifecycle,
            failureStore,
            failureIndices,
+            rolloverOnWrite,


I wasn't 100% sure about the snapshot and restore ones (see the change in the RestoreService.java). I don't immediately see a reason why we wouldn't want to copy the flag when snapshotting or restoring, so I decided to keep it in (rather than overriding it to false). Let me know if I'm missing something here.

I am leaning also into copying it. If this flag was added because something changes in the index template, I think it's safer to apply the rollover at the next write to ensure we have the latest index template data.

If we restore a data stream as a different name that no longer matches the template we wont be able to roll it over. We document that this as a concern currently: https://www.elastic.co/guide/en/elasticsearch/reference/current/restore-snapshot-api.html#restore-snapshot-api-prereqs

Restoring the data stream without a template poses a maintenance problem, but if we restore a data stream in that way with the rollover on write flag set, will it cause us to not be able to write to the data stream at all until it is remedied? That would take that corner case from being just a maintenance problem to a possible availability problem.

I don't think this is a blocker for merging this, but I think we may want to make sure that kind of interaction is what we expect it to be and note it as a "gotcha" in the documentation.

Ah that's an excellent point. I don't immediately see an easy/sensible way to avoid that "gotcha" while keeping the rolloverOnWrite flag set to true (I see some ways, but they're either very limiting in the snapshot/restore process or they require elaborate changes in the rollover process, both of which I'm not a big fan of right away). I think I'm still on board with copying the flag, but I agree that we should make this "gotcha" (more) clear in the documentation.

I wanted to include the doc change in this PR, see 59b66a7. Let me know if either of you had anything else in mind. (fingers crossed that this renders like I intended)

Hm, very good point:

That would take that corner case from being just a maintenance problem to a possible availability problem.

I think this is not very good user experience.

What if we checked if there is a valid template at the moment of the restoration, we have the cluster state available at: https://github.com/elastic/elasticsearch/blob/882b92ab6092a8ea6b53a64e90cb12e812241483/server/src/main/java/org/elasticsearch/snapshots/RestoreService.java#L1409

If there is a template we copy the rolloverOnWrite, if not we set it to false. We could even add an info log line restoring a data stream without a template, so we can debug this.

If you think this is too complex, then I am in favour of not copying the flag, and give the user the recommendation to rollover if they want to use the current index template.

What are your thoughts?

If there is a template we copy the rolloverOnWrite, if not we set it to false. We could even add an info log line restoring a data stream without a template, so we can debug this.

Ah I like this idea. I don't have an issue with the complexity myself, but I am trying to figure out what impact this would have on the UX as well. Ideally, I think I'd like to make this clear to the user in some way other than a log line as well (to hopefully reduce SDHs coming our way about this).

If we do not copy the rolloverOnWrite flag, and a user creates a matching template afterwards (with a different mapping or w/e), we won't automatically roll over because it isn't considered a "change" (source). I guess this would be a less "blocking" flow than the current behaviour in this PR, but I'm trying to come up with a way to make this clear to the user. Maybe we could add a warn log line along the lines of

Restoring a data stream that was marked for lazy rollover, without a matching index template. The restored data stream will NOT be marked for lazy rollover, and might thus not include the latest changes in the index template that caused the original data stream to be marked for rollover.

I think the fact that you can restore a data stream without a template is vaguely bug-like already considering that the templates are so tightly coupled with how data streams are managed.

I think we could potentially surface a warning in the logs as well as via the HeaderWarning internal API, but lets discuss some more off this PR.

nielsbauman · 2024-04-04T19:22:49Z

...ck/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/TransportPutFollowAction.java

@@ -395,6 +398,7 @@ static DataStream updateLocalDataStream(
                localDataStream.getLifecycle(),
                localDataStream.isFailureStore(),
                localDataStream.getFailureIndices(),
+                localDataStream.rolloverOnWrite(),


I was a bit on the fence about this one. In theory, localDataStream should already have rolloverOnWrite set to false as per the lines above, so that's why I just copied it. But, always setting it to false would technically be safer.

Hm, I am curious if this is something we always want. Maybe it should be in the DataStream constructor, either as an assertion or also when setting the rolloverOnWrite. Maybe there is has to be replicated == false && rolloverOnWrite. What do you think? If we make the decision that a replicated data stream always has the rolloverOnWrite false, I would prefer it to be in a the core data stream code. What do you think?

I like the idea of doing this.rolloverOnWrite = replicated == false && rolloverOnWrite in the constructor. My only concern right away is how this will work out in randomized tests. That's also why I think an assertion wouldn't be a (good) option here, as it would require all randomized tests to take this requirement into consideration when generating values. I think that we should be fine in randomized tests if we override the rolloverOnWrite flag in the constructor as you suggested. Do you foresee any other potential issues?

Good point, I expect randomised tests to create valid objects, random but valid. If a certain combination of constructor arguments is not valid, I think the randomised test should generate it. Right?

Ah yeah right, we also have this already in the constructor:

elasticsearch/server/src/main/java/org/elasticsearch/cluster/metadata/DataStream.java

Line 180 in a28de96

assert system == false || hidden; // system indices must be hidden

Then I'll add an assertion and check/update all constructors (that use randomization).

I'll throw out another corner case based on the one above with restoring a data stream:

If we are going to persist the rollover on write flag for snapshots and restoring, should we do so for CCR? I'm not super well read up on CCR and how frequently we synchronize the data streams between clusters. Is it possible that we end up in a situation where we are promoting a data stream in a disaster recovery case where the data stream should be rolled over on the next write?

My guess is no, since it doesn't look like we require a template on the cluster that is promoting a data stream, and practically speaking you would only want to lazily rollover a data stream if the template has been updated. Plus, if CCR only updates the data stream definition when there's a new index to sync, then we should definitely not allow the flag to be set on the follower.

For context: we don't allow rolling over a replicated data stream (source) and

Whenever the name of a new index on the remote cluster matches the auto-follow pattern, a corresponding follower index is added to the local cluster.
You can also create auto-follow patterns for data streams. When a new backing index is generated on a remote cluster, that index and its data stream are automatically followed if the data stream name matches an auto-follow pattern. If you create a data stream after creating the auto-follow pattern, all backing indices are followed automatically.
The data streams replicated from a remote cluster by CCR are protected from local rollovers. The promote data stream API can be used to turn these data streams into regular data streams.

source

Your scenario is an interesting one, though.

since it doesn't look like we require a template on the cluster that is promoting a data stream

I just had a quick look at the CCR code and it seems like that is indeed the case. That sounds odd to me. There's no point in promoting a DS if you don't have the/a template, right?

@nielsbauman , that's my understanding as well, if you promote the data stream then we end up with a broken data stream since it cannot rollover anymore (either by ILM, DSL or manually). Should we create a health indicator for stuff like this?

I never thought about this before because I thought this could not happen, but with these scenarios it's possible. ILM & DSL health indicators would signal this, but I think this is too late. Signalling users as soon as we see this is better I think.

Thoughts?

gmarouli

LGTM, thanks for fixing this @nielsbauman . I continue the discussion about the combo of replicated and rolloverOnWrite flags but I do not see this as blocking this version of the code.

gmarouli · 2024-04-05T09:13:24Z

server/src/main/java/org/elasticsearch/cluster/metadata/DataStream.java

@@ -810,6 +796,7 @@ public DataStream snapshot(Collection<String> indicesInSnapshot) {
            lifecycle,
            failureStore,
            failureIndices,
+            rolloverOnWrite,


I am leaning also into copying it. If this flag was added because something changes in the index template, I think it's safer to apply the rollover at the next write to ensure we have the latest index template data.

gmarouli · 2024-04-05T09:18:23Z

...ck/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/TransportPutFollowAction.java

@@ -395,6 +398,7 @@ static DataStream updateLocalDataStream(
                localDataStream.getLifecycle(),
                localDataStream.isFailureStore(),
                localDataStream.getFailureIndices(),
+                localDataStream.rolloverOnWrite(),


Hm, I am curious if this is something we always want. Maybe it should be in the DataStream constructor, either as an assertion or also when setting the rolloverOnWrite. Maybe there is has to be replicated == false && rolloverOnWrite. What do you think? If we make the decision that a replicated data stream always has the rolloverOnWrite false, I would prefer it to be in a the core data stream code. What do you think?

jbaiera

I'm also +1 on this change - I posted about possible corner cases on previous comments but those posts are meant to make sure we're all good with our decisions here. I also don't think those comments are necessarily blockers.

This reverts commit 59b66a7.

nielsbauman · 2024-04-08T09:05:12Z

I'm gonna go ahead and merge (and backport) this (when the CI passes), as it's been blocking me to continue with the Builder PR and therefore lazy failure store rollover. I still would like to finish the discussion(s) that are ongoing on this PR, but we can do that while I work on those other PRs.

elasticsearchmachine · 2024-04-08T10:07:29Z

💔 Backport failed

Status	Branch	Result
❌	8.13	Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 107122

…lastic#107122) A lot of places in the code use a `DataStream` constructor that sets the `rolloverOnWrite` flag to `false`. For some places, this was intentional, but for others, this was erroneous (and for most tests, it didn't matter much). This PR fixes the erroneous spots and avoids similar unintentional behavior in the future by removing the constructor in question altogether. Most use cases just want to copy the flag over and if you _do_ want to set the flag to false, it makes more sense to do so explicitly yourself rather than letting the constructor do it for you. An additional small bonus is that we have one less constructor for the `DataStream` class :). Follow up of [this](elastic#107035 (comment)) discussion.

…107122) (#107206) A lot of places in the code use a `DataStream` constructor that sets the `rolloverOnWrite` flag to `false`. For some places, this was intentional, but for others, this was erroneous (and for most tests, it didn't matter much). This PR fixes the erroneous spots and avoids similar unintentional behavior in the future by removing the constructor in question altogether. Most use cases just want to copy the flag over and if you _do_ want to set the flag to false, it makes more sense to do so explicitly yourself rather than letting the constructor do it for you. An additional small bonus is that we have one less constructor for the `DataStream` class :). Follow up of [this](#107035 (comment)) discussion.

Remove DataStream constructor that sets rolloverOnWrite flag to false

44d034b

nielsbauman added >bug :Data Management/Data streams Data streams and their lifecycles Team:Data Management Meta label for data/management team v8.14.0 v8.13.2 labels Apr 4, 2024

nielsbauman requested review from jbaiera and gmarouli April 4, 2024 19:14

Update docs/changelog/107122.yaml

a28de96

nielsbauman commented Apr 4, 2024

View reviewed changes

nielsbauman mentioned this pull request Apr 4, 2024

Add and use a Builder for the DataStream class #107035

Merged

gmarouli approved these changes Apr 5, 2024

View reviewed changes

nielsbauman added 2 commits April 5, 2024 13:12

Assert replicated data streams cannot be marked for lazy rollover

3fbcc09

Fix failing test

c6bd21a

elasticsearchmachine added v8.13.3 and removed v8.13.2 labels Apr 5, 2024

jbaiera approved these changes Apr 5, 2024

View reviewed changes

nielsbauman added 4 commits April 8, 2024 07:45

Merge branch 'main' into ds-constructor

3a5b1f5

Add warning in docs

59b66a7

Fix compilation after branch update

2cc8586

Revert "Add warning in docs"

5204161

This reverts commit 59b66a7.

nielsbauman added auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) auto-backport-and-merge Automatically create backport pull requests and merge when ready labels Apr 8, 2024

elasticsearchmachine merged commit 887d48d into elastic:main Apr 8, 2024
14 checks passed

nielsbauman deleted the ds-constructor branch April 8, 2024 10:06

elasticsearchmachine added the backport pending label Apr 8, 2024

nielsbauman mentioned this pull request Apr 8, 2024

Avoid unintentionally clearing the DataStream.rolloverOnWrite flag #107206

Merged

nielsbauman removed the backport pending label Apr 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid unintentionally clearing the `DataStream.rolloverOnWrite` flag #107122

Avoid unintentionally clearing the `DataStream.rolloverOnWrite` flag #107122

nielsbauman commented Apr 4, 2024

elasticsearchmachine commented Apr 4, 2024

elasticsearchmachine commented Apr 4, 2024

nielsbauman Apr 4, 2024

nielsbauman Apr 4, 2024

gmarouli Apr 5, 2024

jbaiera Apr 5, 2024

nielsbauman Apr 6, 2024

nielsbauman Apr 8, 2024

gmarouli Apr 8, 2024

nielsbauman Apr 8, 2024

jbaiera Apr 9, 2024

nielsbauman Apr 4, 2024

gmarouli Apr 5, 2024

nielsbauman Apr 5, 2024

gmarouli Apr 5, 2024

nielsbauman Apr 5, 2024

jbaiera Apr 5, 2024

nielsbauman Apr 6, 2024

gmarouli Apr 15, 2024 •

edited

gmarouli left a comment

gmarouli Apr 5, 2024

gmarouli Apr 5, 2024

jbaiera left a comment

nielsbauman commented Apr 8, 2024

elasticsearchmachine commented Apr 8, 2024

Avoid unintentionally clearing the DataStream.rolloverOnWrite flag #107122

Avoid unintentionally clearing the DataStream.rolloverOnWrite flag #107122

Conversation

nielsbauman commented Apr 4, 2024

elasticsearchmachine commented Apr 4, 2024

elasticsearchmachine commented Apr 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gmarouli Apr 15, 2024 • edited

Choose a reason for hiding this comment

gmarouli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbaiera left a comment

Choose a reason for hiding this comment

nielsbauman commented Apr 8, 2024

elasticsearchmachine commented Apr 8, 2024

💔 Backport failed

Avoid unintentionally clearing the `DataStream.rolloverOnWrite` flag #107122

Avoid unintentionally clearing the `DataStream.rolloverOnWrite` flag #107122

gmarouli Apr 15, 2024 •

edited