-
Notifications
You must be signed in to change notification settings - Fork 24.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid unintentionally clearing the DataStream.rolloverOnWrite
flag
#107122
Conversation
Pinging @elastic/es-data-management (Team:Data Management) |
Hi @nielsbauman, I've created a changelog YAML for you. |
@@ -1791,7 +1791,8 @@ public ClusterState execute(ClusterState currentState) throws Exception { | |||
original.getLifecycle(), | |||
original.isFailureStore(), | |||
original.getFailureIndices(), | |||
null | |||
original.rolloverOnWrite(), | |||
original.getAutoShardingEvent() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a couple of changes like this, where I also copy the auto-sharding event rather than passing null (only in tests). From what I could see, none of these tests actually cared about the auto-sharding event, so we should just copy the original value. This will also simplify the use of the Builder
class that's being worked on in another PR.
@@ -810,6 +796,7 @@ public DataStream snapshot(Collection<String> indicesInSnapshot) { | |||
lifecycle, | |||
failureStore, | |||
failureIndices, | |||
rolloverOnWrite, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't 100% sure about the snapshot and restore ones (see the change in the RestoreService.java
). I don't immediately see a reason why we wouldn't want to copy the flag when snapshotting or restoring, so I decided to keep it in (rather than overriding it to false
). Let me know if I'm missing something here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am leaning also into copying it. If this flag was added because something changes in the index template, I think it's safer to apply the rollover at the next write to ensure we have the latest index template data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we restore a data stream as a different name that no longer matches the template we wont be able to roll it over. We document that this as a concern currently: https://www.elastic.co/guide/en/elasticsearch/reference/current/restore-snapshot-api.html#restore-snapshot-api-prereqs
Restoring the data stream without a template poses a maintenance problem, but if we restore a data stream in that way with the rollover on write flag set, will it cause us to not be able to write to the data stream at all until it is remedied? That would take that corner case from being just a maintenance problem to a possible availability problem.
I don't think this is a blocker for merging this, but I think we may want to make sure that kind of interaction is what we expect it to be and note it as a "gotcha" in the documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah that's an excellent point. I don't immediately see an easy/sensible way to avoid that "gotcha" while keeping the rolloverOnWrite
flag set to true (I see some ways, but they're either very limiting in the snapshot/restore process or they require elaborate changes in the rollover process, both of which I'm not a big fan of right away). I think I'm still on board with copying the flag, but I agree that we should make this "gotcha" (more) clear in the documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to include the doc change in this PR, see 59b66a7. Let me know if either of you had anything else in mind. (fingers crossed that this renders like I intended)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, very good point:
That would take that corner case from being just a maintenance problem to a possible availability problem.
I think this is not very good user experience.
What if we checked if there is a valid template at the moment of the restoration, we have the cluster state available at: https://github.com/elastic/elasticsearch/blob/882b92ab6092a8ea6b53a64e90cb12e812241483/server/src/main/java/org/elasticsearch/snapshots/RestoreService.java#L1409
If there is a template we copy the rolloverOnWrite
, if not we set it to false
. We could even add an info log line restoring a data stream without a template, so we can debug this.
If you think this is too complex, then I am in favour of not copying the flag, and give the user the recommendation to rollover if they want to use the current index template.
What are your thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a template we copy the
rolloverOnWrite
, if not we set it tofalse
. We could even add an info log line restoring a data stream without a template, so we can debug this.
Ah I like this idea. I don't have an issue with the complexity myself, but I am trying to figure out what impact this would have on the UX as well. Ideally, I think I'd like to make this clear to the user in some way other than a log line as well (to hopefully reduce SDHs coming our way about this).
If we do not copy the rolloverOnWrite
flag, and a user creates a matching template afterwards (with a different mapping or w/e), we won't automatically roll over because it isn't considered a "change" (source). I guess this would be a less "blocking" flow than the current behaviour in this PR, but I'm trying to come up with a way to make this clear to the user. Maybe we could add a warn
log line along the lines of
Restoring a data stream that was marked for lazy rollover, without a matching index template. The restored data stream will NOT be marked for lazy rollover, and might thus not include the latest changes in the index template that caused the original data stream to be marked for rollover.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the fact that you can restore a data stream without a template is vaguely bug-like already considering that the templates are so tightly coupled with how data streams are managed.
I think we could potentially surface a warning in the logs as well as via the HeaderWarning
internal API, but lets discuss some more off this PR.
@@ -395,6 +398,7 @@ static DataStream updateLocalDataStream( | |||
localDataStream.getLifecycle(), | |||
localDataStream.isFailureStore(), | |||
localDataStream.getFailureIndices(), | |||
localDataStream.rolloverOnWrite(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was a bit on the fence about this one. In theory, localDataStream
should already have rolloverOnWrite
set to false
as per the lines above, so that's why I just copied it. But, always setting it to false
would technically be safer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I am curious if this is something we always want. Maybe it should be in the DataStream
constructor, either as an assertion or also when setting the rolloverOnWrite
. Maybe there is has to be replicated == false && rolloverOnWrite
. What do you think? If we make the decision that a replicated data stream always has the rolloverOnWrite
false, I would prefer it to be in a the core data stream code. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of doing this.rolloverOnWrite = replicated == false && rolloverOnWrite
in the constructor. My only concern right away is how this will work out in randomized tests. That's also why I think an assertion wouldn't be a (good) option here, as it would require all randomized tests to take this requirement into consideration when generating values. I think that we should be fine in randomized tests if we override the rolloverOnWrite
flag in the constructor as you suggested. Do you foresee any other potential issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I expect randomised tests to create valid objects, random but valid. If a certain combination of constructor arguments is not valid, I think the randomised test should generate it. Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah right, we also have this already in the constructor:
elasticsearch/server/src/main/java/org/elasticsearch/cluster/metadata/DataStream.java
Line 180 in a28de96
assert system == false || hidden; // system indices must be hidden |
Then I'll add an assertion and check/update all constructors (that use randomization).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll throw out another corner case based on the one above with restoring a data stream:
If we are going to persist the rollover on write flag for snapshots and restoring, should we do so for CCR? I'm not super well read up on CCR and how frequently we synchronize the data streams between clusters. Is it possible that we end up in a situation where we are promoting a data stream in a disaster recovery case where the data stream should be rolled over on the next write?
My guess is no, since it doesn't look like we require a template on the cluster that is promoting a data stream, and practically speaking you would only want to lazily rollover a data stream if the template has been updated. Plus, if CCR only updates the data stream definition when there's a new index to sync, then we should definitely not allow the flag to be set on the follower.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For context: we don't allow rolling over a replicated data stream (source) and
Whenever the name of a new index on the remote cluster matches the auto-follow pattern, a corresponding follower index is added to the local cluster.
You can also create auto-follow patterns for data streams. When a new backing index is generated on a remote cluster, that index and its data stream are automatically followed if the data stream name matches an auto-follow pattern. If you create a data stream after creating the auto-follow pattern, all backing indices are followed automatically.
The data streams replicated from a remote cluster by CCR are protected from local rollovers. The promote data stream API can be used to turn these data streams into regular data streams.
Your scenario is an interesting one, though.
since it doesn't look like we require a template on the cluster that is promoting a data stream
I just had a quick look at the CCR code and it seems like that is indeed the case. That sounds odd to me. There's no point in promoting a DS if you don't have the/a template, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nielsbauman , that's my understanding as well, if you promote the data stream then we end up with a broken data stream since it cannot rollover anymore (either by ILM, DSL or manually). Should we create a health indicator for stuff like this?
I never thought about this before because I thought this could not happen, but with these scenarios it's possible. ILM & DSL health indicators would signal this, but I think this is too late. Signalling users as soon as we see this is better I think.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for fixing this @nielsbauman . I continue the discussion about the combo of replicated
and rolloverOnWrite
flags but I do not see this as blocking this version of the code.
@@ -810,6 +796,7 @@ public DataStream snapshot(Collection<String> indicesInSnapshot) { | |||
lifecycle, | |||
failureStore, | |||
failureIndices, | |||
rolloverOnWrite, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am leaning also into copying it. If this flag was added because something changes in the index template, I think it's safer to apply the rollover at the next write to ensure we have the latest index template data.
@@ -395,6 +398,7 @@ static DataStream updateLocalDataStream( | |||
localDataStream.getLifecycle(), | |||
localDataStream.isFailureStore(), | |||
localDataStream.getFailureIndices(), | |||
localDataStream.rolloverOnWrite(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I am curious if this is something we always want. Maybe it should be in the DataStream
constructor, either as an assertion or also when setting the rolloverOnWrite
. Maybe there is has to be replicated == false && rolloverOnWrite
. What do you think? If we make the decision that a replicated data stream always has the rolloverOnWrite
false, I would prefer it to be in a the core data stream code. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also +1 on this change - I posted about possible corner cases on previous comments but those posts are meant to make sure we're all good with our decisions here. I also don't think those comments are necessarily blockers.
I'm gonna go ahead and merge (and backport) this (when the CI passes), as it's been blocking me to continue with the |
💔 Backport failed
You can use sqren/backport to manually backport by running |
…lastic#107122) A lot of places in the code use a `DataStream` constructor that sets the `rolloverOnWrite` flag to `false`. For some places, this was intentional, but for others, this was erroneous (and for most tests, it didn't matter much). This PR fixes the erroneous spots and avoids similar unintentional behavior in the future by removing the constructor in question altogether. Most use cases just want to copy the flag over and if you _do_ want to set the flag to false, it makes more sense to do so explicitly yourself rather than letting the constructor do it for you. An additional small bonus is that we have one less constructor for the `DataStream` class :). Follow up of [this](elastic#107035 (comment)) discussion.
…107122) (#107206) A lot of places in the code use a `DataStream` constructor that sets the `rolloverOnWrite` flag to `false`. For some places, this was intentional, but for others, this was erroneous (and for most tests, it didn't matter much). This PR fixes the erroneous spots and avoids similar unintentional behavior in the future by removing the constructor in question altogether. Most use cases just want to copy the flag over and if you _do_ want to set the flag to false, it makes more sense to do so explicitly yourself rather than letting the constructor do it for you. An additional small bonus is that we have one less constructor for the `DataStream` class :). Follow up of [this](#107035 (comment)) discussion.
A lot of places in the code use a
DataStream
constructor that sets therolloverOnWrite
flag tofalse
. For some places, this was intentional, but for others, this was erroneous (and for most tests, it didn't matter much).This PR fixes the erroneous spots and avoids similar unintentional behavior in the future by removing the constructor in question altogether. Most use cases just want to copy the flag over and if you do want to set the flag to false, it makes more sense to do so explicitly yourself rather than letting the constructor do it for you.
An additional small bonus is that we have one less constructor for the
DataStream
class :).Follow up of this discussion.