-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: Prevent dropping column which is referenced by active partition… #10352
Core: Prevent dropping column which is referenced by active partition… #10352
Conversation
@@ -533,6 +537,34 @@ private static Schema applyChanges( | |||
} | |||
} | |||
|
|||
Map<Integer, List<Integer>> specToDeletes = Maps.newHashMap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a mapping from a spec to the requested fields to delete, where the fields are referenced as part of that spec.
It's a reverse mapping which makes surfacing the error details for the case where it's an active partition spec easier. Probably needs a better name though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe put this in a separate checkNotDeletingColumnsInActiveSpecs
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We occurred a similar issue. There might be another option which simplifies this schema update logic, namely:
- Prevent dropping any source column if it's used in any spec in table's
specById
- Introduce the
RemoveUnusedSpec
in https://github.com/apache/iceberg/pull/3462/files
In that way, SchemaUpdate only needs to check specById
instead of reading all the manifest files.
To actually drop an unused partition source field, user should make sure all the data files wrote by that spec are rewrote or removed, then call RemoveUnusedSpec
to drop that spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's a better option @advancedxy , I was having a slack discussion with someone on removing unused partition specs and unused schemas and thinking the same thing about this PR. I think there's a legitimate need for those operations and we could leverage that need here.
With this approach, this means to drop a column which used to be referenced in a partition spec a user must first call RemovedUnusedPartitionSpec
to remove any partition specs (this procedure would have to go through the manifests) and only then can they perform the column drop. This seems better because then we don't have to have potential manifest reads on the schema evolution path which I can see being problematic for folks.
One aspect is on ordering of these. There are two ways of going about this:
1.) Prevent dropping of columns which are part of the specs and get in the RemovedUnusedPartitionSpec procedure after that. The downside of this approach is that there are cases where a user could not drop a partition column even though they should be able to. Here's a trivial case: Imagine a user just creates the table and they don't write any data. Then they realize they want to drop a partition column, but the procedure will fail unexpectedly whereas before it would work. The benefit of this approach though is we will prevent users from shooting themselves in the foot generally as seen by the reported issues for this.
2.) First get in the RemovedUnusedPartitionSpec procedure and then prevent dropping if it's part of a spec. The downside of this is, it may take some more time in to get the whole API in? Maybe not much more time since it looks mostly there, but I'd have to check. Another downside is until the procedure is in, users may end up in bad states. The benefit of this approach is that users have a way for dropping historical partition columns that are unreachable prior to the behavior change.
Right now in my mind approach 1 seems better even though there are some behavior changes, it seems net better for users.
cc @Fokko @RussellSpitzer @aokolnychyi @rdblue @nastra in case they had any comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One aspect is on ordering of these. There are two ways of going about this:
I don't think the ordering of getting these two functionality into the master really matter that much? As long as they are both merged before the next release(assuming we are saying Iceberg 1.7 release). I imagine customer should use official released versions.
Here's a trivial case: Imagine a user just creates the table and they don't write any data. Then they realize they want to drop a partition column, but the procedure will fail unexpectedly whereas before it would work.
If preventing dropping active partition source field and RemoveUnusedPartitionSpec are both landed. It's still possible to drop the just created table but with some additional but necessary steps:
- first remove the unwanted partition field, which will create a new PartitionSpec
- Call RemoveUnusedPartitionSpec, which should be able to remove the previous wrongly partition spec
- remove the wrongly partition source field.
First get in the RemovedUnusedPartitionSpec procedure and then prevent dropping if it's part of a spec. The downside of this is, it may take some more time in to get the whole API in?
I think we can parallelize these two PRs if others all agree that's in the right direction. BTW, I can help to work on https://github.com/apache/iceberg/pull/3462/files to get it merged in case @RussellSpitzer is busy and cannot work on that recently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@advancedxy That's reasonable, we can do the 2 independently. Cool, discussed offline with @RussellSpitzer feel free to go for it, for carrying forward the removing historical partitions PR!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update, filed #10755, please take a look.
e0e0223
to
184d434
Compare
for (int fieldIdToDelete : deletes) { | ||
for (PartitionSpec spec : base.specs()) { | ||
if (spec.schema().findField(fieldIdToDelete) != null) { | ||
List<Integer> deletesForSpec = | ||
specToDeletes.computeIfAbsent(spec.specId(), k -> Lists.newArrayList()); | ||
deletesForSpec.add(fieldIdToDelete); | ||
specToDeletes.put(spec.specId(), deletesForSpec); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we only need to do this logic if the current snapshot is not null, so probably want to re-organize the logic around that.
.findAny(); | ||
Preconditions.checkArgument( | ||
!manifestReferencingActivePartition.isPresent(), | ||
"Cannot delete field %s as it is used by an active partition spec %s", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably use the field name in the error message instead of the ID so it's more useful to a user
184d434
to
3e454d2
Compare
3e454d2
to
c62b649
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amogh-jahagirdar are you still working on this? It seems like a serious bug.
@@ -533,6 +537,34 @@ private static Schema applyChanges( | |||
} | |||
} | |||
|
|||
Map<Integer, List<Integer>> specToDeletes = Maps.newHashMap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We occurred a similar issue. There might be another option which simplifies this schema update logic, namely:
- Prevent dropping any source column if it's used in any spec in table's
specById
- Introduce the
RemoveUnusedSpec
in https://github.com/apache/iceberg/pull/3462/files
In that way, SchemaUpdate only needs to check specById
instead of reading all the manifest files.
To actually drop an unused partition source field, user should make sure all the data files wrote by that spec are rewrote or removed, then call RemoveUnusedSpec
to drop that spec.
Sorry about the delay on this, got busy and forgot I had this open! I've seen more related issue reports to this, so I'm going to prioritize it. |
Well understood, and thanks for your effort on this. |
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
Fixes #10234