Core: Prevent dropping column which is referenced by active partition… #10352

amogh-jahagirdar · 2024-05-18T17:03:34Z

amogh-jahagirdar · 2024-05-18T17:05:37Z

core/src/main/java/org/apache/iceberg/SchemaUpdate.java

@@ -533,6 +537,34 @@ private static Schema applyChanges(
      }
    }

+    Map<Integer, List<Integer>> specToDeletes = Maps.newHashMap();


This is a mapping from a spec to the requested fields to delete, where the fields are referenced as part of that spec.
It's a reverse mapping which makes surfacing the error details for the case where it's an active partition spec easier. Probably needs a better name though

Maybe put this in a separate checkNotDeletingColumnsInActiveSpecs method.

We occurred a similar issue. There might be another option which simplifies this schema update logic, namely:

Prevent dropping any source column if it's used in any spec in table's specById

Introduce the RemoveUnusedSpec in https://github.com/apache/iceberg/pull/3462/files

In that way, SchemaUpdate only needs to check specById instead of reading all the manifest files.

To actually drop an unused partition source field, user should make sure all the data files wrote by that spec are rewrote or removed, then call RemoveUnusedSpec to drop that spec.

I think that's a better option @advancedxy , I was having a slack discussion with someone on removing unused partition specs and unused schemas and thinking the same thing about this PR. I think there's a legitimate need for those operations and we could leverage that need here.

With this approach, this means to drop a column which used to be referenced in a partition spec a user must first call RemovedUnusedPartitionSpec to remove any partition specs (this procedure would have to go through the manifests) and only then can they perform the column drop. This seems better because then we don't have to have potential manifest reads on the schema evolution path which I can see being problematic for folks.

One aspect is on ordering of these. There are two ways of going about this:

1.) Prevent dropping of columns which are part of the specs and get in the RemovedUnusedPartitionSpec procedure after that. The downside of this approach is that there are cases where a user could not drop a partition column even though they should be able to. Here's a trivial case: Imagine a user just creates the table and they don't write any data. Then they realize they want to drop a partition column, but the procedure will fail unexpectedly whereas before it would work. The benefit of this approach though is we will prevent users from shooting themselves in the foot generally as seen by the reported issues for this.

2.) First get in the RemovedUnusedPartitionSpec procedure and then prevent dropping if it's part of a spec. The downside of this is, it may take some more time in to get the whole API in? Maybe not much more time since it looks mostly there, but I'd have to check. Another downside is until the procedure is in, users may end up in bad states. The benefit of this approach is that users have a way for dropping historical partition columns that are unreachable prior to the behavior change.

Right now in my mind approach 1 seems better even though there are some behavior changes, it seems net better for users.

cc @Fokko @RussellSpitzer @aokolnychyi @rdblue @nastra in case they had any comments.

One aspect is on ordering of these. There are two ways of going about this:

I don't think the ordering of getting these two functionality into the master really matter that much? As long as they are both merged before the next release(assuming we are saying Iceberg 1.7 release). I imagine customer should use official released versions.

Here's a trivial case: Imagine a user just creates the table and they don't write any data. Then they realize they want to drop a partition column, but the procedure will fail unexpectedly whereas before it would work.

If preventing dropping active partition source field and RemoveUnusedPartitionSpec are both landed. It's still possible to drop the just created table but with some additional but necessary steps:

first remove the unwanted partition field, which will create a new PartitionSpec

Call RemoveUnusedPartitionSpec, which should be able to remove the previous wrongly partition spec

remove the wrongly partition source field.

First get in the RemovedUnusedPartitionSpec procedure and then prevent dropping if it's part of a spec. The downside of this is, it may take some more time in to get the whole API in?

I think we can parallelize these two PRs if others all agree that's in the right direction. BTW, I can help to work on https://github.com/apache/iceberg/pull/3462/files to get it merged in case @RussellSpitzer is busy and cannot work on that recently.

@advancedxy That's reasonable, we can do the 2 independently. Cool, discussed offline with @RussellSpitzer feel free to go for it, for carrying forward the removing historical partitions PR!

Thanks for the update, filed #10755, please take a look.

amogh-jahagirdar · 2024-05-18T17:12:27Z

Needs tests but did some local testing on spark

amogh-jahagirdar · 2024-05-18T17:16:20Z

core/src/main/java/org/apache/iceberg/SchemaUpdate.java

+      for (int fieldIdToDelete : deletes) {
+        for (PartitionSpec spec : base.specs()) {
+          if (spec.schema().findField(fieldIdToDelete) != null) {
+            List<Integer> deletesForSpec =
+                specToDeletes.computeIfAbsent(spec.specId(), k -> Lists.newArrayList());
+            deletesForSpec.add(fieldIdToDelete);
+            specToDeletes.put(spec.specId(), deletesForSpec);
+          }
+        }


I think we only need to do this logic if the current snapshot is not null, so probably want to re-organize the logic around that.

amogh-jahagirdar · 2024-05-18T17:33:36Z

core/src/main/java/org/apache/iceberg/SchemaUpdate.java

+              .findAny();
+      Preconditions.checkArgument(
+          !manifestReferencingActivePartition.isPresent(),
+          "Cannot delete field %s as it is used by an active partition spec %s",


Probably use the field name in the error message instead of the ID so it's more useful to a user

core/src/main/java/org/apache/iceberg/SchemaUpdate.java

… specs

advancedxy

@amogh-jahagirdar are you still working on this? It seems like a serious bug.

advancedxy · 2024-07-12T09:06:35Z

core/src/main/java/org/apache/iceberg/SchemaUpdate.java

@@ -533,6 +537,34 @@ private static Schema applyChanges(
      }
    }

+    Map<Integer, List<Integer>> specToDeletes = Maps.newHashMap();


We occurred a similar issue. There might be another option which simplifies this schema update logic, namely:

Prevent dropping any source column if it's used in any spec in table's specById

Introduce the RemoveUnusedSpec in https://github.com/apache/iceberg/pull/3462/files

In that way, SchemaUpdate only needs to check specById instead of reading all the manifest files.

To actually drop an unused partition source field, user should make sure all the data files wrote by that spec are rewrote or removed, then call RemoveUnusedSpec to drop that spec.

amogh-jahagirdar · 2024-07-12T18:31:22Z

Sorry about the delay on this, got busy and forgot I had this open! I've seen more related issue reports to this, so I'm going to prioritize it.

advancedxy · 2024-07-13T12:25:19Z

Sorry about the delay on this, got busy and forgot I had this open! I've seen more related issue reports to this, so I'm going to prioritize it.

Well understood, and thanks for your effort on this.

github-actions · 2024-11-03T00:17:05Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

github-actions · 2024-11-11T00:15:22Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

github-actions bot added the core label May 18, 2024

amogh-jahagirdar commented May 18, 2024

View reviewed changes

amogh-jahagirdar requested review from nastra and Fokko May 18, 2024 17:09

amogh-jahagirdar force-pushed the prevent-dropping-column-active-spec branch from e0e0223 to 184d434 Compare May 18, 2024 17:14

amogh-jahagirdar commented May 18, 2024

View reviewed changes

core/src/main/java/org/apache/iceberg/SchemaUpdate.java Outdated Show resolved Hide resolved

amogh-jahagirdar force-pushed the prevent-dropping-column-active-spec branch from 184d434 to 3e454d2 Compare May 18, 2024 19:25

Core: Prevent dropping column which is referenced by active partition…

c62b649

… specs

amogh-jahagirdar force-pushed the prevent-dropping-column-active-spec branch from 3e454d2 to c62b649 Compare May 18, 2024 23:50

walkkker mentioned this pull request Jun 14, 2024

Spark: Dropping partition column from old partition table corrupts entire table #10234

Open

advancedxy reviewed Jul 12, 2024

View reviewed changes

advancedxy mentioned this pull request Jul 23, 2024

API: Support removeUnusedSpecs in ExpireSnapshots #10755

Open

github-actions bot added the stale label Nov 3, 2024

github-actions bot closed this Nov 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core: Prevent dropping column which is referenced by active partition… #10352

Core: Prevent dropping column which is referenced by active partition… #10352

amogh-jahagirdar commented May 18, 2024 •

edited

Loading

amogh-jahagirdar May 18, 2024

amogh-jahagirdar May 18, 2024

advancedxy Jul 12, 2024

amogh-jahagirdar Jul 12, 2024 •

edited

Loading

advancedxy Jul 13, 2024

amogh-jahagirdar Jul 19, 2024

advancedxy Jul 23, 2024

amogh-jahagirdar commented May 18, 2024

amogh-jahagirdar May 18, 2024

amogh-jahagirdar May 18, 2024

advancedxy left a comment

advancedxy Jul 12, 2024

amogh-jahagirdar commented Jul 12, 2024

advancedxy commented Jul 13, 2024

github-actions bot commented Nov 3, 2024

github-actions bot commented Nov 11, 2024

Core: Prevent dropping column which is referenced by active partition… #10352

Core: Prevent dropping column which is referenced by active partition… #10352

Conversation

amogh-jahagirdar commented May 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amogh-jahagirdar Jul 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amogh-jahagirdar commented May 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

advancedxy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amogh-jahagirdar commented Jul 12, 2024

advancedxy commented Jul 13, 2024

github-actions bot commented Nov 3, 2024

github-actions bot commented Nov 11, 2024

amogh-jahagirdar commented May 18, 2024 •

edited

Loading

amogh-jahagirdar Jul 12, 2024 •

edited

Loading