Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: Prevent dropping column which is referenced by active partition… #10352

Conversation

amogh-jahagirdar
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar commented May 18, 2024

Fixes #10234

@github-actions github-actions bot added the core label May 18, 2024
@@ -533,6 +537,34 @@ private static Schema applyChanges(
}
}

Map<Integer, List<Integer>> specToDeletes = Maps.newHashMap();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a mapping from a spec to the requested fields to delete, where the fields are referenced as part of that spec.
It's a reverse mapping which makes surfacing the error details for the case where it's an active partition spec easier. Probably needs a better name though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe put this in a separate checkNotDeletingColumnsInActiveSpecs method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We occurred a similar issue. There might be another option which simplifies this schema update logic, namely:

  1. Prevent dropping any source column if it's used in any spec in table's specById
  2. Introduce the RemoveUnusedSpec in https://github.com/apache/iceberg/pull/3462/files

In that way, SchemaUpdate only needs to check specById instead of reading all the manifest files.

To actually drop an unused partition source field, user should make sure all the data files wrote by that spec are rewrote or removed, then call RemoveUnusedSpec to drop that spec.

Copy link
Contributor Author

@amogh-jahagirdar amogh-jahagirdar Jul 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's a better option @advancedxy , I was having a slack discussion with someone on removing unused partition specs and unused schemas and thinking the same thing about this PR. I think there's a legitimate need for those operations and we could leverage that need here.

With this approach, this means to drop a column which used to be referenced in a partition spec a user must first call RemovedUnusedPartitionSpec to remove any partition specs (this procedure would have to go through the manifests) and only then can they perform the column drop. This seems better because then we don't have to have potential manifest reads on the schema evolution path which I can see being problematic for folks.

One aspect is on ordering of these. There are two ways of going about this:

1.) Prevent dropping of columns which are part of the specs and get in the RemovedUnusedPartitionSpec procedure after that. The downside of this approach is that there are cases where a user could not drop a partition column even though they should be able to. Here's a trivial case: Imagine a user just creates the table and they don't write any data. Then they realize they want to drop a partition column, but the procedure will fail unexpectedly whereas before it would work. The benefit of this approach though is we will prevent users from shooting themselves in the foot generally as seen by the reported issues for this.

2.) First get in the RemovedUnusedPartitionSpec procedure and then prevent dropping if it's part of a spec. The downside of this is, it may take some more time in to get the whole API in? Maybe not much more time since it looks mostly there, but I'd have to check. Another downside is until the procedure is in, users may end up in bad states. The benefit of this approach is that users have a way for dropping historical partition columns that are unreachable prior to the behavior change.

Right now in my mind approach 1 seems better even though there are some behavior changes, it seems net better for users.

cc @Fokko @RussellSpitzer @aokolnychyi @rdblue @nastra in case they had any comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One aspect is on ordering of these. There are two ways of going about this:

I don't think the ordering of getting these two functionality into the master really matter that much? As long as they are both merged before the next release(assuming we are saying Iceberg 1.7 release). I imagine customer should use official released versions.

Here's a trivial case: Imagine a user just creates the table and they don't write any data. Then they realize they want to drop a partition column, but the procedure will fail unexpectedly whereas before it would work.

If preventing dropping active partition source field and RemoveUnusedPartitionSpec are both landed. It's still possible to drop the just created table but with some additional but necessary steps:

  1. first remove the unwanted partition field, which will create a new PartitionSpec
  2. Call RemoveUnusedPartitionSpec, which should be able to remove the previous wrongly partition spec
  3. remove the wrongly partition source field.

First get in the RemovedUnusedPartitionSpec procedure and then prevent dropping if it's part of a spec. The downside of this is, it may take some more time in to get the whole API in?

I think we can parallelize these two PRs if others all agree that's in the right direction. BTW, I can help to work on https://github.com/apache/iceberg/pull/3462/files to get it merged in case @RussellSpitzer is busy and cannot work on that recently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@advancedxy That's reasonable, we can do the 2 independently. Cool, discussed offline with @RussellSpitzer feel free to go for it, for carrying forward the removing historical partitions PR!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update, filed #10755, please take a look.

@amogh-jahagirdar
Copy link
Contributor Author

Screenshot from 2024-05-18 11-11-56
Needs tests but did some local testing on spark

@amogh-jahagirdar amogh-jahagirdar force-pushed the prevent-dropping-column-active-spec branch from e0e0223 to 184d434 Compare May 18, 2024 17:14
Comment on lines 542 to 551
for (int fieldIdToDelete : deletes) {
for (PartitionSpec spec : base.specs()) {
if (spec.schema().findField(fieldIdToDelete) != null) {
List<Integer> deletesForSpec =
specToDeletes.computeIfAbsent(spec.specId(), k -> Lists.newArrayList());
deletesForSpec.add(fieldIdToDelete);
specToDeletes.put(spec.specId(), deletesForSpec);
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we only need to do this logic if the current snapshot is not null, so probably want to re-organize the logic around that.

.findAny();
Preconditions.checkArgument(
!manifestReferencingActivePartition.isPresent(),
"Cannot delete field %s as it is used by an active partition spec %s",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably use the field name in the error message instead of the ID so it's more useful to a user

@amogh-jahagirdar amogh-jahagirdar force-pushed the prevent-dropping-column-active-spec branch from 184d434 to 3e454d2 Compare May 18, 2024 19:25
Copy link
Contributor

@advancedxy advancedxy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amogh-jahagirdar are you still working on this? It seems like a serious bug.

@@ -533,6 +537,34 @@ private static Schema applyChanges(
}
}

Map<Integer, List<Integer>> specToDeletes = Maps.newHashMap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We occurred a similar issue. There might be another option which simplifies this schema update logic, namely:

  1. Prevent dropping any source column if it's used in any spec in table's specById
  2. Introduce the RemoveUnusedSpec in https://github.com/apache/iceberg/pull/3462/files

In that way, SchemaUpdate only needs to check specById instead of reading all the manifest files.

To actually drop an unused partition source field, user should make sure all the data files wrote by that spec are rewrote or removed, then call RemoveUnusedSpec to drop that spec.

@amogh-jahagirdar
Copy link
Contributor Author

Sorry about the delay on this, got busy and forgot I had this open! I've seen more related issue reports to this, so I'm going to prioritize it.

@advancedxy
Copy link
Contributor

Sorry about the delay on this, got busy and forgot I had this open! I've seen more related issue reports to this, so I'm going to prioritize it.

Well understood, and thanks for your effort on this.

Copy link

github-actions bot commented Nov 3, 2024

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Nov 3, 2024
Copy link

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Spark: Dropping partition column from old partition table corrupts entire table
2 participants