Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SUPPORT] Hudi partitions not dropped by Hive sync after insert_overwrite_table operation #8114

Closed
Limess opened this issue Mar 7, 2023 · 4 comments
Assignees
Labels
feature-enquiry issue contains feature enquiries/requests or great improvement ideas meta-sync

Comments

@Limess
Copy link

Limess commented Mar 7, 2023

Describe the problem you faced

After running an insert to overwrite a Hudi table inplace using insert_overwrite_table, partitions which no longer exist in the new input data are not removed by the Hive Sync. This causes some query engines to fail until the old partitions are manually removed (e.g. AWS Athena).

This is on Hudi 0.12.1, but I'm fairly sure this issue still exists on 0.13.0 - this change: #6662 fixes this behaviour for delete_partition operations, but doesn't add any handling for insert_overwrite_table.

I'd be happy to be proven otherwise if this is fixed in 0.13.0 - I don't have an environment to easily test this without working out how to upgrade on EMR without a release.

To Reproduce

Steps to reproduce the behavior:

  1. Create a new Hudi table using input data with two partitions, e.g. partition_col=1, partition_col=2
  2. Insert into the table using the operation hoodie.datasource.write.operation=insert_overwrite_table with input data containing 1/2 of the original partitions, e.g. only partition_col=2
  3. Run HiveSyncTool or similar (doesn't work with Spark writer sync or HiveSyncTool)
  4. Check the Hive partitions. Both partitions still exist

Expected behavior

I'd expect the partition which was not inserted to be removed, e.g. only partition_col=2 exists, partition_col=1 is deleted.

Environment Description

  • Hudi version : 0.12.1

  • Spark version : 3.3.1

  • Hive version : AWS Glue

  • Hadoop version :

  • Storage (HDFS/S3/GCS..) : S3

  • Running on Docker? (yes/no) : no

Additional context

Running on EMR 0.6.9

@danny0405 danny0405 added feature-enquiry issue contains feature enquiries/requests or great improvement ideas meta-sync labels Mar 8, 2023
@danny0405
Copy link
Contributor

Thanks for the feedback, guess you are right, this should be supported

@zhaobangcai
Copy link

Has this problem been solved? @Limess

@danny0405
Copy link
Contributor

cc @codope guess this should have been fixed? #6662

@codope
Copy link
Member

codope commented Apr 29, 2024

Yes this was fixed in 0.13.0

@codope codope closed this as completed Apr 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-enquiry issue contains feature enquiries/requests or great improvement ideas meta-sync
Projects
Archived in project
Development

No branches or pull requests

4 participants