You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After running an insert to overwrite a Hudi table inplace using insert_overwrite_table, partitions which no longer exist in the new input data are not removed by the Hive Sync. This causes some query engines to fail until the old partitions are manually removed (e.g. AWS Athena).
This is on Hudi 0.12.1, but I'm fairly sure this issue still exists on 0.13.0 - this change: #6662 fixes this behaviour for delete_partition operations, but doesn't add any handling for insert_overwrite_table.
I'd be happy to be proven otherwise if this is fixed in 0.13.0 - I don't have an environment to easily test this without working out how to upgrade on EMR without a release.
To Reproduce
Steps to reproduce the behavior:
Create a new Hudi table using input data with two partitions, e.g. partition_col=1, partition_col=2
Insert into the table using the operation hoodie.datasource.write.operation=insert_overwrite_table with input data containing 1/2 of the original partitions, e.g. only partition_col=2
Run HiveSyncTool or similar (doesn't work with Spark writer sync or HiveSyncTool)
Check the Hive partitions. Both partitions still exist
Expected behavior
I'd expect the partition which was not inserted to be removed, e.g. only partition_col=2 exists, partition_col=1 is deleted.
Environment Description
Hudi version : 0.12.1
Spark version : 3.3.1
Hive version : AWS Glue
Hadoop version :
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : no
Additional context
Running on EMR 0.6.9
The text was updated successfully, but these errors were encountered:
Describe the problem you faced
After running an insert to overwrite a Hudi table inplace using
insert_overwrite_table
, partitions which no longer exist in the new input data are not removed by the Hive Sync. This causes some query engines to fail until the old partitions are manually removed (e.g. AWS Athena).This is on Hudi 0.12.1, but I'm fairly sure this issue still exists on 0.13.0 - this change: #6662 fixes this behaviour for
delete_partition
operations, but doesn't add any handling forinsert_overwrite_table
.I'd be happy to be proven otherwise if this is fixed in 0.13.0 - I don't have an environment to easily test this without working out how to upgrade on EMR without a release.
To Reproduce
Steps to reproduce the behavior:
hoodie.datasource.write.operation=insert_overwrite_table
with input data containing 1/2 of the original partitions, e.g. only partition_col=2Expected behavior
I'd expect the partition which was not inserted to be removed, e.g. only partition_col=2 exists, partition_col=1 is deleted.
Environment Description
Hudi version : 0.12.1
Spark version : 3.3.1
Hive version : AWS Glue
Hadoop version :
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : no
Additional context
Running on EMR 0.6.9
The text was updated successfully, but these errors were encountered: