Skip to content

[HUDI-5317] Fix insert overwrite table for partitioned table#8015

Closed
stream2000 wants to merge 3 commits intoapache:masterfrom
stream2000:fix_bulk_insert_fail_for_v2_table
Closed

[HUDI-5317] Fix insert overwrite table for partitioned table#8015
stream2000 wants to merge 3 commits intoapache:masterfrom
stream2000:fix_bulk_insert_fail_for_v2_table

Conversation

@stream2000
Copy link
Contributor

@stream2000 stream2000 commented Feb 22, 2023

Change Logs

fix #7365, when hoodie.sql.bulk.insert.enable = true and hoodie.schema.on.read.enable=true, insert overwrite on non-partition table will fail because the save mode is set Append not Overwrite.

After this pr both insert overwrite table and insert overwrite partition can not use bulk insert

Impact

both insert overwrite table and insert overwrite partition can not use bulk insert

Risk level (write none, low medium or high below)

low

Documentation Update

None

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@stream2000 stream2000 force-pushed the fix_bulk_insert_fail_for_v2_table branch from f0dcc31 to 997d2a9 Compare February 22, 2023 08:56
@stream2000 stream2000 force-pushed the fix_bulk_insert_fail_for_v2_table branch from 997d2a9 to 8aa97ff Compare February 22, 2023 08:58
@boneanxs
Copy link
Contributor

boneanxs commented Feb 23, 2023

@stream2000 Hi, I'm quite confused with the behavior here, for insert_overwrite_table with bulk_insert enabled, we'll directly drop the data first, and then do the write operation, this breaks the ACID transactions, and the weird thing is this only happens when we enable bulk_insert.

if (mode == SaveMode.Overwrite && tableExists && operation != WriteOperationType.INSERT_OVERWRITE_TABLE) {
        // When user set operation as INSERT_OVERWRITE_TABLE,
        // overwrite will use INSERT_OVERWRITE_TABLE operator in doWriteOperation
        log.warn(s"hoodie table at $tablePath already exists. Deleting existing data & overwriting with new data.")
        fs.delete(tablePath, true)
        tableExists = false
      }

while for insert_overwrite, we don't support bulk_insert, and will keep the old files.

I'm thinking we should firstly don't allow for bulk_insert for insert_overwrite and insert_overwrite_table, after we can fully support bulk_insert for insert_overwrite and insert_overwrite_table(which means keep the overwritten data if using bulk_insert to keep the consistent behavior), then we can consider to have tests to cover this.

cc @alexeykudinkin

@stream2000
Copy link
Contributor Author

@stream2000 Hi, I'm quite confused with the behavior here, for insert_overwrite_table with bulk_insert enabled, we'll directly drop the data first, and then do the write operation, this breaks the ACID transactions, and the weird thing is this only happens when we enable bulk_insert.

if (mode == SaveMode.Overwrite && tableExists && operation != WriteOperationType.INSERT_OVERWRITE_TABLE) {
        // When user set operation as INSERT_OVERWRITE_TABLE,
        // overwrite will use INSERT_OVERWRITE_TABLE operator in doWriteOperation
        log.warn(s"hoodie table at $tablePath already exists. Deleting existing data & overwriting with new data.")
        fs.delete(tablePath, true)
        tableExists = false
      }

while for insert_overwrite, we don't support bulk_insert, and will keep the old files.

I'm thinking we should firstly don't allow for bulk_insert for insert_overwrite and insert_overwrite_table, after we can fully support bulk_insert for insert_overwrite and insert_overwrite_table(which means keep the overwritten data if using bulk_insert to keep the consistent behavior), then we can consider to have tests to cover this.

cc @alexeykudinkin

I agree that we should unify the semantics between insert overwrite partition and insert overwrite table when bulk_insert is enabled. I will push a commit later that directly throw exception when bulk_insert is enabled in insert overwrite table/partittion.

cc @leesf

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

3 participants