Skip to content

[SUPPORT] Bucket index with inserts fails to insert to pre-existing file groups #9742

@jmnatzaganian

Description

@jmnatzaganian

Describe the problem you faced

When using the bucket index in an insert only mode data is only written if it goes to a new file group. New records targeting a pre-existing file group are not inserted.

The settings to define it as an insert only operation are below, with the expectation that if the record already exists it won't be inserted. New record keys will be inserted.

"hoodie.datasource.write.operation": "insert",

"hoodie.sql.insert.mode": "strict",
"hoodie.datasource.write.insert.drop.duplicates": True,
"hoodie.datasource.write.payload.class": "org.apache.hudi.common.model.DefaultHoodieRecordPayload",
"hoodie.merge.allow.duplicate.on.inserts": False,
"hoodie.combine.before.insert": True,
"hoodie.payload.ordering.field": "ts",

To Reproduce

See the attached script hudi_bucket_ix_issue.py. The output is here.

Expected behavior

See the attached script. Simple index shows the expected behavior. Dupes are dropped and new records are inserted.

Environment Description

  • Hudi version: 0.13.1
  • Spark version: 3.1.1
  • Hive version: N/A
  • Hadoop version:
  • Storage (HDFS/S3/GCS..): Local
  • Running on Docker? (yes/no): No

Metadata

Metadata

Assignees

No one assigned

    Labels

    type:featureNew features and enhancements

    Type

    No type

    Projects

    Status

    ✅ Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions