Skip to content

[SUPPORT] PartialUpdateAvroPayload still overwriting undefined columns as NULL #11726

@joelwalden

Description

@joelwalden

Tips before filing an issue

  • Have you gone through our FAQs?

  • Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.

  • If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced

I'm trying to use PartialUpdateAvroPayload with Hudi 0.15.0 in AWS Glue. When using PartialUpdateAvroPayload, null columns are overwritten in the database instead of being skipped.

To Reproduce

Steps to reproduce the behavior:

  1. Create Hudi table with schema:
StructType([StructField('_hoodie_commit_time', StringType(), True), StructField('_hoodie_commit_seqno', StringType(), True), StructField('_hoodie_record_key', StringType(), True), StructField('_hoodie_partition_path', StringType(), True), StructField('_hoodie_file_name', StringType(), True), StructField('_sdc_batched_at', StringType(), True), StructField('_sdc_received_at', StringType(), True), StructField('_sdc_record_hash', StringType(), True), StructField('_sdc_sequence', StringType(), True), StructField('_sdc_table_version', StringType(), True), StructField('amount_foreign_linked', StringType(), True), StructField('amount_linked', StringType(), True), StructField('applied_date_posted', StringType(), True), StructField('applied_transaction_id', StringType(), True), StructField('applied_transaction_line_id', StringType(), True), StructField('date_last_modified', StringType(), True), StructField('discount', StringType(), True), StructField('inventory_number', StringType(), True), StructField('link_type', StringType(), True), StructField('link_type_code', StringType(), True), StructField('original_date_posted', StringType(), True), StructField('original_transaction_id', StringType(), True), StructField('original_transaction_line_id', StringType(), True), StructField('quantity_linked', StringType(), True), StructField('filename', StringType(), True), StructField('is_deleted', StringType(), True), StructField('date_deleted', StringType(), True), StructField('deleted_record_name', StringType(), True), StructField('deleted_record_base_type', StringType(), True), StructField('partition_key', StringType(), True)])
  1. Send Payload such as the following:
Row(applied_transaction_id='10556594', applied_transaction_line_id='1', original_transaction_id='6496794', original_transaction_line_id='0', link_type_code='Payment', date_deleted='2023-01-31T00:18:24.478441Z', is_deleted='True', date_last_modified='2023-01-31T00:18:24.478441Z', partition_key='649', filename='')
  1. Set Hudi options to:
{'hoodie.datasource.write.table.type': 'COPY_ON_WRITE', 
'hoodie.datasource.write.operation': 'upsert', 
'hoodie.datasource.write.hive_style_partitioning': 'true',
'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.ComplexKeyGenerator', 'hoodie.datasource.write.reconcile.schema': 'false', 
'hoodie.datasource.write.recordkey.field': 'applied_transaction_id,applied_transaction_line_id,original_transaction_id,original_transaction_line_id,link_type_code', 
'hoodie.datasource.write.precombine.field': 'date_last_modified',
'hoodie.parquet.compression.codec': 'gzip',
'hoodie.write.concurrency.mode': 'single_writer',
'hoodie.clean.automatic': 'true',
'hoodie.clean.async': 'true',
'hoodie.cleaner.policy': 'KEEP_LATEST_COMMITS',
'hoodie.cleaner.commits.retained': 15,
'hoodie.cleaner.policy.failed.writes': 'EAGER',
'hoodie.index.type': 'GLOBAL_BLOOM',
'hoodie.bloom.index.update.partition.path': 'false',
'hoodie.datasource.write.payload.class': 'org.apache.hudi.common.model.PartialUpdateAvroPayload',
'hoodie.compaction.payload.class': 'org.apache.hudi.common.model.PartialUpdateAvroPayload',
'hoodie.payload.ordering.field': 'date_last_modified',
'write.precombine': 'true',
'payload.class': 'org.apache.hudi.common.model.PartialUpdateAvroPayload',
'hoodie.write.set.null.for.missing.columns': 'true',
'hoodie.datasource.write.partitionpath.field': 'partition_key',
'hoodie.datasource.hive_sync.partition_fields': 'partition_key',
'hoodie.table.name': 'netsuite_transaction_links',
'hoodie.datasource.hive_sync.table': 'netsuite_transaction_links'}
  1. All fields not set in payload are set to NULL

Expected behavior

Values are not overwritten with NULL. Note this error occurs whether or not the set.null.for.missing.columns opt is true or false.

Environment Description

  • Hudi version : 0.15.0

  • Spark version : 3.3.0-amzn-1

  • Hive version :

  • Hadoop version :

  • Storage (HDFS/S3/GCS..) : S3

  • Running on Docker? (yes/no) : no

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    👤 User Action

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions