-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
Tips before filing an issue
-
Have you gone through our FAQs?
-
Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
-
If you have triaged this as a bug, then file an issue directly.
Describe the problem you faced
I'm trying to use PartialUpdateAvroPayload with Hudi 0.15.0 in AWS Glue. When using PartialUpdateAvroPayload, null columns are overwritten in the database instead of being skipped.
To Reproduce
Steps to reproduce the behavior:
- Create Hudi table with schema:
StructType([StructField('_hoodie_commit_time', StringType(), True), StructField('_hoodie_commit_seqno', StringType(), True), StructField('_hoodie_record_key', StringType(), True), StructField('_hoodie_partition_path', StringType(), True), StructField('_hoodie_file_name', StringType(), True), StructField('_sdc_batched_at', StringType(), True), StructField('_sdc_received_at', StringType(), True), StructField('_sdc_record_hash', StringType(), True), StructField('_sdc_sequence', StringType(), True), StructField('_sdc_table_version', StringType(), True), StructField('amount_foreign_linked', StringType(), True), StructField('amount_linked', StringType(), True), StructField('applied_date_posted', StringType(), True), StructField('applied_transaction_id', StringType(), True), StructField('applied_transaction_line_id', StringType(), True), StructField('date_last_modified', StringType(), True), StructField('discount', StringType(), True), StructField('inventory_number', StringType(), True), StructField('link_type', StringType(), True), StructField('link_type_code', StringType(), True), StructField('original_date_posted', StringType(), True), StructField('original_transaction_id', StringType(), True), StructField('original_transaction_line_id', StringType(), True), StructField('quantity_linked', StringType(), True), StructField('filename', StringType(), True), StructField('is_deleted', StringType(), True), StructField('date_deleted', StringType(), True), StructField('deleted_record_name', StringType(), True), StructField('deleted_record_base_type', StringType(), True), StructField('partition_key', StringType(), True)])
- Send Payload such as the following:
Row(applied_transaction_id='10556594', applied_transaction_line_id='1', original_transaction_id='6496794', original_transaction_line_id='0', link_type_code='Payment', date_deleted='2023-01-31T00:18:24.478441Z', is_deleted='True', date_last_modified='2023-01-31T00:18:24.478441Z', partition_key='649', filename='')
- Set Hudi options to:
{'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
'hoodie.datasource.write.operation': 'upsert',
'hoodie.datasource.write.hive_style_partitioning': 'true',
'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.ComplexKeyGenerator', 'hoodie.datasource.write.reconcile.schema': 'false',
'hoodie.datasource.write.recordkey.field': 'applied_transaction_id,applied_transaction_line_id,original_transaction_id,original_transaction_line_id,link_type_code',
'hoodie.datasource.write.precombine.field': 'date_last_modified',
'hoodie.parquet.compression.codec': 'gzip',
'hoodie.write.concurrency.mode': 'single_writer',
'hoodie.clean.automatic': 'true',
'hoodie.clean.async': 'true',
'hoodie.cleaner.policy': 'KEEP_LATEST_COMMITS',
'hoodie.cleaner.commits.retained': 15,
'hoodie.cleaner.policy.failed.writes': 'EAGER',
'hoodie.index.type': 'GLOBAL_BLOOM',
'hoodie.bloom.index.update.partition.path': 'false',
'hoodie.datasource.write.payload.class': 'org.apache.hudi.common.model.PartialUpdateAvroPayload',
'hoodie.compaction.payload.class': 'org.apache.hudi.common.model.PartialUpdateAvroPayload',
'hoodie.payload.ordering.field': 'date_last_modified',
'write.precombine': 'true',
'payload.class': 'org.apache.hudi.common.model.PartialUpdateAvroPayload',
'hoodie.write.set.null.for.missing.columns': 'true',
'hoodie.datasource.write.partitionpath.field': 'partition_key',
'hoodie.datasource.hive_sync.partition_fields': 'partition_key',
'hoodie.table.name': 'netsuite_transaction_links',
'hoodie.datasource.hive_sync.table': 'netsuite_transaction_links'}
- All fields not set in payload are set to NULL
Expected behavior
Values are not overwritten with NULL. Note this error occurs whether or not the set.null.for.missing.columns opt is true or false.
Environment Description
-
Hudi version : 0.15.0
-
Spark version : 3.3.0-amzn-1
-
Hive version :
-
Hadoop version :
-
Storage (HDFS/S3/GCS..) : S3
-
Running on Docker? (yes/no) : no
Additional context
Add any other context about the problem here.
Stacktrace
Add the stacktrace of the error.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status