Skip to content

[SUPPORT] Upsert writing too many records #9860

@szingerpeter

Description

@szingerpeter

Describe the problem you faced

Upserts are taking a really long time and currently i just noticed that hudi writes a lot more records during a commit than what the inserts/updates are:

commit.totalRecordsWritten: 48,528,843
commit.totalUpdateRecordsWritten: 437,833
commit.totalInsertRecordsWritten: 212,220

Disclaimer: i tried looking at the Tuning and Performance guides of the hudi website, but i would need a bit more help if possible.

To Reproduce

The use case is ingesting transactional data related to multiple concepts from source systems using spark on a daily basis, where we use hudi to take care of upserts.

  • In most cases, we have a transaction table and an item table, which represent the items assigned to these transactions.
  • Most of the time transactions have a created_datetime; items occasionally have such field.
  • Also, most of the time the PK is an integer; however, not always auto_incrementing.
  • The precombine field is either an incrementing integer (based on when the data was received (eg: today it would have a value of 20231013)) field or a timestamp.
  • Most of the updates are within a couple of days/weeks, but updates can happen even after years (that's a really rare scenario, but a possibility)

currently this is our write settings for all tables:

'hoodie.datasource.write.table.type': 'COPY_ON_WRITE'
'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.CustomKeyGenerator'
'hoodie.datasource.write.partitionpath.field': ''
'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.NonPartitionedExtractor'
'hoodie.parquet.max.file.size': '1500000000'
'hoodie.parquet.small.file.limit': '1000000000'
'hoodie.datasource.hive_sync.enable': True
'hoodie.datasource.hive_sync.support_timestamp': True
'hoodie.embed.timeline.server': False

we have several scenarios, but the behaviour above pretty much applies to each of them.

i believe, this could be improved via correct partitioning and indexing of the data.
however, i'm a bit uncertain about what partitioning/index to use for some scenarios.

whenever we have a consistent created_datetime for the PK - i guess it makes most sense to use the created_datetime as the partitioning field (if there is a better way, please let me know) (also, what would be the best indexing strategy here when the PK is auto_increment vs random ints)

(i couldn't really find if dynamic partitioning a possibility in hudi - eg: records older than 2 years are partitioned by year; records > 1yrs partitioned by year and month; records < 1yrs partitioned by year and month and day)?

Also, do you have some recommendations what to do when:
we don't have a consistent created_datetime for the PK (and nothing from the table seems like a good candidate for partitioning/indexing) (this could apply to both transactions and items as described above)

  • in some of these cases the PK is auto_increment
  • in some other cases the PK is quite random (after seeing a value of 2Billion something, the next value is 400Million something)

also, is there a built-in partitioning logic for hudi, where the partitions are calculated on the hash of the PK - or potentially we can also implement a logic, where a set of PKs will always be under the same partition - however, that wouldn't automatically scale (do you have some recommendations for that?)

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

  • Hudi version : 0.11.0-amzn-0

  • Spark version : 3.2.1

  • Hive version : 3.1.3

  • Hadoop version : 3.2.1

  • Storage (HDFS/S3/GCS..) : S3

  • Running on Docker? (yes/no) : NO

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:performancePerformance optimizationspriority:highSignificant impact; potential bugstype:featureNew features and enhancements

    Type

    No type

    Projects

    Status

    ✅ Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions