-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
Describe the problem you faced
Upserts are taking a really long time and currently i just noticed that hudi writes a lot more records during a commit than what the inserts/updates are:
commit.totalRecordsWritten: 48,528,843
commit.totalUpdateRecordsWritten: 437,833
commit.totalInsertRecordsWritten: 212,220
Disclaimer: i tried looking at the Tuning and Performance guides of the hudi website, but i would need a bit more help if possible.
To Reproduce
The use case is ingesting transactional data related to multiple concepts from source systems using spark on a daily basis, where we use hudi to take care of upserts.
- In most cases, we have a transaction table and an item table, which represent the items assigned to these transactions.
- Most of the time transactions have a created_datetime; items occasionally have such field.
- Also, most of the time the PK is an integer; however, not always auto_incrementing.
- The precombine field is either an incrementing integer (based on when the data was received (eg: today it would have a value of 20231013)) field or a timestamp.
- Most of the updates are within a couple of days/weeks, but updates can happen even after years (that's a really rare scenario, but a possibility)
currently this is our write settings for all tables:
'hoodie.datasource.write.table.type': 'COPY_ON_WRITE'
'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.CustomKeyGenerator'
'hoodie.datasource.write.partitionpath.field': ''
'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.NonPartitionedExtractor'
'hoodie.parquet.max.file.size': '1500000000'
'hoodie.parquet.small.file.limit': '1000000000'
'hoodie.datasource.hive_sync.enable': True
'hoodie.datasource.hive_sync.support_timestamp': True
'hoodie.embed.timeline.server': False
we have several scenarios, but the behaviour above pretty much applies to each of them.
i believe, this could be improved via correct partitioning and indexing of the data.
however, i'm a bit uncertain about what partitioning/index to use for some scenarios.
whenever we have a consistent created_datetime for the PK - i guess it makes most sense to use the created_datetime as the partitioning field (if there is a better way, please let me know) (also, what would be the best indexing strategy here when the PK is auto_increment vs random ints)
(i couldn't really find if dynamic partitioning a possibility in hudi - eg: records older than 2 years are partitioned by year; records > 1yrs partitioned by year and month; records < 1yrs partitioned by year and month and day)?
Also, do you have some recommendations what to do when:
we don't have a consistent created_datetime for the PK (and nothing from the table seems like a good candidate for partitioning/indexing) (this could apply to both transactions and items as described above)
- in some of these cases the PK is auto_increment
- in some other cases the PK is quite random (after seeing a value of 2Billion something, the next value is 400Million something)
also, is there a built-in partitioning logic for hudi, where the partitions are calculated on the hash of the PK - or potentially we can also implement a logic, where a set of PKs will always be under the same partition - however, that wouldn't automatically scale (do you have some recommendations for that?)
Expected behavior
A clear and concise description of what you expected to happen.
Environment Description
-
Hudi version : 0.11.0-amzn-0
-
Spark version : 3.2.1
-
Hive version : 3.1.3
-
Hadoop version : 3.2.1
-
Storage (HDFS/S3/GCS..) : S3
-
Running on Docker? (yes/no) : NO
Metadata
Metadata
Assignees
Labels
Type
Projects
Status