[SUPPORT] Upsert writing too many records

**Describe the problem you faced**

Upserts are taking a really long time and currently i just noticed that hudi writes a lot more records during a commit than what the inserts/updates are:
```
commit.totalRecordsWritten: 48,528,843
commit.totalUpdateRecordsWritten: 437,833
commit.totalInsertRecordsWritten: 212,220
```

Disclaimer: i tried looking at the Tuning and Performance guides of the hudi website, but i would need a bit more help if possible.

**To Reproduce**

The use case is ingesting transactional data related to multiple concepts from source systems using spark on a daily basis, where we use hudi to take care of upserts.
- In most cases, we have a transaction table and an item table, which represent the items assigned to these transactions.
- Most of the time transactions have a created_datetime; items occasionally have such field.
- Also, most of the time the PK is an integer; however, not always auto_incrementing.
- The precombine field is either an incrementing integer (based on when the data was received (eg: today it would have a value of 20231013)) field or a timestamp.
- Most of the updates are within a couple of days/weeks, but updates can happen even after years (that's a really rare scenario, but a possibility)


currently this is our write settings for all tables:
```
'hoodie.datasource.write.table.type': 'COPY_ON_WRITE'
'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.CustomKeyGenerator'
'hoodie.datasource.write.partitionpath.field': ''
'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.NonPartitionedExtractor'
'hoodie.parquet.max.file.size': '1500000000'
'hoodie.parquet.small.file.limit': '1000000000'
'hoodie.datasource.hive_sync.enable': True
'hoodie.datasource.hive_sync.support_timestamp': True
'hoodie.embed.timeline.server': False
```

we have several scenarios, but the behaviour above pretty much applies to each of them.

i believe, this could be improved via correct partitioning and indexing of the data.
however, i'm a bit uncertain about what partitioning/index to use for some scenarios.

whenever we have a consistent created_datetime for the PK - i guess it makes most sense to use the created_datetime as the partitioning field (if there is a better way, please let me know) (also, what would be the best indexing strategy here when the PK is auto_increment vs random ints)

(i couldn't really find if dynamic partitioning a possibility in hudi - eg: records older than 2 years are partitioned by year; records > 1yrs partitioned by year and month; records < 1yrs partitioned by year and month and day)?

Also, do you have some recommendations what to do when:
we don't have a consistent created_datetime for the PK (and nothing from the table seems like a good candidate for partitioning/indexing) (this could apply to both transactions and items as described above)
- in some of these cases the PK is auto_increment
- in some other cases the PK is quite random (after seeing a value of 2Billion something, the next value is 400Million something)

also, is there a built-in partitioning logic for hudi, where the partitions are calculated on the hash of the PK - or potentially we can also implement a logic, where a set of PKs will always be under the same partition - however, that wouldn't automatically scale (do you have some recommendations for that?)

**Expected behavior**

A clear and concise description of what you expected to happen.

**Environment Description**

* Hudi version : 0.11.0-amzn-0

* Spark version : 3.2.1

* Hive version : 3.1.3

* Hadoop version : 3.2.1

* Storage (HDFS/S3/GCS..) : S3

* Running on Docker? (yes/no) : NO




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SUPPORT] Upsert writing too many records #9860

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[SUPPORT] Upsert writing too many records #9860

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions