[SUPPORT] Poor Upsert Performance on COW table due to indexing #6687

jtm437 · 2022-09-16T00:27:07Z

Hello, I am having performance issues when attempting to upsert data into a Hudi COW table. With the below specs it is taking longer than 4 hours to finish upserting (if it ever does finish). In the screenshots below, you can see that it is taking a long time doing the index scan. I have tried disabling hoodie.bloom.index.prune.by.ranges because our record key is random. I've also tried upserting using the "Simple" index type and did not see any performance improvements. Is there anything else I can do to improve the performance?

Specs:
Table Size: 13.6TB (compressed in S3)
Number of partitions: 1135 (hoodie.datasource.hive_sync.partition_fields=year,month)
Upsert dataset size: 68 million records, 6GB compressed
Index type: Default (Bloom)
Number of nodes: 30
Node type: r6g.8xlarge
Average record size: ~40 bytes (calculated by File Size/Num Records: 10MB/250000 records)

Environment Description

Hudi version : 0.9.0
Spark version : 2.4.8
EMR version: 5.34.0
Hive version : 2.38.0
Hadoop version : Amazon 2.10.1
Storage (HDFS/S3/GCS..) : S3

scxwhite · 2022-09-29T10:17:18Z

It is recommended that you upgrade to the latest version of hudi and use bucket index. Or try to use hbase index and MOR table formats in version 0.9.0

jtm437 · 2022-09-29T10:38:04Z

@scxwhite can you point me at some documentation on implementing bucket or hbase indexes?

scxwhite · 2022-09-30T02:07:21Z

You can see how to use these indexes in the official documents. If you want to know more about bucket index. Take a look at this document.

nsivabalan · 2022-10-22T23:23:39Z

You can enable clustering to increase the file sizes. by default file sizes are of 120MB. but you can try to batch small files into larger ones (500Mb) and so during index lookup, the no of files to be looked up could reduce.

bibhu107 · 2024-05-02T11:53:37Z

Hi
Can try https://hudi.apache.org/blog/2023/11/01/record-level-index/#metadata-table this stores the record_keys in metadata tables. But I am not sure if this indexing can be applied for COW tables.

xushiyan added performance priority:major degraded perf; unable to move forward; potential bugs index labels Sep 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SUPPORT] Poor Upsert Performance on COW table due to indexing #6687

[SUPPORT] Poor Upsert Performance on COW table due to indexing #6687

jtm437 commented Sep 16, 2022 •

edited

Loading

scxwhite commented Sep 29, 2022

jtm437 commented Sep 29, 2022

scxwhite commented Sep 30, 2022

nsivabalan commented Oct 22, 2022

bibhu107 commented May 2, 2024

[SUPPORT] Poor Upsert Performance on COW table due to indexing #6687

[SUPPORT] Poor Upsert Performance on COW table due to indexing #6687

Comments

jtm437 commented Sep 16, 2022 • edited Loading

scxwhite commented Sep 29, 2022

jtm437 commented Sep 29, 2022

scxwhite commented Sep 30, 2022

nsivabalan commented Oct 22, 2022

bibhu107 commented May 2, 2024

jtm437 commented Sep 16, 2022 •

edited

Loading