Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SUPPORT] Poor Upsert Performance on COW table due to indexing #6687

Open
jtm437 opened this issue Sep 16, 2022 · 5 comments
Open

[SUPPORT] Poor Upsert Performance on COW table due to indexing #6687

jtm437 opened this issue Sep 16, 2022 · 5 comments
Labels
index performance priority:major degraded perf; unable to move forward; potential bugs

Comments

@jtm437
Copy link

jtm437 commented Sep 16, 2022

Hello, I am having performance issues when attempting to upsert data into a Hudi COW table. With the below specs it is taking longer than 4 hours to finish upserting (if it ever does finish). In the screenshots below, you can see that it is taking a long time doing the index scan. I have tried disabling hoodie.bloom.index.prune.by.ranges because our record key is random. I've also tried upserting using the "Simple" index type and did not see any performance improvements. Is there anything else I can do to improve the performance?

image

image

Specs:
Table Size: 13.6TB (compressed in S3)
Number of partitions: 1135 (hoodie.datasource.hive_sync.partition_fields=year,month)
Upsert dataset size: 68 million records, 6GB compressed
Index type: Default (Bloom)
Number of nodes: 30
Node type: r6g.8xlarge
Average record size: ~40 bytes (calculated by File Size/Num Records: 10MB/250000 records)

Environment Description

  • Hudi version : 0.9.0
  • Spark version : 2.4.8
  • EMR version: 5.34.0
  • Hive version : 2.38.0
  • Hadoop version : Amazon 2.10.1
  • Storage (HDFS/S3/GCS..) : S3
@xushiyan xushiyan added performance priority:major degraded perf; unable to move forward; potential bugs index labels Sep 20, 2022
@scxwhite
Copy link
Contributor

It is recommended that you upgrade to the latest version of hudi and use bucket index. Or try to use hbase index and MOR table formats in version 0.9.0

@jtm437
Copy link
Author

jtm437 commented Sep 29, 2022

@scxwhite can you point me at some documentation on implementing bucket or hbase indexes?

@scxwhite
Copy link
Contributor

You can see how to use these indexes in the official documents. If you want to know more about bucket index. Take a look at this document.

@nsivabalan
Copy link
Contributor

You can enable clustering to increase the file sizes. by default file sizes are of 120MB. but you can try to batch small files into larger ones (500Mb) and so during index lookup, the no of files to be looked up could reduce.

@bibhu107
Copy link
Contributor

bibhu107 commented May 2, 2024

Hi
Can try https://hudi.apache.org/blog/2023/11/01/record-level-index/#metadata-table this stores the record_keys in metadata tables. But I am not sure if this indexing can be applied for COW tables.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
index performance priority:major degraded perf; unable to move forward; potential bugs
Projects
Status: Awaiting Triage
Development

No branches or pull requests

5 participants