New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SUPPORT] Performance Tuning: Slow stages (Building Workload Profile & Getting Small files from partitions) during Hudi Writes #2620
Comments
@nsivabalan is looking into this. |
Thanks @bvaradar and @nsivabalan. Please let me know how to improve the performance or if you need any further details to investigate. But what are the downsides of not using the DEFAULT Bloom filter. In my use-case I would have late arriving data, so will the performance suffer because of this choice? Also I would like to understand why these specific steps are taking time. From Spark web-UI it seems the execution of the below methods are taking too long. Any insights to understand what is happening in the background please? org.apache.hudi.index.bloom.SparkHoodieBloomIndex.findMatchingFilesForRecordKeys(SparkHoodieBloomIndex.java:266) |
Hey hi @codejoyan : Few clarifying questions on your use-case and record keys.
If your record keys are completely random, then using SIMPLE makes sense, as we may not do any filtering. While with default BLOOM index, we do filtering based on min/max ranges which may not be required(since in this step we read parquet footers to parse the min/max ranges). Once you clarify these details, I can look into it further. |
Apologies for the delay @nsivabalan
Few additional Questions:
Based on the above scenario do you suggest:
|
I have a similar issue where bloom index performance is very slow for upsert into a Hudi MOR table.
Obtain key ranges for file slices (range pruning=on) |
@nsivabalan, any inputs would be very helpful. |
@codejoyan : sorry, somehow slipped from my radar. Among 3 methods you have quoted, 2 of them are index related and 3rd is actual write operation. Best way to decide partitionin strategy is to see what your queries usually filter based on. If its date based, then you definitely need to have date in your partitioning strategy which you already do. And if adding region would cut down most of the data to be looked up, sure. I assume this would also blow up your # partitions in general since its no of dates * no of regions. wrt record keys and bloom: @n3nash : do you have any pointers here. |
@kimberlyamandalu : do you have a support ticket for your question. lets not pollute this issue. we can create a new one for your use-case and can discuss over there |
hi @nsivabalan no, i do not have a separate ticket for my question. I thought it might be related to this so I chimed in. I can open a new ticket for my use case so we can isolate. Sorry for the confusion. Thanks. |
I face the same issue. It usually takes 1-2 minutes for getting small files from partitions in one micro batch(60 seconds interval). My storage is s3. But it looks like it is working fine on hdfs. |
@kimberlyamandalu @njalan @codejoyan There are a few problems when using BLOOM_INDEX
|
FAQ link on how to configure bloom configs. |
Problem Statement: I am using COW table and receiving roughly 1GB of incremental data. The batch has data quality check and upsert. Attached is the spark UI stages screenshot:
SnapShot Count before the Upsert
Incremental Count after the Upsert:
|
Some additional details for the above runs.
I then changed the configs as below to have roughly 100k entries per file. But the performance is worse now. It basically gets stuck. Attached Spark Web UI screenshot
The performance is now okay witth BLOOM index when the incremental batch size is around 100 MB (around 4-5 mins for upsert). But it gets worse when batch size increases (> 5 GB) and the countByKey at BaseSparkCommitActionExecutor.java:154 step gets stuck. **
** |
Is there any progress on this problem, I have the same problem, |
Hi, I have the same problem with slow stages. Firstly it runs well, however when more and more small files are inserted it slows, and the |
I also encountered the same problem with 0.14.0, how to solve it? change hoodie.parquet.small.file.limit ? set hoodie.bloom.index.prune.by.ranges = false ? change hoodie.memory.merge.max.size ? Can this be optimized in hudi 1.0? This stage is simply too time consuming. |
I'm having the same problema on hudi version 0.14.1 and spark 3.4.1. |
Hi,
I am seeing some performance issues while upserting data especially in the below 2 jobs:
15 (SparkUpsertCommitActionExecutor)
17 (UpsertPartitioner)
Attached are some of the stats regarding the slow jobs/stages.
Configurations used:
--driver-memory 5G --executor-memory 10G --executor-cores 5 --num-executors 10
Upsert config parameters:
option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, "org.apache.hudi.keygen.ComplexKeyGenerator").
option("hoodie.upsert.shuffle.parallelism","2").
option("hoodie.insert.shuffle.parallelism","2").
option(HoodieCompactionConfig.PARQUET_SMALL_FILE_LIMIT_BYTES, 128 * 1024 * 1024).
option(HoodieStorageConfig.PARQUET_FILE_MAX_BYTES, 128 * 1024 * 1024).
option("hoodie.copyonwrite.record.size.estimate", "40")
Can you please guide how to approach tuning this performance problem? Let me know if you need any further details.
Below are some of the stats:
Environment Description
The text was updated successfully, but these errors were encountered: