-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
Describe the problem you faced
- Our scenario
We have 700 million records in our original offline table, distributed across 10 partitions. Each partition has a different data size, ranging from 10GB to 200GB. We plan to ingest this data into a data lake and test the point query performance after applying Clustering.
- Point query scenario
The original table has a column called "vin," which will be used as a filter along with the time partition column for point queries.
- Hudi configuration
hoodie.clustering.plan.strategy.target.file.max.bytes is set to 1GB, consistent with Delta Lake's default value.
hoodie.clustering.execution.strategy.class=org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy
hoodie.clustering.plan.strategy.sort.columns=vin
hoodie.clustering.rollback.pending.replacecommit.on.conflict=true
hoodie.clustering.plan.strategy.daybased.lookback.partitions=10
hoodie.clustering.plan.partition.filter.mode=SELECTED_PARTITIONS
hoodie.clustering.plan.strategy.cluster.begin.partition=part_dt=20230614
hoodie.clustering.plan.strategy.cluster.end.partition=part_dt=20230623
hoodie.clustering.plan.strategy.max.bytes.per.group=17179869184
hoodie.clustering.plan.strategy.max.num.groups=128
hoodie.layout.optimize.enable=true
hoodie.layout.optimize.strategy=z-order
- Phenomena we observed
-
After Clustering, both Hudi and Delta Lake produce Parquet files of approximately 1GB.
-
With Clustering applied, when performing point queries, Hudi scans around 10 files in partitions with larger data, while Delta Lake typically scans only 1-2 files regardless of the partition.
-
We conducted performance tests with 10 concurrent and 1 concurrent queries. We ran hundreds of rounds of tests on both Hudi and Delta Lake, with different combinations of "vin" and time partition columns. The final conclusion was that Delta Lake performs three times better than Hudi.
After examining Hudi's List file code, we found that Hudi primarily uses column statistics (min and max values) to retrieve candidate files. Therefore, we believe that the List file logic itself is unlikely to be the cause of the performance lag. It is highly likely that the issue lies in the Clustering algorithm itself.
Can you please analyze from a professional perspective what is the reason behind this? Because it determines which data lake technology we ultimately choose.
Expected behavior
The point query performance after clustering is comparable to Delta Lake.
Environment Description
-
Hudi version : 0.13.1
-
Spark version : 3.3
-
Hive version : 2.3.9
-
Hadoop version : 2.x
-
Storage (HDFS/S3/GCS..) : HDFS
-
Running on Docker? (yes/no) : no, on k8s
Metadata
Metadata
Assignees
Labels
Type
Projects
Status