Skip to content

[SUPPORT] The point query performance after clustering is lags behind Delta Lake. #9351

@zbbkeepgoing

Description

@zbbkeepgoing

Describe the problem you faced

  • Our scenario

We have 700 million records in our original offline table, distributed across 10 partitions. Each partition has a different data size, ranging from 10GB to 200GB. We plan to ingest this data into a data lake and test the point query performance after applying Clustering.

  • Point query scenario

The original table has a column called "vin," which will be used as a filter along with the time partition column for point queries.

  • Hudi configuration
hoodie.clustering.plan.strategy.target.file.max.bytes is set to 1GB, consistent with Delta Lake's default value.
hoodie.clustering.execution.strategy.class=org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy
hoodie.clustering.plan.strategy.sort.columns=vin
hoodie.clustering.rollback.pending.replacecommit.on.conflict=true
hoodie.clustering.plan.strategy.daybased.lookback.partitions=10
hoodie.clustering.plan.partition.filter.mode=SELECTED_PARTITIONS
hoodie.clustering.plan.strategy.cluster.begin.partition=part_dt=20230614
hoodie.clustering.plan.strategy.cluster.end.partition=part_dt=20230623
hoodie.clustering.plan.strategy.max.bytes.per.group=17179869184
hoodie.clustering.plan.strategy.max.num.groups=128
hoodie.layout.optimize.enable=true
hoodie.layout.optimize.strategy=z-order
  • Phenomena we observed
  1. After Clustering, both Hudi and Delta Lake produce Parquet files of approximately 1GB.

  2. With Clustering applied, when performing point queries, Hudi scans around 10 files in partitions with larger data, while Delta Lake typically scans only 1-2 files regardless of the partition.

  3. We conducted performance tests with 10 concurrent and 1 concurrent queries. We ran hundreds of rounds of tests on both Hudi and Delta Lake, with different combinations of "vin" and time partition columns. The final conclusion was that Delta Lake performs three times better than Hudi.

After examining Hudi's List file code, we found that Hudi primarily uses column statistics (min and max values) to retrieve candidate files. Therefore, we believe that the List file logic itself is unlikely to be the cause of the performance lag. It is highly likely that the issue lies in the Clustering algorithm itself.

Can you please analyze from a professional perspective what is the reason behind this? Because it determines which data lake technology we ultimately choose.

Expected behavior

The point query performance after clustering is comparable to Delta Lake.

Environment Description

  • Hudi version : 0.13.1

  • Spark version : 3.3

  • Hive version : 2.3.9

  • Hadoop version : 2.x

  • Storage (HDFS/S3/GCS..) : HDFS

  • Running on Docker? (yes/no) : no, on k8s

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    ✅ Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions