Skip to content

[SUPPORT] PartitionedFile's size estimation in FileSourceScanExec#createReadRDD when enable NewHoodieParquetFileFormat #12139

@TheR1sing3un

Description

@TheR1sing3un

Tips before filing an issue

  • Have you gone through our FAQs?

  • Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.

  • If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced

I had a question when reading the source code in NewHoodieParquetFileFormat and HoodieFileIndex.

When we enable hoodie.datasource.read.use.new.parquet.file.format and then hudi will provide a HadoopFsRelation with NewHoodieParquetFileFormat.
image
image
And FileSourceScanExec#createReadRDD will query needed Partitions.
image
relation.location.listFiles in this case will be redirected to HoodieFileIndex#listFiles.
image
And I find that for each file-slice:

  1. BaseFile is present, return PartitionedFile with BaseFile's status
  2. LogFile nonEmpty, return PartitionedFile with any (random maybe) LogFile's status.

My question is, should we choose the file with the median log-files sorted by size as the PartitionedFile rather than a random log-file?
In Spark, it merges multiple PartitionedFile into a FilePartition based on the size of each PartitionedFile.
image

image

I think for a PartitionedFile that is actually a FileSlice, we should choose a more representative file size.

To Reproduce

Steps to reproduce the behavior:

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

  • Hudi version :

  • Spark version :

  • Hive version :

  • Hadoop version :

  • Storage (HDFS/S3/GCS..) :

  • Running on Docker? (yes/no) :

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    ⏳ Awaiting Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions