-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
Tips before filing an issue
-
Have you gone through our FAQs?
-
Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
-
If you have triaged this as a bug, then file an issue directly.
Describe the problem you faced
I had a question when reading the source code in NewHoodieParquetFileFormat and HoodieFileIndex.
When we enable hoodie.datasource.read.use.new.parquet.file.format and then hudi will provide a HadoopFsRelation with NewHoodieParquetFileFormat.


And FileSourceScanExec#createReadRDD will query needed Partitions.

relation.location.listFiles in this case will be redirected to HoodieFileIndex#listFiles.

And I find that for each file-slice:
- BaseFile is present, return PartitionedFile with BaseFile's status
- LogFile nonEmpty, return PartitionedFile with any (random maybe) LogFile's status.
My question is, should we choose the file with the median log-files sorted by size as the PartitionedFile rather than a random log-file?
In Spark, it merges multiple PartitionedFile into a FilePartition based on the size of each PartitionedFile.

I think for a PartitionedFile that is actually a FileSlice, we should choose a more representative file size.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
Environment Description
-
Hudi version :
-
Spark version :
-
Hive version :
-
Hadoop version :
-
Storage (HDFS/S3/GCS..) :
-
Running on Docker? (yes/no) :
Additional context
Add any other context about the problem here.
Stacktrace
Add the stacktrace of the error.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status