[SUPPORT] `PartitionedFile`'s size estimation in `FileSourceScanExec#createReadRDD` when enable `NewHoodieParquetFileFormat`

**_Tips before filing an issue_**

- Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?

- Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.

- If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.

**Describe the problem you faced**

I had a question when reading the source code in `NewHoodieParquetFileFormat` and `HoodieFileIndex`.

When we enable `hoodie.datasource.read.use.new.parquet.file.format` and then hudi will provide a `HadoopFsRelation` with `NewHoodieParquetFileFormat`.
<img width="919" alt="image" src="https://github.com/user-attachments/assets/8d92d609-bede-4db8-8ab8-d691d625c487">
<img width="1139" alt="image" src="https://github.com/user-attachments/assets/98b19654-3c46-4105-b3c8-543f076379c7">
And `FileSourceScanExec#createReadRDD` will query needed Partitions.
<img width="714" alt="image" src="https://github.com/user-attachments/assets/4efe0f66-e201-4d46-8400-08618a7ab5a1">
`relation.location.listFiles` in this case will be redirected to `HoodieFileIndex#listFiles`.
<img width="1091" alt="image" src="https://github.com/user-attachments/assets/64cb5bf8-aac7-407b-93d7-50e2153ac023">
And I find that for each file-slice:
1. BaseFile is present, return PartitionedFile with BaseFile's status
2. LogFile nonEmpty, return PartitionedFile with any (random maybe) LogFile's status.

My question is, should we choose the file with the median log-files sorted by size as the PartitionedFile rather than a random log-file?
In Spark, it merges multiple PartitionedFile into a FilePartition based on the size of each PartitionedFile.
<img width="929" alt="image" src="https://github.com/user-attachments/assets/3e09e0be-7c2b-458c-9b44-99b6232d62b0">
 
<img width="693" alt="image" src="https://github.com/user-attachments/assets/46c43f87-1bbc-433e-a6ec-a1fb600c6452">

I think for a PartitionedFile that is actually a FileSlice, we should choose a more representative file size.


**To Reproduce**

Steps to reproduce the behavior:

1.
4.
5.
6.

**Expected behavior**

A clear and concise description of what you expected to happen.

**Environment Description**

* Hudi version :

* Spark version :

* Hive version :

* Hadoop version :

* Storage (HDFS/S3/GCS..) :

* Running on Docker? (yes/no) :


**Additional context**

Add any other context about the problem here.

**Stacktrace**

```Add the stacktrace of the error.```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SUPPORT] `PartitionedFile`'s size estimation in `FileSourceScanExec#createReadRDD` when enable `NewHoodieParquetFileFormat` #12139

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[SUPPORT] PartitionedFile's size estimation in FileSourceScanExec#createReadRDD when enable NewHoodieParquetFileFormat #12139

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[SUPPORT] `PartitionedFile`'s size estimation in `FileSourceScanExec#createReadRDD` when enable `NewHoodieParquetFileFormat` #12139