-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Description
Apache Iceberg version
1.10.1 (latest release)
Query engine
Spark
Please describe the bug 🐞
Hi, team!
The SparkSchemaUtil. estimateSize method calculates the size based on the default size of the field type and the number of rows, may differ significantly from the actual size.
May I ask if there are any areas that can be improved?
iceberg/spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/SparkSchemaUtil.java
Line 339 in eb460a5
| public static long estimateSize(StructType tableSchema, long totalRecords) { |
The discovery of this issue is due to the different execution plans of Spark querying the parquet source/iceberg tables.
Spark default field size percentage based on file size:
https://github.com/apache/spark/blob/10dd228d4c09166c2cb744cb0e3e7f15385afae0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala#L34
May I ask if it is possible to use the manifest file for more accurate statistics?
Willingness to contribute
- I can contribute a fix for this bug independently
- I would be willing to contribute a fix for this bug with guidance from the Iceberg community
- I cannot contribute a fix for this bug at this time