Skip to content

The estimated table size is inaccurate #15684

@lurnagao-dahua

Description

@lurnagao-dahua

Apache Iceberg version

1.10.1 (latest release)

Query engine

Spark

Please describe the bug 🐞

Hi, team!
The SparkSchemaUtil. estimateSize method calculates the size based on the default size of the field type and the number of rows, may differ significantly from the actual size.
May I ask if there are any areas that can be improved?

public static long estimateSize(StructType tableSchema, long totalRecords) {

The discovery of this issue is due to the different execution plans of Spark querying the parquet source/iceberg tables.
Spark default field size percentage based on file size:
https://github.com/apache/spark/blob/10dd228d4c09166c2cb744cb0e3e7f15385afae0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala#L34

May I ask if it is possible to use the manifest file for more accurate statistics?

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions