Skip to content

Optimize re-partitioning for satisfiable schema evolution to minimize creation of too many small files. #16514

@mukund-thakur

Description

@mukund-thakur

Feature Request / Improvement

Problem:
Let’s assume we have a large existing Iceberg table which is currently partitioned by month. After a few years, we would like to evolve the partition to have month and day as well. And now we want to rewrite all old data files using the current partition spec which is (month, day). As per the current algorithm, all the old partition spec data files are grouped in a single big group. If the data to be partitioned is large, a user will want to enable partial-progress. But then, the files are randomly split into multiple spark jobs. Thus, one partition gets processed in multiple spark jobs, which leads to small files in the resulting partition. These small files often require yet another round of compaction.

Why it creates small files:
Suppose there are 15TB of old spec data files. It will get broken into 150 spark shuffle jobs each processing 100GB of data. As the files are random in each group, every job can write files to all new partitions thus potentially leading to max of 150 files in each output spec partition.

Current Solution:
We have to run a separate compaction job to reduce the number of output files in each output partition.

Proposed Solution:
We can optimize the algorithm to create smaller groups of files per old partition even for older spec files if the current spec satisfies the older spec. By satisfies, we mean whether the new partition spec has the same ordering as the old partition spec. For example, the new partition by day on a timestamp field satisfies the old partition by month on the same timestamp field but vice-versa is not true.

Query engine

None

Willingness to contribute

  • I can contribute this improvement/feature independently
  • I would be willing to contribute this improvement/feature with guidance from the Iceberg community
  • I cannot contribute this improvement/feature at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    improvementPR that improves existing functionality

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions