Optimize re-partitioning for satisfiable schema evolution to minimize creation of too many small files.

### Feature Request / Improvement

**Problem**: 
Let’s assume we have a large existing Iceberg table which is currently partitioned by month. After a few years, we would like to evolve the partition to have month and day as well. And now we want to rewrite all old data files using the current partition spec which is (month, day). As per the current algorithm, all the old partition spec data files are grouped in a single big group.  If the data to be partitioned is large, a user will want to enable partial-progress. But then, the files are randomly split into multiple spark jobs.  Thus, one partition gets processed in multiple spark jobs, which leads to small files in the resulting partition.  These small files often require yet another round of compaction.


**Why it creates small files:**
Suppose there are 15TB of old spec data files. It will get broken into 150 spark shuffle jobs each processing 100GB of data. As the files are random in each group, every job can write files to all new partitions thus potentially leading to max of 150 files in each output spec partition. 

**Current Solution:** 
We have to run a separate compaction job to reduce the number of output files in each output partition. 

**Proposed Solution:**
We can optimize the algorithm to create smaller groups of files per old partition even for older spec files if the current spec satisfies the older spec. By satisfies, we mean whether the new partition spec has the same ordering as the old partition spec. For example, the new partition by day on a timestamp field satisfies the old partition by month on the same timestamp field but vice-versa is not true. 



### Query engine

None

### Willingness to contribute

- [ ] I can contribute this improvement/feature independently
- [x] I would be willing to contribute this improvement/feature with guidance from the Iceberg community
- [ ] I cannot contribute this improvement/feature at this time

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize re-partitioning for satisfiable schema evolution to minimize creation of too many small files. #16514

Feature Request / Improvement

Query engine

Willingness to contribute

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Optimize re-partitioning for satisfiable schema evolution to minimize creation of too many small files. #16514

Description

Feature Request / Improvement

Query engine

Willingness to contribute

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions