Feature Request / Improvement
Problem:
Let’s assume we have a large existing Iceberg table which is currently partitioned by month. After a few years, we would like to evolve the partition to have month and day as well. And now we want to rewrite all old data files using the current partition spec which is (month, day). As per the current algorithm, all the old partition spec data files are grouped in a single big group. If the data to be partitioned is large, a user will want to enable partial-progress. But then, the files are randomly split into multiple spark jobs. Thus, one partition gets processed in multiple spark jobs, which leads to small files in the resulting partition. These small files often require yet another round of compaction.
Why it creates small files:
Suppose there are 15TB of old spec data files. It will get broken into 150 spark shuffle jobs each processing 100GB of data. As the files are random in each group, every job can write files to all new partitions thus potentially leading to max of 150 files in each output spec partition.
Current Solution:
We have to run a separate compaction job to reduce the number of output files in each output partition.
Proposed Solution:
We can optimize the algorithm to create smaller groups of files per old partition even for older spec files if the current spec satisfies the older spec. By satisfies, we mean whether the new partition spec has the same ordering as the old partition spec. For example, the new partition by day on a timestamp field satisfies the old partition by month on the same timestamp field but vice-versa is not true.
Query engine
None
Willingness to contribute
Feature Request / Improvement
Problem:
Let’s assume we have a large existing Iceberg table which is currently partitioned by month. After a few years, we would like to evolve the partition to have month and day as well. And now we want to rewrite all old data files using the current partition spec which is (month, day). As per the current algorithm, all the old partition spec data files are grouped in a single big group. If the data to be partitioned is large, a user will want to enable partial-progress. But then, the files are randomly split into multiple spark jobs. Thus, one partition gets processed in multiple spark jobs, which leads to small files in the resulting partition. These small files often require yet another round of compaction.
Why it creates small files:
Suppose there are 15TB of old spec data files. It will get broken into 150 spark shuffle jobs each processing 100GB of data. As the files are random in each group, every job can write files to all new partitions thus potentially leading to max of 150 files in each output spec partition.
Current Solution:
We have to run a separate compaction job to reduce the number of output files in each output partition.
Proposed Solution:
We can optimize the algorithm to create smaller groups of files per old partition even for older spec files if the current spec satisfies the older spec. By satisfies, we mean whether the new partition spec has the same ordering as the old partition spec. For example, the new partition by day on a timestamp field satisfies the old partition by month on the same timestamp field but vice-versa is not true.
Query engine
None
Willingness to contribute