You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
I used to use my own auto compaction method on a legacy system.
How it basically works is that it calculates the sum of file size for every hive partition, and consolidate the files in every partition.
Example:
for a partition, there are 1000 thousand files which are around 1 MB.
Sum is 1 GB and the method divides the sum to 128 MB and ceil it , which is 8 in our case. it makes repartition it to 8.
after compaction, new total size is much less than 1 GB, it might be 700 MB.
So ı needed to recursively run the function, till it reach the proper size. (generally two or three times.)
My question is that, How delta optimize deals with this issue ?
Thank you.
The text was updated successfully, but these errors were encountered:
ugurkalkavan
changed the title
How does Optimize decide the File Size
How does Optimize decide the File Size (Question)
Jun 14, 2024
Optimize targets 1GB for the target files
Delta Lake targets 1 GB files when OPTIMIZE is run. This can be configured by setting the spark.databricks.delta.optimize.maxFileSize property.
Delta Optimize uses bin packing for compacting the files. In simple words
filter all files for only files < maxFileSize (default 1GB)
sequentially add them to "bins" until the bin is ~1GB
Hi,
I used to use my own auto compaction method on a legacy system.
How it basically works is that it calculates the sum of file size for every hive partition, and consolidate the files in every partition.
Example:
for a partition, there are 1000 thousand files which are around 1 MB.
Sum is 1 GB and the method divides the sum to 128 MB and ceil it , which is 8 in our case. it makes repartition it to 8.
after compaction, new total size is much less than 1 GB, it might be 700 MB.
So ı needed to recursively run the function, till it reach the proper size. (generally two or three times.)
My question is that, How delta optimize deals with this issue ?
Thank you.
The text was updated successfully, but these errors were encountered: