Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How does Optimize decide the File Size (Question) #3272

Open
ugurkalkavan opened this issue Jun 14, 2024 · 1 comment
Open

How does Optimize decide the File Size (Question) #3272

ugurkalkavan opened this issue Jun 14, 2024 · 1 comment

Comments

@ugurkalkavan
Copy link

Hi,
I used to use my own auto compaction method on a legacy system.
How it basically works is that it calculates the sum of file size for every hive partition, and consolidate the files in every partition.

Example:
for a partition, there are 1000 thousand files which are around 1 MB.
Sum is 1 GB and the method divides the sum to 128 MB and ceil it , which is 8 in our case. it makes repartition it to 8.

after compaction, new total size is much less than 1 GB, it might be 700 MB.
So ı needed to recursively run the function, till it reach the proper size. (generally two or three times.)

My question is that, How delta optimize deals with this issue ?

Thank you.

@ugurkalkavan ugurkalkavan changed the title How does Optimize decide the File Size How does Optimize decide the File Size (Question) Jun 14, 2024
@rishabhchaudha
Copy link

rishabhchaudha commented Jul 17, 2024

Optimize targets 1GB for the target files
Delta Lake targets 1 GB files when OPTIMIZE is run. This can be configured by setting the spark.databricks.delta.optimize.maxFileSize property.

Delta Optimize uses bin packing for compacting the files. In simple words

  1. filter all files for only files < maxFileSize (default 1GB)
  2. sequentially add them to "bins" until the bin is ~1GB
  3. everytime you overflow a bin it starts a new bin
  4. this happens per partition

Referrence : https://delta.io/blog/2023-01-25-delta-lake-small-file-compaction-optimize/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@ugurkalkavan @rishabhchaudha and others