Skip to content

Conversation

@dotsering
Copy link

@dotsering dotsering commented Dec 23, 2025

What changes were proposed in this pull request?

I am proposing to add a new function in Dataset class to fix small file problem. We have noticed that if the source data our spark job reads have many small files (KB size), it creates lot of partitions. This PR adds a new function named optimizePartition which when used creates partitions of size 128MB if no size passed. You can pass your own desired partition size.

Why are the changes needed?

The changes are needed to solve small file problem. It also helps in reducing the number of files that gets written back to sink.

Does this PR introduce any user-facing change?

It does not introduce any change in any existing feature/functions of Dataset. It is a brand new function.

How was this patch tested?

I have added number of unit tests that covers the scenario of lot of small partitions and when this function is called, it either coalesces to reduce partition count or uses repartition to increase partition count if partition size is too big.

Was this patch authored or co-authored using generative AI tooling?

I did most of the coding. I also used Gemini along to do some research. This PR does not have lot of code change.

@dotsering dotsering marked this pull request as ready for review December 23, 2025 02:32
@dotsering dotsering closed this Dec 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant