A new feature to optimize partitions size. #53573

dotsering · 2025-12-23T02:24:28Z

What changes were proposed in this pull request?

I am proposing to add a new function in Dataset class to fix small file problem. We have noticed that if the source data our spark job reads have many small files (KB size), it creates lot of partitions. This PR adds a new function named optimizePartition which when used creates partitions of size 128MB if no size passed. You can pass your own desired partition size.

Why are the changes needed?

The changes are needed to solve small file problem. It also helps in reducing the number of files that gets written back to sink.

Does this PR introduce any user-facing change?

It does not introduce any change in any existing feature/functions of Dataset. It is a brand new function.

How was this patch tested?

I have added number of unit tests that covers the scenario of lot of small partitions and when this function is called, it either coalesces to reduce partition count or uses repartition to increase partition count if partition size is too big.

Was this patch authored or co-authored using generative AI tooling?

I did most of the coding. I also used Gemini along to do some research. This PR does not have lot of code change.

dotsering added 3 commits December 9, 2025 18:35

A new feature to optimize partitions size and thus count

63c0183

Add changes for pyspark

224305d

Merge branch 'master' into optimizePartitions

703e3cd

github-actions bot added SQL PYTHON CONNECT labels Dec 23, 2025

dotsering marked this pull request as ready for review December 23, 2025 02:32

dotsering added 9 commits December 22, 2025 21:25

Format files as advised

64b74f5

Keep the method signature consistent

7d20fa7

fix the example test in sql/dataframe.py

2984c12

Apply formatting to test_dataframe and use existing error class

9a0b035

Fix build failure

aec6e36

skip the connect method test in test_parity_dataframe

0f34243

Merge branch 'master' into optimizePartitions

69ab173

Fix connect/test_parity_dataframe test failure.

62fd15e

Cleanup long comments

e286e3b

dotsering closed this Dec 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

A new feature to optimize partitions size. #53573

A new feature to optimize partitions size. #53573

Uh oh!

dotsering commented Dec 23, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

A new feature to optimize partitions size. #53573

A new feature to optimize partitions size. #53573

Uh oh!

Conversation

dotsering commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dotsering commented Dec 23, 2025 •

edited

Loading