New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-22412][SQL] Fix incorrect comment in DataSourceScanExec #19634
Conversation
Test build #3973 has finished for PR 19634 at commit
|
@@ -469,7 +469,7 @@ case class FileSourceScanExec( | |||
currentSize = 0 | |||
} | |||
|
|||
// Assign files to partitions using "First Fit Decreasing" (FFD) | |||
// Assign files to partitions using "Next Fit Decreasing" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liancheng do you agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is correct.
LGTM |
Thanks! Merged to master. |
@gatorsmile I also wanted to discuss if we should consider other bin packing algorithms. According to this http://www.math.unl.edu/~s-sjessie1/203Handouts/Bin%20Packing.pdf, next fit decreasing is the least efficient of all but it is easiest to implement and has O(N) run time. |
@vgankidi Does it help the performance of our file reading? |
We will end up having fewer combined splits. That reduces the number of files that the job produces and also reduces the number of tasks in the downstream jobs. In some tests I have noticed about 10% reduction in the combined splits. However, the simple implementation of FFD has O(n^2) run time. |
Fewer combined splits might not matter in this case. |
What changes were proposed in this pull request?
Next fit decreasing bin packing algorithm is used to combine splits in DataSourceScanExec but the comment incorrectly states that first fit decreasing algorithm is used. The current implementation doesn't go back to a previously used bin other than the bin that the last element was put into.