Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-22412][SQL] Fix incorrect comment in DataSourceScanExec #19634

Closed
wants to merge 1 commit into from

Conversation

vgankidi
Copy link

@vgankidi vgankidi commented Nov 1, 2017

What changes were proposed in this pull request?

Next fit decreasing bin packing algorithm is used to combine splits in DataSourceScanExec but the comment incorrectly states that first fit decreasing algorithm is used. The current implementation doesn't go back to a previously used bin other than the bin that the last element was put into.

@SparkQA
Copy link

SparkQA commented Nov 2, 2017

Test build #3973 has finished for PR 19634 at commit b8f38a8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -469,7 +469,7 @@ case class FileSourceScanExec(
currentSize = 0
}

// Assign files to partitions using "First Fit Decreasing" (FFD)
// Assign files to partitions using "Next Fit Decreasing"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liancheng do you agree?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct.

@gatorsmile
Copy link
Member

LGTM

@gatorsmile
Copy link
Member

Thanks! Merged to master.

@asfgit asfgit closed this in f7f4e9c Nov 4, 2017
@vgankidi
Copy link
Author

vgankidi commented Nov 8, 2017

@gatorsmile I also wanted to discuss if we should consider other bin packing algorithms. According to this http://www.math.unl.edu/~s-sjessie1/203Handouts/Bin%20Packing.pdf, next fit decreasing is the least efficient of all but it is easiest to implement and has O(N) run time.

@gatorsmile
Copy link
Member

@vgankidi Does it help the performance of our file reading?

@vgankidi
Copy link
Author

vgankidi commented Nov 8, 2017

We will end up having fewer combined splits. That reduces the number of files that the job produces and also reduces the number of tasks in the downstream jobs. In some tests I have noticed about 10% reduction in the combined splits. However, the simple implementation of FFD has O(n^2) run time.

@gatorsmile
Copy link
Member

Fewer combined splits might not matter in this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants