[SPARK-22412][SQL] Fix incorrect comment in DataSourceScanExec #19634

vgankidi · 2017-11-01T20:35:43Z

What changes were proposed in this pull request?

Next fit decreasing bin packing algorithm is used to combine splits in DataSourceScanExec but the comment incorrectly states that first fit decreasing algorithm is used. The current implementation doesn't go back to a previously used bin other than the bin that the last element was put into.

SparkQA · 2017-11-02T13:36:30Z

Test build #3973 has finished for PR 19634 at commit b8f38a8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2017-11-03T19:31:00Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

@@ -469,7 +469,7 @@ case class FileSourceScanExec(
      currentSize = 0
    }

-    // Assign files to partitions using "First Fit Decreasing" (FFD)
+    // Assign files to partitions using "Next Fit Decreasing"


@liancheng do you agree?

This is correct.

gatorsmile · 2017-11-04T18:09:18Z

LGTM

gatorsmile · 2017-11-04T18:10:03Z

Thanks! Merged to master.

vgankidi · 2017-11-08T00:59:33Z

@gatorsmile I also wanted to discuss if we should consider other bin packing algorithms. According to this http://www.math.unl.edu/~s-sjessie1/203Handouts/Bin%20Packing.pdf, next fit decreasing is the least efficient of all but it is easiest to implement and has O(N) run time.

gatorsmile · 2017-11-08T19:03:36Z

@vgankidi Does it help the performance of our file reading?

vgankidi · 2017-11-08T22:31:20Z

We will end up having fewer combined splits. That reduces the number of files that the job produces and also reduces the number of tasks in the downstream jobs. In some tests I have noticed about 10% reduction in the combined splits. However, the simple implementation of FFD has O(n^2) run time.

gatorsmile · 2017-11-09T00:59:24Z

Fewer combined splits might not matter in this case.

[SPARK-22412][SQL] Fix incorrect comment in DataSourceScanExec

b8f38a8

srowen reviewed Nov 3, 2017

View reviewed changes

asfgit closed this in f7f4e9c Nov 4, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-22412][SQL] Fix incorrect comment in DataSourceScanExec #19634

[SPARK-22412][SQL] Fix incorrect comment in DataSourceScanExec #19634

vgankidi commented Nov 1, 2017

SparkQA commented Nov 2, 2017

srowen Nov 3, 2017

gatorsmile Nov 4, 2017

gatorsmile commented Nov 4, 2017

gatorsmile commented Nov 4, 2017

vgankidi commented Nov 8, 2017

gatorsmile commented Nov 8, 2017

vgankidi commented Nov 8, 2017

gatorsmile commented Nov 9, 2017

[SPARK-22412][SQL] Fix incorrect comment in DataSourceScanExec #19634

[SPARK-22412][SQL] Fix incorrect comment in DataSourceScanExec #19634

Conversation

vgankidi commented Nov 1, 2017

What changes were proposed in this pull request?

SparkQA commented Nov 2, 2017

srowen Nov 3, 2017

Choose a reason for hiding this comment

gatorsmile Nov 4, 2017

Choose a reason for hiding this comment

gatorsmile commented Nov 4, 2017

gatorsmile commented Nov 4, 2017

vgankidi commented Nov 8, 2017

gatorsmile commented Nov 8, 2017

vgankidi commented Nov 8, 2017

gatorsmile commented Nov 9, 2017