[SPARK-15420] [SQL] Add repartition and sort to prepare output data #13206

rdblue · 2016-05-19T23:33:32Z

This is currently based on SPARK-14543 and includes its commits.

What changes were proposed in this pull request?

WriterContainer detects that the incoming logical plan has been sorted and does not sort a second time if the sort matches the table's partitioning, bucketing, and sorting.
Local sort is added by the optimizer for CatalogTables that have bucket or sort columns. This implements sortBy for Hive tables.
Repartition and sort operations are added when DataFrameWriter#writersPerPartition(Int) is set. This enables users to easily control how many files per partition are created.
Repartition and sort operations are added by the optimizer for columnar formats like Parquet if enabled by spark.sql.files.columnar.insertRepartition (and does not conflict with the query).

How was this patch tested?

WIP: adding tests.

This combines Hive's pre-insertion casts (without renames) that handle partitioning with the pre-insertion casts/renames in core. The combined rule, ResolveOutputColumns, will resolve columns by name or by position. Resolving by position will detect cases where the number of columns is incorrect or where the input columns are a permutation of the output columns and fail. When resolving by name, each output column is located by name in the child plan. This handles cases where a subset of a data frame is written out.

This PR now catches this problem during analysis and has a better error message. This commit updates the test for the new message and exception type.

Adding new argumetns to InsertIntoTable requires changes to several files. Instead of adding a long list of optional args, this adds an options map, like the one passed to DataSource. Future options can be added and used only where they are needed.

This avoids an extra sort in the WriterContainer when data has already been sorted as part of the query plan. This fixes writes for both HadoopFsRelation and MetastoreRelation.

SparkQA · 2016-05-20T00:44:30Z

Test build #58915 has finished for PR 13206 at commit eed85ad.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class DistributeAndSortOutputData(conf: CatalystConf) extends Rule[LogicalPlan]

This adds an optimizer rule that will add repartition and sort operations to the logical plan. Sort is added when the table has sort or bucketing columns. Repartition is added when writing columnar formats and the option "spark.sql.files.columnar.insertRepartition" is enabled. This also adds a `writersPerPartition(numTasks: Int)` option when writing that controls the number of files in each output table partition. The optimizer rule adds a repartition step that distributes output by partition and a random value in [0, numTasks).

SparkQA · 2016-05-20T02:39:42Z

Test build #58925 has finished for PR 13206 at commit a64be8a.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class DistributeAndSortOutputData(conf: CatalystConf) extends Rule[LogicalPlan]

Downchuck · 2017-04-07T20:31:44Z

may be fixed in #16898

HyukjinKwon · 2017-05-11T12:39:17Z

@rdblue do you know if Is it true ^ ?

rdblue · 2017-05-11T16:21:02Z

@HyukjinKwon, that addresses part of what this patch does, but only for writes that go through FileFormatWriter. This patch works for Hive and adds an optimizer rule to add the sort instead of sorting in the writer, which I don't think is a great idea.

gatorsmile · 2017-10-28T00:10:57Z

@rdblue What is the latest status of this PR?

rdblue · 2017-10-28T00:31:27Z

We still maintain a version of this for our Spark builds to avoid an extra sort in Hive. If someone is willing to review it, I can probably find the time to rebase it on master. I think the year this sat initially was just because the 2.0 release was happening at the same time and there wasn't much bandwidth for reviews.

SparkQA · 2017-11-08T07:16:59Z

Test build #83583 has finished for PR 13206 at commit a64be8a.

This patch fails PySpark unit tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
case class DistributeAndSortOutputData(conf: CatalystConf) extends Rule[LogicalPlan]

SparkQA · 2018-08-21T15:31:16Z

Test build #95019 has finished for PR 13206 at commit a64be8a.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
case class DistributeAndSortOutputData(conf: CatalystConf) extends Rule[LogicalPlan]

SparkQA · 2018-09-11T22:07:22Z

Test build #95964 has finished for PR 13206 at commit a64be8a.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
case class DistributeAndSortOutputData(conf: CatalystConf) extends Rule[LogicalPlan]

tooptoop4 · 2019-07-12T22:55:24Z

@rdblue can this be merged?

rdblue · 2019-07-13T02:07:48Z

@tooptoop4, this will be done in the DataSourceV2 work. I don't think that it is going to be done for v1 plans.

rdblue added 5 commits May 19, 2016 09:59

SPARK-14543: Fix bad SQL in HiveQuerySuite test.

d133949

SPARK-14543: Update InsertSuite test for too few columns.

2b11935

This PR now catches this problem during analysis and has a better error message. This commit updates the test for the new message and exception type.

SPARK-15420: Detect sorting and do not sort in WriteContainers.

3cdbfa8

This avoids an extra sort in the WriterContainer when data has already been sorted as part of the query plan. This fixes writes for both HadoopFsRelation and MetastoreRelation.

rdblue mentioned this pull request May 19, 2016

[SPARK-14543] [SQL] Improve InsertIntoTable column resolution. #12313

Closed

rdblue force-pushed the SPARK-15420-parquet-repartition-sort branch from eed85ad to a64be8a Compare May 20, 2016 01:00

rdblue closed this Sep 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-15420] [SQL] Add repartition and sort to prepare output data #13206

[SPARK-15420] [SQL] Add repartition and sort to prepare output data #13206

rdblue commented May 19, 2016 •

edited

SparkQA commented May 20, 2016

SparkQA commented May 20, 2016

Downchuck commented Apr 7, 2017

HyukjinKwon commented May 11, 2017

rdblue commented May 11, 2017

gatorsmile commented Oct 28, 2017

rdblue commented Oct 28, 2017

SparkQA commented Nov 8, 2017

SparkQA commented Aug 21, 2018

SparkQA commented Sep 11, 2018

tooptoop4 commented Jul 12, 2019

rdblue commented Jul 13, 2019

[SPARK-15420] [SQL] Add repartition and sort to prepare output data #13206

[SPARK-15420] [SQL] Add repartition and sort to prepare output data #13206

Conversation

rdblue commented May 19, 2016 • edited

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented May 20, 2016

SparkQA commented May 20, 2016

Downchuck commented Apr 7, 2017

HyukjinKwon commented May 11, 2017

rdblue commented May 11, 2017

gatorsmile commented Oct 28, 2017

rdblue commented Oct 28, 2017

SparkQA commented Nov 8, 2017

SparkQA commented Aug 21, 2018

SparkQA commented Sep 11, 2018

tooptoop4 commented Jul 12, 2019

rdblue commented Jul 13, 2019

rdblue commented May 19, 2016 •

edited