Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-15420] [SQL] Add repartition and sort to prepare output data #13206

Closed

Conversation

rdblue
Copy link
Contributor

@rdblue rdblue commented May 19, 2016

This is currently based on SPARK-14543 and includes its commits.

What changes were proposed in this pull request?

  • WriterContainer detects that the incoming logical plan has been sorted and does not sort a second time if the sort matches the table's partitioning, bucketing, and sorting.
  • Local sort is added by the optimizer for CatalogTables that have bucket or sort columns. This implements sortBy for Hive tables.
  • Repartition and sort operations are added when DataFrameWriter#writersPerPartition(Int) is set. This enables users to easily control how many files per partition are created.
  • Repartition and sort operations are added by the optimizer for columnar formats like Parquet if enabled by spark.sql.files.columnar.insertRepartition (and does not conflict with the query).

How was this patch tested?

WIP: adding tests.

This combines Hive's pre-insertion casts (without renames) that handle
partitioning with the pre-insertion casts/renames in core. The combined
rule, ResolveOutputColumns, will resolve columns by name or by position.
Resolving by position will detect cases where the number of columns is
incorrect or where the input columns are a permutation of the output
columns and fail. When resolving by name, each output column is located
by name in the child plan. This handles cases where a subset of a data
frame is written out.
This PR now catches this problem during analysis and has a better error
message. This commit updates the test for the new message and exception
type.
Adding new argumetns to InsertIntoTable requires changes to several
files.  Instead of adding a long list of optional args, this adds an
options map, like the one passed to DataSource. Future options can
be added and used only where they are needed.
This avoids an extra sort in the WriterContainer when data has already
been sorted as part of the query plan. This fixes writes for both
HadoopFsRelation and MetastoreRelation.
@SparkQA
Copy link

SparkQA commented May 20, 2016

Test build #58915 has finished for PR 13206 at commit eed85ad.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class DistributeAndSortOutputData(conf: CatalystConf) extends Rule[LogicalPlan]

This adds an optimizer rule that will add repartition and sort
operations to the logical plan. Sort is added when the table has sort
or bucketing columns. Repartition is added when writing columnar formats
and the option "spark.sql.files.columnar.insertRepartition" is enabled.

This also adds a `writersPerPartition(numTasks: Int)` option when
writing that controls the number of files in each output table
partition. The optimizer rule adds a repartition step that distributes
output by partition and a random value in [0, numTasks).
@rdblue rdblue force-pushed the SPARK-15420-parquet-repartition-sort branch from eed85ad to a64be8a Compare May 20, 2016 01:00
@SparkQA
Copy link

SparkQA commented May 20, 2016

Test build #58925 has finished for PR 13206 at commit a64be8a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class DistributeAndSortOutputData(conf: CatalystConf) extends Rule[LogicalPlan]

@Downchuck
Copy link

may be fixed in #16898

@HyukjinKwon
Copy link
Member

@rdblue do you know if Is it true ^ ?

@rdblue
Copy link
Contributor Author

rdblue commented May 11, 2017

@HyukjinKwon, that addresses part of what this patch does, but only for writes that go through FileFormatWriter. This patch works for Hive and adds an optimizer rule to add the sort instead of sorting in the writer, which I don't think is a great idea.

@gatorsmile
Copy link
Member

@rdblue What is the latest status of this PR?

@rdblue
Copy link
Contributor Author

rdblue commented Oct 28, 2017

We still maintain a version of this for our Spark builds to avoid an extra sort in Hive. If someone is willing to review it, I can probably find the time to rebase it on master. I think the year this sat initially was just because the 2.0 release was happening at the same time and there wasn't much bandwidth for reviews.

@SparkQA
Copy link

SparkQA commented Nov 8, 2017

Test build #83583 has finished for PR 13206 at commit a64be8a.

  • This patch fails PySpark unit tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
  • case class DistributeAndSortOutputData(conf: CatalystConf) extends Rule[LogicalPlan]

@SparkQA
Copy link

SparkQA commented Aug 21, 2018

Test build #95019 has finished for PR 13206 at commit a64be8a.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
  • case class DistributeAndSortOutputData(conf: CatalystConf) extends Rule[LogicalPlan]

@SparkQA
Copy link

SparkQA commented Sep 11, 2018

Test build #95964 has finished for PR 13206 at commit a64be8a.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
  • case class DistributeAndSortOutputData(conf: CatalystConf) extends Rule[LogicalPlan]

@rdblue rdblue closed this Sep 19, 2018
@tooptoop4
Copy link
Contributor

@rdblue can this be merged?

@rdblue
Copy link
Contributor Author

rdblue commented Jul 13, 2019

@tooptoop4, this will be done in the DataSourceV2 work. I don't think that it is going to be done for v1 plans.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants