[SPARK-11914] [SQL] Support coalesce and repartition in Dataset APIs #9899

gatorsmile · 2015-11-23T03:45:52Z

This PR is to provide two common coalesce and repartition in Dataset APIs.

After reading the comments of SPARK-9999, I am unclear about the plan for supporting re-partitioning in Dataset APIs. Currently, both RDD APIs and Dataframe APIs provide users such a flexibility to control the number of partitions.

In most traditional RDBMS, they expose the number of partitions, the partitioning columns, the table partitioning methods to DBAs for performance tuning and storage planning. Normally, these parameters could largely affect the query performance. Since the actual performance depends on the workload types, I think it is almost impossible to automate the discovery of the best partitioning strategy for all the scenarios.

I am wondering if Dataset APIs are planning to hide these APIs from users? Feel free to reject my PR if it does not match the plan.

Thank you for your answers. @marmbrus @rxin @cloud-fan

SparkQA · 2015-11-23T06:02:59Z

Test build #46507 has finished for PR 9899 at commit db4accc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2015-11-24T23:53:30Z

Thanks - I'm merging this in.

This PR is to provide two common `coalesce` and `repartition` in Dataset APIs. After reading the comments of SPARK-9999, I am unclear about the plan for supporting re-partitioning in Dataset APIs. Currently, both RDD APIs and Dataframe APIs provide users such a flexibility to control the number of partitions. In most traditional RDBMS, they expose the number of partitions, the partitioning columns, the table partitioning methods to DBAs for performance tuning and storage planning. Normally, these parameters could largely affect the query performance. Since the actual performance depends on the workload types, I think it is almost impossible to automate the discovery of the best partitioning strategy for all the scenarios. I am wondering if Dataset APIs are planning to hide these APIs from users? Feel free to reject my PR if it does not match the plan. Thank you for your answers. marmbrus rxin cloud-fan Author: gatorsmile <gatorsmile@gmail.com> Closes #9899 from gatorsmile/coalesce. (cherry picked from commit 238ae51) Signed-off-by: Reynold Xin <rxin@databricks.com>

gatorsmile added 2 commits November 22, 2015 19:12

Support coalesce and repartition in Dataset APIs

41d3ade

style clean

db4accc

asfgit closed this in 238ae51 Nov 24, 2015

gatorsmile deleted the coalesce branch December 5, 2015 18:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-11914] [SQL] Support coalesce and repartition in Dataset APIs #9899

[SPARK-11914] [SQL] Support coalesce and repartition in Dataset APIs #9899

gatorsmile commented Nov 23, 2015

SparkQA commented Nov 23, 2015

rxin commented Nov 24, 2015

[SPARK-11914] [SQL] Support coalesce and repartition in Dataset APIs #9899

[SPARK-11914] [SQL] Support coalesce and repartition in Dataset APIs #9899

Conversation

gatorsmile commented Nov 23, 2015

SparkQA commented Nov 23, 2015

rxin commented Nov 24, 2015