Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-11914] [SQL] Support coalesce and repartition in Dataset APIs #9899

Closed
wants to merge 2 commits into from

Conversation

gatorsmile
Copy link
Member

This PR is to provide two common coalesce and repartition in Dataset APIs.

After reading the comments of SPARK-9999, I am unclear about the plan for supporting re-partitioning in Dataset APIs. Currently, both RDD APIs and Dataframe APIs provide users such a flexibility to control the number of partitions.

In most traditional RDBMS, they expose the number of partitions, the partitioning columns, the table partitioning methods to DBAs for performance tuning and storage planning. Normally, these parameters could largely affect the query performance. Since the actual performance depends on the workload types, I think it is almost impossible to automate the discovery of the best partitioning strategy for all the scenarios.

I am wondering if Dataset APIs are planning to hide these APIs from users? Feel free to reject my PR if it does not match the plan.

Thank you for your answers. @marmbrus @rxin @cloud-fan

@SparkQA
Copy link

SparkQA commented Nov 23, 2015

Test build #46507 has finished for PR 9899 at commit db4accc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Nov 24, 2015

Thanks - I'm merging this in.

asfgit pushed a commit that referenced this pull request Nov 24, 2015
This PR is to provide two common `coalesce` and `repartition` in Dataset APIs.

After reading the comments of SPARK-9999, I am unclear about the plan for supporting re-partitioning in Dataset APIs. Currently, both RDD APIs and Dataframe APIs provide users such a flexibility to control the number of partitions.

In most traditional RDBMS, they expose the number of partitions, the partitioning columns, the table partitioning methods to DBAs for performance tuning and storage planning. Normally, these parameters could largely affect the query performance. Since the actual performance depends on the workload types, I think it is almost impossible to automate the discovery of the best partitioning strategy for all the scenarios.

I am wondering if Dataset APIs are planning to hide these APIs from users? Feel free to reject my PR if it does not match the plan.

Thank you for your answers. marmbrus rxin cloud-fan

Author: gatorsmile <gatorsmile@gmail.com>

Closes #9899 from gatorsmile/coalesce.

(cherry picked from commit 238ae51)
Signed-off-by: Reynold Xin <rxin@databricks.com>
@asfgit asfgit closed this in 238ae51 Nov 24, 2015
@gatorsmile gatorsmile deleted the coalesce branch December 5, 2015 18:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants