Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19426][SQL] Custom coalescer for Dataset #18861

Closed
wants to merge 5 commits into from

Conversation

maropu
Copy link
Member

@maropu maropu commented Aug 6, 2017

What changes were proposed in this pull request?

This pr added a new API for coalesce in Dataset; users can specify the custom coalescer which reduces an input Dataset into fewer partitions. This coalescer implementation is the same with the one in RDD#coalesce added in #11865 (SPARK-14042).

This is the rework of #16766.

How was this patch tested?

Added tests in DatasetSuite.

@SparkQA
Copy link

SparkQA commented Aug 6, 2017

Test build #80306 has finished for PR 18861 at commit bb7f9b0.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class PartitionCoalesce(numPartitions: Int, coalescer: PartitionCoalescer, child: LogicalPlan)
  • case class CoalesceExec(numPartitions: Int, child: SparkPlan, coalescer: Option[PartitionCoalescer])

@SparkQA
Copy link

SparkQA commented Aug 6, 2017

Test build #80307 has finished for PR 18861 at commit 253bdcc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class PartitionCoalesce(numPartitions: Int, coalescer: PartitionCoalescer, child: LogicalPlan)
  • case class CoalesceExec(numPartitions: Int, child: SparkPlan, coalescer: Option[PartitionCoalescer])

@maropu
Copy link
Member Author

maropu commented Aug 7, 2017

@gatorsmile ok

@@ -571,7 +570,8 @@ case class UnionExec(children: Seq[SparkPlan]) extends SparkPlan {
* current upstream partitions will be executed in parallel (per whatever
* the current partitioning is).
*/
case class CoalesceExec(numPartitions: Int, child: SparkPlan) extends UnaryExecNode {
case class CoalesceExec(numPartitions: Int, child: SparkPlan, coalescer: Option[PartitionCoalescer])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add the parm description of coalescer? also update function descriptions? Thanks~!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok!

addPartition(partition, splitSize)
index += 1
} else {
updateGroups
updateGroups()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the above changes are not related to this PR, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I just left the changes of the original author (probably refactoring stuffs?) ..., better remove this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine about this, but it might confuse the others. Maybe just remove them in this PR? You can submit a separate PR later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I'll drop these from this pr.

* supplying a `PartitionCoalescer` to control the behavior of the partitioning.
*/
case class PartitionCoalesce(numPartitions: Int, coalescer: PartitionCoalescer, child: LogicalPlan)
extends UnaryNode {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding new logical nodes also needs the updates in multiple different components. (e.g., Optimizer).

Is that possible to reuse the existing node Repartition?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, I think so. I'll try and plz give me days to do so.

@SparkQA
Copy link

SparkQA commented Aug 7, 2017

Test build #80321 has finished for PR 18861 at commit c0306d3.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 7, 2017

Test build #80320 has finished for PR 18861 at commit 413b0eb.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 7, 2017

Test build #80327 has finished for PR 18861 at commit 83ac85f.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class Repartition(

@SparkQA
Copy link

SparkQA commented Aug 7, 2017

Test build #80328 has finished for PR 18861 at commit 3b4c679.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class Repartition(

@maropu
Copy link
Member Author

maropu commented Aug 7, 2017

@gatorsmile ok, could you check again?

numPartitions: Int,
shuffle: Boolean,
child: LogicalPlan,
coalescer: Option[PartitionCoalescer] = None)
extends RepartitionOperation {
require(numPartitions > 0, s"Number of partitions ($numPartitions) must be positive.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a new require here?

require(!shuffle || coalescer.isEmpty, "Custom coalescer is not allowed for repartition(shuffle=true)")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

* @since 2.3.0
*/
def coalesce(numPartitions: Int, userDefinedCoalescer: Option[PartitionCoalescer])
: Dataset[T] = withTypedPlan {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def coalesce(
    numPartitions: Int,
    userDefinedCoalescer: Option[PartitionCoalescer]): Dataset[T] = withTypedPlan {

case (false, true) => if (r.numPartitions >= child.numPartitions) child else r
case _ => r.copy(child = child.child)
}
case r @ Repartition(_, _, child: RepartitionOperation, None) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, my bad. Fixed.

@SparkQA
Copy link

SparkQA commented Aug 8, 2017

Test build #80373 has finished for PR 18861 at commit d7392c4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Aug 8, 2017

@gatorsmile ok, fixed.

@maropu
Copy link
Member Author

maropu commented Aug 10, 2017

@gatorsmile ping

@@ -596,14 +596,17 @@ object CollapseProject extends Rule[LogicalPlan] {
object CollapseRepartition extends Rule[LogicalPlan] {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add new test cases to CollapseRepartitionSuite for the changes in this rule.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@jiangxb1987
Copy link
Contributor

@maropu any update on this issue?

@maropu
Copy link
Member Author

maropu commented Oct 3, 2017

@gatorsmile better to close this pr and the jira?

@gatorsmile
Copy link
Member

Yeah, maybe close it first. Thanks!

@maropu
Copy link
Member Author

maropu commented Oct 4, 2017

ok, thanks!

@maropu maropu closed this Oct 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants