-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19399][SPARKR] Add R coalesce API for DataFrame and Column #16739
Conversation
Test build #72147 has started for PR 16739 at commit |
Jenkins, retest this please |
Test build #72149 has finished for PR 16739 at commit
|
#' df <- read.json(path) | ||
#' newDF <- coalesce(df, 1L) | ||
#'} | ||
#' @note coalesce(SparkDataFrame) since 2.1.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2.2.0? Or this will be ported back to 2.1.1 too.
setMethod("coalesce", | ||
signature(x = "SparkDataFrame"), | ||
function(x, numPartitions) { | ||
stopifnot(is.numeric(numPartitions)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we enforce the input param as Integer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's being coerced into integer - the reason we don't want this to be integer is to allow calls like
coalesce(df, 3)
in which 3
is a numeric by default. (vs 3L
is integer) IMO, forcing the user to call with 3L
is a bit too much
R/pkg/R/generics.R
Outdated
@@ -406,6 +406,13 @@ setGeneric("attach") | |||
#' @export | |||
setGeneric("cache", function(x) { standardGeneric("cache") }) | |||
|
|||
#' @rdname coalesce | |||
#' @param x a Column or a SparkDataFrame. | |||
#' @param ... additional argument(s). If \code{x} is a Column, addition Columns can be optionally |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addition Columns -> additional Columns?
Test build #72166 has finished for PR 16739 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comment on the doc - otherwise looking good. And I can see that the RDD stuff is getting annoying - Will respond on that JIRA
#' | ||
#' Returns a new SparkDataFrame that has exactly \code{numPartitions} partitions. | ||
#' This operation results in a narrow dependency, e.g. if you go from 1000 partitions to 100 | ||
#' partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are more partitions then there will be a shuffle right ? Might be useful to add that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, no, coalesce is set to min(prev partitions, numPartitions)
according to CoalescedRDD here so it will be unchanged then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh well I guess thats worth mentioning then ?
Test build #72232 has finished for PR 16739 at commit
|
Test build #72240 has finished for PR 16739 at commit
|
Thanks @felixcheung - I think these changes look good. cc @gatorsmile / @holdenk for doc changes in SQL, Python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the Python Doc String update looks reasonable, but maybe while we are updating the coalesce docstrings accross the three languages we should consider if we want to include the warning from RDD's coalesce?
@@ -2432,7 +2432,8 @@ class Dataset[T] private[sql]( | |||
* Returns a new Dataset that has exactly `numPartitions` partitions. | |||
* Similar to coalesce defined on an `RDD`, this operation results in a narrow dependency, e.g. | |||
* if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of | |||
* the 100 new partitions will claim 10 of the current partitions. | |||
* the 100 new partitions will claim 10 of the current partitions. If a larger number of | |||
* partitions is requested, it will stay at the current number of partitions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we seem to have left out the warning from RDD about darastic coaleces in the Dataset coalesce. Since we are updating the docstrings now anyways would it maybe make sense to include that warning here as well? (Looking at the implementation of CoalesceExec
it seems like it would still apply unless I'm missing something).
surely, i think you mean https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L428 |
and actually I find the current behavior a bit hard to explain, could someone perhaps enlighten me if this is intentional and how best, if we are to, document this behavior?
|
@felixcheung I was refering to the ` * However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,
but documentating the coalesce capping out based on numSlices also sounds important to document (and potentially confusing). |
yap, #16739 (comment) - only RDD has |
Thus, I am wondering whether we should allow users to set it to a larger number? Or some advanced users are using it? |
@gatorsmile thanks for commenting. but, since you are here, do you know why we see this behavior
Shouldn't I allow to set partition to 5 < n < 10, since I just |
: ) This might be caused by the optimizer rule |
hmm, not as far as I can see:
Perhaps during optimization the |
55b99df
to
a0fe134
Compare
Test build #72791 has finished for PR 16739 at commit
|
Test build #72790 has finished for PR 16739 at commit
|
Let me rewrite the test cases in Scala. val df = spark.range(0, 10000, 1, 5)
assert(df.rdd.getNumPartitions == 5)
assert(df.coalesce(3).rdd.getNumPartitions == 3)
assert(df.coalesce(6).rdd.getNumPartitions == 5)
val df1 = df.coalesce(3)
assert(df1.rdd.getNumPartitions == 3)
assert(df1.coalesce(6).rdd.getNumPartitions == 5)
assert(df1.coalesce(4).rdd.getNumPartitions == 4)
assert(df1.coalesce(2).rdd.getNumPartitions == 2)
val df2 = df.repartition(10)
assert(df2.rdd.getNumPartitions == 10)
assert(df2.coalesce(13).rdd.getNumPartitions == 5)
assert(df2.coalesce(7).rdd.getNumPartitions == 5)
assert(df2.coalesce(3).rdd.getNumPartitions == 3) The question is why the second one is
Ok... |
great, looking forward to that. |
The issue is fixed in #16933. If this is merged at first, I will fix the test case in this PR Thanks! : ) |
a0fe134
to
bf2373f
Compare
Test build #72925 has started for PR 16739 at commit |
Jenkins, retest this please |
Test build #72929 has finished for PR 16739 at commit
|
merged to master and branch-2.1 |
Hi, @felixcheung .
|
@dongjoon-hyun my apologies, thanks for bringing this to my attention. I had to hand merge and didn't realize the mismatch. Opened a new PR to fix that. |
Thank YOU, always! :) |
## What changes were proposed in this pull request? Add coalesce on DataFrame for down partitioning without shuffle and coalesce on Column ## How was this patch tested? manual, unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes apache#16739 from felixcheung/rcoalesce.
I've commented elsewhere, but wanted to here just to make more people aware: Let's refrain from backporting new APIs into patch versions unless they are really critical. We do not do this elsewhere in Spark, and we should not in SparkR. New APIs and API changes should only happen in minor versions (and ideally changes will only happen in major ones). It's been discussed elsewhere that SparkR is more experimental than other parts of Spark, but the sooner we start treating it like a stable library, the sooner it will be a stable library. For most people, there isn't a huge difference between getting a new API in a patch version (every 1-2 months) vs. getting it in a minor version (every 4 months). Thanks all! |
…nabled Repartition ### What changes were proposed in this pull request? Observed by felixcheung in #16739, when users use the shuffle-enabled `repartition` API, they expect the partition they got should be the exact number they provided, even if they call shuffle-disabled `coalesce` later. Currently, `CollapseRepartition` rule does not consider whether shuffle is enabled or not. Thus, we got the following unexpected result. ```Scala val df = spark.range(0, 10000, 1, 5) val df2 = df.repartition(10) assert(df2.coalesce(13).rdd.getNumPartitions == 5) assert(df2.coalesce(7).rdd.getNumPartitions == 5) assert(df2.coalesce(3).rdd.getNumPartitions == 3) ``` This PR is to fix the issue. We preserve shuffle-enabled Repartition. ### How was this patch tested? Added a test case Author: Xiao Li <gatorsmile@gmail.com> Closes #16933 from gatorsmile/CollapseRepartition.
Agree with @jkbradley on this one. We should avoid adding functions that are completely new in a patch release given that the timing between minor versions and patch releases aren't that high. As we discussed in the other thread, lets start tagging JIRAs with |
What changes were proposed in this pull request?
Add coalesce on DataFrame for down partitioning without shuffle and coalesce on Column
How was this patch tested?
manual, unit tests