-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-13734][SPARKR] Added histogram function #11569
Conversation
Test build #52615 has finished for PR 11569 at commit
|
Test build #52618 has finished for PR 11569 at commit
|
It seems better to keep SparkR as a base package providing core functionalities, while visualization features can be implemented in other packages based on SparkR. There is an example at https://github.com/PAPL-SKKU/ggplot2.SparkR. |
Test build #52628 has finished for PR 11569 at commit
|
I agree that SparkR should not "require" ggplot... |
Yeah installing ggplot as a part of SparkR isn't a good pattern. I agree that this should be in a add-on package. |
@shivaram @sun-rui @felixcheung Yeah, that makes sense. I modified histogram() function so now it only computes the histogram statistics. There's neither rendering nor dependency on ggplot2 anymore. I think the histogram stats are still very useful for an R user and if they wanna plot it later, they're free to use any of R packages. |
Test build #53732 has finished for PR 11569 at commit
|
Test build #53831 has finished for PR 11569 at commit
|
Test build #53836 has finished for PR 11569 at commit
|
#' } | ||
setMethod("histogram", | ||
signature(df = "DataFrame"), | ||
function(df, colname, nbins = 10) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to specify colname type in signature()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some other functions here take the Column
type, you might want to support both character
or Column
for colname
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@felixcheung Yeah, I thought of that but I don't know how to compute the min and max in one single pass given a Column (not a name). I used describe() which requires a column name. I also tried agg, but it cannot compute more than 1 stat per column:
> collect(agg(irisDF, Sepal_Length="max", Sepal_Width="min"))
max(Sepal_Length) min(Sepal_Width)
1 7.9 2
> collect(agg(irisDF, Sepal_Length="max", Sepal_Length="min"))
max(Sepal_Length)
1 7.9
Suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you need names of columns, something like
columns(select(df, colname))
to get a list of names back?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@felixcheung Yeah, but then what if the user wants to do:
hist(irisDF, irisDF$Sepal_Length + 1)
describe() would fail as this column doesn't belong to df.
Test build #54050 has finished for PR 11569 at commit
|
Test build #54051 has finished for PR 11569 at commit
|
btw, is this still the right place? functions.R works with Column, if this works with DataFrame, should it go to DataFrame.R? |
Test build #54234 has finished for PR 11569 at commit
|
#' @param colname the name of the column to build the histogram from. | ||
#' @return a data.frame with the histogram statistics, i.e., counts and centroids. | ||
#' @rdname histogram | ||
#' @family agg_funcs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#' @family DataFrame functions
looks good except 1 minor doc comment |
Test build #56711 has finished for PR 11569 at commit
|
@felixcheung @shivaram I'm done with all your suggestions. Thanks. Shall we merge? |
|
||
# Append the given column to the dataset. This is to support Columns that | ||
# don't belong to the DataFrame but are rather expressions | ||
df$x <- col |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to check if x
is a column name already present in the data frame ? For example I ran the code
irisDF$x <- irisDF$Petal_Width + 2.0
histogram(irisDF, irisDF$x, 8)
and I got an error
org.apache.spark.sql.AnalysisException: resolved attribute(s) x#141 missing from Species#4,Sepal_Length#0,x#269,Petal_Width#3,Petal_Length#2,Sepal_Width#1 in operator !Project [Sepal_Length#0,Sepal_Width#1,Petal_Length#2,Petal_Width#3,Species#4,x#269,cast((((cast(cast((((x#141 - 2.1) / 2.4) * 10000.0) as int) as double) / 10000.0) / 0.125) - CASE WHEN ((((cast(cast((((x#141 - 2.1) / 2.4) * 10000.0) as int) as double) / 10000.0) / 0.125) = cast(cast(((cast(cast((((x#141 - 2.1) / 2.4) * 10000.0) as int) as double) / 10000.0) / 0.125) as int) as double)) && NOT (x#141 = 2.1)) THEN 1.0 ELSE 0.0 END) as int) AS bins#325]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shivaram Yes, I have fixed this. Thanks!
please rebase to pick up |
Test build #56924 has finished for PR 11569 at commit
|
Test build #56927 has finished for PR 11569 at commit
|
#' @param colname the name of the column to build the histogram from. | ||
#' @return a data.frame with the histogram statistics, i.e., counts and centroids. | ||
#' @rdname histogram | ||
#' @family DataFrame functions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this has changed as well
@family SparkDataFrame functions
sorry this is such a moving target
Test build #57028 has finished for PR 11569 at commit
|
@shivaram @felixcheung Looks like the version of lint-r running on the build server is different than the one on Spark's Github. Even though lint-r passes on my local, I keep getting this errors: R/DataFrame.R:2542:40: style: Put spaces around all infix operators. |
Test build #57029 has finished for PR 11569 at commit
|
@shivaram @felixcheung I have addressed all your comments. Anything else? Or shall we merge? Thanks! |
#' @examples | ||
#' \dontrun{ | ||
#' | ||
#' # Create a DataFrame from the Iris dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @felixcheung mentioned before DataFrame
-> SparkDataFrame
? Or we can just delete this comment
LGTM. Thanks @olarayej - I just had a couple of minor comments about using |
Test build #57038 has finished for PR 11569 at commit
|
Test build #57043 has finished for PR 11569 at commit
|
LGTM. Merging this. |
What changes were proposed in this pull request?
Added method histogram() to compute the histogram of a Column
Usage:
Note: Usage will change once SPARK-9325 is figured out so that histogram() only takes a Column as a parameter, as opposed to a DataFrame and a name
How was this patch tested?
All unit tests pass. I added specific unit cases for different scenarios.