[SPARK-13734][SPARKR] Added histogram function #11569

olarayej · 2016-03-08T00:45:12Z

What changes were proposed in this pull request?

Added method histogram() to compute the histogram of a Column

Usage:

## Create a DataFrame from the Iris dataset
irisDF <- createDataFrame(sqlContext, iris)

## Render a histogram for the Sepal_Length column 
histogram(irisDF, "Sepal_Length", nbins=12)

Note: Usage will change once SPARK-9325 is figured out so that histogram() only takes a Column as a parameter, as opposed to a DataFrame and a name

How was this patch tested?

All unit tests pass. I added specific unit cases for different scenarios.

SparkQA · 2016-03-08T01:07:34Z

Test build #52615 has finished for PR 11569 at commit 0ad424b.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-08T01:48:18Z

Test build #52618 has finished for PR 11569 at commit d19992b.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

sun-rui · 2016-03-08T02:39:12Z

It seems better to keep SparkR as a base package providing core functionalities, while visualization features can be implemented in other packages based on SparkR. There is an example at https://github.com/PAPL-SKKU/ggplot2.SparkR.

SparkQA · 2016-03-08T03:31:49Z

Test build #52628 has finished for PR 11569 at commit ac8f4c9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-03-10T05:53:14Z

I agree that SparkR should not "require" ggplot...

shivaram · 2016-03-11T01:14:20Z

Yeah installing ggplot as a part of SparkR isn't a good pattern. I agree that this should be in a add-on package.

olarayej · 2016-03-22T00:14:36Z

@shivaram @sun-rui @felixcheung Yeah, that makes sense. I modified histogram() function so now it only computes the histogram statistics. There's neither rendering nor dependency on ggplot2 anymore. I think the histogram stats are still very useful for an R user and if they wanna plot it later, they're free to use any of R packages.

SparkQA · 2016-03-22T00:20:45Z

Test build #53732 has finished for PR 11569 at commit 125b82d.

This patch fails R style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-22T22:29:50Z

Test build #53831 has finished for PR 11569 at commit 971d306.

This patch fails R style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-22T23:15:21Z

Test build #53836 has finished for PR 11569 at commit c06344e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-03-23T04:45:49Z

R/pkg/R/functions.R

+#' } 
+setMethod("histogram",
+          signature(df = "DataFrame"),
+          function(df, colname, nbins = 10) {


is it possible to specify colname type in signature()?

Some other functions here take the Column type, you might want to support both character or Column for colname

@felixcheung Yeah, I thought of that but I don't know how to compute the min and max in one single pass given a Column (not a name). I used describe() which requires a column name. I also tried agg, but it cannot compute more than 1 stat per column:

> collect(agg(irisDF, Sepal_Length="max", Sepal_Width="min")) max(Sepal_Length) min(Sepal_Width) 1 7.9 2 > collect(agg(irisDF, Sepal_Length="max", Sepal_Length="min")) max(Sepal_Length) 1 7.9

Suggestions?

if you need names of columns, something like
columns(select(df, colname)) to get a list of names back?

@felixcheung Yeah, but then what if the user wants to do:

hist(irisDF, irisDF$Sepal_Length + 1)

describe() would fail as this column doesn't belong to df.

…tation tags

SparkQA · 2016-03-24T16:50:01Z

Test build #54050 has finished for PR 11569 at commit 468adbf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-24T17:04:23Z

Test build #54051 has finished for PR 11569 at commit dbc9d75.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-03-24T17:50:00Z

btw, is this still the right place? functions.R works with Column, if this works with DataFrame, should it go to DataFrame.R?

SparkQA · 2016-03-25T23:49:47Z

Test build #54234 has finished for PR 11569 at commit 19f995c.

This patch fails R style tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-04-22T00:15:09Z

R/pkg/R/DataFrame.R

+#' @param colname the name of the column to build the histogram from.
+#' @return a data.frame with the histogram statistics, i.e., counts and centroids.
+#' @rdname histogram
+#' @family agg_funcs


#' @family DataFrame functions

felixcheung · 2016-04-22T01:03:39Z

looks good except 1 minor doc comment

SparkQA · 2016-04-22T17:51:34Z

Test build #56711 has finished for PR 11569 at commit fc4c536.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

olarayej · 2016-04-22T17:52:48Z

@felixcheung @shivaram I'm done with all your suggestions. Thanks. Shall we merge?

shivaram · 2016-04-22T20:22:58Z

R/pkg/R/DataFrame.R

+
+              # Append the given column to the dataset. This is to support Columns that
+              # don't belong to the DataFrame but are rather expressions
+              df$x <- col


Do we need to check if x is a column name already present in the data frame ? For example I ran the code

irisDF$x <- irisDF$Petal_Width + 2.0 histogram(irisDF, irisDF$x, 8)

and I got an error

org.apache.spark.sql.AnalysisException: resolved attribute(s) x#141 missing from Species#4,Sepal_Length#0,x#269,Petal_Width#3,Petal_Length#2,Sepal_Width#1 in operator !Project [Sepal_Length#0,Sepal_Width#1,Petal_Length#2,Petal_Width#3,Species#4,x#269,cast((((cast(cast((((x#141 - 2.1) / 2.4) * 10000.0) as int) as double) / 10000.0) / 0.125) - CASE WHEN ((((cast(cast((((x#141 - 2.1) / 2.4) * 10000.0) as int) as double) / 10000.0) / 0.125) = cast(cast(((cast(cast((((x#141 - 2.1) / 2.4) * 10000.0) as int) as double) / 10000.0) / 0.125) as int) as double)) && NOT (x#141 = 2.1)) THEN 1.0 ELSE 0.0 END) as int) AS bins#325]

@shivaram Yes, I have fixed this. Thanks!

felixcheung · 2016-04-23T23:54:52Z

please rebase to pick up DataFrame -> SparkDataFrame class name change.

SparkQA · 2016-04-25T21:27:43Z

Test build #56924 has finished for PR 11569 at commit 7cdb9e8.

This patch fails some tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-25T21:56:44Z

Test build #56927 has finished for PR 11569 at commit 976e412.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-04-25T23:37:56Z

R/pkg/R/DataFrame.R

+#' @param colname the name of the column to build the histogram from.
+#' @return a data.frame with the histogram statistics, i.e., counts and centroids.
+#' @rdname histogram
+#' @family DataFrame functions


this has changed as well
@family SparkDataFrame functions
sorry this is such a moving target

…umns

SparkQA · 2016-04-26T21:14:52Z

Test build #57028 has finished for PR 11569 at commit cd7ba4c.

This patch fails R style tests.
This patch merges cleanly.
This patch adds no public classes.

olarayej · 2016-04-26T21:20:14Z

@shivaram @felixcheung Looks like the version of lint-r running on the build server is different than the one on Spark's Github. Even though lint-r passes on my local, I keep getting this errors:

R/DataFrame.R:2542:40: style: Put spaces around all infix operators.
collapse=""

SparkQA · 2016-04-26T21:37:09Z

Test build #57029 has finished for PR 11569 at commit e9dbc5b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

olarayej · 2016-04-26T21:48:10Z

@shivaram @felixcheung I have addressed all your comments. Anything else? Or shall we merge? Thanks!

shivaram · 2016-04-26T21:49:27Z

R/pkg/R/DataFrame.R

+#' @examples 
+#' \dontrun{
+#' 
+#' # Create a DataFrame from the Iris dataset


As @felixcheung mentioned before DataFrame-> SparkDataFrame ? Or we can just delete this comment

shivaram · 2016-04-26T21:50:01Z

LGTM. Thanks @olarayej - I just had a couple of minor comments about using SparkDataFrame. Other than that this looks good to merge

SparkQA · 2016-04-26T22:13:12Z

Test build #57038 has finished for PR 11569 at commit fc2f6a3.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-26T22:33:08Z

Test build #57043 has finished for PR 11569 at commit 838c915.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2016-04-26T22:33:35Z

LGTM. Merging this.

Oscar D. Lara Yejas added 2 commits March 7, 2016 16:19

Added histogram function

efc2f66

Added test case where some bins have zero counts

0ad424b

olarayej changed the title ~~Added histogram function~~ [SPARK-13734][SPARKR] Added histogram function Mar 8, 2016

Added check for ggplot2 package

d19992b

Suppressed warnings for loading ggplot

ac8f4c9

Modified histogram to remove ggplot2 dependency

125b82d

Fixed style issues

971d306

Fixed style issues

c06344e

felixcheung reviewed Mar 23, 2016
View reviewed changes

Oscar D. Lara Yejas added 2 commits March 23, 2016 17:28

Added example to render the histogram with ggplot2, and added documen…

468adbf

…tation tags

Round nbins to the smallest integer

dbc9d75

Added support for Columns

19f995c

Fixed style

2800492

felixcheung reviewed Apr 22, 2016
View reviewed changes

Minor docs fix

fc4c536

shivaram reviewed Apr 22, 2016
View reviewed changes

Merged changes from master stream (Renamed to SparkDataFrame)

7cdb9e8

pkg/R/DataFrame.R

976e412

felixcheung reviewed Apr 25, 2016
View reviewed changes

Oscar D. Lara Yejas added 2 commits April 26, 2016 12:28

Added dynamic colname generation to avoid colliding with existing col…

96714fd

…umns

Fixed ggplot example

cd7ba4c

Fixed style issues

e9dbc5b

shivaram reviewed Apr 26, 2016
View reviewed changes

Changes DataFrame for SparkDataFrame

fc2f6a3

Changed error message on histogram tests

838c915

asfgit closed this in 0c99c23 Apr 26, 2016

[SPARK-13734][SPARKR] Added histogram function #11569

[SPARK-13734][SPARKR] Added histogram function #11569

Conversation

olarayej commented Mar 8, 2016

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Mar 8, 2016

SparkQA commented Mar 8, 2016

sun-rui commented Mar 8, 2016

SparkQA commented Mar 8, 2016

felixcheung commented Mar 10, 2016

shivaram commented Mar 11, 2016

olarayej commented Mar 22, 2016

SparkQA commented Mar 22, 2016

SparkQA commented Mar 22, 2016

SparkQA commented Mar 22, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 24, 2016

SparkQA commented Mar 24, 2016

felixcheung commented Mar 24, 2016

SparkQA commented Mar 25, 2016

Choose a reason for hiding this comment

felixcheung commented Apr 22, 2016

SparkQA commented Apr 22, 2016

olarayej commented Apr 22, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felixcheung commented Apr 23, 2016

SparkQA commented Apr 25, 2016

SparkQA commented Apr 25, 2016

Choose a reason for hiding this comment

SparkQA commented Apr 26, 2016

olarayej commented Apr 26, 2016

SparkQA commented Apr 26, 2016

olarayej commented Apr 26, 2016

Choose a reason for hiding this comment

shivaram commented Apr 26, 2016

SparkQA commented Apr 26, 2016

SparkQA commented Apr 26, 2016

shivaram commented Apr 26, 2016