Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-13734][SPARKR] Added histogram function #11569

Closed
wants to merge 24 commits into from

Conversation

olarayej
Copy link

@olarayej olarayej commented Mar 8, 2016

What changes were proposed in this pull request?

Added method histogram() to compute the histogram of a Column

Usage:

## Create a DataFrame from the Iris dataset
irisDF <- createDataFrame(sqlContext, iris)

## Render a histogram for the Sepal_Length column 
histogram(irisDF, "Sepal_Length", nbins=12)

histogram

Note: Usage will change once SPARK-9325 is figured out so that histogram() only takes a Column as a parameter, as opposed to a DataFrame and a name

How was this patch tested?

All unit tests pass. I added specific unit cases for different scenarios.

@olarayej olarayej changed the title Added histogram function [SPARK-13734][SPARKR] Added histogram function Mar 8, 2016
@SparkQA
Copy link

SparkQA commented Mar 8, 2016

Test build #52615 has finished for PR 11569 at commit 0ad424b.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 8, 2016

Test build #52618 has finished for PR 11569 at commit d19992b.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sun-rui
Copy link
Contributor

sun-rui commented Mar 8, 2016

It seems better to keep SparkR as a base package providing core functionalities, while visualization features can be implemented in other packages based on SparkR. There is an example at https://github.com/PAPL-SKKU/ggplot2.SparkR.

@SparkQA
Copy link

SparkQA commented Mar 8, 2016

Test build #52628 has finished for PR 11569 at commit ac8f4c9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

I agree that SparkR should not "require" ggplot...

@shivaram
Copy link
Contributor

Yeah installing ggplot as a part of SparkR isn't a good pattern. I agree that this should be in a add-on package.

@olarayej
Copy link
Author

@shivaram @sun-rui @felixcheung Yeah, that makes sense. I modified histogram() function so now it only computes the histogram statistics. There's neither rendering nor dependency on ggplot2 anymore. I think the histogram stats are still very useful for an R user and if they wanna plot it later, they're free to use any of R packages.

@SparkQA
Copy link

SparkQA commented Mar 22, 2016

Test build #53732 has finished for PR 11569 at commit 125b82d.

  • This patch fails R style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 22, 2016

Test build #53831 has finished for PR 11569 at commit 971d306.

  • This patch fails R style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 22, 2016

Test build #53836 has finished for PR 11569 at commit c06344e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

#' }
setMethod("histogram",
signature(df = "DataFrame"),
function(df, colname, nbins = 10) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to specify colname type in signature()?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some other functions here take the Column type, you might want to support both character or Column for colname

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixcheung Yeah, I thought of that but I don't know how to compute the min and max in one single pass given a Column (not a name). I used describe() which requires a column name. I also tried agg, but it cannot compute more than 1 stat per column:

> collect(agg(irisDF, Sepal_Length="max", Sepal_Width="min"))
  max(Sepal_Length) min(Sepal_Width)
1               7.9                2

> collect(agg(irisDF, Sepal_Length="max", Sepal_Length="min"))
  max(Sepal_Length)
1               7.9

Suggestions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you need names of columns, something like
columns(select(df, colname)) to get a list of names back?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixcheung Yeah, but then what if the user wants to do:

hist(irisDF, irisDF$Sepal_Length + 1)

describe() would fail as this column doesn't belong to df.

@SparkQA
Copy link

SparkQA commented Mar 24, 2016

Test build #54050 has finished for PR 11569 at commit 468adbf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 24, 2016

Test build #54051 has finished for PR 11569 at commit dbc9d75.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

btw, is this still the right place? functions.R works with Column, if this works with DataFrame, should it go to DataFrame.R?

@SparkQA
Copy link

SparkQA commented Mar 25, 2016

Test build #54234 has finished for PR 11569 at commit 19f995c.

  • This patch fails R style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

#' @param colname the name of the column to build the histogram from.
#' @return a data.frame with the histogram statistics, i.e., counts and centroids.
#' @rdname histogram
#' @family agg_funcs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#' @family DataFrame functions

@felixcheung
Copy link
Member

looks good except 1 minor doc comment

@SparkQA
Copy link

SparkQA commented Apr 22, 2016

Test build #56711 has finished for PR 11569 at commit fc4c536.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@olarayej
Copy link
Author

@felixcheung @shivaram I'm done with all your suggestions. Thanks. Shall we merge?


# Append the given column to the dataset. This is to support Columns that
# don't belong to the DataFrame but are rather expressions
df$x <- col
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to check if x is a column name already present in the data frame ? For example I ran the code

irisDF$x <- irisDF$Petal_Width + 2.0
histogram(irisDF, irisDF$x, 8)

and I got an error

org.apache.spark.sql.AnalysisException: resolved attribute(s) x#141 missing from Species#4,Sepal_Length#0,x#269,Petal_Width#3,Petal_Length#2,Sepal_Width#1 in operator !Project [Sepal_Length#0,Sepal_Width#1,Petal_Length#2,Petal_Width#3,Species#4,x#269,cast((((cast(cast((((x#141 - 2.1) / 2.4) * 10000.0) as int) as double) / 10000.0) / 0.125) - CASE WHEN ((((cast(cast((((x#141 - 2.1) / 2.4) * 10000.0) as int) as double) / 10000.0) / 0.125) = cast(cast(((cast(cast((((x#141 - 2.1) / 2.4) * 10000.0) as int) as double) / 10000.0) / 0.125) as int) as double)) && NOT (x#141 = 2.1)) THEN 1.0 ELSE 0.0 END) as int) AS bins#325]

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shivaram Yes, I have fixed this. Thanks!

@felixcheung
Copy link
Member

please rebase to pick up DataFrame -> SparkDataFrame class name change.

@SparkQA
Copy link

SparkQA commented Apr 25, 2016

Test build #56924 has finished for PR 11569 at commit 7cdb9e8.

  • This patch fails some tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 25, 2016

Test build #56927 has finished for PR 11569 at commit 976e412.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

#' @param colname the name of the column to build the histogram from.
#' @return a data.frame with the histogram statistics, i.e., counts and centroids.
#' @rdname histogram
#' @family DataFrame functions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this has changed as well
@family SparkDataFrame functions
sorry this is such a moving target

@SparkQA
Copy link

SparkQA commented Apr 26, 2016

Test build #57028 has finished for PR 11569 at commit cd7ba4c.

  • This patch fails R style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@olarayej
Copy link
Author

@shivaram @felixcheung Looks like the version of lint-r running on the build server is different than the one on Spark's Github. Even though lint-r passes on my local, I keep getting this errors:

R/DataFrame.R:2542:40: style: Put spaces around all infix operators.
collapse=""

@SparkQA
Copy link

SparkQA commented Apr 26, 2016

Test build #57029 has finished for PR 11569 at commit e9dbc5b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@olarayej
Copy link
Author

@shivaram @felixcheung I have addressed all your comments. Anything else? Or shall we merge? Thanks!

#' @examples
#' \dontrun{
#'
#' # Create a DataFrame from the Iris dataset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @felixcheung mentioned before DataFrame-> SparkDataFrame ? Or we can just delete this comment

@shivaram
Copy link
Contributor

LGTM. Thanks @olarayej - I just had a couple of minor comments about using SparkDataFrame. Other than that this looks good to merge

@SparkQA
Copy link

SparkQA commented Apr 26, 2016

Test build #57038 has finished for PR 11569 at commit fc2f6a3.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 26, 2016

Test build #57043 has finished for PR 11569 at commit 838c915.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

LGTM. Merging this.

@asfgit asfgit closed this in 0c99c23 Apr 26, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants