Skip to content

Commit

Permalink
[SPARK-34306][SQL][PYTHON][R] Use Snake naming rule across the functi…
Browse files Browse the repository at this point in the history
…on APIs

### What changes were proposed in this pull request?

This PR completes snake_case rule at functions APIs across the languages, see also SPARK-10621.

In more details, this PR:
- Adds `count_distinct` in Scala Python, and R, and document that `count_distinct` is encouraged. This was not deprecated because `countDistinct` is pretty commonly used. We could deprecate in the future releases.
- (Scala-specific) adds `typedlit` but doesn't deprecate `typedLit` which is arguably commonly used. Likewise, we could deprecate in the future releases.
- Deprecates and renames:
  - `sumDistinct` -> `sum_distinct`
  - `bitwiseNOT` -> `bitwise_not`
  - `shiftLeft` -> `shiftleft` (matched with SQL name in `FunctionRegistry`)
  - `shiftRight` -> `shiftright` (matched with SQL name in `FunctionRegistry`)
  - `shiftRightUnsigned` -> `shiftrightunsigned` (matched with SQL name in `FunctionRegistry`)
  - (Scala-specific) `callUDF` -> `call_udf`

### Why are the changes needed?

To keep the consistent naming in APIs.

### Does this PR introduce _any_ user-facing change?

Yes, it deprecates some APIs and add new renamed APIs as described above.

### How was this patch tested?

Unittests were added.

Closes #31408 from HyukjinKwon/SPARK-34306.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
  • Loading branch information
HyukjinKwon committed Feb 2, 2021
1 parent 9db566a commit 30468a9
Show file tree
Hide file tree
Showing 27 changed files with 409 additions and 108 deletions.
6 changes: 6 additions & 0 deletions R/pkg/NAMESPACE
Expand Up @@ -243,6 +243,7 @@ exportMethods("%<=>%",
"base64",
"between",
"bin",
"bitwise_not",
"bitwiseNOT",
"bround",
"cast",
Expand All @@ -259,6 +260,7 @@ exportMethods("%<=>%",
"cos",
"cosh",
"count",
"count_distinct",
"countDistinct",
"crc32",
"create_array",
Expand Down Expand Up @@ -391,8 +393,11 @@ exportMethods("%<=>%",
"sha1",
"sha2",
"shiftLeft",
"shiftleft",
"shiftRight",
"shiftright",
"shiftRightUnsigned",
"shiftrightunsigned",
"shuffle",
"sd",
"sign",
Expand All @@ -415,6 +420,7 @@ exportMethods("%<=>%",
"substr",
"substring_index",
"sum",
"sum_distinct",
"sumDistinct",
"tan",
"tanh",
Expand Down
125 changes: 102 additions & 23 deletions R/pkg/R/functions.R
Expand Up @@ -484,7 +484,7 @@ setMethod("acosh",
#' \dontrun{
#' head(select(df, approx_count_distinct(df$gear)))
#' head(select(df, approx_count_distinct(df$gear, 0.02)))
#' head(select(df, countDistinct(df$gear, df$cyl)))
#' head(select(df, count_distinct(df$gear, df$cyl)))
#' head(select(df, n_distinct(df$gear)))
#' head(distinct(select(df, "gear")))}
#' @note approx_count_distinct(Column) since 3.0.0
Expand Down Expand Up @@ -636,20 +636,33 @@ setMethod("bin",
})

#' @details
#' \code{bitwiseNOT}: Computes bitwise NOT.
#' \code{bitwise_not}: Computes bitwise NOT.
#'
#' @rdname column_nonaggregate_functions
#' @aliases bitwiseNOT bitwiseNOT,Column-method
#' @aliases bitwise_not bitwise_not,Column-method
#' @examples
#'
#' \dontrun{
#' head(select(df, bitwiseNOT(cast(df$vs, "int"))))}
#' head(select(df, bitwise_not(cast(df$vs, "int"))))}
#' @note bitwise_not since 3.2.0
setMethod("bitwise_not",
signature(x = "Column"),
function(x) {
jc <- callJStatic("org.apache.spark.sql.functions", "bitwise_not", x@jc)
column(jc)
})

#' @details
#' \code{bitwiseNOT}: Computes bitwise NOT.
#'
#' @rdname column_nonaggregate_functions
#' @aliases bitwiseNOT bitwiseNOT,Column-method
#' @note bitwiseNOT since 1.5.0
setMethod("bitwiseNOT",
signature(x = "Column"),
function(x) {
jc <- callJStatic("org.apache.spark.sql.functions", "bitwiseNOT", x@jc)
column(jc)
.Deprecated("bitwise_not")
bitwise_not(x)
})

#' @details
Expand Down Expand Up @@ -1937,21 +1950,34 @@ setMethod("sum",
})

#' @details
#' \code{sumDistinct}: Returns the sum of distinct values in the expression.
#' \code{sum_distinct}: Returns the sum of distinct values in the expression.
#'
#' @rdname column_aggregate_functions
#' @aliases sumDistinct sumDistinct,Column-method
#' @aliases sum_distinct sum_distinct,Column-method
#' @examples
#'
#' \dontrun{
#' head(select(df, sumDistinct(df$gear)))
#' head(select(df, sum_distinct(df$gear)))
#' head(distinct(select(df, "gear")))}
#' @note sum_distinct since 3.2.0
setMethod("sum_distinct",
signature(x = "Column"),
function(x) {
jc <- callJStatic("org.apache.spark.sql.functions", "sum_distinct", x@jc)
column(jc)
})

#' @details
#' \code{sumDistinct}: Returns the sum of distinct values in the expression.
#'
#' @rdname column_aggregate_functions
#' @aliases sumDistinct sumDistinct,Column-method
#' @note sumDistinct since 1.4.0
setMethod("sumDistinct",
signature(x = "Column"),
function(x) {
jc <- callJStatic("org.apache.spark.sql.functions", "sumDistinct", x@jc)
column(jc)
.Deprecated("sum_distinct")
sum_distinct(x)
})

#' @details
Expand Down Expand Up @@ -2469,23 +2495,37 @@ setMethod("approxCountDistinct",
})

#' @details
#' \code{countDistinct}: Returns the number of distinct items in a group.
#' \code{count_distinct}: Returns the number of distinct items in a group.
#'
#' @rdname column_aggregate_functions
#' @aliases countDistinct countDistinct,Column-method
#' @note countDistinct since 1.4.0
setMethod("countDistinct",
#' @aliases count_distinct count_distinct,Column-method
#' @note count_distinct since 3.2.0
setMethod("count_distinct",
signature(x = "Column"),
function(x, ...) {
jcols <- lapply(list(...), function(x) {
stopifnot(class(x) == "Column")
x@jc
})
jc <- callJStatic("org.apache.spark.sql.functions", "countDistinct", x@jc,
jc <- callJStatic("org.apache.spark.sql.functions", "count_distinct", x@jc,
jcols)
column(jc)
})

#' @details
#' \code{countDistinct}: Returns the number of distinct items in a group.
#'
#' An alias of \code{count_distinct}, and it is encouraged to use \code{count_distinct} directly.
#'
#' @rdname column_aggregate_functions
#' @aliases countDistinct countDistinct,Column-method
#' @note countDistinct since 1.4.0
setMethod("countDistinct",
signature(x = "Column"),
function(x, ...) {
count_distinct(x, ...)
})

#' @details
#' \code{concat}: Concatenates multiple input columns together into a single column.
#' The function works with strings, binary and compatible array columns.
Expand Down Expand Up @@ -2550,7 +2590,7 @@ setMethod("least",
#' @note n_distinct since 1.4.0
setMethod("n_distinct", signature(x = "Column"),
function(x, ...) {
countDistinct(x, ...)
count_distinct(x, ...)
})

#' @rdname count
Expand Down Expand Up @@ -2893,6 +2933,21 @@ setMethod("sha2", signature(y = "Column", x = "numeric"),
column(jc)
})

#' @details
#' \code{shiftleft}: Shifts the given value numBits left. If the given value is a long value,
#' this function will return a long value else it will return an integer value.
#'
#' @rdname column_math_functions
#' @aliases shiftleft shiftleft,Column,numeric-method
#' @note shiftleft since 3.2.0
setMethod("shiftleft", signature(y = "Column", x = "numeric"),
function(y, x) {
jc <- callJStatic("org.apache.spark.sql.functions",
"shiftleft",
y@jc, as.integer(x))
column(jc)
})

#' @details
#' \code{shiftLeft}: Shifts the given value numBits left. If the given value is a long value,
#' this function will return a long value else it will return an integer value.
Expand All @@ -2901,9 +2956,22 @@ setMethod("sha2", signature(y = "Column", x = "numeric"),
#' @aliases shiftLeft shiftLeft,Column,numeric-method
#' @note shiftLeft since 1.5.0
setMethod("shiftLeft", signature(y = "Column", x = "numeric"),
function(y, x) {
.Deprecated("shiftleft")
shiftleft(y, x)
})

#' @details
#' \code{shiftright}: (Signed) shifts the given value numBits right. If the given value is a long
#' value, it will return a long value else it will return an integer value.
#'
#' @rdname column_math_functions
#' @aliases shiftright shiftright,Column,numeric-method
#' @note shiftright since 3.2.0
setMethod("shiftright", signature(y = "Column", x = "numeric"),
function(y, x) {
jc <- callJStatic("org.apache.spark.sql.functions",
"shiftLeft",
"shiftright",
y@jc, as.integer(x))
column(jc)
})
Expand All @@ -2916,9 +2984,22 @@ setMethod("shiftLeft", signature(y = "Column", x = "numeric"),
#' @aliases shiftRight shiftRight,Column,numeric-method
#' @note shiftRight since 1.5.0
setMethod("shiftRight", signature(y = "Column", x = "numeric"),
function(y, x) {
.Deprecated("shiftright")
shiftright(y, x)
})

#' @details
#' \code{shiftrightunsigned}: (Unsigned) shifts the given value numBits right. If the given value is
#' a long value, it will return a long value else it will return an integer value.
#'
#' @rdname column_math_functions
#' @aliases shiftrightunsigned shiftrightunsigned,Column,numeric-method
#' @note shiftrightunsigned since 3.2.0
setMethod("shiftrightunsigned", signature(y = "Column", x = "numeric"),
function(y, x) {
jc <- callJStatic("org.apache.spark.sql.functions",
"shiftRight",
"shiftrightunsigned",
y@jc, as.integer(x))
column(jc)
})
Expand All @@ -2932,10 +3013,8 @@ setMethod("shiftRight", signature(y = "Column", x = "numeric"),
#' @note shiftRightUnsigned since 1.5.0
setMethod("shiftRightUnsigned", signature(y = "Column", x = "numeric"),
function(y, x) {
jc <- callJStatic("org.apache.spark.sql.functions",
"shiftRightUnsigned",
y@jc, as.integer(x))
column(jc)
.Deprecated("shiftrightunsigned")
shiftrightunsigned(y, x)
})

#' @details
Expand Down
24 changes: 24 additions & 0 deletions R/pkg/R/generics.R
Expand Up @@ -884,6 +884,10 @@ setGeneric("base64", function(x) { standardGeneric("base64") })
#' @name NULL
setGeneric("bin", function(x) { standardGeneric("bin") })

#' @rdname column_nonaggregate_functions
#' @name NULL
setGeneric("bitwise_not", function(x) { standardGeneric("bitwise_not") })

#' @rdname column_nonaggregate_functions
#' @name NULL
setGeneric("bitwiseNOT", function(x) { standardGeneric("bitwiseNOT") })
Expand Down Expand Up @@ -923,6 +927,10 @@ setGeneric("concat_ws", function(sep, x, ...) { standardGeneric("concat_ws") })
#' @name NULL
setGeneric("conv", function(x, fromBase, toBase) { standardGeneric("conv") })

#' @rdname column_aggregate_functions
#' @name NULL
setGeneric("count_distinct", function(x, ...) { standardGeneric("count_distinct") })

#' @rdname column_aggregate_functions
#' @name NULL
setGeneric("countDistinct", function(x, ...) { standardGeneric("countDistinct") })
Expand Down Expand Up @@ -1324,14 +1332,26 @@ setGeneric("sha2", function(y, x) { standardGeneric("sha2") })
#' @name NULL
setGeneric("shiftLeft", function(y, x) { standardGeneric("shiftLeft") })

#' @rdname column_math_functions
#' @name NULL
setGeneric("shiftleft", function(y, x) { standardGeneric("shiftleft") })

#' @rdname column_math_functions
#' @name NULL
setGeneric("shiftRight", function(y, x) { standardGeneric("shiftRight") })

#' @rdname column_math_functions
#' @name NULL
setGeneric("shiftright", function(y, x) { standardGeneric("shiftright") })

#' @rdname column_math_functions
#' @name NULL
setGeneric("shiftRightUnsigned", function(y, x) { standardGeneric("shiftRightUnsigned") })

#' @rdname column_math_functions
#' @name NULL
setGeneric("shiftrightunsigned", function(y, x) { standardGeneric("shiftrightunsigned") })

#' @rdname column_collection_functions
#' @name NULL
setGeneric("shuffle", function(x) { standardGeneric("shuffle") })
Expand Down Expand Up @@ -1388,6 +1408,10 @@ setGeneric("struct", function(x, ...) { standardGeneric("struct") })
#' @name NULL
setGeneric("substring_index", function(x, delim, count) { standardGeneric("substring_index") })

#' @rdname column_aggregate_functions
#' @name NULL
setGeneric("sum_distinct", function(x) { standardGeneric("sum_distinct") })

#' @rdname column_aggregate_functions
#' @name NULL
setGeneric("sumDistinct", function(x) { standardGeneric("sumDistinct") })
Expand Down
17 changes: 12 additions & 5 deletions R/pkg/tests/fulltests/test_sparkSQL.R
Expand Up @@ -1397,15 +1397,17 @@ test_that("column operators", {
test_that("column functions", {
c <- column("a")
c1 <- abs(c) + acos(c) + approx_count_distinct(c) + ascii(c) + asin(c) + atan(c)
c2 <- avg(c) + base64(c) + bin(c) + bitwiseNOT(c) + cbrt(c) + ceil(c) + cos(c)
c2 <- avg(c) + base64(c) + bin(c) + suppressWarnings(bitwiseNOT(c)) +
bitwise_not(c) + cbrt(c) + ceil(c) + cos(c)
c3 <- cosh(c) + count(c) + crc32(c) + hash(c) + exp(c)
c4 <- explode(c) + expm1(c) + factorial(c) + first(c) + floor(c) + hex(c)
c5 <- hour(c) + initcap(c) + last(c) + last_day(c) + length(c)
c6 <- log(c) + (c) + log1p(c) + log2(c) + lower(c) + ltrim(c) + max(c) + md5(c)
c7 <- mean(c) + min(c) + month(c) + negate(c) + posexplode(c) + quarter(c)
c8 <- reverse(c) + rint(c) + round(c) + rtrim(c) + sha1(c) + monotonically_increasing_id()
c9 <- signum(c) + sin(c) + sinh(c) + size(c) + stddev(c) + soundex(c) + sqrt(c) + sum(c)
c10 <- sumDistinct(c) + tan(c) + tanh(c) + degrees(c) + radians(c)
c10 <- suppressWarnings(sumDistinct(c)) + sum_distinct(c) + tan(c) + tanh(c) +
degrees(c) + radians(c)
c11 <- to_date(c) + trim(c) + unbase64(c) + unhex(c) + upper(c)
c12 <- variance(c) + xxhash64(c) + ltrim(c, "a") + rtrim(c, "b") + trim(c, "c")
c13 <- lead("col", 1) + lead(c, 1) + lag("col", 1) + lag(c, 1)
Expand Down Expand Up @@ -1457,6 +1459,8 @@ test_that("column functions", {
expect_equal(collect(df3)[[2, 1]], FALSE)
expect_equal(collect(df3)[[3, 1]], TRUE)

df4 <- select(df, count_distinct(df$age, df$name))
expect_equal(collect(df4)[[1, 1]], 2)
df4 <- select(df, countDistinct(df$age, df$name))
expect_equal(collect(df4)[[1, 1]], 2)

Expand Down Expand Up @@ -1887,9 +1891,12 @@ test_that("column binary mathfunctions", {
expect_equal(collect(select(df, hypot(df$a, df$b)))[3, "HYPOT(a, b)"], sqrt(3^2 + 7^2))
expect_equal(collect(select(df, hypot(df$a, df$b)))[4, "HYPOT(a, b)"], sqrt(4^2 + 8^2))
## nolint end
expect_equal(collect(select(df, shiftLeft(df$b, 1)))[4, 1], 16)
expect_equal(collect(select(df, shiftRight(df$b, 1)))[4, 1], 4)
expect_equal(collect(select(df, shiftRightUnsigned(df$b, 1)))[4, 1], 4)
expect_equal(collect(select(df, shiftleft(df$b, 1)))[4, 1], 16)
expect_equal(collect(select(df, shiftright(df$b, 1)))[4, 1], 4)
expect_equal(collect(select(df, shiftrightunsigned(df$b, 1)))[4, 1], 4)
expect_equal(collect(select(df, suppressWarnings(shiftLeft(df$b, 1))))[4, 1], 16)
expect_equal(collect(select(df, suppressWarnings(shiftRight(df$b, 1))))[4, 1], 4)
expect_equal(collect(select(df, suppressWarnings(shiftRightUnsigned(df$b, 1))))[4, 1], 4)
expect_equal(class(collect(select(df, rand()))[2, 1]), "numeric")
expect_equal(collect(select(df, rand(1)))[1, 1], 0.636, tolerance = 0.01)
expect_equal(class(collect(select(df, randn()))[2, 1]), "numeric")
Expand Down
2 changes: 1 addition & 1 deletion R/pkg/vignettes/sparkr-vignettes.Rmd
Expand Up @@ -331,7 +331,7 @@ A common flow of grouping and aggregation is

2. Feed the `GroupedData` object to `agg` or `summarize` functions, with some provided aggregation functions to compute a number within each group.

A number of widely used functions are supported to aggregate data after grouping, including `avg`, `countDistinct`, `count`, `first`, `kurtosis`, `last`, `max`, `mean`, `min`, `sd`, `skewness`, `stddev_pop`, `stddev_samp`, `sumDistinct`, `sum`, `var_pop`, `var_samp`, `var`. See the [API doc for aggregate functions](https://spark.apache.org/docs/latest/api/R/column_aggregate_functions.html) linked there.
A number of widely used functions are supported to aggregate data after grouping, including `avg`, `count_distinct`, `count`, `first`, `kurtosis`, `last`, `max`, `mean`, `min`, `sd`, `skewness`, `stddev_pop`, `stddev_samp`, `sum_distinct`, `sum`, `var_pop`, `var_samp`, `var`. See the [API doc for aggregate functions](https://spark.apache.org/docs/latest/api/R/column_aggregate_functions.html) linked there.

For example we can compute a histogram of the number of cylinders in the `mtcars` dataset as shown below.

Expand Down
2 changes: 1 addition & 1 deletion docs/sql-getting-started.md
Expand Up @@ -352,7 +352,7 @@ Scalar functions are functions that return a single value per row, as opposed to

## Aggregate Functions

Aggregate functions are functions that return a single value on a group of rows. The [Built-in Aggregation Functions](sql-ref-functions-builtin.html#aggregate-functions) provide common aggregations such as `count()`, `countDistinct()`, `avg()`, `max()`, `min()`, etc.
Aggregate functions are functions that return a single value on a group of rows. The [Built-in Aggregation Functions](sql-ref-functions-builtin.html#aggregate-functions) provide common aggregations such as `count()`, `count_distinct()`, `avg()`, `max()`, `min()`, etc.
Users are not limited to the predefined aggregate functions and can create their own. For more details
about user defined aggregate functions, please refer to the documentation of
[User Defined Aggregate Functions](sql-ref-functions-udf-aggregate.html).
Expand Down

0 comments on commit 30468a9

Please sign in to comment.