Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32799][R][SQL] Add allowMissingColumns to SparkR unionByName #29813

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
15 changes: 13 additions & 2 deletions R/pkg/R/DataFrame.R
Expand Up @@ -2863,11 +2863,19 @@ setMethod("unionAll",
#' \code{UNION ALL} and \code{UNION DISTINCT} in SQL as column positions are not taken
#' into account. Input SparkDataFrames can have different data types in the schema.
#'
#' When the parameter `allowMissingColumns` is `TRUE`, this function allows
zero323 marked this conversation as resolved.
Show resolved Hide resolved
#' different set of column names between two `SparkDataFrames`.
#' Missing columns at each side, will be filled with null values.
#' The missing columns at left `SparkDataFrame` will be added at the end in the schema
#' of the union result.
#'
#' Note: This does not remove duplicate rows across the two SparkDataFrames.
#' This function resolves columns by name (not by position).
#'
#' @param x A SparkDataFrame
#' @param y A SparkDataFrame
#' @param allowMissingColumns logical
#' @param ... further arguments to be passed to or from other methods.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... is not actually supported?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm, seen below it's added to the generic

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct, but I am not sure if there is a better way of handling that.

Right now we have generic as follows:

setGeneric("unionByName", function(x, y, ...) { standardGeneric("unionByName") })

‒ as far as I am aware this is the convention for handling optional arguments we use in SparkR.

Technically speaking we could have

setGeneric("unionByName", function(x, y, allowMissingColumns) { standardGeneric("unionByName") })

but then we'd have to support

signature(x = "SparkDataFrame", y = "SparkDataFrame", allowMissingColumns = "missing")

and

signature(x = "SparkDataFrame", y = "SparkDataFrame", allowMissingColumns = "logical")

if I am not mistaken, and in the past I've been told that's too much.

Do I miss something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way you've done it looks natural to me

#' @return A SparkDataFrame containing the result of the union.
#' @family SparkDataFrame functions
#' @rdname unionByName
Expand All @@ -2880,12 +2888,15 @@ setMethod("unionAll",
#' df1 <- select(createDataFrame(mtcars), "carb", "am", "gear")
#' df2 <- select(createDataFrame(mtcars), "am", "gear", "carb")
#' head(unionByName(df1, df2))
#'
#' df3 <- select(createDataFrame(mtcars), "carb")
#' head(unionByName(df1, df3, allowMissingColumns = TRUE))
#' }
#' @note unionByName since 2.3.0
setMethod("unionByName",
signature(x = "SparkDataFrame", y = "SparkDataFrame"),
function(x, y) {
unioned <- callJMethod(x@sdf, "unionByName", y@sdf)
function(x, y, allowMissingColumns=FALSE) {
unioned <- callJMethod(x@sdf, "unionByName", y@sdf, allowMissingColumns)
dataFrame(unioned)
})

Expand Down
2 changes: 1 addition & 1 deletion R/pkg/R/generics.R
Expand Up @@ -638,7 +638,7 @@ setGeneric("union", function(x, y) { standardGeneric("union") })
setGeneric("unionAll", function(x, y) { standardGeneric("unionAll") })

#' @rdname unionByName
setGeneric("unionByName", function(x, y) { standardGeneric("unionByName") })
setGeneric("unionByName", function(x, y, ...) { standardGeneric("unionByName") })

#' @rdname unpersist
setGeneric("unpersist", function(x, ...) { standardGeneric("unpersist") })
Expand Down
13 changes: 13 additions & 0 deletions R/pkg/tests/fulltests/test_sparkSQL.R
Expand Up @@ -2696,6 +2696,19 @@ test_that("union(), unionByName(), rbind(), except(), and intersect() on a DataF
expect_error(rbind(df, df2, df3),
"Names of input data frames are different.")


df4 <- unionByName(df2, select(df2, "age"), TRUE)

expect_equal(
sum(collect(
select(df4, alias(isNull(df4$name), "missing_name")
))$missing_name),
3
)

testthat::expect_error(unionByName(df2, select(df2, "age"), FALSE))
testthat::expect_error(unionByName(df2, select(df2, "age")))

excepted <- arrange(except(df, df2), desc(df$age))
expect_is(unioned, "SparkDataFrame")
expect_equal(count(excepted), 2)
Expand Down