Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-13327][SPARKR] Added parameter validations for colnames<- #11220

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 21 additions & 1 deletion R/pkg/R/DataFrame.R
Original file line number Diff line number Diff line change
Expand Up @@ -303,8 +303,28 @@ setMethod("colnames",
#' @rdname columns
#' @name colnames<-
setMethod("colnames<-",
signature(x = "DataFrame", value = "character"),
signature(x = "DataFrame"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer keeping the current signature instead of checking the class of the parameter inside the method.
@felixcheung, what's your opinion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree - letting R do type matching for the method signature seems like a better approach?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I added that check was because of this:

colnames(irisDF) <- 1
Error in colnames<-(*tmp*, value = 1) :
'dimnames' applied to non-array

After I ran that, I saw this:

showMethods("colnames<-")
Function: colnames<- (package SparkR)
x="ANY", value="ANY"
x="DataFrame", value="character"
x="DataFrame", value="numeric"
(inherited from: x="ANY", value="ANY")

So looks like R automatically adds definitions of colnames<- if value is other than character.

This does not happen with coltypes<-, as it's not part of base package and doesn't have an (ANY,ANY) signature.

coltypes(irisDF) <- 1
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘coltypes<-’ for signature ‘"DataFrame", "numeric"’

Therefore, I believe we do need to do this data type check inside colnames<-.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is interesting. @felixcheung, do you know the reason behind this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in this case the base:: definition is implemented in such a way that it should (or at least it attempts to) handle different parameter types:

> getMethod("colnames<-")
Method Definition (Class "derivedDefaultMethod"):

function (x, value)
{
    if (is.data.frame(x)) {
        names(x) <- value
    }
    else {
        dn <- dimnames(x)
        if (is.null(dn)) {
            if (is.null(value))
                return(x)
            if ((nd <- length(dim(x))) < 2L)
                stop("attempt to set 'colnames' on an object with less than two dimensions")
            dn <- vector("list", nd)
        }
        if (length(dn) < 2L)
            stop("attempt to set 'colnames' on an object with less than two dimensions")
        if (is.null(value))
            dn[2L] <- list(NULL)
        else dn[[2L]] <- value
        dimnames(x) <- dn
    }
    x
}
<bytecode: 0x34c9058>
<environment: namespace:base>

Signatures:
        x     value
target  "ANY" "ANY"
defined "ANY" "ANY"

So from what I can see the error you see actually comes from the base implement, as "sort of" expected. Though I'm ok if we are to add explicit checks to make the error more clear for the user.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @felixcheung for investigating this further!

function(x, value) {

# Check parameter integrity
if (class(value) != "character") {
stop("Invalid column names.")
}

if (length(value) != ncol(x)) {
stop(
"Column names must have the same length as the number of columns in the dataset.")
}

if (any(is.na(value))) {
stop("Column names cannot be NA.")
}

# Check if the column names have . in it
if (any(regexec(".", value, fixed=TRUE)[[1]][1] != -1)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might seem rather restrictive? As this is possibly fixed by https://issues.apache.org/jira/browse/SPARK-11976
Instead of hardcoding the check here, why not just let this through? As of now a column cannot have '.' in it anyway

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is OK as SPARK-11976 is not fixed. This restriction can be removed later, if columns having '.' can be accessed transparently

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I would suggest creating a new test case with colnames to check that a DataFrame created from iris would have column names with . replaced with _.
This test case would be broken in the event that SPARK-11976 is fixed to remind us to remove this check in colnames. Otherwise it is very easy to forget.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixcheung @sun-rui Thanks for your input. Right now if I assign column names containing "." character, any subsequent operation on the DataFrame will fail.

Now, regarding @felixcheung's comment on the test case, right now there are two test cases with str() and with() expecting colnames of iris to be "Sepal_Length", ..., etc. Those will be broken when they fix SPARK-11976. No need to add more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, I think having a specific test of colnames with iris will help to realize that this check code is in place and should be removed, if and when SPARK-11976 is fixed. Otherwise only tests with str and with will be fixed. Does that make sense?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixcheung Not sure I follow your idea. Is this what you refer to?

Note: if this test is broken, remove check for "." character on colnames<- method

expect_equal(colnames(irisDF)[1] == "Sepal_Length"))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is, also what you have here is also a good test for covering this case.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. @sun-rui, @felixcheung. Shall we merge this PR?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sun-rui @felixcheung @shivaram Folks: this is a really simple thing. Shall we merge it?

stop("Colum names cannot contain the '.' symbol.")
}

sdf <- callJMethod(x@sdf, "toDF", as.list(value))
dataFrame(sdf)
})
Expand Down
11 changes: 11 additions & 0 deletions R/pkg/inst/tests/testthat/test_sparkSQL.R
Original file line number Diff line number Diff line change
Expand Up @@ -691,6 +691,17 @@ test_that("names() colnames() set the column names", {
colnames(df) <- c("col3", "col4")
expect_equal(names(df)[1], "col3")

expect_error(colnames(df) <- c("sepal.length", "sepal_width"),
"Colum names cannot contain the '.' symbol.")
expect_error(colnames(df) <- c(1, 2), "Invalid column names.")
expect_error(colnames(df) <- c("a"),
"Column names must have the same length as the number of columns in the dataset.")
expect_error(colnames(df) <- c("1", NA), "Column names cannot be NA.")

# Note: if this test is broken, remove check for "." character on colnames<- method
irisDF <- suppressWarnings(createDataFrame(sqlContext, iris))
expect_equal(names(irisDF)[1], "Sepal_Length")

# Test base::colnames base::names
m2 <- cbind(1, 1:4)
expect_equal(colnames(m2, do.NULL = FALSE), c("col1", "col2"))
Expand Down