-
Notifications
You must be signed in to change notification settings - Fork 326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SparkR-209] Remaining DataFrame methods + dropTempTable
#204
Conversation
Add `dropTempTable` function and update tests to drop the test table at the end of every test.
Added another version of `select` that will take a list as an argument (instead of just specific column names or expression)
Hmmm, this was passing on my machine. I'll look at the tests and see if the merge threw something off. |
I've run |
I've restarted it -- FWIW the errors on the previous build were
|
@shivaram I saw those, which is especially strange since those two tests rely solely on the order of rows in the inputs...which never changed. Curious to see how Travis goes this time. |
Still the same errors...I'm at a loss. |
Maybe it has to do with the temporary filepaths? I think Davies and I used the same filepath names inside some of our tests. Maybe we need to unlink the extra filepaths (basically everything that isn't either Still not sure why none of the tests would fail locally though. |
I think the problem might be that the order in which things are appended or unioned need to not be deterministic. For example if we say |
BTW the reason I suspect that is because on my machine the subtract test failed and it turned out the rows were reordered
|
Wow, that is very interesting, especially since it seems like the order varies from machine to machine but not from run to run (I even tried restarting my computer and running the tests, but still didn't get any errors.) I'm not aware of any internal mechanism for guaranteeing the order of the rows, but I can just sort the results prior to any expect() statements and I think that will make it consistent. |
Sorting will be good or you can also add statements that check if any row contains "Michael" etc. |
Travis is happy again! Looks like sorting fixed it. Good catch @shivaram |
@davies could you take a look ? |
@@ -768,6 +798,20 @@ setMethod("select", signature(x = "DataFrame", col = "Column"), | |||
dataFrame(sdf) | |||
}) | |||
|
|||
setMethod("select", | |||
signature(x = "DataFrame", col = "list"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the purpose of this API? I did not find one in Scala/Python to match this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There isn't one in Scala or Python that matches. I created it as utility because it allows you to pass a list of arguments to select
. Since the generic for select
is currently specified as function(x, col, ...)
where col
is either character
or Column
, I don't see a good way to pass in an entire list.
In the Python/Scala API, you get around this by using *args
, but that isn't quite as easy with S4.
You can see how this is used in the withColumnRenamed
below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, we won't need this if we change withColumnRenamed
to call Java API, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW in R you can use do.call
to call a varargs function with a list of arguments. For example
arglist <- list(x=runif(10), trim=0.1, na.rm=TRUE)
do.call(mean, arglist)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shivaram That's a good point, but I think having this method is still a nice built-in option. My thinking was that it gives you a very accessible way to programmatically create new Data Frames in R. Meaning like...you've got a ton of columns and you lapply
a function to grab a subset of the columns (based on some condition) and pass them through to select
without needing to even think about do.call
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see -- yes i think it is a good feature to have. I wrote up a few more APIs we should support for column selection at [1]
and one of them was to include a way to specify a list of column names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, with the way the signature for select
is written, will do.call
actually work that way? The signature specifies one explicit col
argument prior to ...
, so I was thinking that to pass a list to the current methods, you'd need something like this:
arglist <- list("age", "name", "height")
select(df, arglist[[1]], arglist[2:3])
Does do.call
collapse all the other arguments down into ...
for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah here is a simple example
a <- function(x1, ...) { first <- x1; rest <- list(...); cat("Got ", length(rest) + 1, " arguments\n") }
do.call(a, list(1, 2, 3))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, wow, I stand corrected :)
So should I drop/refactor this method?
@cafreeman I think this looks pretty good. As I mentioned inline I think we should extend the So this PR LGTM but for a minor inline comment I had. @davies any other comments ? |
LGTM, thanks! |
Merging this |
[SparkR-209] Remaining DataFrame methods + `dropTempTable`
dropTempTable
inSQLContext
DataFrame Methods:
insertInto
unionAll
intersect
subtract
withColumn
withColumnRenamed